Date of Completion

12-13-2013

Embargo Period

12-13-2013

Advisors

Sanguthevar Rajasekaran, Yufeng Wu

Field of Study

Computer Science and Engineering

Degree

Master of Science

Open Access

Open Access

Abstract

Metatranscriptomics is the study of gene expression levels in uncultured microbial communities sampled directly from their environment. In cultured microbes, the sequencing data comes from a single clone, making assembly and annotation tractable. However, in metatranscriptomics, the data comes from communities with diverse microbial compositions, possibly containing thousands of species. The assembly of transcripts from these complex samples represents a serious computational challenge. The development of genome-independent methods is essential because of the lack of trusted reference genomes that can be used to guide the assembly process. The predominant assembly formalism used for short-read datasets is the de Bruijn graph. In a de Bruijn graph, reads are split into consecutive overlapping words, or k-mers, which are then used to build a connectivity graph. In order to reduce the complexity and improve the results of the assembly process, genome-independent methods which cluster the reads into bins can be employed. Previous methods for read clustering process single samples only. In this thesis we propose a novel method for binning reads from N > 1 metatranscriptomics samples. In the first stage we run a sample-by-sample error removal algorithm to remove as many of the incorrect k-mers as possible from the de Bruijn graph. In the second stage we use the structure of the de Bruijn graph and the N-dimensional count vectors for each k-mer to identify paths in the de Bruijn graph that correspond to maximal substrings which appear in the same transcripts with the same multiplicity (pseudo-exons). In order to validate our algorithm, we simulated RNA-Seq data (both error-free and with errors) using real expression levels of human genes from the GNFAtlas2 project [8]. We created several di fferent workflows for processing the data and managed to obtain promising results. One of the biggest challenges we faced was dealing with errors. Since our algorithms are based on k-mers a single error in a read results in multiple erroneous k-mers. For an error rate of 0:1% over two thirds of the k-mers are incorrect, however, we managed to achieve a very good balance between how many correct k-mers we keep (47,708,616 out of 49,540,693) and how many erroneous ones we keep (29,583 out of 108,528,982).

Major Advisor

Ion Mandoiu

Share

COinS