Title

Scalable algorithms for analysis of genomic diversity data

Date of Completion

January 2008

Keywords

Biology, Bioinformatics|Computer Science

Degree

Ph.D.

Abstract

After the complete genome sequence for several species, including human, has been determined, genomics research is now focusing on the study of DNA variations, with the goal of providing answers to fundamental problems ranging from determining the genetic basis of disease susceptibility to uncovering the pattern of historical population migrations and DNA-based species identification. These large scale genomic studies are facilitated by recent advances in high-throughput genomic technologies such as sequencing and SNP genotyping. Computationally, the huge amount of data to be processed raises the need for integrating recently developed statistical models of the structure of genomic variability with efficient combinatorial methods delivering predictable solution quality. ^ In this thesis we propose efficient algorithms for several problems arising in the study of genomic diversity within human populations and among species. First, we introduce a highly scalable method for reconstructing the haplotypes from SNP genotype data based on the entropy minimization principle. We present extensive empirical results showing that our proposed method achieves accuracy close to that of best existing methods while being several orders of magnitude faster. Second, we give improved haplotype reconstruction algorithms based on a Hidden Markov Model (HMM) of haplotype diversity in a population. Third, the proposed HMM is used to develop efficient and accurate methods for other problems in the analysis of whole-genome SNP genotype data including imputation of genotypes at untyped SNP loci based on higher density reference haplotypes. Finally, we propose new methods for species identification based on short DNA sequences called barcodes, and present a comprehensive assessment of the effect of barcode repository size (number of samples per species, barcode length, etc.) on identification accuracy. ^