Efficient algorithms for SNP genotype data analysis using hidden Markov models of haplotype diversity

Date of Completion

January 2009


Engineering, Computer|Biology, Bioinformatics|Computer Science




Advances in SNP genotyping technologies have played a key role in the proliferation of large scale genomic studies, leading to the discovery of hundreds of genes associated with complex human diseases. Currently, such studies involve genotyping thousands of cases and controls at up to millions of single nucleotide polymorphism (SNP) loci, generating very large datasets that require scalable analysis algorithms. For continued success, efficient algorithms that utilize accurate statistical models and are capable of processing massive amounts of data are needed. ^ This thesis presents several highly scalable algorithms which utilize Hidden Markov Models (HMMs) of haplotype diversity for SNP genotype data analysis problems. First, we propose novel likelihood functions utilizing these HMMs for the problems of genotype error detection, imputation of untyped SNPs, and missing data recovery. Empirical results show significant improvement when compared to other methods on real and simulated genotype datasets. Next, we contribute a novel method for imputation-based local ancestry inference that effectively exploits Linkage Disequilibrium (LD) information. Experiments on simulated admixed populations show that imputation-based ancestry inference has significantly better accuracy over the best current methods for closely related ancestral populations. Finally, we introduce a hierarchical-factorial HMM which integrates sequencing data with haplotype frequency information and is utilized by efficient decoding algorithms for genotype calling. We demonstrate that, highly accurate SNP genotypes can be inferred from very low coverage shotgun using this HMM. ^