Title

Modelling genetic data using Bayesian hierarchical models

Date of Completion

January 2007

Keywords

Biology, Genetics|Statistics

Degree

Ph.D.

Abstract

Populations diverge from each other as a result of evolutionary forces such as genetic drift, natural selection, mutation, and migration. For certain types of genetic markers, and for single-nucleotide polymorphisms (SNPs), in particular, it is reasonable to presume that genotypes at most loci are selectively neutral. Because demographic parameters (e.g. population size and migration rates) are common across all loci, locus-specific variation, which can be measured by Wright's FST, will depart from a common mean only for loci with unusually high/low rate of mutation or for loci closely associated with genomic regions having a substantial effect on fitness. We propose two alternative Bayesian hierarchical-beta models to estimate locus-specific effects on FST. To detect loci for which locus-specific effects are not well explained by the common FST, we use the Kullback-Leibler divergence measure (KLD) to measure the divergence between the posterior distributions of locus-specific effects and the common FST and further calibrate the KLD using a coin flip experiment. We conduct a simulation study to illustrate the performance of our approach for detecting loci subject to stabilizing/divergent selections. We apply the hierarchical-beta models to a subset of single nucleotide polymorphism data from the HapMap project. With this method, we identify 15 SNP loci having unusual among population variation and 9 out of the 15 loci are located either within identified genes or nearby. ^ Because loci associated with one another physically are likely to show similar patterns of variation, we introduce conditional autoregressive (CAR) models to incorporate the local correlations among loci. We use two levels of KLDs, global and local KLDs, for identify loci depart from the overall mean FST or neighbor loci. We apply the CAR models and the beta-hierarchical models to a high resolution SNP data from the HapMap project. Model comparison using several criteria, including DIC and LPML, reveals that CAR models are superior to alternative models for the data used in the analysis. Using global and local KLDs, we identify several clusters of loci with unusual patterns of variation. We find that one cluster is located around gene leptin (LEP), which have been confirmed to be related to human obesity by several other independent studies. ^ In another application of modelling genetic data, we propose a hierarchical Bayesian model to estimate the proportional contribution of source populations to a newly founded colony. Samples are derived from the first generation offspring in the new colony, but mating may occur preferentially among migrants from the same source population. Genotypes of the newly founded colony and allele counts of source populations are used to estimate the mixture proportions, and the mixture proportions are related to environmental and demographic factors that might affect the colonizing process. We estimate an assortative mating coefficient, mixture proportions, and regression relationships between environmental factors and the mixture proportions in a single hierarchical model. The first-stage likelihood for genotypes in the newly founded colony is a mixture multinomial distribution reflecting the colonizing process. The environmental and demographic factors are incorporated into the model through a hierarchical prior structure. We conduct a simulation study to investigate the performance of the models by using different level of population divergence and number of loci included in the analysis. We use Markov chain Monte Carlo (MCMC) simulation to conduct inference for the posterior distribution of the model parameters. We apply the model to a data set derived from gray seals in the Orkney Islands. We compare our model with a similar model previously used to analyze these data. Both the simulation and application to real data confirm that our model provides a better estimation for the covariate effects. ^