Probabilistic structure and statistical inference for nonexplicit population models of allele frequency

Date of Completion

January 2003


Biology, Genetics|Statistics




This work is an interdisciplinary effort combining statistics and population genetics. First by using properties of moment stationarity, we develop exact expressions for the mean and covariance of allele frequencies at a single locus for a set of populations subject to drift, mutation, and migration. Some general results are obtained for arbitrary mutation and migration matrices. The drift, mutation, migration process is not ergodic when any finite number of populations is exchanging genes. In addition, we provide closed form expressions for the mean and covariance of allele frequencies in Wright's finite-island model of migration under several simple models of mutation. The traditional diffusion approximation provides a poor approximation of the stationary distribution of allele frequencies among populations when correlation among populations is large. ^ To incorporate the correlation among populations, we propose a mixture beta model to describe allele frequency. Using simulated data, we show the mixture model provides a good approximation of the allele frequency in general and correlation could be recovered well from the model unless there is a large proportion of allele frequency being 0s and 1s in the data. The mixture model is illustrated by a data set from human populations and we also extend the approach to data with different clusters. Inference is performed in a Bayesian framework. ^ An important fraction of recently generated molecular data are dominant makers. They contain substantial information about genetic variation but dominance makes it impossible to apply standard techniques to calculate measures of genetic differentiation, like F-Statistics. We present a Bayesian approach to make inference of genetic structure from these makers. Instead of assuming a common FST across all loci, we assume non-homogenous genetic differentiation among loci and multiple FST's are directly estimated from the sample. Loci with similar genetic differentiation are detected using clustering method and the number of FST's is determined using deviance information criterion (DIC) and L measure. The estimates of FST's incorporate uncertainty about the magnitude of within-population inbreeding. We also propose measures for identifying outlying populations. We illustrate the method with RAPD data from 14 populations of an endangered North American orchid, Platanthera leucophaea. ^