Bayesian Phylogenetics Model Selection, and Methods of Detecting Non-independent and Heterotachous Molecular Evolution

Date of Completion

January 2011


Biology, Biostatistics|Biology, Evolution and Development|Biology, Bioinformatics




Phylogenetics, the study of evolutionary relationships among groups of organisms, has played an important role in modem biological research, such as genomic comparison, detecting orthology and paralogy, estimating divergence times, reconstructing ancient proteins, finding the residues that are important to natural selection, identifying mutations likely to be associated with disease and determining the identity of new pathogens. As a relatively new field, Bayesian phylogenetics makes it possible to simultaneously estimate phylogeny and obtain measurements of uncertainty for every branch, and to incorporate complex statistical models into phylogeny estimation. It is critical that the phylogeny estimation be accurate. Phylogenetic relationships are usually inferred from molecular sequences using statistical models that reflect the complex and dynamic evolutionary process. Recently, statistical models put forward tend to become more and more complicated, especially in the analyses of multi-gene data sets. This dissertation introduces a new method to choose an economical partition strategy for the data that allows the model to fit the data well but discourages unnecessary partitions that contribute little to goodness-of-fit. Besides over-fitting, phylogenetic inaccuracies can be caused by the violations of model assumptions. Most contemporary statistical models assume: (1) sites in molecular sequences (either nucleotides or amino acids) evolve independently; and (2) evolutionary rate for each site is constant across the entire tree. Those two assumptions, however, are usually violated for empirical data, which may lead to unreliable phylogenetic trees. This dissertation introduces a conditional autoregressive (CAR) prior model and a Dirichlet process prior heterotachy (DP-heterotachy) model to relax the two assumptions, respectively. ^