Bayesian methods for high-throughput gene expression data in bioinformatics

Date of Completion

January 2007


Biology, Biostatistics|Statistics|Biology, Bioinformatics




Analysis of gene expression is one of the main research areas of bioinformatics. The advances in molecular and computational biology develop lots of powerful, high-throughput methods for the analysis of differential gene expression. These methods have played an important role in fields ranging from cell and development biology to drug development and pharmacogenetics to identify genes that are critical for a developmental process, or even to identify the molecular events that are associated with drug treatment. Although these high-throughput tools offer us rich biological information, they are highly error-prone since many genes are monitored at the same time with a relatively small sample size. To minimize the effect of the experiment contamination and efficiently analyze these high-throughput data, this dissertation develops several Bayesian methods to handle the high-throughput data from two frequently used biological techniques: oligonucleotide microarray (Lockhart et al., 1996), and expressed sequence tag (EST) sampling (Adams, et al., 1991).^ First we develop a Bayes factor approach to compare gene expression intensities across two biological conditions using the oligonucleotide microarray data. The reason we use Bayes factor is that it allows us to compare two population distributions. To adapt the use of Bayes factors to microarray data analysis, we propose a new calibration approach that weighs two types of prior predictive error probabilities differently for each gene and at the same time controls the overall error rates for all genes. Moreover, a new gene selection algorithm based on the calibration approach is developed and its properties are examined. The proposed method is shown to have a smaller false discovery rate (FDA) and a smaller false non-discovery rate (FNDR) than several existing methods via several simulations. Finally, a real dataset from an affymetrix microarray experiment to identify genes associated with the mature osteoblast differentiation is used to further illustrate the proposed methodology.^ Secondly, we propose a multinomial nonlinear mixture Dirichlet model to fit the expression levels of the EST data observed from multiple libraries of multiple types of tissues. The prior properties of the proposed model are examined in detail. Efficient computational algorithms are provided. Novel gene selection algorithm is developed to detect differentially expressed genes based on the evaluated abundance level of the considered genes. Simulation studies and a real EST dataset are used to illustrate the proposed model and assess the efficiency of the proposed models.^ Thirdly, to provide a direct measure of the gene expressions summarizing all libraries of the same tissue type, we propose two new hierarchical multinomial nonlinear mixture Dirichlet models by adding a class dependent or a class independent latent library gene expressions under each tissue type. Computational algorithm is explored to fit these two models to the EST data. Prior properties and relationships among these models are examined. At the end, a brief discussion on the future research is included in Chapter 5. ^