Contributions to microarray data analysis

Date of Completion

January 2009






High-throughput gene analysis technology such as cDNA microarray and oligonucleotide arrays has enabled parallel analysis of thousands of genes simultaneously. In this thesis, we first discuss five-commonly used tests in details and compare them in terms of sensitivity, specificity, and predictive power through simulations. Selecting a different test (classifier) will generally identify a different list of significant genes, since each test operates under a specific set of assumptions, and places a different emphasis on certain features in the data. We then propose several models with likelihood based inference that synthesize the results of the different classifiers. We develop EM algorithms to obtain MLE, and Gibbs algorithms for the Bayes estimates for the probabilities of each gene being differentially expressed (DE), and the sensitivity and specificity of each test simultaneously. The Bayes estimates of the sensitivity and specificity also provide a guide on the performance of each classifier. ^ While the microarray projects rapidly determine gene catalogs, functional annotation of genes is still largely incomplete. Identifying the most activated pathways relevant to a particular phenotype, e.g. cancer, or in a defined stage of cell differentiation, provides more insights into the functional relationships among genes. Chapter 4 describes two novel Bayesian models to integrate the microarray data, with the putative pathway networks obtained from the KEGG database, and the medical journal statistics on the activation or inhibition relationships between genes collected in PrimeDB. ^ We define the symmetric Kullback-Leibler divergence of a pathway, and utilize it to identify the pathway(s) most supported by the microarray data. Monte Carlo Markov Chain sampling is used to carry out the posterior computation for the hierarchical model. ^