Date of Completion

5-9-2014

Embargo Period

11-5-2014

Major Advisor

Sanguthevar Rajasekaran

Co-Major Advisor

Reda A. Ammar

Associate Advisor

Ion Mandoiu

Associate Advisor

Jimbo Bi

Field of Study

Computer Science and Engineering

Degree

Doctor of Philosophy

Open Access

Open Access

Abstract

Diverse health challenges such as rising incidence of metabolic disease, rapid aging, and increasing antibiotic resistance are facing current humanity. Most diseases involve many genes in complex interactions, as well as environmental influences that are often not well understood. High-throughput advances in genome sequencing, transcript measurement, and protein measurement have been developed to address these challenges. A number of disease biomarkers have been identified as a result of an increased understanding of cellular functions. The observation of such systems-level cellular behavior has naturally extended to the metabolite level, leading to the study of metabolomics. Measurement of the metabolites in a biological sample represents a snapshot of the physiology of the cell. The study of metabolites can help assign biochemical functions to so-called orphan genes (genes that cannot be ascribed a function by sequence analogy) and validate them as molecular targets for therapeutic intervention. Integration of metabolomics data with other omics data will provide a more complete picture of the functioning of organisms. Due to the chemical diversity of metabolites, the identification process in metabolomics is currently less advanced than that in proteomics and transcriptomics. Development of a computational workflow to improve and accelerate metabolite identification and biochemical pathway reconstruction is required for metabolomics to increase its impact in systems biology. The goal of this thesis is to design, develop, and validate methods for metabolite structure identification as well as defining their biochemical functions by predicting their metabolic pathway associations. First, I propose BioSM; a cheminformatics tool that uses known endogenous mammalian biochemical compounds and graph matching methods to identify endogenous mammalian biochemical structures in chemical structure space. The results of a comprehensive set of empirical experiments suggest that BioSM identifies endogenous mammalian biochemical structures with high accuracy (95{\%}). In addition, results suggest that approximately 13{\%} of PubChem compounds are mammalian biochemicals. Thus, BioSM may be useful for searching large chemical databases in metabolomics applications where the number of potential false positives is very large. BioSM is freely available at http://metabolomics.pharm.uconn.edu. A major downside of BioSM, granting its encouraging results, was its need to exhaustively search all known biochemical structures to be able to make a decision about the molecular structure under investigation, which resulted in an undesirably high run time. To tackle this concern, I introduce BioSMXpress, designed and developed as an enhancement to BioSM. BioSMXpress is, on average, 8 times faster than BioSM without compromising the quality of the predictions made. BioSMXpress will be an extremely useful tool in the timely identification of unknown biochemical structures in metabolomics. Finally, I present TrackSM; a bioinformatics tool designed to predict the metabolic pathway classes as well as the individual pathways to which small molecules might be associated with, based only on their molecular structures. Validation experiments show that TrackSM is capable of associating 93{\%} of the structures to their correct pathway classes as defined by KEGG and 88{\%} of them to the correct individual KEGG pathway. These impressive results suggest that TrackSM may be a valuable tool to aid in recognizing the biochemical functions of small molecules.

COinS