Date of Completion


Embargo Period



Reda Ammar, Sheida Nabavi, Sanguthevar Rajasekaran, Yufeng Wu

Field of Study

Computer Science and Engineering


Master of Science

Open Access

Open Access


The human body is made up of trillions of cells. Although all the human body cells contain the same DNA sequence inside their nuclei, each one carries out its own function. Normally, human cells grow and divide to form daughter cells as the body needs them. When cells grow old, or lose their ability to function properly, they die (in a very organized way called apoptosis or programmed cell death) and new cells take their role. Cancer is a disease that is caused by uncontrolled division of abnormal cells in some part of the body, breaking the natural process of growing. Old or damaged cells survive when they should die, and new (abnormal) cells form when they are not needed. Some types of cancer form solid tumors, which are masses of tissue. Others, such as leukemias, do not form solid tumors. It is widely believed that cancer is caused by the accumulation of detrimental variation in the genome over the course of a lifetime. Variations can take several forms. Single Nucleotide Polymorphism (SNP) is a mutation in a single base of the DNA. Indels describe insertions or deletions of bases in the genome. Copy Number Variation (CNV) represents multiplied and deleted segments in a genome. Most of the time, one type of mutation is not sufficient to induce cancer formation.

In this study, we have investigated genomic datasets of a phase-1 clinical trial on triple-negative breast cancer and ovarian cancer patients. The goal is to identify genes that drive drug resistance. We have developed data analysis pipelines to obtain genomics variations (somatic mutations and copy number variations) from the Whole Exome Sequencing (WES) raw data of 35 triple-negative breast cancer (TNBC) and ovarian cancer patients. In addition, we have analyzed the gene expression levels and gene fusion from the RNA-Seq raw reads data for a subset of 16 patients. This study is an effort toward optimizing the integrative analysis of genomic datasets under certain limitations. The main limitation is the small number of samples in the clinical trial (as is the case in most clinical trials). Another challenge is to find an abstract way to analyze the raw sequencing data given its large size and heterogeneity. The novelty of our work comes in following a data science approach in answering such research questions. The unbiased and data-driven approach was successful in identifying genes that are most likely related to the drug resistance. Our results will guide clinicians toward having an in-depth study of the driver genes.

Major Advisor

Reda Ammar, Sheida Nabavi