"Efficient Techniques for Big Data Processing in Data Integration and M" by Tian Mi

Doctoral Dissertations

Title

Efficient Techniques for Big Data Processing in Data Integration and Motif Search

Authors

Tian Mi, University of Connecticut - StorrsFollow

Date of Completion

7-19-2013

Embargo Period

7-19-2013

Major Advisor

Sanguthevar Rajasekaran

Associate Advisor

Ion Mandoiu

Associate Advisor

Yufeng Wu

Associate Advisor

Reda Ammar

Field of Study

Computer Science and Engineering

Degree

Doctor of Philosophy

Open Access

Campus Access

Abstract

The rapid growth of data in bioinformatics and biomedical informatics brings new challenges to these areas. In this thesis, we present efficient computational algorithms for big data processing in data integration and motif search.

Data integration, or record linkage, is the problem of identifying information pertaining to the same entity, existing in different data sources, in the absence of a global identifier. For instance, there could be multiple records for the same individual with different healthcare providers. Several algorithms have been proposed in the literature that are adept in integrating records from two different datasets. However, limitations show up when facing multiple (more than two) data sources. More often than not we have to deal with much more than two datasets. We propose efficient algorithms based on hierarchical clustering to handle massive data from multiple sources.

In motif prediction, minimotifs (also called Short Linear Motifs) are short contiguous peptide pieces of proteins that have a known biological function. Minimotif Miner (MnM) (http://mnm.engr.uconn.edu) is a computational minimotif prediction tool that analyzes protein queries for the presence of minimotifs. The basic algorithm employs sequence matching and checks to see if any of the experimentally validated motifs can be located in the query. It then uses a series of methods (known as {\em filters}) to eliminate possible false-positive predictions. Since the initial version of MnM, the MnM database has grown rapidly and the number of minimotifs has increased from 462 to 294,933. This growth has also resulted in more false positives in our predictions. In our work, we have developed novel filters to address this problem using knowledge of the cellular function and molecular function. Together with other filters of protein protein interaction, frequency score, and surface prediction score, we have developed computational combination of individual filters to significantly increase the accuracy of the minimotif prediction.

Besides, we studied a crucial fundamental operation in bioinformatics and biomedical informatics, the external or out-of-core selection problem. Selection problem is aimed to find the i_th smallest element given a number of input elements. ‘Out-of-core’ refers to the case when the number of input elements is much more than what the core memory can hold. Some applications include noise reduction (e.g., median filters) in signal or image processing, high-breakdown regression in robust statistics, clustering, neural networks, data mining, etc. Note that these applications play an important role in computational biological science. We propose a novel algorithm of no more than (2+epsilon) passes (epsilon being a very small fraction) and compare our algorithms with some of the best existing algorithms.

Recommended Citation

Mi, Tian, "Efficient Techniques for Big Data Processing in Data Integration and Motif Search" (2013). Doctoral Dissertations. 168.
https://digitalcommons.lib.uconn.edu/dissertations/168

Download

COinS

Doctoral Dissertations

Title

Authors

Date of Completion

Embargo Period

Major Advisor

Associate Advisor

Associate Advisor

Associate Advisor

Field of Study

Degree

Open Access

Abstract

Recommended Citation

Search

Links

Browse

Author Corner

Homepage