Date of Completion

4-1-2016

Embargo Period

3-31-2016

Keywords

planted motif search, suffix arrays, pattern matching with k mismatches

Major Advisor

Prof. Sanguthevar Rajasekaran

Associate Advisor

Prof. Ion Mandoiu

Associate Advisor

Prof. Yufeng Wu

Field of Study

Computer Science and Engineering

Degree

Doctor of Philosophy

Open Access

Open Access

Abstract

This thesis studies the following problems:

1. Planted Motif Search. Discovering patterns in biological sequences is a crucial process that has resulted in the determination of open reading frames, gene promoter elements, intron/exon splicing sites, SH RNAs, etc. We study the (l, d) motif search problem or Planted Motif Search (PMS). PMS receives as input n strings and two integers l and d. It returns all sequences M of length that occur in each input string, where each occurrence differ from M in at most d positions. Another formulation is quorum PMS (qPMS), where M appears in at least q% of the strings. We developed qPMS9, an efficient parallel exact qPMS algorithm for DNA and protein datasets.

2. Suffix Array Construction. The suffix array is a data structure that finds numerous applications in string processing problems for both linguistic texts and biological data. The suffix array consists of the sorted suffixes of a string. There are several linear time suffix array construction algorithms known in the literature. However, one of the fastest algorithms in practice has a worst case run time of O(n ^ 2 ). We developed an efficient algorithm called RadixSA that has a worst case run time of O(n log n) and is one of the fastest algorithms to date. RadixSA introduces an idea that may find independent applications as a speedup technique for other algorithms.

3. Pattern Matching with Mismatches. We consider several variants of the pattern matching with mismatches problem. Given a text T = t 1 t 2 · · · t n and a pattern P = p 1 p 2 · · · p m , we investigate the following problems: 1) Pattern matching with mismatches: for every alignment i, 1 ≤ i ≤ n − m + 1 output the distance between P and t i t i+1 · · · t i+m−1 , and 2) Pattern matching with k mismatches: output those alignments i where the distance is at most k. The distance metric used is the Hamming distance. Variants of these problems allow for wild cards in the text or the pattern. For these problems we offer novel deterministic, randomized and approximation algorithms.

Source code relevant to these results is available at https://github.com/mariusmni/.

COinS