Discovery of Flexible Gap Patterns from Sequences
MetadataShow full item record
Human genome contains abundant motifs bound by particular biomolecules. These motifs are involved in the complex regulatory mechanisms of gene expressions. The dominant mechanism behind the intriguing gene expression patterns is known as combinatorial regulation, achieved by multiple cooperating biomolecules binding in a nearby genomic region to provide a specific regulatory behavior. To decipher the complicated combinatorial regulation mechanism at work in the cellular processes, there is a pressing need to identify co-binding motifs for these cooperating biomolecules in genomic sequences. The great flexibility of the interaction distance between nearby cooperating biomolecules leads to the presence of flexible gaps in between component motifs of a co-binding motif. Many existing motif discovery methods cannot handle co-binding motifs with flexible gaps. Existing co-binding motif discovery methods are ineffective in dealing with the following problems: (1) co-binding motifs may not appear in a large fraction of the input sequences, (2) the lengths of component motifs are unknown and (3) the maximum range of the flexible gap can be large. As a result, the probabilistic approach is easily trapped into a local optimal solution. Though deterministic approach may resolve these problems by allowing a relaxed motif template, it encounters the challenges of exploring an enormous pattern space and handling a huge output. This thesis presents an effective and scalable method called DFGP which stands for “Discovery of Flexible Gap Patterns” for identifying co-binding motifs in massive datasets. DFGP follows the deterministic approach that uses flexible gap pattern to model co-binding motif. A flexible gap pattern is composed of a number of boxes with a flexible gap in between consecutive boxes where each box is a consensus pattern representing a component motif. To address the computational challenge and the need to effectively process the large output under a relaxed motif template, DFGP incorporates two redundancy reduction methods as well as an effective statistical significance measure for ranking patterns. The first reduction method is achieved by the proposed concept of representative patterns, which aims at reducing the large set of consensus patterns used as boxes in existing deterministic methods into a much smaller yet informative set. The second method is attained by the proposed concept of delegate occurrences aiming at reducing the redundancy among occurrences of a flexible gap pattern. iv Extensive experiment results showed that (1) DFGP outperforms existing co-binding discovery methods significantly in terms of both the capability of identifying co-binding motifs and the runtime, (2) co-binding motifs found by DFGP in datasets reveal biological insights previously unknown, (3) the two redundancy reduction methods via the proposed concepts of representative patterns and delegate occurrences are indeed effective in significantly reducing the computational burden without sacrificing output quality, (4) the proposed statistical significance measures are robust and useful in ranking patterns and (5) DFGP allows a large maximum distance for flexible gap between component motifs and it is scalable to massive datasets.