Finding Similar Protein Structures Efficiently and Effectively

Cui, Xuefeng

Finding Similar Protein Structures Efficiently and Effectively

Files

Cui_Xuefeng.pdf (7.57 MB)

Date

2014-04-24

Authors

Cui, Xuefeng

Publisher

University of Waterloo

Abstract

To assess the similarities and the differences among protein structures, a variety of structure alignment algorithms and programs have been designed and implemented. We introduce a low-resolution approach and a high-resolution approach to evaluate the similarities among protein structures. Our results show that both the low-resolution approach and the high-resolution approach outperform state-of-the-art methods. For the low-resolution approach, we eliminate false positives through the comparison of both local similarity and remote similarity with little compromise in speed. Two kinds of contact libraries (ContactLib) are introduced to fingerprint protein structures effectively and efficiently. Each contact group from the contact library consists of one local or two remote fragments and is represented by a concise vector. These vectors are then indexed and used to calculate a new combined hit-rate score to identify similar protein structures effectively and efficiently. We tested our ContactLibs on the high-quality protein structure subset of SCOP30, which contains 3,297 protein structures. For each protein structure of the subset, we retrieved its neighbor protein structures from the rest of the subset. The best area under the ROC curve, archived by a ContactLib, is as high as 0.960. This is a significant improvement over 0.747, the best result achieved by the state-of-the-art method, FragBag. For the high-resolution approach, our PROtein STructure Alignment method (PROSTA) relies on and verifies the fact that the optimal protein structure alignment always contains a small subset of aligned residue pairs, called a seed, such that the rotation and translation (ROTRAN), which minimizes the RMSD of the seed, yields both the optimal ROTRAN and the optimal alignment score. Thus, ROTRANs minimizing the RMSDs of small subsets of residues are sampled, and global alignments are calculated directly from the sampled ROTRANs. Moreover, our method incorporates remote information and filters similar ROTRANs (or alignments) by clustering, rather than by an exhaustive method, to overcome the computational inefficiency. Our high-resolution protein structure alignment method, when applied to optimizing the TM-score and the GDT-TS score, produces a significantly better result than state-of-the-art protein structure alignment methods. Specifically, if the highest TM-score found by TM-align is lower than 0.6 and the highest TM-score found by one of the tested methods is higher than 0.5, our alignment method tends to discover better protein structure alignments with (up to 0.21) higher TM-scores. In such cases, TM-align fails to find TM-scores higher than 0.5 with a probability of 42%; however, our alignment method fails the same task with a probability of only 2%. In addition, existing protein structure alignment scoring functions focus on atom coordinate similarity alone and simply ignore other important similarities, such as sequence similarity. Our scoring function has the capacity for incorporating multiple similarities into the scoring function. Our result shows that sequence similarity aids in finding high quality protein structure alignments that are more consistent with HOMSTRAD alignments, which are protein structure alignments examined by human experts. When atom coordinate similarity itself fails to find alignments with any consistency to HOMSTRAD alignments, our scoring function remains capable of finding alignments highly similar to, or even identical to, HOMSTRAD alignments.