Computational Methods for Compositional Epistasis Detection

Cheng, Lu

Computational Methods for Compositional Epistasis Detection

Files

Cheng_Lu.pdf (6.44 MB)

Date

2022-04-19

Authors

Cheng, Lu

Advisor

Zhu, Mu

Publisher

University of Waterloo

Abstract

In genetics, the term “epistasis” refers to the phenomenon that the effect of one gene or single-nucleotide polymorphism (SNP) is dependent on the presence of others. Various possibilities of epistasis exist, and the understanding of them is limited. In recent years, failure of replication for single-locus effects in genome-wide association studies (GWAS) motivates the exploration of epistasis for human complex disease. This thesis is thus dedicated to the study of computational approaches for two-way compositional epistasis (SNP-SNP interaction) detection. Epistasis of this sort is best described by disease models, which can be simply understood as disease probability patterns associated with the genotype combinations of SNP-pairs. Because the epistasis detection problem requires determination of proper disease models to capture the compositional epistasis effect, it is more complicated than a typical variable selection task. Three projects are pursued in this thesis. The first two target epistasis that is characterized by a set of “two-locus, two-allele, two-phenotype and complete-penetrance” (TTTC) disease model, and the third one extends to more general epistasis. There are theoretically 2^9 = 512 TTTC disease models. For a given SNP-pair, the first step of the problem is to find a proper TTTC model to capture its epistasis effect. It is found that existing methods that use data to determine best-fitting disease models prior to screening may be too greedy. Motivated by this, the first project proposes a less greedy strategy by limiting the search of disease models to a set of prototypes. The prototypes are determined a priori. Specifically, a distance metric is defined and used to cluster all disease models, and then a “representative” from each cluster is selected to form the prototypes. Compared to the existing approaches, the proposed method provides a more satisfying balance between precision and recall in epistasis detection. If one uses data to determine a best-fitting disease model for a pair of SNPs, the nominal statistical evidence of association between the SNP-pair and the disease outcome is inflated. Therefore, the second project aims to directly correct inflation of this type. To make it feasible for genome-wide studies, a first-order correction method is proposed that can be applied in practice with no additional computational cost. Simulation studies are performed on two popular existing methods, which show that the correction is quite effective in improving an overall epistasis detection. The TTTC disease models can be viewed as coding two risk levels, i.e., high and low risk. Compared to them, some other disease models code multiple risk levels, which capture more general epistasis patterns. Two methods are proposed in the third project, which are centered on epistasis detection using multi-level risk disease models. One method is inspired by the fused lasso under a regression-based framework, and adopts the post-model selection test to account for inflation incurred during disease model searching. The other one makes sequential split of the genotype combinations of a SNP-pair and uses a stopping criterion to determine the final disease model; after that, it also applies a first-order correction to the testing statistic to effectively account for inflation. It is shown that the two methods with totally different starting framework are equivalent in terms of the disease model searching process. Subsequent simulation studies show that use of multi-level disease models achieves better detection efficiency in terms of a balance between precision and recall than the two-level ones. In summary, it is a rather complicated task to uncover the underlying mechanism of locus interaction effects, and endeavours are only beginning to be made. The epistasis detection methods in this thesis are practically useful at genome-wide level, which complements the single SNP screening in genome-wide association studies. What’s more, the method of first-order correction for inflation is simple and effective, which is practically valuable for the epistasis detection methods involving inflated testing statistics.