|In genetics, the term “epistasis” refers to the phenomenon that the effect of one gene
or single-nucleotide polymorphism (SNP) is dependent on the presence of others. Various
possibilities of epistasis exist, and the understanding of them is limited. In recent years,
failure of replication for single-locus effects in genome-wide association studies (GWAS)
motivates the exploration of epistasis for human complex disease.
This thesis is thus dedicated to the study of computational approaches for two-way
compositional epistasis (SNP-SNP interaction) detection. Epistasis of this sort is best
described by disease models, which can be simply understood as disease probability patterns
associated with the genotype combinations of SNP-pairs. Because the epistasis detection
problem requires determination of proper disease models to capture the compositional epistasis
effect, it is more complicated than a typical variable selection task.
Three projects are pursued in this thesis. The first two target epistasis that is characterized
by a set of “two-locus, two-allele, two-phenotype and complete-penetrance” (TTTC) disease
model, and the third one extends to more general epistasis.
There are theoretically 2^9 = 512 TTTC disease models. For a given SNP-pair, the first step
of the problem is to find a proper TTTC model to capture its epistasis effect. It is found that
existing methods that use data to determine best-fitting disease models prior to screening
may be too greedy. Motivated by this, the first project proposes a less greedy strategy by
limiting the search of disease models to a set of prototypes. The prototypes are determined a
priori. Specifically, a distance metric is defined and used to cluster all disease models, and
then a “representative” from each cluster is selected to form the prototypes. Compared to
the existing approaches, the proposed method provides a more satisfying balance between
precision and recall in epistasis detection.
If one uses data to determine a best-fitting disease model for a pair of SNPs, the nominal
statistical evidence of association between the SNP-pair and the disease outcome is inflated.
Therefore, the second project aims to directly correct inflation of this type. To make it feasible
for genome-wide studies, a first-order correction method is proposed that can be applied in
practice with no additional computational cost. Simulation studies are performed on two
popular existing methods, which show that the correction is quite effective in improving an
overall epistasis detection.
The TTTC disease models can be viewed as coding two risk levels, i.e., high and low risk.
Compared to them, some other disease models code multiple risk levels, which capture more
general epistasis patterns. Two methods are proposed in the third project, which are centered
on epistasis detection using multi-level risk disease models. One method is inspired by the
fused lasso under a regression-based framework, and adopts the post-model selection test to
account for inflation incurred during disease model searching. The other one makes sequential
split of the genotype combinations of a SNP-pair and uses a stopping criterion to determine
the final disease model; after that, it also applies a first-order correction to the testing
statistic to effectively account for inflation. It is shown that the two methods with totally
different starting framework are equivalent in terms of the disease model searching process.
Subsequent simulation studies show that use of multi-level disease models achieves better
detection efficiency in terms of a balance between precision and recall than the two-level ones.
In summary, it is a rather complicated task to uncover the underlying mechanism of locus
interaction effects, and endeavours are only beginning to be made. The epistasis detection
methods in this thesis are practically useful at genome-wide level, which complements the
single SNP screening in genome-wide association studies. What’s more, the method of
first-order correction for inflation is simple and effective, which is practically valuable for the
epistasis detection methods involving inflated testing statistics.