High-Dimensional Statistical Inference and False Discovery Rate Control with Covariates

No Thumbnail Available

Date

2025-01-17

Advisor

Qin, Yingli
Liang, Kun

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

In this thesis, we focus on three statistical problems. First, we consider graph-based tests for differences of two high-dimensional distributions. Second, we investigate the estimation of multiple large covariance matrices and the application to high-dimensional quadratic discriminant analysis. Lastly, we focus on controlling the false discovery rate while incorporating complex auxiliary information. Testing whether two samples are from a common distribution is an important problem in statistics. Friedman & Rafsky (1979) proposed a non-parametric multivariate distribution test based on the minimal spanning tree (MST). Recently, this test has been extended under various scenarios. However, as demonstrated in Chapter 2, these extensions are not sensitive to sparse alternatives. To address this, we propose a two-step testing procedure, IM-MST. Specifically, IM-MST incorporates marginal screening while accounting for the dependence structure via energy distance, followed by MST-based tests. IM-MST combines the strength of both non-parametric screening and MST-based tests. Simulation studies and real data applications are conducted to evaluate the numerical performance of the two-step procedure, demonstrating that IM-MST exhibits substantial power gains. When estimating covariance matrices for data from two related categories, it is reasonable to assume that these covariance matrices share certain structural components. As a result, the precision matrix (the inverse of the covariance matrix) for each category can be decomposed into three parts: a common diagonal component, a common low-rank component, and a category-specific low-rank component. This decomposition can be motivated by a factor model, where some latent factors are common across two categories while others are specific to individual categories. In Chapter 3, we propose a consistent joint estimation method for two precision matrices building on the work of Wu (2017). Furthermore, these estimators are applied to formulate a high-dimensional quadratic discriminant analysis (QDA) rule, for which we derive the convergence rate for the classification error. In many genetic multiple testing applications, the signs of the test statistics provide important directional information. For example, in RNA-seq data analysis, a negative sign could suggest that the expression of the corresponding gene is potentially suppressed, while a positive sign could indicate a potentially elevated expression level. However, most existing procedures that control the false discovery rate (FDR) ignore such valuable information. In Chapter 4, we extend the covariate and direction adaptive knockoff procedure (Tian 2020) by implementing powerful predictive functions. Through simulation studies and real data analysis, we show that our procedures are competitive to existing covariate-adaptive methods. The companion R package Codak is available.

Description

Keywords

LC Subject Headings

Citation

Collections