Two-sample Inference, Order Determination, and Data Integration for Functional Data

Zhang, Chi

Two-sample Inference, Order Determination, and Data Integration for Functional Data

Files

Zhang_Chi.pdf (1.06 MB)

Date

2026-01-07

Authors

Zhang, Chi

Advisor

Sang, Peijun
Qin, Yingli

Publisher

University of Waterloo

Abstract

Functional data analysis has gained increasing prominence in modern statistics, largely due to advancements in data collection technologies. It provides a nonparametric framework for analyzing discrete observations obtained from realizations of a continuous random function, often defined over time or space. In this thesis, we focus on three distinct problems, each reflecting a different aspect of functional data analysis. In Chapter 2, we address the problem of comparing mean functions between two groups of sparse functional data within the framework of a reproducing kernel Hilbert space. The proposed method is well-suited for sparsely and irregularly sampled functional data. Traditional approaches often assume homogeneous covariance structures across groups, an assumption that is difficult to justify in practice. To circumvent this limitation, we first develop a novel linear approximation for the mean estimator, which naturally leads to its desirable pointwise limiting distributions. Furthermore, we establish the weak convergence of the mean estimator, enabling the construction of a test statistic for the mean differences. The finite-sample performance of our method is demonstrated through extensive simulations and two real-world applications. In Chapter 3, we study the problem of determining the number of eigenpairs to retain in functional principal component analysis---a problem commonly referred to as order determination. When a covariance function admits a finite representation, the challenge becomes estimating the rank of the corresponding covariance operator. While this problem is straightforward when the full trajectories of functional data are available, in practice, functional data are typically collected discretely and are subject to measurement error contamination. Such contamination introduces a ridge in the empirical covariance function, obscuring the true rank. We develop a novel procedure to identify the true rank of the covariance operator by leveraging the information of eigenvalues and eigenfunctions. By incorporating smoothing techniques to accommodate the nonparametric nature of functional data, the method is applicable to functional data collected at random, subject-specific points. Extensive simulation studies demonstrate the excellent performance of our approach across a wide range of settings, outperforming commonly used information-criterion-based methods and maintaining effectiveness even in high-noise scenarios. We further illustrate our method with two real-world data examples. In Chapter 4, we investigate the integration of multi-source functional data to extract a subspace that captures the variation shared across sources. In practice, data collection procedures often follow source-specific protocols. Directly averaging sample covariance operators across sources implicitly assumes homogeneity, which may lead to biased recovery of both shared and source-specific variation structures. To address this issue, we propose a projection-based data integration method that explicitly separates the shared and source-specific subspaces. The method first estimates source-specific projection operators via smoothing to accommodate the nonparametric nature of functional data. The shared subspace is then isolated by examining the eigenvalues of the averaged projection operator across all sources. If a source-specific subspace is of interest, we re-project the associated source-specific covariance estimator onto the subspace orthogonal to the estimated shared subspace, and estimate the source-specific subspace from the resulting projection. We further establish the asymptotic properties of both the shared and source-specific subspace estimators. Extensive simulation studies demonstrate the effectiveness of the proposed method across a wide range of settings. Finally, we illustrate its practical utility with an example of air pollutants data.