Causal Inference and Matrix Completion with Correlated Incomplete Data

Loading...
Thumbnail Image

Date

2023-01-19

Authors

Sun, Zhaohan

Advisor

Zhu, Yeying
Dubin, Joel

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Missing data problems are frequently encountered in biomedical research, social sciences, and environmental studies. When data are missing completely at random, a complete-case analysis may be the easiest approach. However, when data are missing not completely at random, ignoring the missing values will result in biased estimators. There has been a lot of work in handling missing data in the last two decades, such as likelihood-based methods, imputation methods, and bayesian approaches. The so-called matrix completion algorithm is one of the imputation approaches that has been widely discussed in the missing data literature. However, in a longitudinal setting, limited efforts have been devoted to using covariate information to recover the outcome matrix via matrix completion, when the response is subject to missingness. In Chapter 1, the basic definition and concepts of different types of correlated data are introduced, and matrix completion algorithms as well as the semiparametric approaches are also introduced for handling missingness in the literature of correlated data analysis. The definition of robust estimation and interference in causal inference are also presented in this chapter. In Chapter 2 we consider the prediction of missing responses in a longitudinal dataset via matrix completion. We propose a fixed effects longitudinal low-rank model which incorporates both subject-specific and time-specific covariates. The missingness mechanism is allowed to be missing at random, and the inverse probability weighting approach is utilized to debias the traditional quadratic loss in the matrix completion literature. To solve the optimization problem, a two-step optimization algorithm is proposed which provides good statistical properties for the estimation of the fixed effects and the low-rank term. In the theoretical investigation, the non-asymptotic error bounds on the fixed effects and the low-rank term are presented. We illustrate the finite sample performance of the proposed algorithm via simulation studies and apply our method to both a Covid-19 and PM2.5 emissions dataset. In Chapter 3, we consider the partial interference setting, that is, the whole population can be partitioned into clusters where the outcome of each unit depends on the intervention on other units within the same cluster, but not on the units in different clusters. We also assume that the confounders are subject to nonignorable missingness. We propose three distinct consistent estimators for the direct, indirect, total, and overall effect of the intervention on the outcome, and derive the asymptotic results accordingly. A comprehensive simulation study is carried out as well to investigate the finite sample properties of the proposed estimators. We illustrate the proposed methods by analyzing the data collected from an Acid Rain Program, which was launched to reduce air pollution in the USA by encouraging the scrubber’s installation on power plants, where the records of some operating characteristics of the power generating facilities are subject to missingness. In Chapter 4, we focus on the estimation of network causal effects. Under the setting of nonignorable missing confounders, we develop a multiply robust estimation procedure that gains extra protection against model misspecification. Compared with doubly robust estimators proposed in Chapter 3, the proposed multiply robust estimators are consistent if either one pair of the propensity score of treatment and missingness mechanism, or the joint model of confounders and the outcome, is correctly specified. The finite performance of the proposed methods under different missingness rates and cluster sizes is investigated, and we further illustrate the proposed methods with the same real data used in Chapter 3. We conclude this thesis and discuss the future work in Chapter 5. Specifically, in Section 5.1, we summarize the contributions of the chapters in this thesis. In Section 5.2, we discuss the extension of Chapter 2 where the construction of confidence intervals for the low-rank term and the estimated fixed effects are investigated. Finally, in Section 5.3, we briefly discuss the potential extensions of Chapters 3 and 4 to a more general setting.

Description

Keywords

Missing data, Causal inference

LC Subject Headings

Citation