Topics in Study Design and Analysis Involving Incomplete Data

Yang, Ce

Topics in Study Design and Analysis Involving Incomplete Data

Files

Yang_Ce.pdf (2.37 MB)

Date

2021-07-27

Authors

Yang, Ce

Advisor

Cook, Richard
Diao, Liqun

Publisher

University of Waterloo

Abstract

Incomplete data is a common occurrence in statistics with various types and mechanisms such that each can have a significant effect on statistical analysis and inference. This thesis tackles several statistical issues in study design and analysis involving incomplete data. The first half of the thesis deals with the case of incomplete observations of the responses. In medical studies, events of interest are most likely to be under intermittent observation schemes, for example, detected via periodic clinical examinations. As a result, the event of interest is only known to happen within an interval, and the resulting interval-censored data hinders the application of numerous analysis tools. Although it is possible to presume the event time to happen at the endpoint or the midpoint of the interval, such ad hoc imputations are known to lead to invalid inferences. In Chapter 2, we propose appropriate imputations via censoring unbiased transformations and pseudo-observations of such incomplete responses to facilitate a straightforward use of prevalent machine learning algorithms. The former technique helps preserve the conditional mean structure with the presence of censoring, and the latter originates from the biased-corrected jackknife estimates. For a continuous response, both proposed imputations lead to regression trees models with the same expected L2 loss as those fitted from complete observations. Therefore, prediction and variable selection naturally follow. Unlike most survival trees in literature, our proposed models do not rely on the widely made proportional hazard assumption. Furthermore, such models reduce to ordinary regression trees without the presence of censoring. Survivor function estimates of interval-censored data are required to employ the imputations; various semiparametric and nonparametric approaches are considered and compared. In particular, we scrutinize the case of current status data in a separate section. The second half of the thesis addresses incomplete covariate data missing by design. Controlled by the investigators, the missingness is attributed to the budgetary constraints when measuring an ``expensive exposure variable" in real-life scenarios. We focus on the well-known two-phase studies which exploit the response and inexpensive auxiliary information of the population to select a phase II sub-sample for the collection of the expensive covariate. In Chapter 3, we look into an adaptive two-phase design that avoids the need for external pilot data. Dividing the phase II sub-sampling into multiple interim stages, we employ conventional sampling to select a fraction of the individuals of the phase II sub-sample to provide the information required for constructing an optimal sub-sample from those remaining to achieve maximum statistical efficiency subject to sampling constraints. Such adaptive two-phase designs naturally extend to multiple stages in phase II and are applicable when a surrogate of the exposure variable is available. Efficiency and robustness issues are investigated under various frameworks of analysis. As expected, the maximum likelihood approach that models the nuisance distribution tends to be more efficient, whereas inverse probability weighted estimating equations that avoid this tend to be more robust to the misspecification of the nuisance covariates models. The conditional maximum likelihood approach, to our delight, is well-balanced between the two. Moreover, the eagerness to gain efficiency while maintaining a certain level of robustness further drives us to explore semiparametric methods in all the analyses and designs. Chapter 4 onward pays attention to more complicated settings in which covariates are missing in a sequence of two-phase studies with multiple responses and sampling constraints conducted on a common platform. For a given two-phase study, we expect to exploit not only information of the responses and auxiliary covariates at hand but also those passed on from earlier studies. We consider joint response models and perform secondary analyses of a new response using previously studied exposure variables. Moreover, the exposure variables acquired from earlier studies serve as pilot data to help construct an optimal selection model in an upcoming two-phase study. As we assess the balance between efficiency and robustness of the analysis methods, the potential misspecification of the joint response model warrants our attention. Finally, we note that the work can be extended to deal with two-phase response-dependent sampling with longitudinal data in Chapter 5.