Statistical Methods for Incomplete Covariates and Two-Phase Designs

McIsaac, Michael

Statistical Methods for Incomplete Covariates and Two-Phase Designs

Files

McIsaac_Michael.pdf (9.94 MB)

Date

2013-01-24T16:53:42Z

Authors

McIsaac, Michael

Publisher

University of Waterloo

Abstract

Incomplete data is a pervasive problem in health research, and as a result statistical methods enabling inference based on partial information play a critical role. This thesis explores estimation of regression coefficients and associated inferences when variables are incompletely observed. In the later chapters, we focus primarily on settings with incomplete covariate data which arise by design, as in studies with two-phase sampling schemes, as opposed to incomplete data which arise due to events beyond the control of the scientist. We consider the problem in which "inexpensive" auxiliary information can be used to inform the selection of individuals for collection of data on the "expensive" covariate. In particular, we explore how parameter estimation relates to the choice of sampling scheme. Efficient sampling designs are defined by choosing the optimal sampling criteria within a particular class of selection models under a two-phase framework. We compare the efficiency of these optimal designs to simple random sampling and balanced sampling designs under a variety of frameworks for inference. As a prelude to the work on two-phase designs, we first review and study issues related to incomplete data arising due to chance. In Chapter 2, we discuss several models by which missing data can arise, with an emphasis on issues in clinical trials. The likelihood function is used as a basis for discussing different missing data mechanisms for incomplete responses in short-term and longitudinal studies, as well as for missing covariates. We briefly discuss common ad hoc strategies for dealing with incomplete data, such as complete-case analyses and naive methods of imputation, and we review more broadly appropriate approaches for dealing with incomplete data in terms of asymptotic and empirical frequency properties. These methods include the EM algorithm, multiple imputation, and inverse probability weighted estimating equations. Simulation studies are reported which demonstrate how to implement these procedures and examine performance empirically. We further explore the asymptotic bias of these estimators when the nature of the missing data mechanism is misspecified. We consider specific types of model misspecification in methods designed to account for the missingness and compare the limiting values of the resulting estimators. In Chapter 3, we focus on methods for two-phase studies in which covariates are incomplete by design. In the second phase of the two-phase study, subject to correct specification of key models, optimal sub-sampling probabilities can be chosen to minimise the asymptotic variance of the resulting estimator. These optimal phase-II sampling designs are derived and the empirical and asymptotic relative efficiencies resulting from these designs are compared to those from simple random sampling and balanced sampling designs. We further examine the effect on efficiency of utilising external pilot data to estimate parameters needed for derivation of optimal designs, and we explore the sensitivity of these optimal sampling designs to misspecification of preliminary parameter estimates and to the misspecification of the covariate model at the design stage. Designs which are optimal for analyses based on inverse probability weighted estimating equations are shown to result in efficiency gains for several different methods of analysis and are shown to be relatively robust to misspecification of the parameters or models used to derive the optimal designs. Furthermore, these optimal designs for inverse probability weighted estimating equations are shown to be well behaved when necessary design parameters are estimated using relatively small external pilot studies. We also consider efficient two-phase designs explicitly in the context of studies involving clustered and longitudinal responses. Model-based methods are discussed for estimation and inference. Asymptotic results are used to derive optimal sampling designs and the relative efficiencies of these optimal designs are again compared with simple random sampling and balanced sampling designs. In this more complex setting, balanced sampling designs are demonstrated to be inefficient and it is not obvious when balanced sampling will offer greater efficiency than a simple random sampling design. We explore the relative efficiency of phase-II sampling designs based on increasing amounts of information in the longitudinal responses and show that the balanced design may become less efficient when more data is available at the design stage. In contrast, the optimal design is able to exploit additional information to increase efficiency whenever more data is available at phase-I. In Chapter 4, we consider an innovative adaptive two-phase design which breaks the phase-II sampling into a phase-IIa sample obtained by a balanced or proportional sampling strategy, and a phase-IIb sample collected according to an optimal sampling design based on the data in phases I and IIa. This approach exploits the previously established robustness of optimal inverse probability weighted designs to overcome the difficulties associated with the fact that derivations of optimal designs require a priori knowledge of parameters. The efficiency of this hybrid design is compared to those of the proportional and balanced sampling designs, and to the efficiency of the true optimal design, in a variety of settings. The efficiency gains of this adaptive two-phase design are particularly apparent in the setting involving clustered response data, and it is natural to consider this approach in settings with complex models for which it is difficult to even speculate on suitable parameter values at the design stage.