Design and Analysis of Life History Studies Involving Incomplete Data

Mao, Fangya

Design and Analysis of Life History Studies Involving Incomplete Data

Files

Mao_Fangya.pdf (874.03 KB)

Date

2022-04-26

Authors

Mao, Fangya

Advisor

Cook, Richard

Publisher

University of Waterloo

Abstract

Incomplete life history data can arise in study designs, coarsened observations, missing covariates, and unobserved latent processes. This thesis consists of three different projects developing statistical models and methods to address problems involving such features. Statistical models which facilitate the exploration of spatial dependence can advance scientific understanding of chronic diseases processes affecting several organ systems or body sites. Motivated by the need to investigate the spatial nature of joint damage in patients with psoriatic arthritis, we develop a multivariate mixture model to characterize latent susceptibility and the progression of joint damage in different locations in Chapter 2. In addition to a large number of joints under consideration and the heterogeneity in risk, the times to joint damage are subject to interval censoring as damage status is only observed at intermittent radiological examination times. We address computational and inferential challenge through use of composite likelihood and two-stage estimation procedures. The key contribution of this chapter is the development of a convenient and general framework for regression modeling to study risk factors for susceptibility to joint damage and the time to damage, as well as spatial dependence of these features. The design and analysis of two-phase studies have been investigated for biomarker studies involving lifetime data. Two-phase designs aim to guide the efficient selection of a sub-sample of individuals from a phase I cohort to measure some "expensive" markers under budgetary constraints. In a phase I sample information on the response and inexpensive covariates is available for a large cohort, and in phase II, a subsample is selected in which to assay the marker of interest through examination of a biospecimen. The design efficiency is measured in terms of the precision in estimating the effect of the biomarker on some event process (e.g. disease progression) of interest. Chapter 3 considers two-phase designs involving current status observation of the failure process; here individuals are monitored at a single assessment time to determine whether or not they have experienced a failure event of interest. This kind of observation scheme is sometimes desirable in practice as it is more efficient and cost-effective then carrying out multiple assessments. We examine efficient two-phase designs under two analysis methods, namely maximum likelihood and inverse probability weighting. The former tends to be more efficient but requires additional model assumptions involving the nuisance covariate model, while the latter is more robust but yields less efficient estimators since it only analyses data from the phase II subsample. The optimal designs are derived by minimizing the asymptotic variance of the coefficient estimators for the expensive marker. To circumvent the computational challenge in evaluating asymptotic variances at the design stage, we consider designs involving sub-sampling based on extreme score statistics, extreme observations, or via stratified sub-sampling schemes. The role of the assessment time is highlighted. Research involving progressive chronic disease processes can be conducted by synthesizing data from different disease registries using different enrolment conditions. In inception cohorts, for example, individuals may be required to not have entered an advanced stage of the disease, while disease registries may focus on individuals who have progressed to a more advanced stage. The former yields left-truncated progression times while the latter yields right-truncated progression times. Chapter 4 considers the development of two-phase designs when the phase I sample contains data pooled from different registries launched to recruit individuals from a common population with different disease-dependent selection criteria. We frame the complex data structure by multistate models and construct partial likelihoods restricted to parameters of interest using intensity-based models under some model assumptions. Both recruitment (phase I) and sub-selection (phase II) biases are accounted for to ensure valid inference. An inverse probability weighting method is also developed to relax or weaken assumptions needed for the likelihood approach. We investigate and compare the performance of various two-phase sampling schemes under each analysis method and provide practical guidance for phase II selection given budgetary constraints. The contributions of this thesis are reviewed in Chapter 5 where we also mention topics of future research.