Statistical Methods for Event History Data under Response Dependent Sampling and Incomplete Observation

Shi, Yidan

Statistical Methods for Event History Data under Response Dependent Sampling and Incomplete Observation

Files

Shi_Yidan.pdf (958.71 KB)

Date

2020-07-17

Authors

Shi, Yidan

Advisor

Thomson, Mary
Zeng, Leilei

Publisher

University of Waterloo

Abstract

This thesis discusses statistical problems in event history data analysis including survival analysis and multistate models. Research questions in this thesis are motivated by the Nun Study, which contains longevity data and longitudinal follow-up of cognition functions in 678 religious sisters. Our research interests lie in modeling the survival pattern and the disease process for dementia. These data are subject to a process-dependent sampling scheme, and the homogeneous Markov assumption is violated when using a multistate model to fit the panel data for cognition. In this thesis, we formulated three statistical questions according to the aforementioned issues and propose approaches to deal with these problems. Survival analysis is often subject to left-truncation when the data are collected within certain study windows. Naive methods ignoring the sampling conditions yield invalid estimates. Much work has been done to deal with the bias caused by left-truncation. However, discussion on the loss-in-efficiency is limited. In Chapter 2, we proposed a method in which auxiliary information is borrowed to improve the efficiency in estimation. The auxiliary information includes summary-level statistics from a previous study on the same cohort and census data for a comparable population. The likelihood and score functions are developed. A Monte Carlo approximation is proposed to deal with the difficulty in obtaining tractable forms of the score and information functions. The method is illustrated by both simulation and real data application to the Nun Study. Continuous-time Markov models are widely used for analyzing longitudinal data on the disease progression over time due to the great convenience for computing the probability transition matrices and the likelihood functions. However, in practice, the Markov assumption does not always hold. Most of the existing methods relax the Markov assumption while losing the advantage of that assumption in the calculation of transition probabilities. In Chapter 3, we consider the case where the violation of the Markov property is due to multiple underlying types of disease. We propose a mixture hidden Markov model where the underlying process is characterized by a mixture of multiple time-homogeneous Markov chains, one for each disease type, while the observation process contains states corresponding to the common symptomatic stages of these diseases. The method can be applied to modeling the disease process of Alzheimer's disease and other types of dementia. In the Nun Study, autopsies were conducted on some of the deceased participants so that one can know whether these individuals have Alzheimer's pathology in their brains. Our method can incorporate these partially observed pathology data as disease type indicators to improve the efficiency in estimation. The predictions for the overall prevalence and type-specific prevalence for dementia are calculated based on the proposed method. The performance of the proposed methods is also evaluated via simulation studies. Many prospective cohort studies of chronic diseases select individuals whose observed process history satisfies particular conditions. For instance, studies aiming to estimate the incidence rate of dementia or the effect of genetic factors on the disease would recruit individuals in the condition of being alive and disease-free. In contrast, some other studies may aim to collect information on disease progression or mortality from the time of the disease onset. Under such settings, individuals are recruited if they are in a subset of the states at the study entry, and the methods of estimation need to account for such state-dependent selection conditions. For multistate analysis, one option is to construct the likelihood based on the prospective data given the history up to and including the time at accrual. This approach yields consistent estimates under state-dependent sampling condition with a price of loss in efficiency. Alternatively, the likelihood contribution from the retrospective and current status data at the time of accrual can be incorporated, but with difficulty in obtaining such information. For example, subjects' initial states are often unknown, imposing a challenge for the computation of the contribution from the current status data at the time of recruitment. However, auxiliary information on the initial states may be available, such as the age-specific population prevalence data related to the disease. In Chapter 4, we proposed a weighted-likelihood method to incorporate auxiliary prevalence data and account for the state-dependent selection condition. The method is demonstrated by simulation and applied to the Nun Study of aging and Alzheimer's disease. A Bayesian sensitivity test is conducted to evaluate the impact of misspecification of the auxiliary prevalence.