Variable Selection and Prediction for Multistate Processes under Complex Observation Schemes

Li, Xianwei2025-09-192025-09-192025-09-192025-09-18https://hdl.handle.net/10012/22493This thesis addresses variable selection and prediction in time-to-event analysis under complex observation schemes that commonly arise in biomedical studies. Such schemes may lead to right-censored data, interval-censored event times, or dual-censoring scenarios. Across three main chapters, we develop variable selection methods for multistate processes, address challenges arising from incomplete data under complex observation schemes, and investigate the implications of model misspecification, such as using simpler models in place of multistate models, and the potential risks of violating assumptions on covariate effects estimation and predictive performance. We begin with considering the problem of variable selection for progressive multistate processes under intermittent observation in Chapter 2. This study is motivated by the need to identify which among a large list of candidate markers play a role in the progression of joint damage in psoriatic arthritis (PsA) patients. We adopted a penalized log-likelihood approach and developed an innovative Expectation-Maximization (EM) algorithm such that the maximization step can exploit existing software for penalized Poisson regression thereby enabling flexible use of common penalty functions. Simulation studies show good performance in identifying important markers with different penalty functions. We applied the algorithm in the motivating application involving a cohort of patients with psoriatic arthritis with repeated assessments of joint damage, and identified human leukocyte antigen (HLA) markers which are associated with disease progression, among a large group of candidate markers. Chapter 3 extends this algorithm to more general multistate processes, and to more complex observation schemes. We consider the classical illness-death model which offers a useful framework for studying the progression of chronic disease while jointly modeling death. The exact time of disease progression is not observed directly but progression status is recorded at intermittent assessment times; the time to death is subject to right-censoring. This creates a dual observation scheme where progression times are interval-censored and survival times are subject to right censoring. A penalized observed data likelihood approach is proposed which allows for separate penalties across different intensity functions. An EM algorithm is again developed to facilitate use of different penalties for variable selection on disease progression and death through penalized Poisson regression. This adaptation retains the flexibility to exploit existing software with commonly used penalty functions. Simulation studies show good finite-sample performance in variable selection with different combination of penalty functions. We also explored how various aspects of the variable selection algorithm affect performance such as use of nonparametric baseline intensities and different ways to select the optimal tuning parameter(s). An application to data from the National Alzheimer’s Coordinating Center (NACC) demonstrates the use of our method in jointly modeling dementia progression and mortality. Chapter 4 builds on insights from Chapters 2 and 3 by investigating how simpler marginal methods targeting entry time to the absorbing state (e.g., a Cox proportional hazards model) compared to full multistate models. Here we retain use of the illness-death process as the basis of the investigation, but consider settings where transition times are only right-censored. We first study the limiting values of regression estimators from a Cox proportional hazards model when the data generating process is based on a Markov illness-death model. The potential impact of modeling the multistate processes based on a misspecified model is also investigated by considering cases where a) important covariates are omitted, or b) the Markov assumption is violated. We then examine the implications of model misspecification when the goal is prediction - this is done by evaluating the predictive performance of a misspecified Cox regression model for overall survival and a misspecified Fine-Gray model for disease progression, and comparing their respective predictive performance against that of the true illness-death model. We find that the limiting value of regression coefficients estimators obtained from Cox models and Fine-Gray models depend on several factors, including the baseline hazard ratio of death between the intermediate and initial states, the probability of moving through the intermediate state, and covariate effects on all transitions. However, the corresponding predictive accuracy is not substantially compromised despite biases in the regression coefficient estimators in most scenarios we investigated. The limiting value of regression coefficients obtained from a Markov illness-death model and the corresponding predictive accuracy are sensitive to model misspecification such as omitting important covariates and violation of the Markov assumption. The practical implications are illustrated using a dataset of patients with metastatic breast cancer in the control arm to predict overall survival and fracture risk. Chapter 5 reviews the contributions of this thesis and discusses problems warranting future research.enVariable Selection and Prediction for Multistate Processes under Complex Observation SchemesDoctoral Thesis