Browsing by Author "Dubin, Joel"

Now showing 1 - 7 of 7

Causal Inference and Matrix Completion with Correlated Incomplete Data
(University of Waterloo, 2023-01-19) Sun, Zhaohan; Zhu, Yeying; Dubin, Joel
Missing data problems are frequently encountered in biomedical research, social sciences, and environmental studies. When data are missing completely at random, a complete-case analysis may be the easiest approach. However, when data are missing not completely at random, ignoring the missing values will result in biased estimators. There has been a lot of work in handling missing data in the last two decades, such as likelihood-based methods, imputation methods, and bayesian approaches. The so-called matrix completion algorithm is one of the imputation approaches that has been widely discussed in the missing data literature. However, in a longitudinal setting, limited efforts have been devoted to using covariate information to recover the outcome matrix via matrix completion, when the response is subject to missingness. In Chapter 1, the basic definition and concepts of different types of correlated data are introduced, and matrix completion algorithms as well as the semiparametric approaches are also introduced for handling missingness in the literature of correlated data analysis. The definition of robust estimation and interference in causal inference are also presented in this chapter. In Chapter 2 we consider the prediction of missing responses in a longitudinal dataset via matrix completion. We propose a fixed effects longitudinal low-rank model which incorporates both subject-specific and time-specific covariates. The missingness mechanism is allowed to be missing at random, and the inverse probability weighting approach is utilized to debias the traditional quadratic loss in the matrix completion literature. To solve the optimization problem, a two-step optimization algorithm is proposed which provides good statistical properties for the estimation of the fixed effects and the low-rank term. In the theoretical investigation, the non-asymptotic error bounds on the fixed effects and the low-rank term are presented. We illustrate the finite sample performance of the proposed algorithm via simulation studies and apply our method to both a Covid-19 and PM2.5 emissions dataset. In Chapter 3, we consider the partial interference setting, that is, the whole population can be partitioned into clusters where the outcome of each unit depends on the intervention on other units within the same cluster, but not on the units in different clusters. We also assume that the confounders are subject to nonignorable missingness. We propose three distinct consistent estimators for the direct, indirect, total, and overall effect of the intervention on the outcome, and derive the asymptotic results accordingly. A comprehensive simulation study is carried out as well to investigate the finite sample properties of the proposed estimators. We illustrate the proposed methods by analyzing the data collected from an Acid Rain Program, which was launched to reduce air pollution in the USA by encouraging the scrubber’s installation on power plants, where the records of some operating characteristics of the power generating facilities are subject to missingness. In Chapter 4, we focus on the estimation of network causal effects. Under the setting of nonignorable missing confounders, we develop a multiply robust estimation procedure that gains extra protection against model misspecification. Compared with doubly robust estimators proposed in Chapter 3, the proposed multiply robust estimators are consistent if either one pair of the propensity score of treatment and missingness mechanism, or the joint model of confounders and the outcome, is correctly specified. The finite performance of the proposed methods under different missingness rates and cluster sizes is investigated, and we further illustrate the proposed methods with the same real data used in Chapter 3. We conclude this thesis and discuss the future work in Chapter 5. Specifically, in Section 5.1, we summarize the contributions of the chapters in this thesis. In Section 5.2, we discuss the extension of Chapter 2 where the construction of confidence intervals for the low-rank term and the estimated fixed effects are investigated. Finally, in Section 5.3, we briefly discuss the potential extensions of Chapters 3 and 4 to a more general setting.
Estimation and prediction methods for univariate and bivariate cyclic longitudinal data using a semiparametric stochastic mixed effects model
(University of Waterloo, 2018-06-19) Ji, Kexin; Dubin, Joel
In this thesis, I propose and consider inference for a semiparametric stochastic mixed model for bivariate longitudinal data; and provide a prediction procedure of a future cycle utilizing past cycle information. This thesis is built on the work of Zhang et al (1998) and Zhang, Lin & Sowers (2000). However, the papers are missing big gaps in the theoretical results, are to be applied on univariate longitudinal data, and contain no coverage of prediction of future cycles. We fill in all the gaps in this thesis as well as consider real application of a dataset that contains bivariate longitudinal data. The proposed approach models the mean of outcome variables by parametric fixed effects and a smooth nonparametric function for the underlying time effects, and the relationship across the bivariate responses by a bivariate Gaussian random field and a joint distribution of random effects. The prediction approach is proposed from the frequentist prospective and a prediction density function with predictive intervals will be provided. Simulations studies are performed and a real application of a hormone dataset is considered.
Exploring the differential impacts of social isolation, loneliness, and their combination on the memory of an aging population: A 6-year longitudinal study of the CLSA
(Elsevier, 2024) Kang, Ji Won; Oremus, Mark; Dubin, Joel; Tyas, Suzanne L; Oga-Omenka, Charity; Golberg, Meira
Memory plays a crucial role in cognitive health. Social isolation (SI) and loneliness (LON) are recognized risk factors for global cognition, although their combined effects on memory have been understudied in the literature. This study used three waves of data over six years from the Canadian Longitudinal Study on Aging to examine whether SI and LON are individually and jointly associated with memory in community-dwelling middle-aged and older adults (n = 14,208). LON was assessed with the question: "In the last week, how often did you feel lonely?". SI was measured using an index based on marital/cohabiting status, retirement status, social activity participation, and social network contacts. Memory was evaluated with combined z-scores from two administrations of the Rey Auditory Verbal Learning Test (immediate-recall, delayed-recall). We conducted our analyses using all available data across the three timepoints and retained participants with missing covariate data. Linear mixed models were used to regress combined memory scores onto SI and LON, adjusting for sociodemographic, health, functional ability, and lifestyle variables. Experiencing both SI and LON had the greatest inverse effect on memory (least-squares mean: -0.80 [95 % confidence-interval: -1.22, -0.39]), followed by LON alone (-0.73 [-1.13, -0.34]), then SI alone (-0.69 [-1.09, -0.29]), and lastly by being neither lonely nor isolated (-0.65 [-1.05, -0.25]). Sensitivity analyses confirmed this hierarchy of effects. Policies developed to enhance memory in middle-aged and older adults might achieve greater benefits when targeting the alleviation of both SI and LON rather than one or the other individually.
Longitudinal Patterns of Cognitive State Changes and their Predictors in Older Adults
(University of Waterloo, 2020-02-03) Iraniparast, Maryam; Tyas, Suzanne; Dubin, Joel
Older adults experience diverse patterns of cognitive state changes, including progression to dementia, that depend on genetic and non-genetic factors. With population aging, the global prevalence of dementia is rising. Given limited treatment success, research focusing on patterns of cognitive state changes and their predictors provides information for older adults and opens windows to develop interventions for preventing or delaying the onset of dementia. This dissertation is based on analyzing secondary data from the Nun Study, a longitudinal study of aging and cognition. The first aim of this dissertation was to identify patterns of changes over time in cognitive states among older adults using a clinically-driven approach and a statistical modeling method, and to compare the patterns identified using these two methods. The second aim was to test and quantify how academic achievement—educational attainment and academic performance in high school—is associated with cognitive state changes and contributes to cognitive reserve. The third aim was to test the potential antagonistic pleiotropy effect of the gene apolipoprotein E (APOE) on cognition. To identify the patterns of cognitive state changes (Aim 1), homogeneous trajectories were grouped together using two different approaches: 1) a clinically-driven approach, and 2) a statistical modeling approach, latent class mixed-effects modeling (lcmm). Using the clinically-driven approach, seven patterns were identified based on whether individuals experienced stable or non-stable trajectories and among non-stable trajectories, whether they experienced a reverse transition to an improved cognitive state, whether they developed dementia or both. These seven trajectories ranged from stable normal cognition to stable dementia. These patterns were preferred to the four classes identified using latent class mixed-effects modeling. This preference was based on the higher level of detail in trajectories captured by the clinically-driven approach compared to the latent classes identified using the lcmm approach. These details include distinguishing between trajectories with and without cognitive improvement, and with and without progression to dementia. The patterns of cognitive state changes based on the clinically-driven approach were then used as the cognitive outcomes to address the two additional aims, with stable dementia used as the reference category. Using multinomial logistic bias reduction regression, the potential presence of cognitive reserve among individuals with higher academic achievement was tested (Aim 2). Adjusting for age and APOE, higher educational attainment (i.e., a graduate degree) was associated with higher odds of experiencing three healthier patterns of cognitive state changes. Higher overall academic performance was significantly associated with experiencing stable cognitive impairment or cognitive impairment without dementia; this effect was mostly due to higher performance in algebra rather than performance in English, Latin, or geometry courses. Higher academic achievement, as evidenced by educational attainment or performance in high school courses, was thus associated with cognitive reserve through experiencing healthier patterns of cognitive trajectories versus experiencing stable dementia. To test the potential antagonistic pleiotropy effect of APOE on cognition, the effect of APOE-ε4 on both early- and late-life cognition was investigated (Aim 3). In addition, the potential modifying effect of higher education among APOE-ε4 carriers was tested. APOE-ε4 was not significantly associated with an early-life measure of cognition (educational attainment); however, among individuals with lower education, APOE-ε4 was associated with experiencing the most impaired cognitive pattern (stable dementia) in late life. This research did not support the antagonistic pleiotropy hypothesis for APOE; however, it did support the scaffolding theory of aging and cognition. Higher educational attainment among APOE-ε4 carriers compensated for the detrimental effects of APOE-ε4 on late-life cognition to the extent that APOE-ε4 carriers with high educational attainment (a graduate degree) showed cognitive aging patterns similar to APOE-ε4 non-carriers. This modifying effect of higher education on the association between APOE-ε4 and late-life cognition suggests that higher education is associated with cognitive reserve even among APOE-ε4 carriers. This dissertation provides information on patterns of cognitive state changes and their predictors in older adults that will benefit older adults, their families, and the healthcare system. Patterns of cognitive trajectories among older adults are diverse, complex, and difficult to identify. Advanced statistical approaches and their software applications are developed for modeling complex longitudinal cognitive trajectories; however, integrating clinically-driven approaches in identifying distinct patterns of cognitive state changes is beneficial. The results of this dissertation show that higher academic achievement may increase the odds of cognitive reserve by leading to healthier cognitive trajectories. While APOE-ε4 and older age are non-modifiable risk factors for dementia, it may be possible to compensate for their detrimental effects through a modifiable factor, such as graduate-level education. Therefore, investing in higher education is an important potential intervention that may prevent or delay dementia even among individuals carrying a genetic risk factor. Furthermore, it may be worthwhile for researchers targeting APOE to develop interventions that consider non-genetic factors that may modify the effect of APOE on cognition.
New Methods for Improving Accuracy in Three Distinct Predictive Modeling Problems
(University of Waterloo, 2018-08-22) XU, Yingying; Dubin, Joel; Lee, Joon
People are often interested in predicting a new or future observation. In clinical prediction, the uptake of Electronic Health Records (EHRs) has generated massive health datasets that are big in volume and diverse in variety. The outcomes can be of different types, e.g., continuous, binary, time-to-event, etc., and covariates can be either time-fixed or longitudinal. These datasets can provide rich and diverse information for modeling and prediction but also pose challenges to fast and accurate prediction of outcomes of interest. One challenge of predicting is that when the data are heterogeneous in the relationship between the covariates and the outcome. In this case, it is quite possible that localizing a subset of data in an informative manner to aid in making predictions will lead to better performance than including all information. Chapter 3 deals with a continuous outcome, and I have developed methodology that gives an interpretable and meaningful definition of similarity, and an algorithm to uncover the similarity structure to improve the prediction accuracy by making similarity-based predictions. In Chapter 4, the similarity-based prediction is extended to a survival outcome, with possible independent or dependent censoring. The algorithm is developed under the random forest framework, and I showed through both simulations and a real data example that incorporating the similarity structure indeed improves prediction accuracy in these cases. Another challenge in prediction arises when longitudinal covariates are present, and that there are scenarios when one needs to make an early prediction as soon as practical and thus cannot monitor the full trajectory of longitudinal covariates (before the prediction is required). In Chapter 5, I address this concern by quantifying the relationship between the earliness of prediction and the prediction accuracy. A penalization approach with a graphical method is introduced to select a monitoring window length given specific prediction accuracy. Comprehensive simulations are conducted to investigate the performance of the algorithm in selecting the length of the monitoring window in different scenarios.
Toward Precision Medicine in Intensive Care: Leveraging Electronic Health Records and Patient Similarity
(University of Waterloo, 2019-05-28) Sharafoddini, Anis; Dubin, Joel; Lee, Joon
The growing adoption of Electronic Health Record (EHR) systems has resulted in an unprecedented amount of data. This availability of data has also opened up the opportunity to utilize EHRs for providing more customized care for each patient by considering individual variability, which is the goal of precision medicine. In this context, patient similarity (PS) analytics have been introduced to facilitate data analysis through investigating the similarities in patients’ data, and, ultimately, to help improve the healthcare system. This dissertation is presented in six chapters and focuses on employing PS analytics in data-rich intensive care units. Chapter 1 provides a review of the literature and summarizes studies describing approaches for predicting patients’ future health status based on EHR and PS. Chapter 2 demonstrates the informativeness of missing data in patient profiles and introduces missing data indicators to use this information in mortality prediction. The results demonstrate that including indicators with observed measurements in a set of well-known prediction models (logistic regression, decision tree, and random forest) can improve the predictive accuracy. Chapter 3 builds upon the previous results and utilizes these missing indicators to reveal patient subpopulations based on their similarity in laboratory test ordering being used for them. In this chapter, the Density-based Spatial Clustering of Applications with Noise method, was employed to group the patients into clusters using the indicators generated in the previous study. Results confirmed that missing indicators capture the laboratory-test-ordering patterns that are informative and can be used to identify similar patient subpopulations. Chapter 4 investigates the performance of a multifaceted PS metric constructed by utilizing appropriate similarity metrics for specific clinical variables (e.g. vital signs, ICD-9, etc.). The proposed PS metric was evaluated in a 30-day post-discharge mortality prediction problem. Results demonstrate that PS-based prediction models with the new PS metric outperformed population-based prediction models. Moreover, the multifaceted PS metric significantly outperformed cosine and Euclidean PS metric in k-nearest neighbors setting. Chapter 5 takes the previous results into consideration and looks for potential subpopulations among septic patients. Sepsis is one of the most common causes of death in Canada. The focus of this chapter is on longitudinal EHR data which are a collection of observations of measurements made chronologically for each patient. This chapter employs Functional Principal Component Analysis to derive the dominant modes of variation in septic patients’ EHR's. Results confirm that including temporal data in the analysis can help in identifying subgroups of septic patients. Finally, Chapter 6 provides a discussion of results from previous chapters. The results indicate the informativeness of missing data and how PS can help in improving the performance of predictive modeling. Moreover, results show that utilizing the temporal information in PS calculation improves patient stratification. Finally, the discussion identifies limitations and directions for future research.
Using Decision Trees to Examine the Influence of the School Environment on Youth Mental Health
(University of Waterloo, 2022-12-22) Battista, Katelyn; Leatherdale, Scott; Dubin, Joel
Youth mental health is a current public health priority in Canada, with nearly one in four young people living with a mental illness. The contextual school environment can be particularly influential given the considerable amount of time that youth spend in school. Schools are seen as ideal settings for prevention and early intervention initiatives. While a myriad of practices and programs are being implemented across schools to address student mental health, there is limited and contradictory evidence on their effectiveness. Most available research has been conducted using statistical techniques that have limited ability to account for the complex interactions between co-occurring environmental influences. While machine learning techniques such as decision trees are well suited for this type of analysis, they are relatively underused in public health research. The overall objective of this dissertation was to use decision tree analysis to further our understanding of the influence of the school contextual environment on youth depression, anxiety, and psychosocial wellbeing. Specific objectives were to (1) compare the performance of decision trees to traditional regression models in the context of health survey data, (2) determine which environmental and behavioural factors are most influential on mental health outcomes, and (3) determine which, if any, combinations of school mental health practices are associated with better student mental health. These objectives were addressed through three manuscripts using student- and school-level data from the 2017-18 and 2018-19 waves of the COMPASS study. The first manuscript provided a methodological overview and application of two decision tree techniques: classification and regression trees and conditional inference trees. Decision tree model performance was compared to traditional linear and logistic regression. All techniques showed general agreement in the identification of key differentiating factors across five outcomes. Tree models had slightly lower prediction accuracy than regression models but were more parsimonious. Unlike traditional regression methods, decision trees allowed for the identification of non-linear associations and differential impacts among high-risk subgroups. The second manuscript used cross-sectional student-level data to examine associations of various environmental and behavioural risk factors with youth anxiety, depression, and flourishing levels. Having a happy home life and sense of school connection were identified as key protective factors, while behavioural factors such as diet, movement, and substance use did not emerge as important differentiators. Females lacking both happy home life and sense of connection to school were at greatest risk for higher anxiety and depression levels. These results highlighted the importance of the home and school environments and suggested that a sense of connection to school may help to mitigate the negative influence of a poor home environment. The third manuscript used longitudinal student- and school-level data to examine variation in school mental health practices as well as associations between changes in these practices and youth anxiety, depression, and flourishing levels. Decision trees were used to comprehensively examine whether any combination of practice and service changes were associated with mental health outcomes. While substantial variability was seen in the mental health practices and services offered between schools and across years, decision tree analysis found no combinations of changes that meaningfully contributed to better student mental health outcomes. These results suggested that incremental practice changes were not effective and highlighted the need for more comprehensive school mental health approaches. This dissertation used a novel decision tree approach to expand our knowledge of the influence of the school contextual environment on youth depression, anxiety, and psychosocial wellbeing. These findings have important implications for practice, as they suggest that schools can enhance student mental health through initiatives that foster a supportive school environment and sense of connection. These findings further support calls for comprehensive school health programming by showing that current tactics of incremental and sporadic practices changes at the individual school level are ineffective. This dissertation also provides a framework for future research, as the decision tree approach used here can be applied to other public health domains to examine complex interactions and identify high-risk subgroups. Further, the ability to comprehensively examine permutations of simultaneously changing factors makes decision trees a compelling tool for natural experiment evaluation. In addition to answering important research questions regarding the influence of school context on youth mental health, this dissertation work highlights the potential power in combining machine learning methods with large population health surveillance data.