# Statistics and Actuarial Science

## Permanent URI for this collection

This is the collection for the University of Waterloo's **Department of Statistics and Actuarial Science**.

Research outputs are organized by type (eg. Master Thesis, Article, Conference Paper).

Waterloo faculty, students, and staff can contact us or visit the UWSpace guide to learn more about depositing their research.

## Browse

### Recent Submissions

Item The roughness exponent and its application in finance(University of Waterloo, 2024-08-30) Han, XiyueRough phenomena and trajectories are prevalent across mathematics, engineering, physics, and the natural sciences. In quantitative finance, sparked by the observation that the historical realized volatility is rougher than a Brownian martingale, rough stochastic volatility models were recently proposed and studied extensively. Unlike classical diffusion volatility models, the volatility process in a rough stochastic volatility model is driven by an irregular process. The roughness of the volatility process plays a pivotal role in governing the behavior of such models. This thesis aims to explore the concept of roughness and estimate the roughness of financial time series in a strictly pathwise manner. To this end, we introduce the notion of the roughness exponent, which is a pathwise measure of the degree of roughness. A substantial portion of this thesis focuses on the model-free estimation of this roughness exponent for both price and volatility processes. Towards the end of this thesis, we study the Wiener–Young Φ-variation of classical fractal functions, which can be regarded as a finer characterization than the roughness exponent. Chapter 2 introduces the roughness exponent and establishes a model-free estimator for the roughness exponent based on direct observations. We say that a continuous real-valued function x admits the roughness exponent R if the pth variation of x converges to zero for p>1/R and to infinity for p<1/R . The main result of this chapter provides a mild condition on the Faber–Schauder coefficients of x under which the roughness exponent exists and is given as the limit of the classical Gladyshev estimates. This result can be viewed as a strong consistency result for the Gladyshev estimator in an entirely model-free setting because no assumption whatsoever is made on the possible dynamics of the function x. Nonetheless, we show that the condition of our main result is satisfied for the typical sample paths of fractional Brownian motion with drift, and we provide almost-sure convergence rates for the corresponding Gladyshev estimates. In this chapter, we also discuss the connections between our roughness exponent and Besov regularity and weighted quadratic variation. Since the Gladyshev estimators are not scale-invariant, we construct several scale-invariant modifications of our estimator. Finally, we extend our results to the case where the p^th variation of x is defined over a sequence of unequally spaced partitions. Chapter 3 considers the problem of estimating the roughness exponent of the volatility process in a stochastic volatility model that arises as a nonlinear function of a fractional Brownian motion with drift. To this end, we establish a new estimator based on the Gladyshev estimator that estimates the roughness exponent of a continuous function x, but based on the observations of its antiderivative y. We identify conditions on the underlying trajectory x under which our estimates converge in a strictly pathwise sense. Then, we verify that these conditions are satisfied by almost every sample path of fractional Brownian motion with drift. As a consequence, we obtain strong consistency of our estimator in the context of a large class of rough volatility models. Numerical simulations are implemented to show that our estimation procedure performs well after passing to a scale-invariant modification of our estimator. Chapter 4 highlights the rationale of constructing the estimator from the Gladyshev estimator. In this chapter, we study the problem of reconstructing the Faber–Schauder coefficients of a continuous function from discrete observations of its antiderivative. This problem arises in the task of estimating the roughness exponent of the volatility process of financial assets but is also of independent interest. Our approach starts with formulating this problem through piecewise quadratic spline interpolation. We then provide a closed-form solution and an in-depth error analysis between the actual and approximated Faber–Schauder coefficients. These results lead to some surprising observations, which also throw new light on the classical topic of quadratic spline interpolation itself: They show that the well-known instabilities of this method can be located exclusively within the final generation of estimated Faber–Schauder coefficients, which suffer from non-locality and strong dependence on the initial value and the given data. By contrast, all other Faber–Schauder coefficients depend only locally on the data, are independent of the initial value, and admit uniform error bounds. We thus conclude that a robust and well-behaved estimator for our problem can be obtained by simply dropping the final-generation coefficients from the estimated Faber–Schauder coefficients. Chapter 5 studies the Wiener–Young Φ-variation of classical fractal functions with a critical degree of roughness. In this case, the functions have vanishing p^th variation for all p>q but are also of infinite p^th variation for p< q for some q≥1 . We partially resolve this apparent puzzle by showing that these functions have finite, nonzero, and linear Wiener–Young Φ-variation along the sequence of certain partitions. For instance, functions of bounded variation admit vanishing p^th variation for any p>1. On the other hand, Weierstrass and Takagi–van der Waerden functions have vanishing p^th variation for p>1 but are also nowhere differentiable and hence not of bounded variation. As a result, power variation and the roughness exponent fail to distinguish the difference of degree of roughness for these functions. However, we can individuate these functions by showing that the Weierstrass and Takagi–van der Waerden functions admit a nontrivial and linear Φ-variation along the sequence of b-adic partitions, where Φ_q (x)=x/(-log x)^1/2. Moreover, for q>1 , we further develop a probabilistic approach so as to identify functions in the Takagi class that have linear and nontrivial Φ_q-variation for a prescribed Φ_q . Furthermore, for each fixed q>1 , the collection of functions Φ_q forms a wide class of increasing functions that are regularly varying at zero with an index of regular variation q.Item Projection Geometric Methods for Linear and Nonlinear Filtering Problems(University of Waterloo, 2024-08-28) Ahmed, AshrafIn this thesis, we review the infinite-dimensional space containing the solution of a broad range of stochastic filtering problems, and outline the substantive differences between the foundations of finite dimensional information geometry and Pistone’s extension to infinite dimensions characterizing the substantive differences between the two geometries with respect to the geometric structures needed for projection theorems such as a dually flat affine manifold preserving the affine and convex geometry of the set of all probability measures with the same support, the notion of orthogonal complement between the different tangent representation which are key for the generalized Pythagorean theorem, and the key notion of exponential and mixture parallel transport needed for projecting a point on a submanifold. We also explore the projection method proposed by Brigo and Pistone for reducing the dimensionality of infinite-dimensional measure valued evolution equations from the infinite-dimensional space in which they are written, that is the infinitedimensional statistical manifold of Pistone, onto a finite-dimensional exponential subfamily using a local generalized projection theorem that is a non-parameteric analog of the generalized projection theorem proposed by Amari. Also, we explore using standard arguments the projection idea in the discrete state space with focus on building intuition and using computational examples to understand properties of the projection method. We establish two novel results regarding the impact of the boundary and choosing a subfamily that does not contain the initial condition of the problem. We demostrate, when the evolution process approaches the boundary of the space, the projection method fails completely due to the classical boundary relating to the vanishing of the tangent spaces at the boundary. We also show the impact of choosing a subfamily to project onto that does not contain the initial condition of the problem showing that, in certain directions, the approximation by projection changes from the true value due to solving a different differential equation than if we are to start from within the low-dimensional manifold. We also study the importance of having a sufficient statistics of the exponential subfamily to lie in the span of the left eigenfunctions of the infinitesimal generator of the process we wish to project using computational experiments.Item Efficient Bayesian Computation with Applications in Neuroscience and Meteorology(University of Waterloo, 2024-08-23) Chen, MeixiHierarchical models are important tools for analyzing data in many disciplines. Efficiency and scalability in model inference have become increasingly important areas of research due to the rapid growth of data. Traditionally, parameters in a hierarchical model are estimated by deriving closed-form estimates or Monte Carlo sampling. Since the former approach is only possible for simpler models with conjugate priors, the latter, Markov Chain Monte Carlo (MCMC) methods in particular, has become the standard approach for inference without a closed form. However, MCMC requires substantial computational resources when sampling from hierarchical models with complex structures, highlighting the need for more computationally efficient inference methods. In this thesis, we study the design of Bayesian inference to improve computational efficiency, with a focus on a class of hierarchical models known as \textit{latent Gaussian models}. The background of hierarchical modelling and Bayesian inference is introduced in Chapter 1. In Chapter 2, we present a fast and scalable approximate inference method for a widely used model in meteorological data analysis. The model features a likelihood layer of the generalized extreme value (GEV) distribution and a latent layer integrating spatial information via Gaussian process (GP) priors on the GEV parameters, hence the name GEV-GP model. The computational bottleneck is caused by the high number of spatial locations being studied, which corresponds to the dimensionality of the GPs. We presented an inference procedure based on the Laplace approximation to the likelihood followed by a Normal approximation to the posterior of interest. By combining the above approach with a sparsity-inducing spatial covariance approximation technique, we demonstrate through simulations that it accurately estimates the Bayesian predictive distribution of extreme weather events, scales to several thousand spatial locations, and is several orders of magnitude faster than MCMC. We also present a case study in forecasting extreme snowfall across Canada. Building on the approximate inference scheme discussed in Chapter 2, Chapter 3 introduces a new modelling framework for capturing the correlation structure in high-dimensional neuronal data, known as \textit{spike trains}. We propose a novel continuous-time multi-neuron latent factor model based on the biological mechanism of spike generation, where the underlying neuronal activities are represented by a multivariate Markov process. To the best of our knowledge, this is the first multivariate spike-train model in a continuous-time setting to study interactions between neural spike trains. A computationally tractable Bayesian inference procedure is proposed to address the challenges in estimating high-dimensional latent parameters. We show that the proposed model and inference method can accurately recover underlying neuronal interactions when applied to a variety of simulated datasets. Application of our model on experimental data reveals that the correlation structure of spike trains in rats' orbitofrontal cortex predicts outcomes following different cues. While Chapter 3 restricts modelling to Markov processes for the latent dynamics of spike trains, Chapter 4 presents an efficient inference method for non-Markov stationary GPs with noisy observations. While computations for such models typically scale as $\mathcal{O}(n^3)$ in the number of observations $n$, our method utilizes a ``superfast'' Toeplitz system solver which reduces computational complexity to $\mathcal{O}(n \log^2 n)$. We demonstrate that our method dramatically improves the scalability of Gaussian Process Factor Analysis (GPFA), which is commonly used for extracting low-dimensional representation for high-dimensional neural data. We extend GPFA to accommodate Poisson count observations and design a superfast MCMC inference algorithm for the extended GPFA. The accuracy and speed of our inference algorithms are illustrated through simulation studies.Item Robust Decision-Making in Finance and Insurance(University of Waterloo, 2024-08-22) Zhang, YuanyuanTraditional finance models assume a decision maker (DM) has a single view on the stochastic dynamics governing the price process but, in practice, the decision maker (DM) may be uncertain about the true probabilistic model that governs the occurrence of different states. If only risk is present, that is, the DM fully relies on a single probabilistic model P. When the DM is ambiguous, he holds different views on the precise distributions of the price dynamics. This type of model uncertainty due to the multiple probabilistic views is called ambiguity. In the presence of model ambiguity, Maccheroni et al. (2013) propose a novel robust mean-variance model which is referred to as the mean-variance-variance (M-V-V) criterion in my thesis. The M-V-V model is an analogue of the Arrow-Pratt approximation to the well-known smooth ambiguity model, but it offers a more tractable structure and meanwhile separates the modeling of ambiguity, ambiguity aversion, and risk aversion. In Chapters 3 and 4, we study the dynamic portfolio optimization and the dynamic reinsurance problem under the M-V-V criterion and derive the equilibrium strategies in light of the issue of time inconsistency. We find the equilibrium strategies share many properties with the ones from smooth ambiguity, but the time horizon appears inconsistently in the objective function of the M-V-V criterion, in turn causing the equilibrium strategies to be non-monotonic with respect to the risk aversion. To resolve this issue, we further propose a mean-variance-standard deviation (M-V-SD) criterion. The corresponding equilibrium investment strategy exhibits the appealing feature of limited stock market participation, a well-documented stylized fact in empirical studies. The corresponding equilibrium reinsurance strategy also displays the property of restricted insurance retention. Chapter 5 analyzes optimal longevity risk transfers, focusing on differing buyer and seller risk aversions using a Stackelberg game framework. We compare static contracts, which offer long-term protection with fixed terms, to dynamic contracts, which provide short-term coverage with variable terms. Our numerical analysis with real-life mortality data shows that risk-averse buyers prefer static contracts, leading to higher welfare gains and flexible market conditions, while less risk-averse buyers favor dynamic contracts. Ambiguity, modeled as information asymmetry, reduces welfare gains and market flexibility but does not change contract preferences. These findings explain key empirical facts and offer insights into the longevity-linked capital market. In the rest of the chapters, Chapter 1 introduces the background literature and main motivations of this thesis. Chapter 2 covers the mathematical preliminaries for the sub-sequent chapters. The core analysis and findings are presented in the following chapters. Finally, Chapter 6 concludes the thesis and suggests potential directions for future research.Item Estimation Methods with Recurrent Causal Events(University of Waterloo, 2024-08-21) Zhang, WenlingThis dissertation presents a comprehensive exploration on causal effects of treatment strategies on recurrent events within complex longitudinal settings. Utilizing a series of advanced statistical methodologies, this work focuses on addressing challenges in causal inference when faced with the complexities related to various treatment strategies, recurrent outcomes and time-varying covariates that are confounded or censored. The first chapter lays the groundwork by introducing two real-life datasets that provide a practical context for investigating recurrent causal events. In this chapter, we establish the foundation of essential concepts and terminologies. An overview of conventional causal estimands and various estimation methods in non-recurrent event settings is described, providing the necessary tools and knowledge base for effective causal analysis in more intricate longitudinal studies with recurrent event outcomes discussed in subsequent chapters. Chapter two extends the traditional time-fixed measure of marginal odds ratios (MORs) to a more complex, causal longitudinal setting. The novel Aggregated Marginal Odds Ratio (AMOR) is introduced to manage scenarios where treatment varies in time and outcome also recurs. Through Monte Carlo simulations, we demonstrate that AMOR can be estimated with low bias and stable variance, when employing appropriate stabilized weight models, for both absorbing and non-absorbing treatment settings. With the 1997 National Longitudinal Study of Youth dataset, we investigate the causal effect of youth smoking on their recurrent enrollment and dropout from school, with the proposed AMOR estimator. In the third chapter, the focus shifts to the causal effect of static treatment on recurrent event outcomes with time-varying covariates. We derive the identifying assumptions and employ a variety of estimators for the average causal effect estimation, addressing the issues of time-varying confounding and censoring. We conduct simulations to verify the robustness of these methods against potential model misspecifications. Among the proposed estimators, we conclude that the targeted maximum likelihood (TML) estimator is the appropriate one for complex longitudinal settings. Therefore, we implement targeted maximum likelihood estimation to the Systolic Blood Pressure Intervention Trial (SPRINT) dataset. Adopting an intention-to-treat analysis, we estimate the average causal effect of intensive versus standard blood pressure lowering therapy on acute kidney injury recurrences for participants surviving the first four years of SPRINT. Chapter four further investigates the average causal effect of time-varying treatments on the recurrence outcome of interest with censoring. Building on the methodologies in Chapter \ref{ch:tmle1_tf}, this chapter explores the singly and doubly robust estimators, especially the TML estimator, in the time-varying treatment context. Then simulation studies are conducted to support the theoretical derivations and validate the robustness of the estimators. The application of the proposed methods on the SPRINT yields some insightful findings. By incorporating participants' medication adherence levels over time as part of the treatment, we are able to investigate various adherence-related questions, and shifting from intention-to-treat to per-protocol analysis for causal effects estimation comparing the intensive versus standard blood pressure therapies. The dissertation concludes with a summary of the main findings and a discussion of significant and promising areas for future research in Chapter five. The studies conducted demonstrate the potential of advanced causal inference methods in handling the complexities of longitudinal data in medical and social research, offering valuable insights into how treatment strategies affect the recurrent causal outcomes over time. This work not only contributes to the theoretical advancements in statistical methodologies but also provides practical implications for the analysis of clinical trials and observational studies involving recurrent events.Item Empirical Likelihood Methods for Causal Inference(University of Waterloo, 2024-08-21) Huang, JingyueThis thesis develops empirical likelihood methods for causal inference, focusing on the estimation and inference of the average treatment effect (ATE) and the causal quantile treatment effect (QTE). Causal inference has been a critical research area for decades, as it is essential for understanding the true impact of interventions, policies, or actions, thereby enabling informed decision-making and providing insights into the mechanisms shaping our world. However, directly comparing responses between treatment and control groups can yield invalid results due to potential confounders in treatment assignments. In Chapter 1, we introduce fundamental concepts in causal inference under the widely adopted potential outcome framework and discuss the challenges in observational studies. We formulate our research problems concerning the estimation and inference of the ATE and review some commonly used methods for ATE estimation. Chapter 2 provides a brief review of traditional empirical likelihood methods, followed by the pseudo-empirical likelihood (PEL) and sample empirical likelihood (SEL) approaches in survey sampling for one-sample problems. In Chapter 3, we propose two inferential procedures for the ATE using a two-sample PEL approach. The first procedure employs estimated propensity scores for the formulation of the PEL function, resulting in a maximum PEL estimator of the ATE equivalent to the inverse probability weighted estimator discussed in the literature. Our focus in this scenario is on the PEL ratio statistic and establishing its theoretical properties. The second procedure incorporates outcome regression models for PEL inference through model-calibration constraints, and the resulting maximum PEL estimator of the ATE is doubly robust. Our main theoretical result in this case is the establishment of the asymptotic distribution of the PEL ratio statistic. We also propose a bootstrap method for constructing PEL ratio confidence intervals for the ATE to bypass the scaling constant which is involved in the asymptotic distribution of the PEL ratio statistic but is very difficult to calculate. Finite sample performances of our proposed methods with comparisons to existing ones are investigated through simulation studies. A real data analysis to examine the ATE of maternal smoking during pregnancy on birth weights using our proposed methods is also presented. In Chapter 4, we explore two SEL-based approaches for the estimation and inference of the ATE. Both involve a traditional two-sample empirical likelihood function with different ways of incorporating propensity scores. The first approach introduces propensity scores-calibrated constraints alongside the standard model-calibration constraints, while the second approach uses propensity scores to form weighted versions of the model-calibration constraints. Both approaches result in doubly robust estimators, and we derive the limiting distributions of the two SEL ratio statistics to facilitate the construction of confidence intervals and hypothesis tests for the ATE. Bootstrap methods for constructing SEL ratio confidence intervals are also discussed for both approaches. We investigate finite sample performances of the methods through simulation studies. While inferences on the ATE are an important problem with many practical applications, analyzing the QTE is equally important as it reveals intervention impacts across different population segments. In Chapter 5, we extend the PEL and the two SEL approaches from Chapters 3 and 4, each augmented with model-calibration constraints, to develop doubly robust estimators for the QTE. Two types of model-calibration constraints are proposed: one leveraging multiple imputations of potential outcomes and the other employing direct modeling of indicator functions. We calculate two types of bootstrap-calibrated confidence intervals for each of the six formulations, using point estimators and empirical likelihood ratios, respectively. We also discuss computational challenges and present simulation results. Our proposed approaches support the integration of multiple working models, facilitating the development of multiply robust estimators, distinguishing our methods from existing approaches. Chapter 6 summarizes the contributions of this thesis and outlines some research topics for future work.Item Modeling and Bayesian Computations for Capture-Recapture Studies(University of Waterloo, 2024-08-19) Wang, YiranCapture-recapture methods are often used for population size estimation, which plays a fundamental role in informing management decisions in ecology and epidemiology. In this thesis, we develop novel approaches to population size estimation that more comprehensively incorporate various sources of statistical uncertainty in the data which are often overlooked. By addressing these uncertainties, our methods provide more accurate and reliable estimates of the parameters of interest. Furthermore, we introduce various techniques to enhance computational efficiency, particularly in the context of Markov Chain Monte Carlo (MCMC) algorithms used for Bayesian inference. In Chapter 2, we delve into the plant-capture method, which is a special case of classical capture-recapture techniques. In this method, decoys referred to as "plants" are introduced into the population to estimate the capture probability. The method has shown considerable success in estimating population sizes from limited samples in many epidemiological, ecological, and demographic studies. However, previous plant-recapture studies have not systematically accounted for uncertainty in the capture status of each individual plant. To address this issue, we propose a novel modeling framework to formally incorporate uncertainty into the plant-capture model arising from (i) the capture status of plants and (ii) the heterogeneity between multiple survey sites. We present two inference methods and compare their performance through simulation studies. We then apply these methods to estimate the homeless population size in five U.S. cities using the large-scale "S-night" study conducted by the U.S. Census Bureau. In Chapter 3, we look into the uncertainty in compositional data. Understanding population composition is essential in many ecological, evolutionary, conservation, and management contexts. Modern methods like genetic stock identification (GSI) allow for estimating the proportions of individuals from different subpopulations using genetic data. These estimates are ideally obtained through mixture analysis, which can provide standard errors that reflect the uncertainty in population composition accurately. However, traditional methods that rely on historical data often only account for sample-level uncertainty, making them inadequate for estimating population-level uncertainties. To address this issue, we develop a reverse Dirichlet-multinomial model and multiple variance estimators to effectively propagate uncertainties from the sample-level composition to the population level. We extend this approach to genetic mark-recapture scenarios, validate it with simulation studies, and apply it to estimate the escapement of Sockeye Salmon (Oncorhynchus nerka) in the Taku River. In Chapter 4, motivated by the long run times of some of the Bayesian computations in this thesis, we shift our focus to the development and evaluation of Bayesian credible intervals. Markov chain Monte Carlo (MCMC) methods are crucial for sampling from posterior distributions in Bayesian analysis. However, slow convergence or mixing can hinder obtaining a large effective sample size due to limited computational resources. This issue is particularly significant when estimating credible interval quantiles, which require more MCMC iterations than posterior means, medians, or variances. Consequently, prematurely stopping MCMC chains can lead to inaccurate credible interval estimates. To mitigate this issue in cases where the posterior distribution is approximately normal, we make a case for the use of parametric quantile estimation for determining credible interval endpoints. This chapter investigates the asymptotic properties of the parametric quantile estimation and compares it with the empirical quantile method to illustrate performance as MCMC chains are prolonged. Furthermore, we apply these techniques to a real-world capture-recapture dataset on Leisler’s bat to compare their performance in a practical scenario. Overall, this thesis contributes to the field of population size estimation by developing innovative statistical methods that improve accuracy and computational efficiency. Our work addresses critical uncertainties and provides practical solutions for ecological and epidemiological applications, demonstrating the broad applicability and impact of advanced capture-recapture methodologies.Item Estimation risk and optimal combined portfolio strategies(University of Waterloo, 2024-08-13) Huang, ZhenzhenThe traditional Mean-Variance (MV) framework of Markowitz(1952) has been the foundation of numerous research works for many years, benefiting from its mathematical tractability and intuitive clarity for investors. However, a significant limitation of this framework is its dependence on the mean vector and covariance matrix of asset returns, which are generally unknown and have to be estimated using historical data. The resulting plug-in portfolio, which uses these estimates instead of the true parameter values, often exhibits poor out-of-sample performance due to estimation risk. A considerable amount of research proposes various sophisticated estimators for these two unknown parameters or introduces portfolio constraints and regularizations. In this thesis, however, we focus on an alternative approach to mitigate estimation risk by utilizing combined portfolios and directly optimizing the expected out-of-sample performance. We review the relevant literature and present essential preliminary discussions in Chapter 1. Building on this, we introduce three distinct perspectives in portfolio selection, each aimed at assessing the efficiency of combined portfolios in managing estimation risk. These perspectives guide the detailed examination of research projects presented in the subsequent three chapters of the thesis. Chapter 2 discusses the Tail Mean-Variance (TMV) portfolio selection with estimation risk. The TMV risk measure has emerged from the actuarial community as a criterion for risk management and portfolio selection, with a focus on extreme losses. The existing literature on portfolio optimization under the TMV criterion relies on the plug-in approach, which introduces estimation risk and leads to significant deterioration in the out-of-sample portfolio performance. To address this issue, we propose a combination of the plug-in and 1/N rules and optimize its expected out-of-sample performance. Our study is based on the Mean-Variance-Standard-deviation (MVS) performance measure, which encompasses the TMV, classical MV, and Mean-Standard-Deviation (MStD) as special cases. The MStD criterion is particularly relevant to mean-risk portfolio selection when risk is assessed using quantile-based risk measures. Our proposed combined portfolio consistently outperforms the plug-in MVS and 1/N portfolios in both simulated and real-world datasets. Chapter 3 focuses on Environmental, Social, and Governance (ESG) investing with estimation risk taken into account. Recently, there has been a significant increase in the commitment of institutional investors to responsible investment. We explore an ESG constrained framework that integrates the ESG criteria into decision-making processes, aiming to enhance risk-adjusted returns by ensuring that the total ESG score of the portfolio meets a specified target. The optimal ESG portfolio satisfies a three-fund separation. However, similar to the traditional MV portfolio, the practical application of the optimal ESG portfolio often encounters estimation risk. To mitigate estimation risk, we introduce a combined three-fund portfolio comprising components corresponding to the plug-in ESG portfolio, and we derive the optimal combination coefficients under the expected out-of-sample MV utility optimization, incorporating either an inequality or equality constraint on the expected total ESG score of the portfolio. Both simulation and empirical studies indicate that the implementable combined portfolio outperforms the plug-in ESG portfolio. Chapter 4 introduces a novel Winning Probability Weighted (WPW) framework for constructing combined portfolios from any pair of constituent portfolios. This framework is centered around the concept of winning probability, which evaluates the likelihood that one constituent portfolio will outperform another in terms of out-of-sample returns. To ensure comparability, the constituent portfolios are adjusted to align with their long-term risk profiles. We utilize machine learning techniques that incorporate financial market factors alongside historical asset returns to estimate the winning probabilities, which then taken as the combination coefficients for the combined portfolio. Additionally, we optimize the expected out-of-sample MV utility of the combined portfolio to enhance its performance. Extensive empirical studies demonstrate the superiority of the proposed WPW approach over existing analytical methods in terms of certainty equivalent return across various scenarios. Finally, Chapter 5 summarizes the thesis and outlines potential directions for further research.Item Estimands in Randomized Clinical Trials with Complex Life History Processes(University of Waterloo, 2024-08-09) Bühler, AlexandraClinical trials in oncology, cardiovascular disease and many other settings are dealing with complex outcomes involving multiple endpoints, competing or semi-competing risks, loss to follow-up and cointerventions related to the management and care of patients. These all complicate the design, analysis and interpretation of randomized trials. In such settings, traditional analyses of the time to some event are not sufficient for assessing new treatments. This thesis discusses the issues involved in the specification of estimands within a comprehensive multistate model framework. Intensity-based multistate models are used to (a) conceptualize the event-generating process involving primary and post-randomization outcomes, to (b) define and interpret causal estimands based on observable marginal features of the process, and to (c) conduct secondary analyses of marginal treatment effects. For a broad range of disease process settings, we investigate how marginal estimands depend on the full intensity-based process. Using large sample theory, factors influencing the limiting values of estimators of treatment effect in generalized linear models for marginal process features are studied. Rejection rates of a variety of hypothesis tests based on marginal regression models are also examined in terms of the true intensity-based process; based on these findings robustness properties are established. Such numerical investigations give insights into the interpretation and use of marginal estimands in randomized trials. We discuss in detail estimands based on cumulative incidence function regression for semi-competing risks processes and mean function regression for processes involving recurrent and terminal events. Specification of utilities for different disease-related outcomes, rescue interventions and other post-randomization events facilitate synthesis of information on complex disease processes, enabling simple causal treatment comparisons. Derivations are provided of an infinitesimal jackknife variance estimator for utility-based estimands to facilitate robust methods for causal inference of randomized trials.Item Optimization, model uncertainty, and testing in risk and insurance(University of Waterloo, 2024-07-11) Jiao, ZhanyiThis thesis focuses on three important topics in quantitative risk management and actuarial science: risk optimization, risk sharing, and statistical hypothesis testing in risk. For the risk optimization, we concentrate on risk optimization under model uncertainty where only partial information about the underlying distribution is available. One key highlight, detailed in Chapter 2, is the development of a novel formula named the reverse Expected Shortfall (ES) optimization formula. This formula is derived to better facilitate the calculation of the worst-case mean excess loss under two commonly used model uncertainty sets – moment-based and distance-based (Wasserstein) uncertainty sets. Further exploration reveals that the reverse ES optimization formula is closely related to the Fenchel-Legendre transforms, and our formulas are generalized from ES to optimized certainty equivalents, a popular class of convex risk measures. Chapter 3 considers a different approach to derive the closed-form worst-case target semi-variance by including distributional shape information, crucial for finance (symmetry) and insurance (non-negativity) applications. We demonstrate that all results are applicable to robust portfolio selection, where the closed-form formulas greatly simplify the calculations for optimal robust portfolio selections, either through explicit forms or via easily solvable optimization problems. Risk sharing focuses on the redistribution of total risk among agents in a specific way. In contrast to the traditional risk sharing rules, Chapter 4 introduces a new risk sharing framework - anonymized risk sharing, which requires no information on preferences, identities, private operations, and realized losses from the individual agents. We establish an axiomatic theory based on four axioms of fairness and anonymity within the context of anonymized risk sharing. The development of this theory provides a solid foundation for further explorations on decentralized and digital economy including peer-to-peer (P2P) insurance, revenue sharing of digital contents and blockchain mining pools. Hypothesis testing plays a vital role not only in statistical inference but also in risk management, particularly in the backtesting of risk measures. In Chapter 5, we address the problem of testing conditional mean and conditional variance for non-stationary data using the recent emerging concept of e-statistics. We build e-values and p-values for four types of non-parametric composite hypotheses with specified mean and variance as well as other conditions on the shape of the data-generating distribution. These shape conditions include symmetry, unimodality, and their combination. Using the obtained e-values and p-values, we construct tests via e-processes, also known as testing by betting, as well as some tests based on combining p-values for comparison. To demonstrate the practical application of these methodologies, empirical studies using financial data are conducted under several settings.Item Design with Sampling Distribution Segments(University of Waterloo, 2024-07-09) Hagar, LukeIn most settings where data-driven decisions are made, these decisions are informed by two-group comparisons. Characteristics – such as median survival times for two cancer treatments, defect rates for two assembly lines, or average satisfaction scores for two consumer products – quantify the impact of each choice available to decision makers. Given estimates for these two characteristics, such comparisons are often made via hypothesis tests. This thesis focuses on sample size determination for hypothesis tests with interval hypotheses, including standard one-sided hypothesis tests, equivalence tests, and noninferiority tests in both frequentist and Bayesian settings. To choose sample sizes for nonstandard hypothesis tests, simulation is used to estimate sampling distributions of e.g., test statistics or posterior summaries corresponding to various sample sizes. These sampling distributions provide context as to which estimated values for the two characteristics are plausible. By considering quantiles of these distributions, one can determine whether a particular sample size satisfies criteria for the operating characteristics of the hypothesis test: power and the type I error rate. It is standard practice to estimate entire sampling distributions for each sample size considered. The computational cost of doing so impedes the adoption of non-simplistic designs. However, only quantiles of the sampling distributions must be estimated to assess operating characteristics. To improve the scalability of simulation-based design, we could focus only on exploring the segments of the sampling distributions near the relevant quantiles. This thesis proposes methods to explore sampling distribution segments for various designs. These methods are used to determine sample sizes and decision criteria for hypothesis tests with orders of magnitude fewer simulation repetitions. Importantly, this reduction in computational complexity is achieved without compromising the consistency of the simulation results that is guaranteed when estimating entire sampling distributions. In parametric frequentist hypothesis tests, test statistics are often constructed from exact pivotal quantities. To improve sample size determination in the absence of exact pivotal quantities, we first propose a simulation-based method for power curve approximation with such hypothesis tests. This method leverages low-discrepancy sequences of sufficient statistics and root-finding algorithms to prompt unbiased sample size recommendations using sampling distribution segments. We also propose a framework for power curve approximation with Bayesian hypothesis tests. The corresponding methods leverage low-discrepancy sequences of maximum likelihood estimates, normal approximations to the posterior, and root-finding algorithms to explore segments of sampling distributions of posterior probabilities. The resulting sample size recommendations are consistent in that they are suitable when the normal approximations to the posterior and sampling distribution of the maximum likelihood estimator are appropriate. When designing Bayesian hypothesis tests, practitioners may need to specify various prior distributions to generate and analyze data for the sample size calculation. Specifying dependence structures for these priors in multivariate settings is particularly difficult. The challenges with specifying such dependence structures have been exacerbated by recommendations made alongside recent advances with copula-based priors. We prove theoretical results that can be used to help select prior dependence structures that align with one's objectives for posterior analysis. We lastly propose a comprehensive method for sample size determination with Bayesian hypothesis tests that considers our recommendations for prior specification. Unlike our framework for power curve approximation, this method recommends probabilistic cutoffs that facilitate decision making while controlling both power and the type I error rate. This scalable approach obtains consistent sample size recommendations by estimating segments of two sampling distributions - one for each operating characteristic. We also extend our design framework to accommodate more complex two-group comparisons that account for additional covariates.Item Measures for risk, dependence and diversification(University of Waterloo, 2024-06-20) Lin, LiyuanTwo primary tasks in quantitative risk management are measuring risk and managing risk. Risk measures and dependence modeling are important tools for assessing portfolio risk, which have gained much interest in the literature of finance and actuarial science. The assessment of risk further serves to address risk management problems, such as portfolio optimization and risk sharing. Value-at-Risk (VaR) and Expected Shortfall (ES) are the most widely used risk measures in banking and insurance regulation. The Probability Equivalent Level of VaR-ES (PELVE) is a new risk metric designed to bridge VaR and ES. In Chapter 2, we investigate the theoretical properties of PELVE and address the calibration problem of PELVE, that is, to find a distribution model that yields a given PELVE. Joint mixability, dependence of a random vector with a constant sum, is considered an extreme negative dependence as it represents a perfectly diversified portfolio. Chapter 3 explores the relationship between joint mix and some negative dependence notions in statistics. We further show that the negatively dependent joint mix plays a crucial role in solving the multi-marginal optimal transport problem under the uncertainty in the components of risks. Diversification is a traditional strategy for mitigating portfolio risk. In Chapter 4, we employ an axiomatic approach to introduce a new diversification measurement called the diversification quotient (DQ). DQ exhibits many attractive properties not shared by existing diversification indices in terms of interpretation for dependence, ability to capture common shocks and tail heaviness, as well as efficiency in portfolio optimization. Chapter 5 provides some technical details and illustrations to support Chapter 4. Moreover, DQ based on VaR and ES have simple formulas for computation. We explore asymptotic behavior of VaR-based DQ and ES-based DQ for large portfolios, the elliptical model, and the multivariate regular varying (MRV) model in Chapter 6, as well as the portfolio optimization problems for the elliptical and MRV models. Counter-monotonicity, as the converse of comonotonicity, is a natural extreme negative dependence. Chapter 7 conducts a systematic study of pairwise counter-monotonicity. We obtain its stochastic representation, invariance property, interactions with negative association, and equivalence to joint mix within the same Fr ́echet class. We also show that Pareto-optimal allocations for quantile agents exhibit pairwise counter-monotonicity. This finding contrasts sharply with traditional comonotonic allocations for risk-averse agents, inspired further investigation into the appearance of pairwise counter-monotonic allocation in risk-sharing problems. In Chapter 8, we address the risk-sharing problem for agents using distortion riskmetrics, who are not necessarily risk-averse or monotone. Our results indicate that Pareto-optimal allocations for inter-quantile difference agents include pairwise counter-monotonicity. Chapter 9 further explores other decision models in risk-sharing that exhibit pairwise counter-monotonicity in optimal allocations. We introduce a counter-monotonic improvement theorem – a converse result to the widely used comonotonic improvement theorem. Furthermore, we show that pairwise counter-monotonic allocations are Pareto optimal for risk-seeking agents, Bernoulli utility agents, and rank-dependent expected utility agents under certain conditions. Besides the studies of two extreme negative dependencies, we expand our analysis to dependence modeling through Pearson correlation and copula. In Chapter 10, we characterize all dependence structures for a bivariate random vector that preserve its Pearson correlation coefficient under any common marginal transformations. For multivariate cases, we characterize all invariant correlation matrices and explore the application of invariant correlation in sample duplication. Chapter 11 discusses the selection of copulas when marginals are discontinuous. The checkerboard copula is a desirable choice. We show that the checkboard copula has the largest Shannon entropy and carries the dependence information of the original random vector.Item A Differentiable Particle Filter for Jump-Diffusion Stochastic Volatility Models(University of Waterloo, 2024-05-28) Ko, MichelleStochastic volatility with jumps has emerged as a crucial tool for understanding and modelling the stochastic and intermittently discontinuous nature of many processes in finance. Due to the highly nonlinear structure of these models, their likelihood functions are often unavailable in closed-form. A common numerical approach is to reformulate the original model as a state-space model, where under this framework, the marginal likelihood of the parameters can be estimated efficiently by integrating the latent variables via particle filtering. A combination of such particle-estimated likelihood and Markov Chain Monte Carlo can be used to sample from parameter posteriors, but imposes a substantial computational burden in multi-dimensional parameter space. Bayesian normal approximation serves as a more efficient alternative, if the mode and quadrature of the stochastic approximation of the posterior can be obtained via a gradient-based method. This is not immediately possible, however, as the particle-estimated marginal posterior is not differentiable due to (1) the inherent discontinuity of jumps in the model, and (2) the widely used multinomial resampling technique in particle filtering. This thesis presents a novel construction of a particle filter that incorporates a multivariate normal resampler and circumvents the jump-induced discontinuity with a customized proposal density, thereby attaining full differentiability of the marginal posterior estimate. A comprehensive simulation study and application to S&P 500 Index data are provided to investigate the performance of the differentiable particle filter for parameter inference and volatility recovery.Item Optimization of Policy Evaluation and Policy Improvement Methods in Portfolio Optimization using Quasi-Monte Carlo Methods(University of Waterloo, 2024-05-24) Orok, GavinMachine learning involves many challenging integrals that can be estimated using numerical methods. One application of these methods which has been explored in recent work is the estimation of policy gradients for reinforcement learning. They found that for many standard continuous control problems, the numerical methods randomized Quasi-Monte Carlo (RQMC) and Array-RQMC that used low-discrepancy point sets improved the efficiency of both policy evaluation and policy gradient-based policy iteration compared to standard Monte Carlo (MC). We extend this work by investigating the application of these numerical methods to model-free reinforcement learning algorithms in portfolio optimization, which are of interest because they do not rely on complex model assumptions that pose difficulties to other analytical methods. We find that RQMC significantly outperforms MC under all conditions for policy evaluation and that Array-RQMC outperforms both MC and RQMC in policy iteration with a strategic choice of the reordering function.Item Edge Estimation and Community Detection in Time-varying Networks(University of Waterloo, 2024-04-30) Jian, JieIn modern statistics and data science, there is a growing focus on network data that indicate interactions among a group of items in a complex system. Scientists are interested in these data as they can reveal important insights into the latent structure present among the nodes of a network. The emerging family of statistical methods effectively addresses these modeling demands in static networks. However, the evolving nature of network structures over time introduces unique challenges not present in static networks. Specifically, in dynamic networks, we want to characterize their smooth change which also controls the model complexity. To achieve this, we need to impose structural assumptions about the similarity of neighboring networks, and this usually will pose computational challenges. This thesis studies three aspects of the statistical analysis in time-varying network problems. First, to identify the dynamic changes of associations among multivariate random variables, we propose a time-varying Gaussian graphical model with two different regularization methods imposed to characterize the smooth change of neighboring networks. These methods lead to non-trivial optimization problems that we solve by developing efficient computational methods based on the Alternating Direction Method of Multipliers algorithm. Second, given the observed time-varying financial relationships among nodes, such as their trading amounts in dollars, we propose new stochastic block models based on a restricted Tweedie distribution to accommodate non-negative continuous edge weights with a positive probability of zero counts. The model can capture dynamic nodal effects. We prove that the estimation of the dynamic covariate effects is asymptotically independent of assigned community labels, allowing for an efficient two-step algorithm. Third, when the timestamp of node interactions is accessible, we aim to enhance the modeling of the distribution of survival time of network interactions, especially in the presence of censoring. In addressing this, we employ Cox proportional hazard models to investigate the influence of community structures on the formation of networks. Overall, this thesis provides new methods for modeling and computing time-varying network problems.Item Sequential Monte Carlo for Applications in Structural Biology, Financial Time Series and Epidemiology(University of Waterloo, 2024-04-25) Hou, ZhaoranSequential Monte Carlo (SMC) methods are widely used to draw samples from intractable target distributions. Moreover, they have also been adopted to other computational methods for inference such as the particle Markov chain Monte Carlo methods and the SMC$^2$ methods. In practice, some difficulties arise and are hindering the use of SMC-based methods; examples are the degeneracy of the particles in SMC and the intractability of the target distribution. This thesis addresses these challenges across various domains and proposes effective solutions. This thesis introduces SMC and SMC-based methods with specific challenges in three diverse fields. Firstly, we propose an SMC method for sampling protein structures from the Boltzmann distribution which is highly constrained, crucial for studying the Boltzmann distribution of protein structures and estimating atomic contacts in viral proteins such as SARS-CoV-2. Secondly, we present a particle Gibbs sampler incorporating the approximate Bayesian computation strategy for stochastic volatility models with intractable likelihoods, offering a solution for parameter inference in financial data analysis by fitting stochastic volatility models to S\&P 500 Index time-series data during the 2008 financial crisis. Finally, we introduce a compartmental model with stochastic transmission dynamics and covariates, facilitating better alignment with real-world data for modeling the spread of COVID-19 in Ontario, for which we employ an SMC$^2$ algorithm incorporating the approximate Bayesian computation strategy.Item Design and Analysis of Experiments on Networks(University of Waterloo, 2024-04-17) Bui, TrangIn the design and analysis of experiments, it is often assumed that experimental units are independent, in the sense that the treatment assigned to one unit will not affect the potential outcome of another unit. However, this assumption may not hold if the experiment is conducted on a network of experimental units. The treatment assignment of one unit can spread to its neighbors via their network connections. The growing popularity of online experiments conducted on social networks calls for more research on this topic. We investigate the problem of experiments on networks and propose new approaches to both the design and analysis of such experiments.Item Measurement System Assessment Studies for Multivariate and Functional Data(University of Waterloo, 2024-04-15) Lashkari, BanafshehA measurement system analysis involves understanding and quantifying the variability in measurement data attributed to the measurement system. A primary goal of such analyses is to assess the measurement system's impact on the overall variability of the data, determining its suitability for the intended purpose. While there are established methods for evaluating measurement systems for a single variable, their applicability is limited when dealing with other data types, such as multivariate and functional data. This thesis addresses a critical gap in the literature concerning the assessment of measurement systems when dealing with multivariate and functional observations. The primary objective is to enhance the understanding of measurement system assessment studies, particularly focusing on multivariate measurements and extending to functional data measurements. Chapter 1 serves as an introduction. We review several statistical properties and parameters for assessing the measurement systems. This chapter includes some real-world examples of measurement system assessment problems for multivariate and functional data and elaborates on the challenges involved. We also outline the contents that will be explored in the subsequent chapters. While the literature on measurement system analysis in multivariate and functional data domains is limited, there is also a notable absence of a systematic theoretical investigation for univariate methods. In Chapter 2, we address this gap by conducting a thorough theoretical examination of measurement system assessment estimators for univariate data. The chapter explores various estimation methods for estimating variance components and other essential parameters crucial for measurement system analysis. We provide a comprehensive scrutiny of the statistical properties of these estimators. This foundational understanding serves as the basis for subsequent exploration into the more intricate domains of multivariate and functional data. In Chapter 3, we extend the scope of measurement system assessment to include multivariate data. This chapter involves adapting the definitions of measurement system assessment parameters to multivariate settings. We employ transformations that yield summary scalar measures for variance-covariance matrices, with a specific focus on the determinant, trace, and Frobenius norm of the variance-covariance matrix components. Building upon the statistical concepts and properties discussed in Chapter 2, we conduct a targeted review of existing theories related to variance-covariance component estimation. A key emphasis is placed on the statistical properties of estimators introduced for one of the parameters in measurement system assessment—the signal-to-noise ratio. Our investigation includes an exploration of its convergence properties and the construction of approximate confidence intervals. Additionally, we conduct a comparative analysis of the application of three transformations, namely, the determinant, the trace, and the Frobenius norm, based upon their asymptotic properties. In Chapter 4, our exploration takes a significant step forward as we establish a framework for assessing measurement systems tailored to functional data types. This involves extending the definition of parameters used in the evaluation of measurement systems for univariate data by applying bounded operators on covariance kernels. To estimate the measurement system assessment parameters, we first provide methods to estimate the covariance kernel components. Initially, we explore a classical estimation approach without smoothing. Subsequently, we leverage specialized tools in functional data analysis, within the framework of reproducing kernel Hilbert space (RKHS), to obtain smooth estimates of the covariance kernel components. The fifth chapter is devoted to a case study application, where we apply the developed framework to a real-world functional dataset. Specifically, we analyze the surface roughness of printed products in the context of additive manufacturing. The comprehensive analysis in Chapter 5 employs statistical methods for univariate and multivariate data types and techniques from functional data analysis. We are in the process of converting the materials in Chapters 2, 3, and 4 to three separate articles for submission.Item Explorations in Pairwise Measures of Dependence and Pooled Significance(University of Waterloo, 2024-01-22) Salahub, ChrisIn the exploration of data sets with many variables, the search for interesting pairs is often the first step of analysis. This search builds a road map of the entirety of data before looking at its details, and can provide indispensable inspiration for deeper inves- tigation. Challenges are present, however, in adjusting results to address the multiple testing problem and choosing a measure with sufficient generality to detect many forms of dependence. This work proposes the measurement of statistical dependence by recursive binning of marginal ranks as a flexible measure of dependence. Simulation studies are used to characterize the null distribution and demonstrate the method’s sensitivity to different data patterns. By splitting bins randomly, the χ2 statistic has a null distribution conservatively approximated by the χ2 distribution seemingly without a loss of power compared to maximized splitting rules, which has an inflated statistic value. The method is demonstrated on real S&P 500 constituent data. To adjust for multiple testing, a new framework and coefficient are devised with appropriate proofs for analyzing pooled p-values based on their tendency to detect concentrated or diffuse evidence. This motivates a pooled p-value based on the χ2 quantile function as a way to adjust for multiple testing while controlling the family-wise error rate and fine-tuning for the evidence pattern of interest. Simulation studies suggest this method is similarly powerful to the uniformly most powerful method while being more robust to mis-specification. Both the recursive binning measurement of association and the χ2 pooled p-value are then demonstrated for genetic data after a tutorial introducing the relevant genetic concepts. A method of moments adjustment of the χ2 pooled p-value to account for correlation between tests is introduced and used with genomic and phenomic data from mice to identify regions of interest. The use of pooled p-values to combine parameter estimates in meta-analysis is also explored, establishing the concepts of evidential intervals and demonstrating their behaviour on simulated data.Item Joint modeling, variable selection and multiply robust estimation in mediation analysis with multiple mediators(University of Waterloo, 2024-01-10) Wang, LijiaThis thesis explores topics in causal mediation analysis with multiple possibly related mediators. The goal of this thesis is to propose innovative methodologies for joint modeling of multiple uncausally related mediators, selecting mediators from high-dimensional candidates while simplifying their dependency structures and performing multiply robust estimations to uncover causal effects of interest. Causal mediation analysis aims to enhance understanding of the effects of an exposure on an outcome by examining direct and indirect effects. In settings where multiple mediators are involved, the relations among these mediators play an important role. Traditional studies focus on the scenario that the multiple mediators are either related under specified causal structures or independent given baseline covariates. Our studies focus on multiple uncausally related mediators, where the mediators are associated with each other conditioning on pre-treatment covariates and treatment but there is no causal ordering among them. In Chapter 2, we begin by reviewing and expanding upon the concept of mediators that are uncausally related, followed by the introduction of causal effects defined under such settings and the associated identification assumptions. We propose to jointly model the uncausally related mediators using copula functions. An important advantage of employing copula functions in joint modeling is the significant flexibility it offers, as this method allows for multiple mediators to have different distributions and be correlated in various ways. Subsequently, we propose methods estimating causal effects within this framework. In Chapter 3, we center our attention on the sparse mediation phenomenon, where only a handful of true mediators, from a pool of possibly high-dimensional candidates, exhibit nonzero indirect effects. We propose a LASSO-based penalization technique that selects the true mediators by considering their indirect effects. Acknowledging that the selected mediators often still exhibit complex dependency structures even after selection, our method also simplifies these structures by selecting non-zero correlation entries within the correlation matrix using a similar penalized estimation technique. To facilitate the correlation structure selection, we transform the correlation matrix selection problem into a standard variable selection problem within the framework of a linear model. Moreover, our proposed method allows the mediator selection and the dependency structure selection processes, to be conducted either via either a parallel or a sequential approach. The grouped and individual causal effects are defined under such settings with estimation approaches discussed. In Chapter 4, we discuss the issue of model misspecification within the context of causal mediation analysis. Following the discussion, we propose two ways of constructing multiply robust estimators. In causal mediation analysis, typically three working models must be specified: the treatment model, the mediator model, and the response model. Both of our multiply robust estimation methods yield consistent estimation of the causal quantities of interest, provided that any two out of the three models are correctly specified. For each proposed method introduced in Chapters 2, 3 and 4, we provide theoretical results with proofs of the consistency and other properties. We also derive large sample properties and investigate finite sample properties via simulations. Each chapter includes an application of the proposed method to a genetic study in psychiatry to investigate DNA methylation loci as mediators on the causal path between childhood trauma and stress reactivity. In Chapter 2, the proposed method estimates the mediation effects of three DNA loci on the Kit ligand gene. Chapter 3 extends this analysis and applies the proposed mediator selection method to the entire DNA methylation dataset, revealing 12 mediating loci, with 10 showing a strong association. We estimate the grouped indirect effect from them and the individual effects of the remaining two loci. In Chapter 4, we employ our multiply robust estimation methods to re-evaluate the mediation effects of these 12 loci, demonstrating enhanced robustness to previous findings.