Statistics and Actuarial Science
Permanent URI for this collectionhttps://uwspace.uwaterloo.ca/handle/10012/9934
This is the collection for the University of Waterloo's Department of Statistics and Actuarial Science.
Research outputs are organized by type (eg. Master Thesis, Article, Conference Paper).
Waterloo faculty, students, and staff can contact us or visit the UWSpace guide to learn more about depositing their research.
Browse
Browsing Statistics and Actuarial Science by Issue Date
Now showing 1 - 20 of 382
- Results Per Page
- Sort Options
Item Coherent Beta Risk Measures for Capital Requirements(University of Waterloo, 1999) Wirch, Julia LynnThis thesis compares insurance premium principles with current financial risk paradigms and uses distorted probabilities, a recent development in premium principle literature, to synthesize the current models for financial risk measures in banking and insurance. This work attempts to broaden the definition of value-at-risk beyond the percentile measures. Examples are used to show how the percentile measure fails to give consistent results, and how it can be manipulated. A new class of consistent risk measures is investigated.Item Modelling Issues in Three-state Progressive Processes(University of Waterloo, 2001) Kopciuk, KarenThis dissertation focuses on several issues pertaining to three-state progressive stochastic processes. Casting survival data within a three-state framework is an effective way to incorporate intermediate events into an analysis. These events can yield valuable insights into treatment interventions and the natural history of a process, especially when the right censoring is heavy. Exploiting the uni-directional nature of these processes allows for more effective modelling of the types of incomplete data commonly encountered in practice, as well as time-dependent explanatory variables and different time scales. In Chapter 2, we extend the model developed by Frydman (1995) by incorporating explanatory variables and by permitting interval censoring for the time to the terminal event. The resulting model is quite general and combines features of the models proposed by Frydman (1995) and Kim et al. (1993). The decomposition theorem of Gu (1996) is used to show that all of the estimating equations arising from Frydman's log likelihood function are self-consistent. An AIDS data set analyzed by these authors is used to illustrate our regression approach. Estimating the standard errors of our regression model parameters, by adopting a piecewise constant approach for the baseline intensity parameters, is the focus of Chapter 3. We also develop data-driven algorithms which select changepoints for the intervals of support, based on the Akaike and Schwarz Information Criteria. A sensitivity study is conducted to evaluate these algorithms. The AIDS example is considered here once more; standard errors are estimated for several piecewise constant regression models selected by the model criteria. Our results indicate that for both the example and the sensitivity study, the resulting estimated standard errors of certain model parameters can be quite large. Chapter 4 evaluates the goodness-of-link function for the transition intensity between states 2 and 3 in the regression model we introduced in chapter 2. By embedding this hazard function in a one-parameter family of hazard functions, we can assess its dependence on the specific parametric form adopted. In a simulation study, the goodness-of-link parameter is estimated and its impact on the regression parameters is assessed. The logistic specification of the hazard function from state 2 to state 3 is appropriate for the discrete, parametric-based data sets considered, as well as for the AIDS data. We also investigate the uniqueness and consistency of the maximum likelihood estimates based on our regression model for these AIDS data. In Chapter 5 we consider the possible efficiency gains realized in estimating the survivor function when an intermediate auxiliary variable is incorporated into a time-to-event analysis. Both Markov and hybrid time scale frameworks are adopted in the resulting progressive three-state model. We consider three cases for the amount of information available about the auxiliary variable: the observation is completely unknown, known exactly, or known to be within an interval of time. In the Markov framework, our results suggest that observing subjects at just two time points provides as much information about the survivor function as knowing the exact time of the intermediate event. There was generally a greater loss of efficiency in the hybrid time setting. The final chapter identifies some directions for future research.Item Duration Data Analysis in Longitudinal Survey(University of Waterloo, 2003) Boudreau, ChristianConsiderable amounts of event history data are collected through longitudinal surveys. These surveys have many particularities or features that are the results of the dynamic nature of the population under study and of the fact that data collected through longitudinal surveys involve the use of complex survey designs, with clustering and stratification. These particularities include: attrition, seam-effect, censoring, left-truncation and complications in the variance estimation due to the use of complex survey designs. This thesis focuses on the last two points. Statistical methods based on the stratified Cox proportional hazards model that account for intra-cluster dependence, when the sampling design is uninformative, are proposed. This is achieved using the theory of estimating equations in conjunction with empirical process theory. Issues concerning analytic inference from survey data and the use of weighted versus unweighted procedures are also discussed. The proposed methodology is applied to data from the U. S. Survey of Income and Program Participation (SIPP) and data from the Canadian Survey of Labour and Income Dynamics (SLID). Finally, different statistical methods for handling left-truncated sojourns are explored and compared. These include the conditional partial likelihood and other methods, based on the Exponential or the Weibull distributions.Item Prediction of recurrent events(University of Waterloo, 2004) Fredette, MarcIn this thesis, we will study issues related to prediction problems and put an emphasis on those arising when recurrent events are involved. First we define the basic concepts of frequentist and Bayesian statistical prediction in the first chapter. In the second chapter, we study frequentist prediction intervals and their associated predictive distributions. We will then present an approach based on asymptotically uniform pivotals that is shown to dominate the plug-in approach under certain conditions. The following three chapters consider the prediction of recurrent events. The third chapter presents different prediction models when these events can be modeled using homogeneous Poisson processes. Amongst these models, those using random effects are shown to possess interesting features. In the fourth chapter, the time homogeneity assumption is relaxed and we present prediction models for non-homogeneous Poisson processes. The behavior of these models is then studied for prediction problems with a finite horizon. In the fifth chapter, we apply the concepts discussed previously to a warranty dataset coming from the automobile industry. The number of processes in this dataset being very large, we focus on methods providing computationally rapid prediction intervals. Finally, we discuss the possibilities of future research in the last chapter.Item Integration in Computer Experiments and Bayesian Analysis(University of Waterloo, 2005) Karuri, StellaMathematical models are commonly used in science and industry to simulate complex physical processes. These models are implemented by computer codes which are often complex. For this reason, the codes are also expensive in terms of computation time, and this limits the number of simulations in an experiment. The codes are also deterministic, which means that output from a code has no measurement error.
One modelling approach in dealing with deterministic output from computer experiments is to assume that the output is composed of a drift component and systematic errors, which are stationary Gaussian stochastic processes. A Bayesian approach is desirable as it takes into account all sources of model uncertainty. Apart from prior specification, one of the main challenges in a complete Bayesian model is integration. We take a Bayesian approach with a Jeffreys prior on the model parameters. To integrate over the posterior, we use two approximation techniques on the log scaled posterior of the correlation parameters. First we approximate the Jeffreys on the untransformed parameters, this enables us to specify a uniform prior on the transformed parameters. This makes Markov Chain Monte Carlo (MCMC) simulations run faster. For the second approach, we approximate the posterior with a Normal density.
A large part of the thesis is focused on the problem of integration. Integration is often a goal in computer experiments and as previously mentioned, necessary for inference in Bayesian analysis. Sampling strategies are more challenging in computer experiments particularly when dealing with computationally expensive functions. We focus on the problem of integration by using a sampling approach which we refer to as "GaSP integration". This approach assumes that the integrand over some domain is a Gaussian random variable. It follows that the integral itself is a Gaussian random variable and the Best Linear Unbiased Predictor (BLUP) can be used as an estimator of the integral. We show that the integration estimates from GaSP integration have lower absolute errors. We also develop the Adaptive Sub-region Sampling Integration Algorithm (ASSIA) to improve GaSP integration estimates. The algorithm recursively partitions the integration domain into sub-regions in which GaSP integration can be applied more effectively. As a result of the adaptive partitioning of the integration domain, the adaptive algorithm varies sampling to suit the variation of the integrand. This "strategic sampling" can be used to explore the structure of functions in computer experiments.Item Statistical Methods for High Throughput Screening Drug Discovery Data(University of Waterloo, 2005) Wang, Yuanyuan (Marcia)High Throughput Screening (HTS) is used in drug discovery to screen large numbers of compounds against a biological target. Data on activity against the target are collected for a representative sample of compounds selected from a large library. The goal of drug discovery is to relate the activity of a compound to its chemical structure, which is quantified by various explanatory variables, and hence to identify further active compounds. Often, this application has a very unbalanced class distribution, with a rare active class.
Classification methods are commonly proposed as solutions to this problem. However, regarding drug discovery, researchers are more interested in ranking compounds by predicted activity than in the classification itself. This feature makes my approach distinct from common classification techniques.
In this thesis, two AIDS data sets from the National Cancer Institute (NCI) are mainly used. Local methods, namely K-nearest neighbours (KNN) and classification and regression trees (CART), perform very well on these data in comparison with linear/logistic regression, neural networks, and Multivariate Adaptive Regression Splines (MARS) models, which assume more smoothness. One reason for the superiority of local methods is the local behaviour of the data. Indeed, I argue that conventional classification criteria such as misclassification rate or deviance tend to select too small a tree or too large a value of k (the number of nearest neighbours). A more local model (bigger tree or smaller k) gives a better performance in terms of drug discovery.
Because off-the-shelf KNN works relatively well, this thesis takes this promising method and makes several novel modifications, which further improve its performance. The choice of k is optimized for each test point to be predicted. The empirically observed superiority of allowing k to vary is investigated. The nature of the problem, ranking of objects rather than estimating the probability of activity, enables the k-varying algorithm to stand out. Similarly, KNN combined with a kernel weight function (weighted KNN) is proposed and demonstrated to be superior to the regular KNN method.
High dimensionality of the explanatory variables is known to cause problems for KNN and many other classifiers. I propose a novel method (subset KNN) of averaging across multiple classifiers based on building classifiers on subspaces (subsets of variables). It improves the performance of KNN for HTS data. When applied to CART, it also performs as well as or even better than the popular methods of bagging and boosting. Part of this improvement is due to the discovery that classifiers based on irrelevant subspaces (unimportant explanatory variables) do little damage when averaged with good classifiers based on relevant subspaces (important variables). This result is particular to the ranking of objects rather than estimating the probability of activity. A theoretical justification is proposed. The thesis also suggests diagnostics for identifying important subsets of variables and hence further reducing the impact of the curse of dimensionality.
In order to have a broader evaluation of these methods, subset KNN and weighted KNN are applied to three other data sets: the NCI AIDS data with Constitutional descriptors, Mutagenicity data with BCUT descriptors and Mutagenicity data with Constitutional descriptors. The k-varying algorithm as a method for unbalanced data is also applied to NCI AIDS data with Constitutional descriptors. As a baseline, the performance of KNN on such data sets is reported. Although different methods are best for the different data sets, some of the proposed methods are always amongst the best.
Finally, methods are described for estimating activity rates and error rates in HTS data. By combining auxiliary information about repeat tests of the same compound, likelihood methods can extract interesting information about the magnitudes of the measurement errors made in the assay process. These estimates can be used to assess model performance, which sheds new light on how various models handle the large random or systematic assay errors often present in HTS data.Item Multiple testing using the posterior probability of half-space: application to gene expression data.(University of Waterloo, 2005) Labbe, AurelieWe consider the problem of testing the equality of two sample means, when the number of tests performed is large. Applying this problem to the context of gene expression data, our goal is to detect a set of genes differentially expressed under two treatments or two biological conditions. A null hypothesis of no difference in the gene expression under the two conditions is constructed. Since such a hypothesis is tested for each gene, it follows that thousands of tests are performed simultaneously, and multiple testing issues then arise. The aim of our research is to make a connection between Bayesian analysis and frequentist theory in the context of multiple comparisons by deriving some properties shared by both p-values and posterior probabilities. The ultimate goal of this work is to use the posterior probability of the one-sided alternative hypothesis (or equivalently, posterior probability of the half-space) in the same spirit as a p-value. We show for instance that such a Bayesian probability can be used as an input in some standard multiple testing procedures controlling for the False Discovery rate.Item Imputation, Estimation and Missing Data in Finance(University of Waterloo, 2006) DiCesare, GiuseppeSuppose X is a diffusion process, possibly multivariate, and suppose that there are various segments of the components of X that are missing. This happens, for example, if X is the price of various assets and these prices are only observed at specific discrete trading times. Imputation (or conditional simulation) of the missing pieces of the sample paths of X is discussed in several settings. When X is a Brownian motion the conditioned process is a tied down Brownian motion or a Brownian bridge process. In the special case of Gaussian stochastic processes the problem is simplified since the conditional finite dimensional distributions of the process are multivariate Normal. For more general diffusion processes, including those with jump components, an acceptance-rejection simulation algorithm is introduced which enables one to sample from the exact conditional distribution without appealing to approximate time step methods such as the popular Euler or Milstein schemes. The method is referred to as pathwise imputation. Its practical implementation relies only on the basic elements of simulation while its theoretical justification depends on the pathwise properties of stochastic processes and in particular Girsanov's theorem. The method allows for the complete characterization of the bridge paths of complicated diffusions using only Brownian bridge interpolation. The imputation methods discussed are applied to estimation, variance reduction and exotic option pricing.Item Measurement Error and Misclassification in Interval-Censored Life History Data(University of Waterloo, 2007-05-04T19:12:07Z) White, Bethany Joy GiddingsIn practice, data are frequently incomplete in one way or another. It can be a significant challenge to make valid inferences about the parameters of interest in this situation. In this thesis, three problems involving such data are addressed. The first two problems involve interval-censored life history data with mismeasured covariates. Data of this type are incomplete in two ways. First, the exact event times are unknown due to censoring. Second, the true covariate is missing for most, if not all, individuals. This work focuses primarily on the impact of covariate measurement error in progressive multi-state models with data arising from panel (i.e., interval-censored) observation. These types of problems arise frequently in clinical settings (e.g. when disease progression is of interest and patient information is collected during irregularly spaced clinic visits). Two and three state models are considered in this thesis. This work is motivated by a research program on psoriatic arthritis (PsA) where the effects of error-prone covariates on rates of disease progression are of interest and patient information is collected at clinic visits (Gladman et al. 1995; Bond et al. 2006). Information regarding the error distributions were available based on results from a separate study conducted to evaluate the reliability of clinical measurements that are used in PsA treatment and follow-up (Gladman et al. 2004). The asymptotic bias of covariate effects obtained ignoring error in covariates is investigated and shown to be substantial in some settings. In a series of simulation studies, the performance of corrected likelihood methods and methods based on a simulation-extrapolation (SIMEX) algorithm (Cook \& Stefanski 1994) were investigated to address covariate measurement error. The methods implemented were shown to result in much smaller empirical biases and empirical coverage probabilities which were closer to the nominal levels. The third problem considered involves an extreme case of interval censoring known as current status data. Current status data arise when individuals are observed only at a single point in time and it is then determined whether they have experienced the event of interest. To complicate matters, in the problem considered here, an unknown proportion of the population will never experience the event of interest. Again, this type of data is incomplete in two ways. One assessment is made on each individual to determine whether or not an event has occurred. Therefore, the exact event times are unknown for those who will eventually experience the event. In addition, whether or not the individuals will ever experience the event is unknown for those who have not experienced the event by the assessment time. This problem was motivated by a series of orthopedic trials looking at the effect of blood thinners in hip and knee replacement surgeries. These blood thinners can cause a negative serological response in some patients. This response was the outcome of interest and the only available information regarding it was the seroconversion time under current status observation. In this thesis, latent class models with parametric, nonparametric and piecewise constant forms of the seroconversion time distribution are described. They account for the fact that only a proportion of the population will experience the event of interest. Estimators based on an EM algorithm were evaluated via simulation and the orthopedic surgery data were analyzed based on this methodology.Item Estimation and allocation of insurance risk capital(University of Waterloo, 2007-05-15T13:06:52Z) Kim, Hyun TaeEstimating tail risk measures such as Value at Risk (VaR) and Conditional Tail Expectation (CTE) is a vital component in financial and actuarial risk management. The CTE is a preferred risk measure, due to coherence and a widespread acceptance in actuarial community. In particular we focus on the estimation of the CTE using both parametric and nonparametric approaches. In parametric case the conditional tail expectation and variance are analytically derived for the exponential distribution family and its transformed distributions. For small i.i.d. samples the exact bootstrap (EB) and the influence function are used as nonparametric methods in estimating the bias and the the variance of the empirical CTE. In particular, it is shown that the bias is corrected using the bootstrap for the CTE case. In variance estimation the influence function of the bootstrapped quantile is derived, and can be used to estimate the variance of any bootstrapped L-estimator without simulations, including the VaR and the CTE, via the nonparametric delta method. An industry model are provided by applying theoretical findings on the bias and the variance of the estimated CTE. Finally a new capital allocation method is proposed. Inspired by the allocation of the solvency exchange option by Sherris (2006), this method resembles the CTE allocation in its form and properties, but has its own unique features, such as managerbased decomposition. Through a numerical example the proposed allocation is shown to fail the no undercut axiom, but we argue that this axiom may not be aligned with the economic reality.Item Customizing kernels in Support Vector Machines(University of Waterloo, 2007-05-22T14:11:26Z) Zhang, ZhanyangSupport Vector Machines have been used to do classification and regression analysis. One important part of SVMs are the kernels. Although there are several widely used kernel functions, a carefully designed kernel will help to improve the accuracy of SVMs. We present two methods in terms of customizing kernels: one is combining existed kernels as new kernels, the other one is to do feature selection. We did theoretical analysis in the interpretation of feature spaces of combined kernels. Further an experiment on a chemical data set showed improvements of a linear-Gaussian combined kernel over single kernels. Though the improvements are not universal, we present a new idea of creating kernels in SVMs.Item Efficient Procedure for Valuing American Lookback Put Options(University of Waterloo, 2007-05-22T18:22:08Z) Wang, XuyanLookback option is a well-known path-dependent option where its payoff depends on the historical extremum prices. The thesis focuses on the binomial pricing of the American floating strike lookback put options with payoff at time $t$ (if exercise) characterized by \[ \max_{k=0, \ldots, t} S_k - S_t, \] where $S_t$ denotes the price of the underlying stock at time $t$. Build upon the idea of \hyperlink{RBCV}{Reiner Babbs Cheuk and Vorst} (RBCV, 1992) who proposed a transformed binomial lattice model for efficient pricing of this class of option, this thesis extends and enhances their binomial recursive algorithm by exploiting the additional combinatorial properties of the lattice structure. The proposed algorithm is not only computational efficient but it also significantly reduces the memory constraint. As a result, the proposed algorithm is more than 1000 times faster than the original RBCV algorithm and it can compute a binomial lattice with one million time steps in less than two seconds. This algorithm enables us to extrapolate the limiting (American) option value up to 4 or 5 decimal accuracy in real time.Item Flexible Mixed-Effect Modeling of Functional Data, with Applications to Process Monitoring(University of Waterloo, 2007-06-18T17:04:23Z) Mosesova, SofiaHigh levels of automation in manufacturing industries are leading to data sets of increasing size and dimension. The challenge facing statisticians and field professionals is to develop methodology to help meet this demand. Functional data is one example of high-dimensional data characterized by observations recorded as a function of some continuous measure, such as time. An application considered in this thesis comes from the automotive industry. It involves a production process in which valve seats are force-fitted by a ram into cylinder heads of automobile engines. For each insertion, the force exerted by the ram is automatically recorded every fraction of a second for about two and a half seconds, generating a force profile. We can think of these profiles as individual functions of time summarized into collections of curves. The focus of this thesis is the analysis of functional process data such as the valve seat insertion example. A number of techniques are set forth. In the first part, two ways to model a single curve are considered: a b-spline fit via linear regression, and a nonlinear model based on differential equations. Each of these approaches is incorporated into a mixed effects model for multiple curves, and multivariate process monitoring techniques are applied to the predicted random effects in order to identify anomalous curves. In the second part, a Bayesian hierarchical model is used to cluster low-dimensional summaries of the curves into meaningful groups. The belief is that the clusters correspond to distinct types of processes (e.g. various types of “good” or “faulty” assembly). New observations can be assigned to one of these by calculating the probabilities of belonging to each cluster. Mahalanobis distances are used to identify new observations not belonging to any of the existing clusters. Synthetic and real data are used to validate the results.Item Stochastic Mortality Models with Applications in Financial Risk Management(University of Waterloo, 2007-06-18T19:20:58Z) Li, Siu HangIn product pricing and reserving, actuaries are often required to make predictions of future death rates. In the past, this has been performed by using deterministic improvement scales that give only a single mortality trajectory. However, there is enormous likelihood that future death rates will turn out to be different from the projected ones, and so a better assessment of longevity risk would be one that consists of both a mean estimate and a measure of uncertainty. Such assessment can be performed using a stochastic mortality model, which is the core of this thesis. The Lee-Carter model is one of the most popular stochastic mortality models. While it does an excellent job in mean forecasting, it has been criticized for providing overly narrow prediction intervals that may have underestimated uncertainty. This thesis mitigates this problem by relaxing the assumption on the distribution of death counts. We found that the generalization from Poisson to negative binomial is equivalent to allowing gamma heterogeneity within each age-period cells. The proposed extension gives not only a better fit, but also a more conservative prediction interval that may reflect better the uncertainty entailed. The proposed extension is then applied to the construction of mortality improvement scales for Canadian insured lives. Given that the insured lives data series are too short for a direct Lee-Carter projection, we build an extra relational model that could borrow strengths from the Canadian population data, which covers a far longer period. The resultant scales consist of explicit measures of uncertainty. The prediction of the tail of a survival distribution requires a special treatment due to the lack of high quality old-age mortality data. We utilize the asymptotic results in modern extreme value theory to extrapolate death probabilities to the advanced ages, and to statistically determine the age at which the life table should be closed. Such technique is further integrated with the Lee-Carter model to produce a stochastic analysis of old-age mortality, and a prediction of the highest attained age for various cohorts. The mortality models we considered are further applied to the valuation of mortality-related financial products. In particular we investigate the no-negative-equity-guarantee that is offered in most fixed-repayment lifetime mortgages in Britain. The valuation of such guarantee requires a simultaneous consideration of both longevity and house price inflation risk. We found that house price returns can be well described by an ARMA-EGARCH time-series process. Under an ARMA-EGARCH process, however, the Black-Scholes formula no longer applies. We derive our own pricing formula based on the conditional Esscher transformation. Finally, we propose some possible hedging and capital reserving strategies for managing the risks associated with the guarantee.Item The Valuation and Risk Management of a DB Underpin Pension Plan(University of Waterloo, 2007-08-02T13:39:54Z) Chen, KaiHybrid pension plans offer employees the best features of both defined benefit and defined contribution plans. In this work, we consider the hybrid design offering a defined contribution benefit with a defined benefit guaranteed minimum underpin. This study applies the contingent claims approach to value the defined contribution benefit with a defined benefit guaranteed minimum underpin. The study shows that entry age, utility function parameters and the market price of risk each has a significant effect on the value of retirement benefits. We also consider risk management for this defined benefit underpin pension plan. Assuming fixed interest rates, and assuming that salaries can be treated as a tradable asset, contribution rates are developed for the Entry Age Normal (EAN), Projected Unit Credit(PUC), and Traditional Unit Credit (TUC) funding methods. For the EAN, the contribution rates are constant throughout the service period. However, the hedge parameters for this method are not tradable. For the accruals method, the individual contribution rates are not constant. For both the PUC and TUC, a delta hedge strategy is derived and explained. The analysis is extended to relax the tradable assumption for salaries, using the inflation as a partial hedge. Finally, methods for incorporating volatility reducingand risk management are considered.Item Topics in Delayed Renewal Risk Models(University of Waterloo, 2007-08-03T15:30:56Z) Kim, So-YeunMain focus is to extend the analysis of the ruin related quantities, such as the surplus immediately prior to ruin, the deficit at ruin or the ruin probability, to the delayed renewal risk models. First, the background for the delayed renewal risk model is introduced and two important equations that are used as frameworks are derived. These equations are extended from the ordinary renewal risk model to the delayed renewal risk model. The first equation is obtained by conditioning on the first drop below the initial surplus level, and the second equation by conditioning on the amount and the time of the first claim. Then, we consider the deficit at ruin in particular among many random variables associated with ruin and six main results are derived. We also explore how the Gerber-Shiu expected discounted penalty function can be expressed in closed form when distributional assumptions are given for claim sizes or the time until the first claim. Lastly, we consider a model that has premium rate reduced when the surplus level is above a certain threshold value until it falls below the threshold value. The amount of the reduction in the premium rate can also be viewed as a dividend rate paid out from the original premium rate when the surplus level is above some threshold value. The constant barrier model is considered as a special case where the premium rate is reduced to $0$ when the surplus level reaches a certain threshold value. The dividend amount paid out during the life of the surplus process until ruin, discounted to the beginning of the process, is also considered.Item Computation of Multivariate Barrier Crossing Probability, and Its Applications in Finance(University of Waterloo, 2007-09-05T13:29:14Z) Huh, JoongheeIn this thesis, we consider computational methods of finding exit probabilities for a class of multivariate stochastic processes. While there is an abundance of results for one-dimensional processes, for multivariate processes one has to rely on approximations or simulation methods. We adopt a Large Deviations approach in order to estimate barrier crossing probabilities of a multivariate Brownian Bridge. We use this approach in conjunction with numerical techniques to propose an efficient method of obtaining barrier crossing probabilities of a multivariate Brownian motion. Using numerical examples, we demonstrate that our method works better than other existing methods. We present applications of the proposed method in addressing problems in finance such as estimating default probabilities of several credit risky entities and pricing credit default swaps. We also extend our computational method to efficiently estimate a barrier crossing probability of a sum of Geometric Brownian motions. This allows us to perform a portfolio selection by maximizing a path-dependent utility function.Item Interval Censoring and Longitudinal Survey Data(University of Waterloo, 2007-09-11T21:38:57Z) Pantoja Galicia, NorbertoBeing able to explore a relationship between two life events is of great interest to scientists from different disciplines. Some issues of particular concern are, for example, the connection between smoking cessation and pregnancy (Thompson and Pantoja-Galicia 2003), the interrelation between entry into marriage for individuals in a consensual union and first pregnancy (Blossfeld and Mills 2003), and the association between job loss and divorce (Charles and Stephens 2004, Huang 2003 and Yeung and Hofferth 1998). Establishing causation in observational studies is seldom possible. Nevertheless, if one of two events tends to precede the other closely in time, a causal interpretation of an association between these events can be more plausible. The role of longitudinal surveys is crucial, then, since they allow sequences of events for individuals to be observed. Thompson and Pantoja-Galicia (2003) discuss in this context several notions of temporal association and ordering, and propose an approach to investigate a possible relationship between two lifetime events. In longitudinal surveys individuals might be asked questions of particular interest about two specific lifetime events. Therefore the joint distribution might be advantageous for answering questions of particular importance. In follow-up studies, however, it is possible that interval censored data may arise due to several reasons. For example, actual dates of events might not have been recorded, or are missing, for a subset of (or all) the sampled population, and can be established only to within specified intervals. Along with the notions of temporal association and ordering, Thompson and Pantoja-Galicia (2003) also discuss the concept of one type of event "triggering" another. In addition they outline the construction of tests for these temporal relationships. The aim of this thesis is to implement some of these notions using interval censored data from longitudinal complex surveys. Therefore, we present some proposed tools that may be used for this purpose. This dissertation is divided in five chapters, the first chapter presents a notion of a temporal relationship along with a formal nonparametric test. The mechanisms of right censoring, interval censoring and left truncation are also overviewed. Issues on complex surveys designs are discussed at the end of this chapter. For the remaining chapters of the thesis, we note that the corresponding formal nonparametric test requires estimation of a joint density, therefore in the second chapter a nonparametric approach for bivariate density estimation with interval censored survey data is provided. The third chapter is devoted to model shorter term triggering using complex survey bivariate data. The semiparametric models in Chapter 3 consider both noncensoring and interval censoring situations. The fourth chapter presents some applications using data from the National Population Health Survey and the Survey of Labour and Income Dynamics from Statistics Canada. An overall discussion is included in the fifth chapter and topics for future research are also addressed in this last chapter.Item Statistical Learning in Drug Discovery via Clustering and Mixtures(University of Waterloo, 2007-09-20T15:20:47Z) Wang, XuIn drug discovery, thousands of compounds are assayed to detect activity against a biological target. The goal of drug discovery is to identify compounds that are active against the target (e.g. inhibit a virus). Statistical learning in drug discovery seeks to build a model that uses descriptors characterizing molecular structure to predict biological activity. However, the characteristics of drug discovery data can make it difficult to model the relationship between molecular descriptors and biological activity. Among these characteristics are the rarity of active compounds, the large volume of compounds tested by high-throughput screening, and the complexity of molecular structure and its relationship to activity. This thesis focuses on the design of statistical learning algorithms/models and their applications to drug discovery. The two main parts of the thesis are: an algorithm-based statistical method and a more formal model-based approach. Both approaches can facilitate and accelerate the process of developing new drugs. A unifying theme is the use of unsupervised methods as components of supervised learning algorithms/models. In the first part of the thesis, we explore a sequential screening approach, Cluster Structure-Activity Relationship Analysis (CSARA). Sequential screening integrates High Throughput Screening with mathematical modeling to sequentially select the best compounds. CSARA is a cluster-based and algorithm driven method. To gain further insight into this method, we use three carefully designed experiments to compare predictive accuracy with Recursive Partitioning, a popular structureactivity relationship analysis method. The experiments show that CSARA outperforms Recursive Partitioning. Comparisons include problems with many descriptor sets and situations in which many descriptors are not important for activity. In the second part of the thesis, we propose and develop constrained mixture discriminant analysis (CMDA), a model-based method. The main idea of CMDA is to model the distribution of the observations given the class label (e.g. active or inactive class) as a constrained mixture distribution, and then use Bayes’ rule to predict the probability of being active for each observation in the testing set. Constraints are used to deal with the otherwise explosive growth of the number of parameters with increasing dimensionality. CMDA is designed to solve several challenges in modeling drug data sets, such as multiple mechanisms, the rare target problem (i.e. imbalanced classes), and the identification of relevant subspaces of descriptors (i.e. variable selection). We focus on the CMDA1 model, in which univariate densities form the building blocks of the mixture components. Due to the unboundedness of the CMDA1 log likelihood function, it is easy for the EM algorithm to converge to degenerate solutions. A special Multi-Step EM algorithm is therefore developed and explored via several experimental comparisons. Using the multi-step EM algorithm, the CMDA1 model is compared to model-based clustering discriminant analysis (MclustDA). The CMDA1 model is either superior to or competitive with the MclustDA model, depending on which model generates the data. The CMDA1 model has better performance than the MclustDA model when the data are high-dimensional and unbalanced, an essential feature of the drug discovery problem! An alternate approach to the problem of degeneracy is penalized estimation. By introducing a group of simple penalty functions, we consider penalized maximum likelihood estimation of the CMDA1 and CMDA2 models. This strategy improves the convergence of the conventional EM algorithm, and helps avoid degenerate solutions. Extending techniques from Chen et al. (2007), we prove that the PMLE’s of the two-dimensional CMDA1 model can be asymptotically consistent.Item Analysis of a Threshold Strategy in a Discrete-time Sparre Andersen Model(University of Waterloo, 2007-09-26T19:17:14Z) Mera, Ana MariaIn this thesis, it is shown that the application of a threshold on the surplus level of a particular discrete-time delayed Sparre Andersen insurance risk model results in a process that can be analyzed as a doubly infinite Markov chain with finite blocks. Two fundamental cases, encompassing all possible values of the surplus level at the time of the first claim, are explored in detail. Matrix analytic methods are employed to establish a computational algorithm for each case. The resulting procedures are then used to calculate the probability distributions associated with fundamental ruin-related quantities of interest, such as the time of ruin, the surplus immediately prior to ruin, and the deficit at ruin. The ordinary Sparre Andersen model, an important special case of the general model, with varying threshold levels is considered in a numerical illustration.