Data Science
Permanent URI for this collectionhttps://uwspace.uwaterloo.ca/handle/10012/18081
This is the collection for the University of Waterloo's Data Science program.
Browse
Recent Submissions
Item Interpretable Machine Learning (IML) Methods: Classification and Solutions for Transparent Models(University of Waterloo, 2024-09-18) Ghaffartehrani, AlirezaThis thesis explores the realm of machine learning (ML), focusing on enhancing model interpretability called interpretable machine learning (IML) techniques. The initial chapter provides a comprehensive overview of various ML models, including supervised, unsupervised, reinforcement, and hybrid learning methods, emphasizing their specific applications across diverse sectors. The second chapter delves into methodologies and the categorization of interpretable models. The research advocates for transparent and understandable IML models, particularly crucial in high-stakes decision-making scenarios. By integrating theoretical insights and practical solutions, this work contributes to the growing field of IML, aiming to bridge the gap between complex IML algorithms and their real-world applications.Item Performance Evaluation of Women's Volleyball Players Using Bayesian Data Analysis(University of Waterloo, 2024-08-28) Awosoga, DavidUnderstanding player contribution is an important component of lineup construction, advance scouting, and performance evaluation in volleyball. However, traditional methods utilize oversimplified percentages that fail to acknowledge latent variables and situational nuance. These shortcomings are addressed in this work via a holistic framework mirroring the structure of a story, with each sub-component representing an angle by which player contribution can be investigated. The emphasis here is on modelling player contribution via player presence, and a Bayesian logistic regularized regression with a horseshoe prior and custom model structure is employed. This approach incorporates player roles, lineup matchups, and additional context-specific information into its estimates. The resultant model outputs are tested against substantive knowledge and posterior predictive checks, with direct application to volleyball coaches, players, and the community at large. Extensions to this analysis consider other factors of player contribution such as player action, event sequences, and player intent, with applications arising from advancements in computer vision, sports science technology, and data acquisition also discussed.Item Variance Reduction with Model-based Counterfactual Estimation(University of Waterloo, 2024-08-21) Shim, Kyu MinVariance reduction is an important area of research in the realm of online controlled experiments, also known as A/B tests. Reducing outcome variability in an A/B test improves the test’s statistical power and improves the efficiency of the experimentation process. Many variance reduction techniques already exist, and they typically utilize data collected prior to the experiment (pre-experiment data) to reveal complex relationships between the outcome of interest and covariates. These insights are then applied to the data collected during the experiment (in-experiment data) to reduce the outcome variability in the A/B test. However, such methods are heavily reliant on the assumption that pre- and in-experiment data are highly correlated. This is questionable in online settings where trends change quickly due to heterogeneity in user behavior, the rapid development of technology, and the competitive landscape. In these settings, we cannot ignore that fluctuations in other factors may degrade the correlation between pre- and in-experiment data. We propose a two-stage framework for treatment effect estimation that adjusts for differences between pre- and in-experiment data, thereby producing treatment effect estimators with smaller variance than those associated with other variance reduction methods. Inference is conducted by modeling and estimating the counterfactual outcome of each unit and performing a pairwise comparison. This method of inference is shown to be asymptotically unbiased, with an asymptotic variance that scales with the model’s predictive accuracy. We compare the variance reduction capabilities of the proposed method with several alternatives through simulation studies using both simulated data and real-world data. In doing so, we demonstrate that the proposed method’s variance reduction capabilities are at least as good (and in some cases orders of magnitude better) than that of existing methods.Item Optimal Decumulation for Retirees using Tontines: a Dynamic Neural Network Based Approach(University of Waterloo, 2023-09-19) Shirazi, MohammadWe introduce a new approach for optimizing neural networks (NN) using data to solve a stochastic control problem with stochastic constraints. We utilize customized activation functions for the output layers of the NN, enabling training through standard unconstrained optimization techniques. The resulting optimal solution provides a strategy for allocating and withdrawing assets over multiple periods for an individual with a defined contribution (DC) pension plan. The objective function of the control problem focuses on minimizing left-tail risk by considering expected withdrawals (EW) and expected shortfall (ES). Stochastic bound constraints ensure a minimum yearly withdrawal. By comparing our data-driven approach with the numerical results obtained from a computational framework based on the Hamilton-Jacobi-Bellman (HJB) Partial Differential Equation (PDE), we demonstrate that our method is capable of learning a solution that is close to optimal. We show that the proposed framework is capable of incorporating additional stochastic processes, particularly in cases related to the use of tontines. We illustrate the benefits of using tontines for the decumulation problem and quantify the decrease in risk they bring. We also extend the framework to use more assets and provide test results to show the robustness of the control.Item A Robust Neural Network Approach to Optimal Decumulation and Factor Investing in Defined Contribution Pension Plans(University of Waterloo, 2023-09-18) Chen, MarcIn this thesis, we propose a novel data-driven neural network (NN) optimization framework for solving an optimal stochastic control problem under stochastic constraints. The NN utilizes customized output layer activation functions, which permits training via standard unconstrained optimization. The optimal solution of the two-asset problem yields a multi-period asset allocation and decumulation strategy for a holder of a defined contribution (DC) pension plan. The objective function of the optimal control problem is based on expected wealth withdrawn (EW) and expected shortfall (ES) that directly targets left-tail risk. The stochastic bound constraints enforce a guaranteed minimum withdrawal each year. We demonstrate that the data-driven NN approach is capable of learning a near-optimal solution by benchmarking it against the numerical results from a Hamilton-Jacobi-Bellman (HJB) Partial Differential Equation (PDE) computational framework. The NN framework has the advantage of being able to scale to high dimensional multi-asset problems, which we take advantage of in this work to investigate the effectiveness of various factor investing strategies in improving investment outcomes for the investor.Item Algorithmic Behaviours of Adagrad in Underdetermined Linear Regression(University of Waterloo, 2023-08-24) Rambidis, AndrewWith the high use of over-parameterized data in deep learning, the choice of optimizer in training plays a big role in a model’s ability to generalize well due to the existence of solution selection bias. We consider the popular adaptive gradient method: Adagrad, and aim to study its convergence and algorithmic biases in the underdetermined linear regression regime. First we prove that Adagrad converges in this problem regime. Subsequently, we empirically find that when using sufficiently small step sizes, Adagrad promotes diffuse solutions, in the sense of uniformity among the coordinates of the solution. Additionally, when compared to gradient descent, we see empirically and show theoretically that Adagrad’s solution, under the same conditions, exhibits greater diffusion compared to the solution obtained through gradient descent. This behaviour is unexpected as conventional data science encourages the utilization of optimizers that attain sparser solutions. This preference arises due to some inherent advantages such as helping to prevent overfitting, and reducing the dimensionality of the data. However, we show that in the application of interpolation, diffuse solutions yield beneficial results when compared to solutions with localization; Namely, we experimentally observe the success of diffuse solutions when interpolating a line via the weighted sum of spike-like functions. The thesis concludes with some suggestions to possible extensions of the content in future work.Item Enhancing Recommender Systems with Causal Inference Methodologies(University of Waterloo, 2023-08-22) Huang, HuiqingIn the current era of data deluge, recommender systems (RSs) are widely recognized as one of the most effective tools for information filtering. However, traditional RSs are founded on associational relationships among variables rather than causality, meaning they are unable to determine which factors actually affect user preference. In addition, the algorithm of conventional RS continues to recommend similar items to users, resulting in user aesthetic fatigue and ultimately the loss of customer sources. Moreover, the generation of recommendations could be biased by the confounding effect, leading to inaccurate results. To tackle this series of challenges, causal inference for recommender systems (CI for RSs) has emerged as a new area of study. In this paper, we present four different propensity score estimation methods, namely hierarchical Poisson factorization (HPF), logistic regression, non-negative matrix factorization (NMF), and neural networks (NNs), and five causal effect estimation methods, namely linear regression, inverse probability weighting (IPW), zero-inflated Poisson (ZIP) regression, zero-inflated Negative Binomial (ZINB) regression, and doubly robust (DR) estimation. Additionally, we propose a new algorithm for parameter estimation based on the concept of alternating gradient descent (AGD). Regarding the study's reliability and precision, it will be evaluated on two distinct categories of datasets. Our research demonstrates that the causal RS can correctly infer causality from user and item characteristics to the final rating with an accuracy of 96%. Moreover, according to the de-confounded and de-biased recommendations, ratings can be increased by an average of 1.6 points (out of 4) for the Yahoo! R3 dataset and 1.2 points (out of 2) for the Restaurant and Consumer data.Item Simple Yet Effective Pseudo Relevance Feedback with Rocchio’s Technique and Text Classification(University of Waterloo, 2022-08-22) Liu, YuqiWith the continuous growth of the Internet and the availability of large-scale collections, assisting users in locating the information they need becomes a necessity. Generally, an information retrieval system will process an input query and provide a list of ranked results. However, this process could be challenging due to the "vocabulary mismatch" issue between input queries and passages. A well-known technique to address this issue is called "query expansion", which reformulates the given query by selecting and adding more relevant terms. Relevance feedback, as a form of query expansion, collects users' opinions on candidate passages and expands query terms from relevant ones. Pseudo relevance feedback assumes that the top documents in initial retrieval are relevant and rebuilds queries without any user interactions. In this thesis, we will discuss two implementations of pseudo relevance feedback: decades-old Rocchio's Technique and more recent text classification. As the reader might notice, both techniques are not "novel" anymore, e.g., the emergence of Rocchio can even be dated back to the 1960s. They are both proposed and studied before the neural age, where texts are still mostly stored as bag-of-words representations. Today, transformers have been shown to advance information retrieval, and searching with transformer-based dense representations outperforms traditional bag-of-words searching on many challenging and complex ranking tasks. This motivates us to ask the following three research questions: RQ1: Given strong baselines, large labelled datasets, and the emergence of transformers today, does pseudo relevance feedback with Rocchio's Technique still perform effectively with both sparse and dense representations? RQ2: Given strong baselines, large labelled datasets, and the emergence of transformers today, does pseudo relevance feedback via text classification still perform effectively with both sparse and dense representations? RQ3: Does applying pseudo relevance feedback with text classification on top of Rocchio's Technique results in further improvements? To answer RQ1, we have implemented Rocchio's Technique with sparse representations based on the Anserini and Pyserini toolkits. Building in a previous implementation of Rocchio's Technique with dense representations in the Pyserini toolkit, we can easily evaluate and compare the impact of Rocchio's Technique on effectiveness with both sparse and dense representations. By applying Rocchio's Technique to MS MARCO Passage and Document TREC Deep Learning topics, we can achieve about a 0.03-0.04 increase in average precision. It’s no surprise that Rocchio's Technique outperforms the BM25 baseline, but it's impressive to find that it is competitive or even superior to RM3, a more common strong baseline, under most circumstances. Hence, we propose to switch to Rocchio's Technique as a more robust and general baseline in future studies. To our knowledge, pseudo relevance feedback via text classification using both positive and negative labels is not well-studied before our work. To answer RQ2, we have verified the effectiveness of pseudo relevance feedback via text classification with both sparse and dense representations. Three classifiers (LR, SVM, KNN) are trained, and all enhance effectiveness. We also observe that pseudo relevance feedback via text classification with dense representations yields greater improvement than sparse ones. However, when we compare text classification to Rocchio's Technique, we find that Rocchio's Technique is superior to pseudo relevance feedback via text classification under all circumstances. In RQ3, the success of pseudo relevance feedback via text classification on BM25 + RM3 across four newswire collections in our previous paper motivates us to study the impact of pseudo relevance feedback via text classification on top of another query expansion result, Rocchio's Technique. However, unlike RM3, we could not observe much difference in the two evaluation metrics after applying pseudo relevance feedback via text classification on top of Rocchio's Technique. This work aims to explore some simple yet effective techniques which might be ignored in light of deep learning transformers. Instead of pursuing "more", we are aiming to find out something "less". We demonstrate the robustness and effectiveness of some "out-of-date" methods in the age of neural networksItem A Particle Filter Method of Inference for Stochastic Differential Equations(University of Waterloo, 2022-05-31) Subramani, PranavStochastic Differential Equations (SDE) serve as an extremely useful modelling tool in areas including ecology, finance, population dynamics, and physics. Yet, parameter inference for SDEs is notoriously difficult due to the intractability of the likelihood function. A common approach is to approximate the likelihood by way of data augmentation, then integrate over the latent variables using particle filtering techniques. In the Bayesian setting, the particle filter is typically combined with various Markov chain Monte Carlo (MCMC) techniques to sample from the parameter posterior. However, MCMC can be excessive when this posterior is well-approximated by a normal distribution, in which case estimating the posterior mean and variance by stochastic optimization presents a much faster alternative. This thesis explores this latter approach. Specifically, we use a particle filter tailored to SDE models and consider various methods for approximating the gradient and hessian of the parameter log-posterior. Empirical results for several SDE models are presented.