Counterfactual Data Augmentation for Regression
| dc.contributor.author | Mohebbi, Hossein | |
| dc.date.accessioned | 2026-01-23T15:50:00Z | |
| dc.date.available | 2026-01-23T15:50:00Z | |
| dc.date.issued | 2026-01-23 | |
| dc.date.submitted | 2026-01-20 | |
| dc.description.abstract | Data-driven modeling in real-world regression tasks often suffers from limited training samples, high collection costs, and noisy observations. While data augmentation has revolutionized fields such as computer vision and natural language processing by leveraging domain-specific symmetries, effective techniques for tabular regression remain elusive. Existing approaches, ranging from geometric interpolation to deep generative models, often fail to preserve the underlying noise structure of the data, leading to the generation of unrealistic samples that can degrade predictive performance. This thesis proposes a novel framework called Counterfactual Residual Data Augmentation (CRDA). Our method is founded on the theoretical principle of Residual Invariance, which posits that once a regressor has modeled the systematic component of the data, the remaining residual noise often remains stable under small perturbations of carefully selected features. We exploit this invariance to synthesize valid counterfactual samples, which are data points with perturbed features but preserved residual noise. We formalize this process through the lens of structural causal models, establishing conditions under which the residual is conditionally independent of specific feature subsets. We provide a practical, model-agnostic algorithm that integrates feature selection heuristics and statistical safety checks to ensure augmentation is applied only when empirically beneficial. Through extensive evaluation across diverse benchmark datasets, we demonstrate that CRDA consistently reduces test error in data-scarce regimes. Specifically, our method reduces the Mean Squared Error (MSE) of Multi-Layer Perceptrons by an average of 22.9% and XGBoost regressors by 6.4%. Furthermore, comparisons against state-of-the-art baselines, including Mixup variants and diffusion-based generative models, reveal that CRDA offers a more robust and statistically grounded remedy for noise-prone, small-sample regression tasks. Finally, we provide a production-ready, open-source implementation of our framework to encourage applications in real-world tabular regression tasks. | |
| dc.identifier.uri | https://hdl.handle.net/10012/22893 | |
| dc.language.iso | en | |
| dc.pending | false | |
| dc.publisher | University of Waterloo | en |
| dc.subject | Regression | |
| dc.subject | Data Augmentation | |
| dc.subject | Machine Learning | |
| dc.subject | Counterfactual Reasoning | |
| dc.title | Counterfactual Data Augmentation for Regression | |
| dc.type | Master Thesis | |
| uws-etd.degree | Master of Mathematics | |
| uws-etd.degree.department | David R. Cheriton School of Computer Science | |
| uws-etd.degree.discipline | Computer Science | |
| uws-etd.degree.grantor | University of Waterloo | en |
| uws-etd.embargo.terms | 0 | |
| uws.contributor.advisor | Poupart, Pascal | |
| uws.contributor.affiliation1 | Faculty of Mathematics | |
| uws.peerReviewStatus | Unreviewed | en |
| uws.published.city | Waterloo | en |
| uws.published.country | Canada | en |
| uws.published.province | Ontario | en |
| uws.scholarLevel | Graduate | en |
| uws.typeOfResource | Text | en |