Bias-Corrected Machine Learning Prediction for Non-random Surveys

Kislenko, Olena

Bias-Corrected Machine Learning Prediction for Non-random Surveys

dc.contributor.author	Kislenko, Olena
dc.date.accessioned	2025-08-19T14:18:04Z
dc.date.available	2025-08-19T14:18:04Z
dc.date.issued	2025-08-19
dc.date.submitted	2025-08-14
dc.description.abstract	Tuberculosis (TB) remains a significant global health concern, particularly in low- and middle-income countries, where diagnostic resources are often limited. As computational tools become increasingly accessible, machine learning (ML) presents promising avenues to improve TB screening and diagnosis, particularly when only survey data are available. However, many of these surveys rely on non-probability sampling strategies, such as quota sampling, which can introduce substantial bias into predictive models. If not adequately accounted for, this bias may distort model performance and lead to inaccurate conclusions. In this thesis, we propose a simulation-based bias correction framework that utilizes auxiliary variables to more accurately represent the full population and enhance prediction reliability. Multiple classification methods were applied to real-world TB screening data from Nigeria, India, and Indonesia, and the performance of the methods was assessed both with and without bias correction. The simulation results demonstrate that XGBoost and support vector machines with a radial kernel consistently outperform other models in terms of Matthews correlation coefficient (MCC) and F1 score. However, specificity remains low due to the class imbalance in the data, with actual negative cases being underrepresented. Among the auxiliary variables considered, weight loss produced the strongest predictive improvements, aligning with prior clinical findings. Neural networks, though occasionally competitive, exhibited greater variability across simulation runs and generally lower MCC values compared to the best-performing traditional ML models. Their higher computational demands and sensitivity to sample composition may limit the practicality of neural networks in smaller or imbalanced datasets. When the models were applied to real-life data, the relative performance of the predictive models shifted depending on the auxiliary variable used for correction. This variability emphasizes the importance of evaluating multiple corrected models in practical applications, as no single approach is universally optimal across all scenarios. In conclusion, the thesis extends existing methodologies to classification problems and demonstrates their relevance in the diagnosis of TB. Our findings support the integration of bias correction into predictive modelling with non-random samples, especially in global health contexts where data imbalance and limited resources are common.
dc.identifier.uri	https://hdl.handle.net/10012/22195
dc.language.iso	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.title	Bias-Corrected Machine Learning Prediction for Non-random Surveys
dc.type	Master Thesis
uws-etd.degree	Master of Mathematics
uws-etd.degree.department	Data Science
uws-etd.degree.discipline	Data Science
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	0
uws.contributor.advisor	Negeri, Zelalem
uws.contributor.affiliation1	Faculty of Mathematics
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Kislenko_Olena.pdf
Size:: 1.9 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Data Science