Bias-Corrected Machine Learning Prediction for Non-random Surveys

dc.contributor.authorKislenko, Olena
dc.date.accessioned2025-08-19T14:18:04Z
dc.date.available2025-08-19T14:18:04Z
dc.date.issued2025-08-19
dc.date.submitted2025-08-14
dc.description.abstractTuberculosis (TB) remains a significant global health concern, particularly in low- and middle-income countries, where diagnostic resources are often limited. As computational tools become increasingly accessible, machine learning (ML) presents promising avenues to improve TB screening and diagnosis, particularly when only survey data are available. However, many of these surveys rely on non-probability sampling strategies, such as quota sampling, which can introduce substantial bias into predictive models. If not adequately accounted for, this bias may distort model performance and lead to inaccurate conclusions. In this thesis, we propose a simulation-based bias correction framework that utilizes auxiliary variables to more accurately represent the full population and enhance prediction reliability. Multiple classification methods were applied to real-world TB screening data from Nigeria, India, and Indonesia, and the performance of the methods was assessed both with and without bias correction. The simulation results demonstrate that XGBoost and support vector machines with a radial kernel consistently outperform other models in terms of Matthews correlation coefficient (MCC) and F1 score. However, specificity remains low due to the class imbalance in the data, with actual negative cases being underrepresented. Among the auxiliary variables considered, weight loss produced the strongest predictive improvements, aligning with prior clinical findings. Neural networks, though occasionally competitive, exhibited greater variability across simulation runs and generally lower MCC values compared to the best-performing traditional ML models. Their higher computational demands and sensitivity to sample composition may limit the practicality of neural networks in smaller or imbalanced datasets. When the models were applied to real-life data, the relative performance of the predictive models shifted depending on the auxiliary variable used for correction. This variability emphasizes the importance of evaluating multiple corrected models in practical applications, as no single approach is universally optimal across all scenarios. In conclusion, the thesis extends existing methodologies to classification problems and demonstrates their relevance in the diagnosis of TB. Our findings support the integration of bias correction into predictive modelling with non-random samples, especially in global health contexts where data imbalance and limited resources are common.
dc.identifier.urihttps://hdl.handle.net/10012/22195
dc.language.isoen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.titleBias-Corrected Machine Learning Prediction for Non-random Surveys
dc.typeMaster Thesis
uws-etd.degreeMaster of Mathematics
uws-etd.degree.departmentData Science
uws-etd.degree.disciplineData Science
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0
uws.contributor.advisorNegeri, Zelalem
uws.contributor.affiliation1Faculty of Mathematics
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Kislenko_Olena.pdf
Size:
1.9 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description: