Show simple item record

dc.contributor.authorLiu, Chenyu
dc.date.accessioned2013-01-10 16:19:52 (GMT)
dc.date.available2013-01-10 16:19:52 (GMT)
dc.date.issued2013-01-10T16:19:52Z
dc.date.submitted2012
dc.identifier.urihttp://hdl.handle.net/10012/7188
dc.description.abstractMedicine and health domains are information intensive fields as data volume has been increasing constantly from them. In order to make full use of the data, the technique of Knowledge Discovery in Databases (KDD) has been developed as a comprehensive pathway to discover valid and unsuspected patterns and trends that are both understandable and useful to data analysts. The present study aimed to investigate the entire KDD process of developing a classification model for cardiovascular disease (CVD) from a Canadian dataset for the first time. The research data source was Canadian Heart Health Database, which contains 265 easily collected variables and 23,129 instances from ten Canadian provinces. Many practical issues involving in different steps of the integrated process were addressed, and possible solutions were suggested based on the experimental results. Five specific learning schemes representing five distinct KDD approaches were employed, as they were never compared with one another. In addition, two improving approaches including cost-sensitive learning and ensemble learning were also examined. The performance of developed models was measured in many aspects. The data set was prepared through data cleaning and missing value imputation. Three pairs of experiments demonstrated that the dataset balancing and outlier removal exerted positive influence to the classifier, but the variable normalization was not helpful. Three combinations of subset generation method and evaluation function were tested in variable subset selection phase, and the combination of Best-First search and Correlation-based Feature Selection showed comparable goodness and was maintained for other benefits. Among the five learning schemes investigated, C4.5 decision tree achieved the best performance on the classification of CVD, followed by Multilayer Feed-forward Network, KNearest Neighbor, Logistic Regression, and Naïve Bayes. Cost-sensitive learning exemplified by the MetaCost algorithm failed to outperform the single C4.5 decision tree when varying the cost matrix from 5:1 to 1:7. In contrast, the models developed from ensemble modeling, especially AdaBoost M1 algorithm, outperformed other models. Although the model with the best performance might be suitable for CVD screening in general Canadian population, it is not ready to use in practice. I propose some criteria to improve the further evaluation of the model. Finally, I describe some of the limitations of the study and propose potential solutions to address such limitations through out the KDD process. Such possibilities should be explored in further research.en
dc.language.isoenen
dc.publisherUniversity of Waterlooen
dc.subjectclassification learningen
dc.subjectKDD processen
dc.subjectCVDen
dc.subjectCanadian databaseen
dc.titleInvestigating the Process of Developing a KDD Model for the Classification of Cases with Cardiovascular Disease Based on a Canadian Databaseen
dc.typeMaster Thesisen
dc.pendingfalseen
dc.subject.programHealth Studies and Gerontologyen
uws-etd.degree.departmentHealth Studies and Gerontologyen
uws-etd.degreeMaster of Scienceen
uws.typeOfResourceTexten
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record


UWSpace

University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

All items in UWSpace are protected by copyright, with all rights reserved.

DSpace software

Service outages