Show simple item record

dc.contributor.authorShamsi, Amir Mehdi
dc.date.accessioned2020-09-02 17:45:32 (GMT)
dc.date.available2020-09-02 17:45:32 (GMT)
dc.date.issued2020-09-02
dc.date.submitted2020-08-25
dc.identifier.urihttp://hdl.handle.net/10012/16224
dc.description.abstractThere is a broad range of information available on the Internet, some of which is considered to be more credible than others. People consider different credibility aspects while evaluating the credibility of a web page, however, many web users find it difficult to determine the credibility of all types of web pages. An autonomous system that can analyze different credibility factors extracted from a web page to estimate the page's credibility could help users to make better decisions about the perceived credibility of the web information. This research investigated the applicability of several machine learning approaches to the evaluation of web page credibility. First, six credibility categories were identified from peer-reviewed literature. Then, their related credibility features were investigated and automatically extracted from the web page content, metadata, or external resources. Three sets of features (i.e., automatically extracted credibility features, bag of words features, and combination of both) were used in classification experiments to compare their impact on the autonomous credibility estimation model performance. The Content Credibility Corpus (C3) dataset was used to develop and test the performance of the model developed in this research. XGBoost achieved the best weighted average F1 score for extracted features. In comparison, the Logistic Regression classifier had the best performance when bag of words features was used, and all features together were used as a feature vector. To begin to explore the legitimacy of this approach, a crowdsourcing task was conducted to evaluate how the output of the proposed model aligns with the credibility ratings given by human annotators. Thirty web pages were selected from the C3 dataset to find out how current users' ratings correlate to the ratings that were used as ground truth to train the model. In addition, 30 new web pages were selected to explore how generalizable the algorithm is for classifying new web pages. Participants were asked to rate the credibility of each web page base on a 5-point Likert scale. Sixty-nine crowd-sourced participants evaluated the credibility of the 60 web pages for a total of 600 ratings (10 per page). Spearman's correlation between average credibility scores given by participants and original scores in the C3 dataset indicates a moderate positive correlation: r = 0.44, p < 0.02. A contingency table was created to compare the predicted scores by the model with the rated scores by participants. Overall, the model achieved an accuracy of 80%, which indicates that the proposed model can generalize for new web pages. The model outlined in this thesis outperformed the previous work by using a promising set of features that some of them were presented in this research for the first time.en
dc.language.isoenen
dc.publisherUniversity of Waterlooen
dc.subjectweb page credibilityen
dc.subjectmachine learningen
dc.titleDevelopment of a machine learning-based model to autonomously estimate web page credibilityen
dc.typeMaster Thesisen
dc.pendingfalse
uws-etd.degree.departmentSystems Design Engineeringen
uws-etd.degree.disciplineSystem Design Engineeringen
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.degreeMaster of Applied Scienceen
uws.contributor.advisorBoger, Jennifer
uws.contributor.affiliation1Faculty of Engineeringen
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.typeOfResourceTexten
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record


UWSpace

University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

All items in UWSpace are protected by copyright, with all rights reserved.

DSpace software

Service outages