Development of a machine learning-based model to autonomously estimate web page credibility

Shamsi, Amir Mehdi

dc.contributor.author	Shamsi, Amir Mehdi
dc.date.accessioned	2020-09-02 17:45:32 (GMT)
dc.date.available	2020-09-02 17:45:32 (GMT)
dc.date.issued	2020-09-02
dc.date.submitted	2020-08-25
dc.identifier.uri	http://hdl.handle.net/10012/16224
dc.description.abstract	There is a broad range of information available on the Internet, some of which is considered to be more credible than others. People consider different credibility aspects while evaluating the credibility of a web page, however, many web users find it difficult to determine the credibility of all types of web pages. An autonomous system that can analyze different credibility factors extracted from a web page to estimate the page's credibility could help users to make better decisions about the perceived credibility of the web information. This research investigated the applicability of several machine learning approaches to the evaluation of web page credibility. First, six credibility categories were identified from peer-reviewed literature. Then, their related credibility features were investigated and automatically extracted from the web page content, metadata, or external resources. Three sets of features (i.e., automatically extracted credibility features, bag of words features, and combination of both) were used in classification experiments to compare their impact on the autonomous credibility estimation model performance. The Content Credibility Corpus (C3) dataset was used to develop and test the performance of the model developed in this research. XGBoost achieved the best weighted average F1 score for extracted features. In comparison, the Logistic Regression classifier had the best performance when bag of words features was used, and all features together were used as a feature vector. To begin to explore the legitimacy of this approach, a crowdsourcing task was conducted to evaluate how the output of the proposed model aligns with the credibility ratings given by human annotators. Thirty web pages were selected from the C3 dataset to find out how current users' ratings correlate to the ratings that were used as ground truth to train the model. In addition, 30 new web pages were selected to explore how generalizable the algorithm is for classifying new web pages. Participants were asked to rate the credibility of each web page base on a 5-point Likert scale. Sixty-nine crowd-sourced participants evaluated the credibility of the 60 web pages for a total of 600 ratings (10 per page). Spearman's correlation between average credibility scores given by participants and original scores in the C3 dataset indicates a moderate positive correlation: r = 0.44, p < 0.02. A contingency table was created to compare the predicted scores by the model with the rated scores by participants. Overall, the model achieved an accuracy of 80%, which indicates that the proposed model can generalize for new web pages. The model outlined in this thesis outperformed the previous work by using a promising set of features that some of them were presented in this research for the first time.	en
dc.language.iso	en	en
dc.publisher	University of Waterloo	en
dc.subject	web page credibility	en
dc.subject	machine learning	en
dc.title	Development of a machine learning-based model to autonomously estimate web page credibility	en
dc.type	Master Thesis	en
dc.pending	false
uws-etd.degree.department	Systems Design Engineering	en
uws-etd.degree.discipline	System Design Engineering	en
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.degree	Master of Applied Science	en
uws.contributor.advisor	Boger, Jennifer
uws.contributor.affiliation1	Faculty of Engineering	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.typeOfResource	Text	en
uws.peerReviewStatus	Unreviewed	en
uws.scholarLevel	Graduate	en

Files in this item

Name:: Shamsi_AmirMehdi.pdf
Size:: 744.8Kb
Format:: PDF
Description:: Master's thesis

View/ Open

This item appears in the following Collection(s)

Show simple item record