A Statistical Analysis of the Aggregation of Crowdsourced Labels

Szepesvari, David

A Statistical Analysis of the Aggregation of Crowdsourced Labels

dc.contributor.author	Szepesvari, David
dc.date.accessioned	2015-10-29T19:24:03Z
dc.date.available	2015-10-29T19:24:03Z
dc.date.issued	2015-10-29
dc.date.submitted	2015
dc.description.abstract	Crowdsourcing, due to its inexpensive and timely nature, has become a popular method of collecting data that is difficult for computers to generate. We focus on using this method of human computation to gather labels for classification tasks, to be used for machine learning. However, data gathered this way may be of varying quality, ranging from spam to perfect. We aim to maintain the cost-effective property of crowdsourcing, while also obtaining quality results. Towards a solution, we have multiple workers label the same problem instance, aggregating the responses into one label afterwards. We study what aggregation method to use, and what guarantees we can provide on its estimates. Different crowdsourcing models call for different techniques – we outline and organize various directions taken in the literature, and focus on the Dawid-Skene model. In this setting each instance has a true label, workers are independent, and the performance of each individual is assumed to be uniform over all instances, in the sense that she has an inherent skill that governs the probability with which she labels correctly. Her skill is unknown to us. Aggregation methods aim to find the true label of each task based solely on the labels the workers reported. We measure the performance of these methods by the probability with which the estimates they output match the true label. In practice, a popular procedure is to run the EM algorithm to find estimates of the skills and labels. However, this method is not directly guaranteed to perform well in our measure. We collect and evaluate theoretical results that bound the error of various aggregation methods, including specific variants of EM. Finally, we prove a guarantee on the error suffered by the maximum likelihood estimator, the global optima of the function that EM aims to numerically optimize.	en
dc.identifier.uri	http://hdl.handle.net/10012/9841
dc.language.iso	en	en
dc.pending	false
dc.publisher	University of Waterloo
dc.subject	Statistics	en
dc.subject	Machine Learning	en
dc.subject	Maximum likelihood	en
dc.subject	Crowdsourcing	en
dc.subject.program	Computer Science	en
dc.title	A Statistical Analysis of the Aggregation of Crowdsourced Labels	en
dc.type	Master Thesis	en
uws-etd.degree	Master of Mathematics	en
uws-etd.degree.department	Computer Science (David R. Cheriton School of)	en
uws.peerReviewStatus	Unreviewed	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Szepesvari_David.pdf
Size:: 841.8 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.17 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Computer Science