Show simple item record

dc.contributor.authorHeidarikhazaei, Alireza
dc.date.accessioned2021-12-20 14:40:11 (GMT)
dc.date.available2021-12-20 14:40:11 (GMT)
dc.date.issued2021-12-20
dc.date.submitted2021-12-14
dc.identifier.urihttp://hdl.handle.net/10012/17779
dc.description.abstractMany errors cannot be detected or repaired without taking into account the underlying structure and dependencies in the dataset. One way of modeling the structure of the data is graphical models. Graphical models combine probability theory and graph theory in order to address one of the key objectives in designing and fitting probabilistic models, which is to capture dependencies among relevant random variables. Structure representation helps to understand the side effect of the errors or it reveals correct interrelationships between data points. Hence, principled representation of structure in prediction and cleaning tasks of dirty data is essential for the quality of downstream analytical results. Existing structured prediction research considers limited structures and configurations, with little attention to the performance limitations and how well the problem can be solved in more general settings where the structure is complex and rich. In this dissertation, I present the following thesis: By leveraging the underlying dependency and structure in machine learning models, we can effectively detect and clean errors via pragmatic structured predictions techniques. To highlight the main contributions: I investigate prediction algorithms and systems on dirty data with a more realistic structure and dependencies to help deploy this type of learning in more pragmatic settings. Specifically, We introduce a few-shot learning framework for error detection that uses structure-based features of data such as denial constraints violations and Bayesian network as co-occurrence feature. I have studied the problem of recovering the latent ground truth labeling of a structured instance. Then, I consider the problem of mining integrity constraints from data and specifically using the sampling methods for extracting approximate denial constraints. Finally, I have introduced an ML framework that uses solitary and structured data features to solve the problem of record fusion.en
dc.language.isoenen
dc.publisherUniversity of Waterlooen
dc.relation.urihttps://github.com/HoloClean/HoloCleanen
dc.subjectdata cleaningen
dc.subjectmachine learningen
dc.subjectgraphical modelsen
dc.subjectstructure predictionen
dc.subjecterror detectionen
dc.subjectdata fusionen
dc.subjectapproximate denial constraintsen
dc.subjectsamplingen
dc.titleStructured Prediction on Dirty Datasetsen
dc.typeDoctoral Thesisen
dc.pendingfalse
uws-etd.degree.departmentDavid R. Cheriton School of Computer Scienceen
uws-etd.degree.disciplineComputer Scienceen
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.degreeDoctor of Philosophyen
uws-etd.embargo.terms0en
uws.contributor.advisorIlyas, Ihab
uws.contributor.affiliation1Faculty of Mathematicsen
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.typeOfResourceTexten
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record


UWSpace

University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

All items in UWSpace are protected by copyright, with all rights reserved.

DSpace software

Service outages