UWSpace >
University of Waterloo >
Electronic Theses and Dissertations (UW) >

Please use this identifier to cite or link to this item: http://hdl.handle.net/10012/3891

Title: 'Healthy' Coreference: Applying Coreference Resolution to the Health Education Domain
Authors: Hirtle, David Z.
Keywords: coreference resolution
anaphora
computational linguistics
natural language processing
corpus analysis
health education
Approved Date: 26-Aug-2008
Date Submitted: 2008
Abstract: This thesis investigates coreference and its resolution within the domain of health education. Coreference is the relationship between two linguistic expressions that refer to the same real-world entity, and resolution involves identifying this relationship among sets of referring expressions. The coreference resolution task is considered among the most difficult of problems in Artificial Intelligence; in some cases, resolution is impossible even for humans. For example, "she" in the sentence "Lynn called Jennifer while she was on vacation" is genuinely ambiguous: the vacationer could be either Lynn or Jennifer.

There are three primary motivations for this thesis. The first is that health education has never before been studied in this context. So far, the vast majority of coreference research has focused on news. Secondly, achieving domain-independent resolution is unlikely without understanding the extent to which coreference varies across different genres. Finally, coreference pervades language and is an essential part of coherent discourse. Its effective use is a key component of easy-to-understand health education materials, where readability is paramount.

No suitable corpus of health education materials existed, so our first step was to create one. The comprehensive analysis of this corpus, which required manual annotation of coreference, confirmed our hypothesis that the coreference used in health education differs substantially from that in previously studied domains. This analysis was then used to shape the design of a knowledge-lean algorithm for resolving coreference. This algorithm performed surprisingly well on this corpus, e.g., successfully resolving over 85% of all pronouns when evaluated on unseen data.

Despite the importance of coreferentially annotated corpora, only a handful are known to exist, likely because of the difficulty and cost of reliably annotating coreference. The paucity of genres represented in these existing annotated corpora creates an implicit bias in domain-independent coreference resolution. In an effort to address these issues, we plan to make our health education corpus available to the wider research community, hopefully encouraging a broader focus in the future.
Program: Computer Science
Department: School of Computer Science
Degree: Master of Mathematics
URI: http://hdl.handle.net/10012/3891
Appears in Collections:Electronic Theses and Dissertations (UW)
Faculty of Mathematics Theses and Dissertations

Files in This Item:

File Description SizeFormat
dhirtle-healthy-coreference.pdf1.16 MBAdobe PDFView/Open


This item is protected by original copyright

All items in UWSpace are protected by copyright, with all rights reserved.

 

University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

contact us | give us feedback | http://www.lib.uwaterloo.ca | © 2006 University of Waterloo