Information Extraction for Low-Resource Schemas
| dc.contributor.author | Xu, Justin | |
| dc.date.accessioned | 2026-06-04T17:19:56Z | |
| dc.date.available | 2026-06-04T17:19:56Z | |
| dc.date.issued | 2026-06-04 | |
| dc.date.submitted | 2026-05-04 | |
| dc.description.abstract | Information Extraction (IE) is a set of important tasks in the study of creating structured data such as knowledge graphs from unstructured data such as text. The past paradigm of IE focused on models with specialized neural network architectures, usually based on transformer encoders. These models typically focus on a single subtask of IE, following a single schema of entity and relation types, and are trained via supervised learning on large datasets of annotated texts. Meanwhile, the current paradigm of IE, called Universal IE (UIE), involves large language models which can generalize across IE subtasks and to completely unseen schemas, but which lack other abilities such as entity grounding and calibration. We first discuss structural consistency, a new measure of robustness in information extraction based on compositionality. We present structural consistency post-training (SCPT) as a data augmentation method to boost structural consistency for a wide range of model architectures. Besides greatly improving robustness, SCPT significantly reduces the amount of labelled data needed to achieve the same level of performance when training specialized IE models. Second, we use reasoning-based data augmentation techniques to gather AdaIE, a very large collection of human-annotated information extraction schemas. We diverge from UIE and align the dataset with a new task we call Guided Information Extraction (GIE). GIE emphasizes the tight grounding and schema-following requirements which have been largely neglected in UIE. Evaluations of state-of-the-art UIE models reveal that state of the art UIE methods can be surpassed by recent commercial large language models (LLMs). Although those LLMs achieve below human performance on AdaIE, they are rapidly advancing. Overall, we hope that both works presented will steer the IE research community towards unifying the strengths of the old and new IE paradigms, while casting light on their weaknesses. | |
| dc.identifier.uri | https://hdl.handle.net/10012/23544 | |
| dc.language.iso | en | |
| dc.pending | false | |
| dc.publisher | University of Waterloo | en |
| dc.relation.uri | https://github.com/xujustinj/t2g-consistency | |
| dc.relation.uri | https://github.com/adanomad/AdaIE | |
| dc.subject | information extraction | |
| dc.subject | knowledge graphs | |
| dc.subject | large language models | |
| dc.subject | relation extraction | |
| dc.subject | entity extraction | |
| dc.subject | data augmentation | |
| dc.subject | consistency training | |
| dc.subject | dataset | |
| dc.subject | natural language processing | |
| dc.title | Information Extraction for Low-Resource Schemas | |
| dc.type | Master Thesis | |
| uws-etd.degree | Master of Mathematics | |
| uws-etd.degree.department | David R. Cheriton School of Computer Science | |
| uws-etd.degree.discipline | Computer Science | |
| uws-etd.degree.grantor | University of Waterloo | en |
| uws-etd.embargo.terms | 0 | |
| uws.comment.hidden | The GitHub code repositories are currently private due to ongoing or anticipated conference submissions, and they may not be the final resting places of the code. I would like them to be hidden until future notification that the repository is available (or moved to another location). However, if that is not possible, I would prefer that the repositories are not included at all. | |
| uws.contributor.advisor | Poupart, Pascal | |
| uws.contributor.affiliation1 | Faculty of Mathematics | |
| uws.peerReviewStatus | Unreviewed | en |
| uws.published.city | Waterloo | en |
| uws.published.country | Canada | en |
| uws.published.province | Ontario | en |
| uws.scholarLevel | Graduate | en |
| uws.typeOfResource | Text | en |