Unsupervised Syntactic Structure Induction in Natural Language Processing

Deshmukh, Anup Anand

Unsupervised Syntactic Structure Induction in Natural Language Processing

dc.contributor.advisor	Li, Ming
dc.contributor.advisor	Lin, Jimmy
dc.contributor.author	Deshmukh, Anup Anand
dc.date.accessioned	2021-09-07T17:20:46Z
dc.date.available	2021-09-07T17:20:46Z
dc.date.issued	2021-09-07
dc.date.submitted	2021-08-25
dc.description.abstract	This work addresses unsupervised chunking as a task for syntactic structure induction, which could help understand the linguistic structures of human languages especially, low-resource languages. In chunking, words of a sentence are grouped together into different phrases (also known as chunks) in a non-hierarchical fashion. Understanding text fundamentally requires finding noun and verb phrases, which makes unsupervised chunking an important step in several real-world applications. In this thesis, we establish several baselines and discuss our three-step knowledge transfer approach for unsupervised chunking. In the first step, we take advantage of state-of-the-art unsupervised parsers, and in the second, we heuristically induce chunk labels from them. We propose a simple heuristic that does not require any supervision of annotated grammar and generates reasonable (albeit noisy) chunks. In the third step, we design a hierarchical recurrent neural network (HRNN) that learns from these pseudo ground-truth labels. The HRNN explicitly models the composition of words into chunks and smooths out the noise from heuristically induced labels. Our HRNN a) maintains both word-level and phrase-level representations and b) explicitly handles the chunking decisions by providing autoregressiveness at each step. Furthermore, we make a case for exploring the self-supervised learning objectives for unsupervised chunking. Finally, we discuss our attempt to transfer knowledge from chunking back to parsing in an unsupervised setting. We conduct comprehensive experiments on three datasets: CoNLL-2000 (English), CoNLL-2003 (German), and the English Web Treebank. Results show that our HRNN improves upon the teacher model (Compound PCFG) in terms of both phrase F1 and tag accuracy. Our HRNN can smooth out the noise from induced chunk labels and accurately capture the chunking patterns. We evaluate different chunking heuristics and show that maximal left-branching performs the best, reinforcing the fact that left-branching structures indicate closely related words. We also present rigorous analysis on the HRNN's architecture and discuss the performance of vanilla recurrent neural networks.	en
dc.identifier.uri	http://hdl.handle.net/10012/17347
dc.language.iso	en	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.subject	machine learning	en
dc.subject	natural language processing	en
dc.title	Unsupervised Syntactic Structure Induction in Natural Language Processing	en
dc.type	Master Thesis	en
uws-etd.degree	Master of Mathematics	en
uws-etd.degree.department	David R. Cheriton School of Computer Science	en
uws-etd.degree.discipline	Computer Science	en
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	0	en
uws.contributor.advisor	Li, Ming
uws.contributor.advisor	Lin, Jimmy
uws.contributor.affiliation1	Faculty of Mathematics	en
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Deshmukh_Anup-Anand.pdf
Size:: 611.39 KB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Computer Science