Large Data-to-Text Generation

Sarangian, Varnan

Large Data-to-Text Generation

Files

Sarangian_Varnan.pdf (6.94 MB)

Date

2023-05-16

Authors

Sarangian, Varnan

Advisor

Chenouri, Shoja'eddin

Publisher

University of Waterloo

Abstract

This thesis presents a domain-driven approach to sports game summarization, a specific instance of large data-to-text generation (DTG). We first address the data fidelity issue in the Rotowire dataset by supplementing existing input records and demonstrating larger relative improvements compared to previously proposed purification schemes. As this method further increases the total number of input records, we alternatively formulate this problem as a multimodal problem (i.e. visual data-to-text), discussing potential advantages over purely textual approaches and studying its effectiveness for future expansion. We work exclusively with pre-trained end-to-end transformers throughout, allowing us to evaluate the efficacy of sparse attention and multimodal encoder-decoders in DTG and providing appropriate benchmarks for future work. To automatically evaluate the statistical correctness of generated summaries, we also extend prior work on automatic relation extraction and build an updated pipeline that incorporates low amounts of human-annotated data which are quickly inflated via data augmentation. By formulating this in a ”text-to-text” fashion, we are able to take advantage of LLMs and achieve significantly higher precision and recall than previous methods while tracking three times the number of unique relations. Our updated models are more consistent and reliable by incorporating human-verified data partitions into the training and evaluation process.

URI

http://hdl.handle.net/10012/19451

Collections

Theses
Statistics and Actuarial Science

Full item page

Large Data-to-Text Generation

Files

Date

Authors

Advisor

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

LC Subject Headings

Citation

URI

Collections