Uncovering the Reliability and Consistency of AI Language Models: A Systematic Study
dc.contributor.author | Khatun, Aisha | |
dc.date.accessioned | 2024-08-22T14:12:00Z | |
dc.date.available | 2024-08-22T14:12:00Z | |
dc.date.issued | 2024-08-22 | |
dc.date.submitted | 2024-08-14 | |
dc.description.abstract | Large Language Models (LLMs) have rapidly advanced, becoming general-purpose assistants and creative partners. Despite their widespread use, LLMs exhibit significant vulnerabilities to prompt variations and struggle with task understanding, leading to inconsistencies and factual inaccuracies in their responses. Traditional Natural Language Processing (NLP) benchmarks often overlook nuances in LLM behavior and reliability. This thesis addresses this gap by curating a dataset across six categories: Fact, Conspiracy, Controversy, Misconception, Stereotype, and Fiction. We rigorously define LLMs' factual accuracy, consistency, and robustness to prompt variations using diverse response formats and question variations, and evaluate these on 37 models. Our findings reveal LLMs' volatility and unreliability, particularly in the Controversy and Misconception categories, where conflicting training data impedes performance. Additionally, we explore LLMs' ability to generate coherent fictional narratives, probing their ability to retain and effectively utilize factual information, a critical requirement for creative tasks like story generation. While LLMs offer versatile applications, their reliability hinges on addressing challenges in prompt understanding and response consistency, emphasizing the need for ongoing research to enhance their performance across diverse tasks and applications. | |
dc.identifier.uri | https://hdl.handle.net/10012/20847 | |
dc.language.iso | en | |
dc.pending | false | |
dc.publisher | University of Waterloo | en |
dc.relation.uri | https://borealisdata.ca/dataset.xhtml?persistentId=doi:10.5683/SP3/5MZWBV | |
dc.relation.uri | https://github.com/tanny411/llm-reliability-and-consistency-evaluation | |
dc.subject | Large Language Model | |
dc.subject | Computational Creativity | |
dc.subject | Story Generation | |
dc.subject | Consistency | |
dc.subject | Robustness | |
dc.subject | LLM | |
dc.subject | AI | |
dc.subject | GPT 3 | |
dc.subject | GPT 4 | |
dc.subject | MCQ | |
dc.subject | Multiple Choice Question | |
dc.subject | Dataset | |
dc.subject | Factual Accuracy | |
dc.subject | Artificial Intelligence | |
dc.subject | NLP | |
dc.subject | Natural Language Processing | |
dc.title | Uncovering the Reliability and Consistency of AI Language Models: A Systematic Study | |
dc.type | Master Thesis | |
uws-etd.degree | Master of Mathematics | |
uws-etd.degree.department | David R. Cheriton School of Computer Science | |
uws-etd.degree.discipline | Computer Science | |
uws-etd.degree.grantor | University of Waterloo | en |
uws-etd.embargo.terms | 0 | |
uws.contributor.advisor | Brown, Dan | |
uws.contributor.affiliation1 | Faculty of Mathematics | |
uws.peerReviewStatus | Unreviewed | en |
uws.published.city | Waterloo | en |
uws.published.country | Canada | en |
uws.published.province | Ontario | en |
uws.scholarLevel | Graduate | en |
uws.typeOfResource | Text | en |