Use of Large Language Models (LLMs) in Qualitative Analysis: Evaluating LLMs as Assistive Coding Agents
| dc.contributor.author | Neeb, Mikayla | |
| dc.date.accessioned | 2026-02-19T14:12:50Z | |
| dc.date.available | 2026-02-19T14:12:50Z | |
| dc.date.issued | 2026-02-19 | |
| dc.date.submitted | 2026-01-16 | |
| dc.description.abstract | Introduction: Large language models (LLMs) are increasingly used to support qualitative research, yet robust methods to evaluate the quality of LLM-generated codes remain underdeveloped. Existing approaches often rely on comparisons to human ground truth or custom evaluative methods, limiting cross-study comparisons. This study examines whether LLMs can function as assistive qualitative coding agents and introduces the CReDS framework as a structured approach to evaluating LLM-generated codes without the need for a comparative codebase. Methods: Two social media datasets were employed as validation sets to systematically develop and test approaches for evaluating LLM-generated inductive codes. Codes were generated using GPT-4o-mini and assessed through an iterative evaluation process. Initial assessment relied on conventional quantitative similarity metrics (e.g., cosine similarity); however, limitations in capturing qualitative distinctions prompted the incorporation of structured human evaluation. This process led to the development of the CReDS framework, comprising Consistency, Relevance, Distinction and Specificity, as a more comprehensive evaluative method. Targeted exploratory analyses further examined evaluative performance under specific conditions, further investigating the evaluative methods explored in this study. Results: LLM-generated codes aligned closely with human codes across both datasets, with overall semantic match rates ranging from 74-83%. At the text level, 65-95% of inputs had at least one LLM-generated code judged appropriate by human reviewers. CReDS scores revealed strong alignment to human-generated codes, with strong overlap across all dimensions. However, LLM-generated codes showed reduced specificity, and the CReDS framework observed conservative scoring behaviour. Despite these limitations, CReDS effectively surfaced systematic strengths and weaknesses in LLM outputs. Conclusions: These findings indicate that LLMs can reliably support early state qualitative coding when used as assistive tools under human oversight. The CReDS framework offers a transparent and scalable method for evaluating LLM-generated codes that align with qualitative principles while supporting iterative model development. This study contributes to a measurable and scalable platform for responsible human-AI collaboration in qualitative analysis and highlights directions for refining evaluation frameworks in future work. | |
| dc.identifier.uri | https://hdl.handle.net/10012/22945 | |
| dc.language.iso | en | |
| dc.pending | false | |
| dc.publisher | University of Waterloo | en |
| dc.subject | qualitative research | |
| dc.subject | qualitative coding | |
| dc.subject | evaluation methodology | |
| dc.subject | human-AI collaboration | |
| dc.subject | qualitative analysis | |
| dc.subject | large language models (LLMs) | |
| dc.title | Use of Large Language Models (LLMs) in Qualitative Analysis: Evaluating LLMs as Assistive Coding Agents | |
| dc.type | Master Thesis | |
| uws-etd.degree | Master of Science | |
| uws-etd.degree.department | School of Public Health Sciences | |
| uws-etd.degree.discipline | Public Health Sciences | |
| uws-etd.degree.grantor | University of Waterloo | en |
| uws-etd.embargo.terms | 0 | |
| uws.contributor.advisor | Chen, Helen | |
| uws.contributor.affiliation1 | Faculty of Health | |
| uws.peerReviewStatus | Unreviewed | en |
| uws.published.city | Waterloo | en |
| uws.published.country | Canada | en |
| uws.published.province | Ontario | en |
| uws.scholarLevel | Graduate | en |
| uws.typeOfResource | Text | en |