Use of Large Language Models (LLMs) in Qualitative Analysis: Evaluating LLMs as Assistive Coding Agents

Neeb, Mikayla2026-02-192026-02-192026-02-192026-01-16https://hdl.handle.net/10012/22945Introduction: Large language models (LLMs) are increasingly used to support qualitative research, yet robust methods to evaluate the quality of LLM-generated codes remain underdeveloped. Existing approaches often rely on comparisons to human ground truth or custom evaluative methods, limiting cross-study comparisons. This study examines whether LLMs can function as assistive qualitative coding agents and introduces the CReDS framework as a structured approach to evaluating LLM-generated codes without the need for a comparative codebase. Methods: Two social media datasets were employed as validation sets to systematically develop and test approaches for evaluating LLM-generated inductive codes. Codes were generated using GPT-4o-mini and assessed through an iterative evaluation process. Initial assessment relied on conventional quantitative similarity metrics (e.g., cosine similarity); however, limitations in capturing qualitative distinctions prompted the incorporation of structured human evaluation. This process led to the development of the CReDS framework, comprising Consistency, Relevance, Distinction and Specificity, as a more comprehensive evaluative method. Targeted exploratory analyses further examined evaluative performance under specific conditions, further investigating the evaluative methods explored in this study. Results: LLM-generated codes aligned closely with human codes across both datasets, with overall semantic match rates ranging from 74-83%. At the text level, 65-95% of inputs had at least one LLM-generated code judged appropriate by human reviewers. CReDS scores revealed strong alignment to human-generated codes, with strong overlap across all dimensions. However, LLM-generated codes showed reduced specificity, and the CReDS framework observed conservative scoring behaviour. Despite these limitations, CReDS effectively surfaced systematic strengths and weaknesses in LLM outputs. Conclusions: These findings indicate that LLMs can reliably support early state qualitative coding when used as assistive tools under human oversight. The CReDS framework offers a transparent and scalable method for evaluating LLM-generated codes that align with qualitative principles while supporting iterative model development. This study contributes to a measurable and scalable platform for responsible human-AI collaboration in qualitative analysis and highlights directions for refining evaluation frameworks in future work.enqualitative researchqualitative codingevaluation methodologyhuman-AI collaborationqualitative analysislarge language models (LLMs)Use of Large Language Models (LLMs) in Qualitative Analysis: Evaluating LLMs as Assistive Coding AgentsMaster Thesis