Enhancing Large Language Model Fine-Tuning for Classification Using Conditional Mutual Information

Loading...
Thumbnail Image

Date

2025-04-16

Advisor

Yang, En-Hui

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Large language models (LLMs) have achieved impressive advancements in recent years, showcasing their versatility and effectiveness in various tasks such as natural language understanding, generation, and translation. Despite these advancements, the full potential of information theory (IT) to further enhance the development of LLMs has yet to be fully explored. This thesis aims to bridge this gap by introducing the information-theoretic concept of Conditional Mutual Information (CMI) and applying it to the fine-tuning process of LLMs for classification tasks. We explore the promise of CMI in two primary ways: minimizing CMI to optimize a model's standalone performance and maximizing CMI to improve knowledge distillation (KD) and create more capable student models. To implement CMI in LLM fine-tuning, we adapt the recently proposed CMI-constrained deep learning framework, initially developed for image classification tasks, with necessary modifications for LLMs. In our experiments, we focus on applying CMI to LLM fine-tuning and knowledge distillation using the GLUE benchmark, a widely used suite of classification tasks for evaluating the performance of language models. Through minimizing CMI during the fine-tuning process, we achieve superior performance on 6 out of 8 GLUE classification tasks compared to the baseline BERT model. Furthermore, we explore the use of CMI to maximize information transfer during the KD process, where a smaller "student" model is trained to mimic the behavior of a larger, more powerful "teacher" model. By maximizing the teacher's CMI, we ensure that richer semantic information is passed to the student, improving performance. Our results show that maximizing CMI during KD leads to substantial improvements in 6 out of 8 GLUE classification tasks when compared to DistilBERT, a popular distilled version of BERT.

Description

Keywords

LLMs, machine learning, NLP, information theory

LC Subject Headings

Citation