Enhancing Large Language Model Fine-Tuning for Classification Using Conditional Mutual Information
Loading...
Date
2025-04-16
Authors
Advisor
Yang, En-Hui
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Large language models (LLMs) have achieved impressive advancements in recent years, showcasing their versatility and effectiveness in various tasks such as natural language understanding, generation, and translation. Despite these advancements, the full potential of information theory (IT) to further enhance the development of LLMs has yet to be fully explored. This thesis aims to bridge this gap by introducing the information-theoretic concept of Conditional Mutual Information (CMI) and applying it to the fine-tuning process of LLMs for classification tasks. We explore the promise of CMI in two primary ways: minimizing CMI to optimize a model's standalone performance and maximizing CMI to improve knowledge distillation (KD) and create more capable student models.
To implement CMI in LLM fine-tuning, we adapt the recently proposed CMI-constrained deep learning framework, initially developed for image classification tasks, with necessary modifications for LLMs. In our experiments, we focus on applying CMI to LLM fine-tuning and knowledge distillation using the GLUE benchmark, a widely used suite of classification tasks for evaluating the performance of language models. Through minimizing CMI during the fine-tuning process, we achieve superior performance on 6 out of 8 GLUE classification tasks compared to the baseline BERT model. Furthermore, we explore the use of CMI to maximize information transfer during the KD process, where a smaller "student" model is trained to mimic the behavior of a larger, more powerful "teacher" model. By maximizing the teacher's CMI, we ensure that richer semantic information is passed to the student, improving performance. Our results show that maximizing CMI during KD leads to substantial improvements in 6 out of 8 GLUE classification tasks when compared to DistilBERT, a popular distilled version of BERT.
Description
Keywords
LLMs, machine learning, NLP, information theory