Enhancing Large Language Model Fine-Tuning for Classification Using Conditional Mutual Information

Sivakaran, Thanushon

Enhancing Large Language Model Fine-Tuning for Classification Using Conditional Mutual Information

Files

Sivakaran_Thanushon.pdf (3.39 MB)

Date

2025-04-16

Authors

Sivakaran, Thanushon

Advisor

Yang, En-Hui

Publisher

University of Waterloo

Abstract

Large language models (LLMs) have achieved impressive advancements in recent years, showcasing their versatility and effectiveness in various tasks such as natural language understanding, generation, and translation. Despite these advancements, the full potential of information theory (IT) to further enhance the development of LLMs has yet to be fully explored. This thesis aims to bridge this gap by introducing the information-theoretic concept of Conditional Mutual Information (CMI) and applying it to the fine-tuning process of LLMs for classification tasks. We explore the promise of CMI in two primary ways: minimizing CMI to optimize a model's standalone performance and maximizing CMI to improve knowledge distillation (KD) and create more capable student models. To implement CMI in LLM fine-tuning, we adapt the recently proposed CMI-constrained deep learning framework, initially developed for image classification tasks, with necessary modifications for LLMs. In our experiments, we focus on applying CMI to LLM fine-tuning and knowledge distillation using the GLUE benchmark, a widely used suite of classification tasks for evaluating the performance of language models. Through minimizing CMI during the fine-tuning process, we achieve superior performance on 6 out of 8 GLUE classification tasks compared to the baseline BERT model. Furthermore, we explore the use of CMI to maximize information transfer during the KD process, where a smaller "student" model is trained to mimic the behavior of a larger, more powerful "teacher" model. By maximizing the teacher's CMI, we ensure that richer semantic information is passed to the student, improving performance. Our results show that maximizing CMI during KD leads to substantial improvements in 6 out of 8 GLUE classification tasks when compared to DistilBERT, a popular distilled version of BERT.

Keywords

LLMs, machine learning, NLP, information theory

URI

https://hdl.handle.net/10012/21594

Collections

Theses
Electrical and Computer Engineering

Full item page

Enhancing Large Language Model Fine-Tuning for Classification Using Conditional Mutual Information

Files

Date

Authors

Advisor

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

LC Subject Headings

Citation

URI

Collections