Enhancing Large Language Model Fine-Tuning for Classification Using Conditional Mutual Information

dc.contributor.authorSivakaran, Thanushon
dc.date.accessioned2025-04-16T13:52:28Z
dc.date.available2025-04-16T13:52:28Z
dc.date.issued2025-04-16
dc.date.submitted2025-04-15
dc.description.abstractLarge language models (LLMs) have achieved impressive advancements in recent years, showcasing their versatility and effectiveness in various tasks such as natural language understanding, generation, and translation. Despite these advancements, the full potential of information theory (IT) to further enhance the development of LLMs has yet to be fully explored. This thesis aims to bridge this gap by introducing the information-theoretic concept of Conditional Mutual Information (CMI) and applying it to the fine-tuning process of LLMs for classification tasks. We explore the promise of CMI in two primary ways: minimizing CMI to optimize a model's standalone performance and maximizing CMI to improve knowledge distillation (KD) and create more capable student models. To implement CMI in LLM fine-tuning, we adapt the recently proposed CMI-constrained deep learning framework, initially developed for image classification tasks, with necessary modifications for LLMs. In our experiments, we focus on applying CMI to LLM fine-tuning and knowledge distillation using the GLUE benchmark, a widely used suite of classification tasks for evaluating the performance of language models. Through minimizing CMI during the fine-tuning process, we achieve superior performance on 6 out of 8 GLUE classification tasks compared to the baseline BERT model. Furthermore, we explore the use of CMI to maximize information transfer during the KD process, where a smaller "student" model is trained to mimic the behavior of a larger, more powerful "teacher" model. By maximizing the teacher's CMI, we ensure that richer semantic information is passed to the student, improving performance. Our results show that maximizing CMI during KD leads to substantial improvements in 6 out of 8 GLUE classification tasks when compared to DistilBERT, a popular distilled version of BERT.
dc.identifier.urihttps://hdl.handle.net/10012/21594
dc.language.isoen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.subjectLLMs
dc.subjectmachine learning
dc.subjectNLP
dc.subjectinformation theory
dc.titleEnhancing Large Language Model Fine-Tuning for Classification Using Conditional Mutual Information
dc.typeMaster Thesis
uws-etd.degreeMaster of Applied Science
uws-etd.degree.departmentElectrical and Computer Engineering
uws-etd.degree.disciplineElectrical and Computer Engineering
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0
uws.contributor.advisorYang, En-Hui
uws.contributor.affiliation1Faculty of Engineering
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Sivakaran_Thanushon.pdf
Size:
3.39 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description: