Efficient Learning for Large Language Models

dc.contributor.authorRajabzadeh, Hossein
dc.date.accessioned2026-01-20T18:27:29Z
dc.date.available2026-01-20T18:27:29Z
dc.date.issued2026-01-20
dc.date.submitted2026-01-16
dc.description.abstractArtificial Intelligence (AI) systems have become indispensable across domains such as healthcare, finance, robotics, and scientific discovery. At the heart of this revolution, Large Language Models (LLMs) have emerged as the central paradigm, demonstrating remarkable reasoning, generalization, and multi-domain adaptability. However, their exponential growth in scale introduces severe computational bottlenecks in training, fine-tuning, and inference, limiting accessibility, sustainability, and real-world deployment. This dissertation advances the efficiency of LLMs across all lifecycle stages by introducing a suite of five frameworks that significantly reduce compute, memory, and latency costs with minimal or no loss in accuracy. First, Quantized Dynamic Low-Rank Adaptation (QDyLoRA) enables memory-efficient fine-tuning across multiple LoRA ranks in a single training pass, achieving competitive performance to QLoRA while reducing GPU memory usage by up to 65% and supporting flexible rank selection at inference time. Second, Sorted-LoRA introduces a stochastic depth–aware fine-tuning framework that co-trains multiple sub-models of varying depths within a single cycle. On LLaMA2–7B, it produces submodels up to 40% smaller that retain over 98% task accuracy, with the largest variant even surpassing the base model by +0.34%. Third, LoRA-Drop accelerates autoregressive inference by dynamically substituting computationally redundant layers with lightweight low-rank modules during decoding. It delivers up to 2.6× faster decoding and a 50% reduction in KV-cache memory with less than 0.5% degradation in accuracy, offering latency-aware adaptability for real-world deployment. Fourth, EchoAtt exploits redundancy in attention maps by sharing attention matrices among similar layers. On TinyLLaMA–1.1B, it achieves 15% faster inference, 25% faster training, and a 4% parameter reduction while improving zero-shot accuracy, highlighting that structural compression can enhance rather than degrade model generalization. Finally, ECHO-LLaMA introduces cross-layer Key–Value (KV) and Query–Key (QK) sharing to reduce redundant attention computation. This approach achieves up to 77% higher token-per-second throughput during training, 16% higher Model FLOPs Utilization (MFU), and 7% higher test-time throughput, while preserving language modeling performance. On the mechanical-domain RoboEval benchmark, ECHO-CodeLLaMA-7B boosts average accuracy from 62.15% to 63.01% with only 50% KV sharing, confirming its robustness in domain adaptation. Together, these contributions form a coherent research program on the efficiency of large-scale Transformers. They demonstrate that intelligently exploiting representational redundancy—through quantization, low-rank structure, cross-layer sharing, and adaptive computation—can yield substantial compute savings with minimal trade-offs.
dc.identifier.urihttps://hdl.handle.net/10012/22858
dc.language.isoen
dc.pendingfalse
dc.publisherUniversity of Waterlooen
dc.titleEfficient Learning for Large Language Models
dc.typeDoctoral Thesis
uws-etd.degreeDoctor of Philosophy
uws-etd.degree.departmentMechanical and Mechatronics Engineering
uws-etd.degree.disciplineMechanical Engineering
uws-etd.degree.grantorUniversity of Waterlooen
uws-etd.embargo.terms0
uws.contributor.advisorKwon, Hyock Ju
uws.contributor.affiliation1Faculty of Engineering
uws.peerReviewStatusUnrevieweden
uws.published.cityWaterlooen
uws.published.countryCanadaen
uws.published.provinceOntarioen
uws.scholarLevelGraduateen
uws.typeOfResourceTexten

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Rajabzadeh_Hossein.pdf
Size:
9.66 MB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
6.4 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections