Efficient Learning for Large Language Models

Loading...
Thumbnail Image

Advisor

Kwon, Hyock Ju

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Artificial Intelligence (AI) systems have become indispensable across domains such as healthcare, finance, robotics, and scientific discovery. At the heart of this revolution, Large Language Models (LLMs) have emerged as the central paradigm, demonstrating remarkable reasoning, generalization, and multi-domain adaptability. However, their exponential growth in scale introduces severe computational bottlenecks in training, fine-tuning, and inference, limiting accessibility, sustainability, and real-world deployment. This dissertation advances the efficiency of LLMs across all lifecycle stages by introducing a suite of five frameworks that significantly reduce compute, memory, and latency costs with minimal or no loss in accuracy. First, Quantized Dynamic Low-Rank Adaptation (QDyLoRA) enables memory-efficient fine-tuning across multiple LoRA ranks in a single training pass, achieving competitive performance to QLoRA while reducing GPU memory usage by up to 65% and supporting flexible rank selection at inference time. Second, Sorted-LoRA introduces a stochastic depth–aware fine-tuning framework that co-trains multiple sub-models of varying depths within a single cycle. On LLaMA2–7B, it produces submodels up to 40% smaller that retain over 98% task accuracy, with the largest variant even surpassing the base model by +0.34%. Third, LoRA-Drop accelerates autoregressive inference by dynamically substituting computationally redundant layers with lightweight low-rank modules during decoding. It delivers up to 2.6× faster decoding and a 50% reduction in KV-cache memory with less than 0.5% degradation in accuracy, offering latency-aware adaptability for real-world deployment. Fourth, EchoAtt exploits redundancy in attention maps by sharing attention matrices among similar layers. On TinyLLaMA–1.1B, it achieves 15% faster inference, 25% faster training, and a 4% parameter reduction while improving zero-shot accuracy, highlighting that structural compression can enhance rather than degrade model generalization. Finally, ECHO-LLaMA introduces cross-layer Key–Value (KV) and Query–Key (QK) sharing to reduce redundant attention computation. This approach achieves up to 77% higher token-per-second throughput during training, 16% higher Model FLOPs Utilization (MFU), and 7% higher test-time throughput, while preserving language modeling performance. On the mechanical-domain RoboEval benchmark, ECHO-CodeLLaMA-7B boosts average accuracy from 62.15% to 63.01% with only 50% KV sharing, confirming its robustness in domain adaptation. Together, these contributions form a coherent research program on the efficiency of large-scale Transformers. They demonstrate that intelligently exploiting representational redundancy—through quantization, low-rank structure, cross-layer sharing, and adaptive computation—can yield substantial compute savings with minimal trade-offs.

Description

Keywords

LC Subject Headings

Citation

Collections