Studying Transformer Behavior Under Markovian Input
Loading...
Date
Authors
Advisor
Ghadimi, Saeed
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Transformers have achieved remarkable success in modeling sequential data, yet a principled theoretical understanding of their behavior remains limited. A recent framework has analyzed transformers through the lens of first-order Markov chains, providing four theorems that characterize the loss landscape and the conditions under which global minima and bad local minima arise. However, the theoretical analysis in that work was restricted to first-order processes, leaving open questions about the behavior of transformers on higher-order Markovian data.
This thesis extends the theoretical framework to the case of second-order Markov chains. Specifically, all four theorems originally established for first-order chains are formally proven for second-order chains, thereby broadening the mathematical foundation for analyzing transformers on sequential data. Also, we show that these theorems are also true on another type of transformer architecture (attention-only transformers), given first- and second-order Markovian input data. Furthermore, experimental evaluations demonstrate that the empirical learning dynamics for second-order chains align closely with the simplified second-order model proposed in prior work, confirming that the theoretical predictions hold in practice.
By closing the gap between first-order theory and second-order behavior, this study contributes to a deeper understanding of transformers’ sequential modeling capabilities. The findings highlight the conditions under which transformers correctly capture second-order dependencies and provide new insights into their limitations and potential extensions for higher-order processes.