Multi-Resolution and Asymmetric Implementation of Attention in Transformers
Loading...
Date
2022-04-29
Authors
Chaudhry, Zaid
Advisor
Poupart, Pascal
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Transformers are the state-of-the-art for machine translation and grammar error correction. One of the most important components of transformers are the attention layers, but they require significant computational power. We suggest a new way of looking at the “mixing” mechanisms of tokens by doing a multi-resolution implementation of attention, which maintains inference results while also improving training and inference speed, thus getting the best of both worlds. This approximation can be applied in symmtrical and asymmetrical manner within and across attention layers. We also suggest an interesting alternative for the softmax layer in attention. We also analyzed some other hyperparameters in detail. For example, our experiments indicate that we can have asymmetry among the attention layers w.r.t. number of heads, while still achieving similar results. In many cases, reducing the number of heads improves inference results. We also explored the role of weighting matrices for query, key, and value vectors; and show that in case of self-attention, absence of these matrices results in the collapse of the attention layers to an identity matrix.
Description
Keywords
approximation, Attention, Machine Translation