Multi-Resolution and Asymmetric Implementation of Attention in Transformers

Chaudhry, Zaid

Multi-Resolution and Asymmetric Implementation of Attention in Transformers

Files

Chaudhry_Zaid.pdf (1 MB)

Date

2022-04-29

Authors

Chaudhry, Zaid

Advisor

Poupart, Pascal

Publisher

University of Waterloo

Abstract

Transformers are the state-of-the-art for machine translation and grammar error correction. One of the most important components of transformers are the attention layers, but they require significant computational power. We suggest a new way of looking at the “mixing” mechanisms of tokens by doing a multi-resolution implementation of attention, which maintains inference results while also improving training and inference speed, thus getting the best of both worlds. This approximation can be applied in symmtrical and asymmetrical manner within and across attention layers. We also suggest an interesting alternative for the softmax layer in attention. We also analyzed some other hyperparameters in detail. For example, our experiments indicate that we can have asymmetry among the attention layers w.r.t. number of heads, while still achieving similar results. In many cases, reducing the number of heads improves inference results. We also explored the role of weighting matrices for query, key, and value vectors; and show that in case of self-attention, absence of these matrices results in the collapse of the attention layers to an identity matrix.

Keywords

approximation, Attention, Machine Translation

URI

http://hdl.handle.net/10012/18197

Collections

Theses
Computer Science

Full item page

Multi-Resolution and Asymmetric Implementation of Attention in Transformers

Files

Date

Authors

Advisor

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

LC Subject Headings

Citation

URI

Collections