Scaling Laws for Compute Optimal Biosignal Transformers

Loading...
Thumbnail Image

Date

2024-08-20

Advisor

Tripp, Bryan

Journal Title

Journal ISSN

Volume Title

Publisher

University of Waterloo

Abstract

Scaling laws which predict the optimal balance between number of model parameters and number of training tokens given a fixed compute budget have recently been developed for language transformers. These allow model developers to allocate their compute budgets such that they can achieve optimal performance. This thesis develops such scaling laws for the Biosignal Transformer trained separately on both accelerometer data and EEG data. This is done by applying methods used by other researchers to develop similar scaling laws for language transformer models. These are referred to as the iso-FLOP curve method and the parametric loss function method. The Biosignal Transformer model is a transformer model which is designed specifically to be trained on tasks that use biosignals such as EEG, ECG, and accelerometer data as input. For example, the Biosignal Transformer can be trained to detect or classify seizures from EEG signals. The Biosignal Transformer is also of particular interest because it is designed to use unsupervised pre-training on large unlabelled biosignal datasets to improve performance on downstream tasks with smaller labelled fine-tuning datasets. This work develops scaling laws which optimize for the best unsupervised pre-training loss given a fixed compute budget. Results show that the developed scaling laws are successful at predicting a balance between number of parameters and number of training tokens for compute budgets five times larger than those used to develop them such that pre-training loss is minimized. Researchers who intend to scale up the Biosignal Transformer should use these scaling laws to attain optimal pre-training loss from their given compute budgets when applying unsupervised pre-training with the Biosignal Transformer.

Description

Keywords

biosignal, compute optimal, unsupervised pre-training, scaling law, transformer

LC Keywords

Citation