Bidirectional TopK Sparsification for Distributed Learning
Loading...
Date
2022-05-27
Authors
Zou, William
Advisor
De Sterck, Hans
Liu, Jun
Liu, Jun
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
Training large neural networks requires a large amount of time. To speed up the
process, distributed training is often used. One of the largest bottlenecks in distributed
training is communicating gradients across different nodes. Different gradient compression
techniques have been proposed to alleviate the communication bottleneck, including
topK gradient sparsification, which truncates the gradient to the top K components before
sending it to other nodes.
Some authors have adopted topK gradient sparsification to the parameter-server framework
by applying topK compression in both the worker-to-server and server-to-worker direction,
as opposed to only the worker-to-server direction. Current intuition
and analysis suggest that adding extra compression degrades the convergence of the model. We provide a simple counterexample where iterating with bidirectional topK
SGD allows better convergence than iterating with unidirectional topK SGD. We explain
this example with the theoretical framework developed by Alistarh et al., remove a critical
assumption the authors’ made in their non-convex convergence analysis of topK SGD,
and show that bidirectional topK SGD can achieve the same convergence bound
as unidirectional topK SGD with assumptions that are potentially easier to satisfy.
We experimentally evaluate unidirectional topK SGD against bidirectional topK SGD and
show that under careful tuning, models trained with bidirectional topK SGD will perform
just as well as models trained with unidirectional topK SGD. Finally, we provide empirical
evidence that the amount of communication saved by adding server-to-worker topK
compression is almost linear to the number of workers.
Description
Keywords
gradient compression, distributed learning, analysis of stochastic gradient descent