Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

29 Sep 2021  ·  Pengcheng Li, Yixin Guo, Yawen Zhang, Qinggang Zhou ·

State-of-the-art deep learning algorithms rely on distributed training to tackle the increasing model size and training data. Mini-batch Stochastic Gradient Descent (SGD) requires workers to halt forward/backward propagations, to wait for gradients synchronized among all workers before the next batch of tasks. The synchronous execution model exposes the overhead of gradient communication among a large number of workers in a distributed training system. To this end, we propose a new SGD algorithm with delayed averaging, namely DaSGD, which can fully parallelize SGD and forward/backward propagations to hide 100\% of gradient communication. By adjusting the gradient update scheme, this algorithm uses hardware resources more efficiently and reduces the reliance on high-throughput inter-connects. The theoretical analysis and experimental results conducted in this paper both show its convergence rate of $ O (1 / \sqrt {K} )$ stays the same as Mini-batch SGD. A analytical model shows that it enables linear performance scalability with the cluster size.

PDF Abstract
No code implementations yet. Submit your code now

Tasks


Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods