Decentralized Asynchronous Gradient Sharing For Bandwidth-Efficient Collaborative Model Training
Keywords:
Decentralized Training, Asynchronous SGD, Gradient Sparsification, Gossip Protocol, Bandwidth Efficiency, Distributed Deep Learning, Peer-to-Peer Training.Abstract
Centralized parameter server topologies for distributed model training suffer from both communication bottlenecks at the aggregation point and synchronization barriers, where the workers' progress is slowed by the " straggling " of slow workers. Decentralized training over a peer-to-peer topology avoids a central aggregation point but leads to stale gradients from asynchronous updates and excessive communication overhead from gossip-based parameter sharing. This work proposes DAGrad: a decentralized asynchronous gradient sharing system for bandwidth-efficient collective training, built upon three components: (i) gossip-based partial gradient exchange, which only broadcasts the top 1% of gradient magnitude between pairs of peers; (ii) an age-weighted update strategy, which penalizes staleness; and (iii) dynamic peer selection to prioritize exchanging gradients that are maximally complementary to one's own gradients. We demonstrated through experiments using ResNet-50/ImageNet and BERT-base/GLUE over a variety of both 32- and 128-worker setups that DAGrad lowers communication bandwidth consumption between workers to 29% of synchronous dense training at 91.9% of accuracy (i.e., within 0.2% accuracy from synchronous dense training) and that the efficiency scales to 128 workers with 87% parallel efficiency.




