Scalable Sparse Model Training Via Computationally Frugal Gradient Approximation Techniques
Keywords:
Sparse Gradient Training, Gradient Compression, Top-K Sparsification, Error Feedback, Distributed Training, Memory Efficiency, Scalable Deep Learning.Abstract
Training large neural networks with dense full-precision gradients requires prohibitive memory and compute resources; hence, it is not scalable on commodity or interconnected cluster hardware with narrow links. Sparse gradient methods attempt to enable scalable training by only transmitting and summing the significant information in gradients yet suffer from low accuracy at higher sparsity levels or impractically large memory requirements for error buffers. This paper proposes FruGrad, a computationally frugal gradient approximation framework enabling scalable sparse model training. FruGrad consists of three components: (i) an adaptive top-K gradient selector with momentum-corrected error feedback; (ii) a structured block-sparsity mask whose structure matches hardware memory mapping and thus enables cache-efficient aggregation; and (iii) a schedule that adapts sparsity during training by dynamically increasing gradient compression as training progresses. Extensive experiments on ResNet-50 (ImageNet), BERT-base (GLUE), and GPT-2 (WikiText-103) show that FruGrad gains up to a 2.7X speedup, reduces peak memory to 44% that of dense training, and still achieves 99.1% of baseline model accuracy with 94% gradient sparsity.




