Variance Reduction in SGD by Distributed Importance Sampling
read the original abstract
Humans are able to accelerate their learning by selecting training materials that are the most informative and at the appropriate level of difficulty. We propose a framework for distributing deep learning in which one set of workers search for the most informative examples in parallel while a single worker updates the model on examples selected by importance sampling. This leads the model to update using an unbiased estimate of the gradient which also has minimum variance when the sampling proposal is proportional to the L2-norm of the gradient. We show experimentally that this method reduces gradient variance even in a context where the cost of synchronization across machines cannot be ignored, and where the factors for importance sampling are not updated instantly across the training set.
This paper has not been read by Pith yet.
Forward citations
Cited by 5 Pith papers
-
Disagreement-Regularized Importance Sampling for Adversarial Label Corruption
DR-IS selects low-contamination subsets via bounded rank-disagreement in proxy ensembles under an ε-contamination model, with O(√(log(N/δ)/K)) concentration rates that certify separation when the expectation gap Δ' is...
-
Batch Loss Score for Dynamic Data Pruning
BLS approximates per-sample loss importance via EMA of batch losses, enabling simple and effective dynamic pruning of 20-50% samples losslessly across many datasets and models.
-
Cost-Aware Learning
Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.
-
Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training
Data Warmup accelerates diffusion training on ImageNet by scheduling images from low to high complexity via a foreground-based metric and temperature-controlled sampler, improving FID and IS scores faster than uniform...
-
Submodular Batch Selection for Training Deep Neural Networks
A greedy submodular maximization method for mini-batch selection in DNN training yields better generalization than SGD on standard datasets.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.