Variance Reduction in SGD by Distributed Importance Sampling

Aaron Courville; Alex Lamb; Chinnadhurai Sankar; Guillaume Alain; Yoshua Bengio

arxiv: 1511.06481 · v7 · pith:GOCZ47QHnew · submitted 2015-11-20 · 📊 stat.ML · cs.LG

Variance Reduction in SGD by Distributed Importance Sampling

Guillaume Alain , Alex Lamb , Chinnadhurai Sankar , Aaron Courville , Yoshua Bengio This is my paper

classification 📊 stat.ML cs.LG

keywords samplinggradientimportancevarianceacrossexamplesinformativelearning

0 comments

read the original abstract

Humans are able to accelerate their learning by selecting training materials that are the most informative and at the appropriate level of difficulty. We propose a framework for distributing deep learning in which one set of workers search for the most informative examples in parallel while a single worker updates the model on examples selected by importance sampling. This leads the model to update using an unbiased estimate of the gradient which also has minimum variance when the sampling proposal is proportional to the L2-norm of the gradient. We show experimentally that this method reduces gradient variance even in a context where the cost of synchronization across machines cannot be ignored, and where the factors for importance sampling are not updated instantly across the training set.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Disagreement-Regularized Importance Sampling for Adversarial Label Corruption
cs.LG 2026-05 unverdicted novelty 7.0

DR-IS selects low-contamination subsets via bounded rank-disagreement in proxy ensembles under an ε-contamination model, with O(√(log(N/δ)/K)) concentration rates that certify separation when the expectation gap Δ' is...
Batch Loss Score for Dynamic Data Pruning
cs.LG 2026-04 unverdicted novelty 7.0

BLS approximates per-sample loss importance via EMA of batch losses, enabling simple and effective dynamic pruning of 20-50% samples losslessly across many datasets and models.
Cost-Aware Learning
cs.LG 2026-04 unverdicted novelty 6.0

Cost-aware SGD achieves target error with lower total sampling cost than standard methods, and Cost-Aware GRPO reduces token usage by up to 30% in LLM reinforcement learning while matching baseline performance.
Data Warmup: Complexity-Aware Curricula for Efficient Diffusion Training
cs.LG 2026-04 conditional novelty 6.0

Data Warmup accelerates diffusion training on ImageNet by scheduling images from low to high complexity via a foreground-based metric and temperature-controlled sampler, improving FID and IS scores faster than uniform...
Submodular Batch Selection for Training Deep Neural Networks
cs.LG 2019-06 unverdicted novelty 5.0

A greedy submodular maximization method for mini-batch selection in DNN training yields better generalization than SGD on standard datasets.