Submodular Batch Selection for Training Deep Neural Networks
Pith reviewed 2026-05-25 19:28 UTC · model grok-4.3
The pith
Submodular batch selection improves generalization of deep neural networks over standard stochastic gradient descent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce a novel submodular formulation for mini-batch selection that captures both the informativeness of individual samples and the diversity of the selected subset. They design an efficient greedy algorithm for this NP-hard problem and show through extensive experiments that deep models trained with this strategy generalize better than those trained with standard SGD or a popular baseline sampling strategy, across various learning rates, batch sizes, and distance metrics.
What carries the argument
A submodular function that scores candidate batches by jointly rewarding sample informativeness and subset diversity, solved approximately by a greedy algorithm.
If this is right
- Models reach higher test accuracy than with random batch selection in SGD.
- The improvement appears across different learning rates and batch sizes.
- The gains persist when different distance metrics are used to compare samples.
- The greedy algorithm supplies a practical way to solve the otherwise intractable selection problem.
Where Pith is reading between the lines
- The same selection rule might let a network reach a target accuracy level after fewer total gradient steps.
- Because the method is defined only in terms of a scoring function on samples, it could be tried on other iterative training loops that also use batches, such as in reinforcement learning or language model fine-tuning.
- If the greedy step remains fast on very large datasets, the approach could become a standard preprocessing stage before each epoch rather than a one-time choice.
Load-bearing premise
The chosen submodular function actually captures both the informativeness of individual samples and the diversity of the selected subset.
What would settle it
Train identical network architectures on identical data using the submodular batch selector versus ordinary random selection and check whether final test accuracy is higher for the submodular method.
Figures
read the original abstract
Mini-batch gradient descent based methods are the de facto algorithms for training neural network architectures today. We introduce a mini-batch selection strategy based on submodular function maximization. Our novel submodular formulation captures the informativeness of each sample and diversity of the whole subset. We design an efficient, greedy algorithm which can give high-quality solutions to this NP-hard combinatorial optimization problem. Our extensive experiments on standard datasets show that the deep models trained using the proposed batch selection strategy provide better generalization than Stochastic Gradient Descent as well as a popular baseline sampling strategy across different learning rates, batch sizes, and distance metrics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a submodular function maximization approach for mini-batch selection during DNN training. The formulation is designed to jointly capture per-sample informativeness and subset diversity; an efficient greedy algorithm is presented for the resulting NP-hard problem. Extensive experiments on standard datasets are reported to show that models trained with the proposed selection strategy achieve better generalization than vanilla SGD and a popular baseline sampler, with the advantage holding across learning rates, batch sizes, and distance metrics.
Significance. If the empirical claims are robust, the work would supply a principled, optimization-based alternative to uniform random sampling that could improve training outcomes without increasing computational cost. The submodular framing is attractive because it inherits standard greedy approximation guarantees, which is a concrete strength relative to purely heuristic batch-selection heuristics.
major comments (2)
- [Experiments section] Experiments section (and abstract): the central claim of consistent outperformance in generalization is presented without any report of multiple random seeds, error bars, variance across runs, or statistical significance tests. Because DNN training is highly stochastic, single-run accuracy differences across LR/batch-size settings cannot reliably support the claim that the submodular strategy is superior.
- [Section 3] Section 3 (submodular formulation): the weakest modeling assumption—that the chosen submodular objective simultaneously encodes both informativeness and diversity—is asserted but not supported by any ablation that isolates the contribution of each term or compares against variants that omit one component.
minor comments (2)
- [Section 3] Notation for the submodular function and the distance metric used inside it should be introduced once and used consistently; several symbols appear to be redefined in passing.
- [Abstract] The abstract states 'extensive experiments' yet supplies no concrete dataset names, model architectures, or evaluation metrics; these details belong in the abstract or a dedicated experimental summary paragraph.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and indicate the changes planned for the revised manuscript.
read point-by-point responses
-
Referee: [Experiments section] Experiments section (and abstract): the central claim of consistent outperformance in generalization is presented without any report of multiple random seeds, error bars, variance across runs, or statistical significance tests. Because DNN training is highly stochastic, single-run accuracy differences across LR/batch-size settings cannot reliably support the claim that the submodular strategy is superior.
Authors: We agree that the absence of multiple random seeds, error bars, and statistical tests weakens the reliability of the generalization claims, given the stochasticity of DNN training. In the revision we will rerun the main experiments (across the reported learning rates and batch sizes) with at least five independent random seeds, report mean test accuracies together with standard deviations, and include paired t-tests or Wilcoxon tests to assess statistical significance of the observed differences. revision: yes
-
Referee: [Section 3] Section 3 (submodular formulation): the weakest modeling assumption—that the chosen submodular objective simultaneously encodes both informativeness and diversity—is asserted but not supported by any ablation that isolates the contribution of each term or compares against variants that omit one component.
Authors: The objective is explicitly defined as the sum of a coverage (diversity) term and a modular (informativeness) term; this additive construction is the standard mechanism by which submodular functions jointly encode both properties. Nevertheless, we acknowledge that an explicit ablation isolating each component would strengthen the modeling justification. We will add this ablation study in the revised Section 3, comparing the full objective against the diversity-only and informativeness-only variants on the same datasets. revision: yes
Circularity Check
No significant circularity; method and claims are independently formulated and empirically validated.
full rationale
The paper presents a submodular objective that combines per-sample informativeness with subset diversity, solved via a standard greedy algorithm for the NP-hard problem. No equations or claims reduce by construction to fitted parameters renamed as predictions, self-definitions, or load-bearing self-citations. The derivation chain (formulation → algorithm → experiments) is self-contained against external benchmarks; generalization improvements are asserted via direct comparison to SGD and a baseline on standard datasets, without circular reduction. This is the expected honest non-finding for an applied methods paper whose core contribution is a new objective rather than a derived identity.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Submodular functions admit efficient greedy maximization with approximation guarantees.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
F(S) = ∑ λ1U(xi)+λ2R(xi)+λ3MC(xi)+λ4FM(xi) ... Lemma 1. The score function F(.) ... is submodular. ... Theorem 1. F(S) ≥ (1−1/e)F(S*)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formulate batch selection as solving a submodular optimization problem
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Variance Reduction in SGD by Distributed Importance Sampling
Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, and Yoshua Bengio. Variance reduction in sgd by distributed importance sampling. arXiv preprint arXiv:1511.06481 , 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[2]
Katyusha: The first direct acceleration of stochastic gradient methods
Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. The Journal of Machine Learning Research , 18(1):8194--8244, 2017
work page 2017
-
[3]
Subset replay based continual learning for scalable improvement of autonomous systems
Pratik Prabhanjan Brahma and Adrienne Othon. Subset replay based continual learning for scalable improvement of autonomous systems. In CVPR Workshops , pages 1179--11798. IEEE, 2018
work page 2018
-
[4]
Adaptive batch mode active learning
Shayok Chakraborty, Vineeth Balasubramanian, and Sethuraman Panchanathan. Adaptive batch mode active learning. IEEE TNNLS , 26(8):1747--1760, 2015
work page 2015
-
[5]
Active bias: Training more accurate neural networks by emphasizing high variance samples
Haw-Shiuan Chang, Erik Learned-Miller, and Andrew McCallum. Active bias: Training more accurate neural networks by emphasizing high variance samples. In Advances in Neural Information Processing Systems , pages 1002--1012, 2017
work page 2017
-
[6]
Algorithms for subset selection in linear regression
Abhimanyu Das and David Kempe. Algorithms for subset selection in linear regression. In ACM STOC , pages 45--54. ACM, 2008
work page 2008
-
[7]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR , pages 770--778, 2016
work page 2016
-
[8]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 , 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[9]
Accelerating stochastic gradient descent using predictive variance reduction
Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS , pages 315--323, 2013
work page 2013
-
[10]
Biased Importance Sampling for Deep Neural Network Training
Angelos Katharopoulos and Fran c ois Fleuret. Biased importance sampling for deep neural network training. arXiv preprint arXiv:1706.00043 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[11]
Not all samples are created equal: Deep learning with importance sampling
Angelos Katharopoulos and Fran c ois Fleuret. Not all samples are created equal: Deep learning with importance sampling. arXiv:1803.00942 , 2018
-
[12]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[13]
Learning multiple layers of features from tiny images
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009
work page 2009
-
[14]
Efficient mini-batch training for stochastic optimization
Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining , pages 661--670. ACM, 2014
work page 2014
-
[15]
Fast DPP Sampling for Nystr\"om with Application to Kernel Methods
Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Fast dpp sampling for nystr " om with application to kernel methods. arXiv preprint arXiv:1603.06052 , 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[16]
A class of submodular functions for document summarization
Hui Lin and Jeff Bilmes. A class of submodular functions for document summarization. In ACL , pages 510--520. Association for Computational Linguistics, 2011
work page 2011
-
[17]
Online batch selection for faster training of neural networks
Ilya Loshchilov and Frank Hutter. Online batch selection for faster training of neural networks. ICLR Workshops , 2015
work page 2015
-
[18]
Accelerated greedy algorithms for maximizing submodular set functions
Michel Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization techniques , pages 234--243. Springer, 1978
work page 1978
-
[19]
Distributed submodular maximization: Identifying representative elements in massive data
Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodular maximization: Identifying representative elements in massive data. In Advances in Neural Information Processing Systems , pages 2049--2057, 2013
work page 2049
-
[20]
Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondr \'a k, and Andreas Krause. Lazier than lazy greedy. In AAAI , 2015
work page 2015
-
[21]
An analysis of approximations for maximizing submodular set functions—i
George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions—i. Mathematical programming , 14(1):265--294, 1978
work page 1978
-
[22]
Reading digits in natural images with unsupervised feature learning
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop , page 5, 2011
work page 2011
-
[23]
Automatic differentiation in pytorch
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W , 2017
work page 2017
-
[24]
Greedy sensor selection: Leveraging submodularity
Manohar Shamaiah, Siddhartha Banerjee, and Haris Vikalo. Greedy sensor selection: Leveraging submodularity. In 49th IEEE conference on decision and control (CDC) , pages 2572--2577. IEEE, 2010
work page 2010
-
[25]
Submodular importance sampling for neural network training
Krishna Kant Singh and Vineeth N Balasubramanian. Submodular importance sampling for neural network training. Master's thesis, Indian Institute of Technology Hyderabad, 2018
work page 2018
-
[26]
Self-Paced Learning with Adaptive Deep Visual Embeddings
Vithursan Thangarasa and Graham W Taylor. Self-paced learning with adaptive deep visual embeddings. arXiv preprint arXiv:1807.09200 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[27]
Submodular subset selection for large-scale speech training data
Kai Wei, Yuzong Liu, Katrin Kirchhoff, Chris Bartels, and Jeff Bilmes. Submodular subset selection for large-scale speech training data. In ICASSP , pages 3311--3315. IEEE, 2014
work page 2014
-
[28]
Submodularity in data subset selection and active learning
Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity in data subset selection and active learning. In ICML , pages 1954--1963, 2015
work page 1954
-
[29]
Determinantal Point Processes for Mini-Batch Diversification
Cheng Zhang, Hedvig Kjellstrom, and Stephan Mandt. Determinantal point processes for mini-batch diversification. arXiv preprint arXiv:1705.00607 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[30]
Active Mini-Batch Sampling using Repulsive Point Processes
Cheng Zhang, Cengiz \"O ztireli, Stephan Mandt, and Giampiero Salvi. Active mini-batch sampling using repulsive point processes. arXiv:1804.02772 , 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[31]
Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling
Peilin Zhao and Tong Zhang. Accelerating minibatch stochastic gradient descent using stratified sampling. arXiv preprint arXiv:1405.3080 , 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[32]
Stochastic optimization with importance sampling for regularized loss minimization
Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In ICML , pages 1--9, 2015
work page 2015
-
[33]
Minimax curriculum learning: Machine teaching with desirable difficulties and scheduled diversity
Tianyi Zhou and Jeff Bilmes. Minimax curriculum learning: Machine teaching with desirable difficulties and scheduled diversity. In ICLR , 2018
work page 2018
-
[34]
" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.