pith. sign in

arxiv: 1906.08771 · v1 · pith:TS33M47Ynew · submitted 2019-06-20 · 💻 cs.LG · stat.ML

Submodular Batch Selection for Training Deep Neural Networks

Pith reviewed 2026-05-25 19:28 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords submodular optimizationbatch selectionmini-batch gradient descentdeep neural networksgeneralizationinformativenessdiversitygreedy algorithm
0
0 comments X

The pith

Submodular batch selection improves generalization of deep neural networks over standard stochastic gradient descent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a way to choose mini-batches for neural network training by maximizing a submodular function that scores both how informative each individual sample is and how diverse the whole batch is. It supplies an efficient greedy algorithm to find good batches even though the exact optimization problem is NP-hard. Experiments on standard datasets show that networks trained with these batches reach higher test accuracy than networks trained with random batches under stochastic gradient descent, and the gains appear across multiple learning rates, batch sizes, and distance measures between samples. A reader would care because almost all modern deep learning relies on random mini-batch selection, so replacing that step with a better rule could raise final model quality without any change to the network architecture or the optimizer itself.

Core claim

The authors introduce a novel submodular formulation for mini-batch selection that captures both the informativeness of individual samples and the diversity of the selected subset. They design an efficient greedy algorithm for this NP-hard problem and show through extensive experiments that deep models trained with this strategy generalize better than those trained with standard SGD or a popular baseline sampling strategy, across various learning rates, batch sizes, and distance metrics.

What carries the argument

A submodular function that scores candidate batches by jointly rewarding sample informativeness and subset diversity, solved approximately by a greedy algorithm.

If this is right

  • Models reach higher test accuracy than with random batch selection in SGD.
  • The improvement appears across different learning rates and batch sizes.
  • The gains persist when different distance metrics are used to compare samples.
  • The greedy algorithm supplies a practical way to solve the otherwise intractable selection problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selection rule might let a network reach a target accuracy level after fewer total gradient steps.
  • Because the method is defined only in terms of a scoring function on samples, it could be tried on other iterative training loops that also use batches, such as in reinforcement learning or language model fine-tuning.
  • If the greedy step remains fast on very large datasets, the approach could become a standard preprocessing stage before each epoch rather than a one-time choice.

Load-bearing premise

The chosen submodular function actually captures both the informativeness of individual samples and the diversity of the selected subset.

What would settle it

Train identical network architectures on identical data using the submodular batch selector versus ordinary random selection and check whether final test accuracy is higher for the submodular method.

Figures

Figures reproduced from arXiv: 1906.08771 by K J Joseph, Krishnakant Singh, Vamshi Teja R, Vineeth N Balasubramanian.

Figure 1
Figure 1. Figure 1: Comparison of the proposed SMDL, with SGD and loss based sampling scheme on SVHN, CIFAR-10 and CIFAR-100 datasets. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The figure shows the ablation results on SVHN dataset. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: The figure shows the ablation results on SVHN dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: The graph brings out the importance of each of the term in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: The figure shows how the training error and loss varies over epochs on different datasets. It is very interesting to note that on all [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Mini-batch gradient descent based methods are the de facto algorithms for training neural network architectures today. We introduce a mini-batch selection strategy based on submodular function maximization. Our novel submodular formulation captures the informativeness of each sample and diversity of the whole subset. We design an efficient, greedy algorithm which can give high-quality solutions to this NP-hard combinatorial optimization problem. Our extensive experiments on standard datasets show that the deep models trained using the proposed batch selection strategy provide better generalization than Stochastic Gradient Descent as well as a popular baseline sampling strategy across different learning rates, batch sizes, and distance metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a submodular function maximization approach for mini-batch selection during DNN training. The formulation is designed to jointly capture per-sample informativeness and subset diversity; an efficient greedy algorithm is presented for the resulting NP-hard problem. Extensive experiments on standard datasets are reported to show that models trained with the proposed selection strategy achieve better generalization than vanilla SGD and a popular baseline sampler, with the advantage holding across learning rates, batch sizes, and distance metrics.

Significance. If the empirical claims are robust, the work would supply a principled, optimization-based alternative to uniform random sampling that could improve training outcomes without increasing computational cost. The submodular framing is attractive because it inherits standard greedy approximation guarantees, which is a concrete strength relative to purely heuristic batch-selection heuristics.

major comments (2)
  1. [Experiments section] Experiments section (and abstract): the central claim of consistent outperformance in generalization is presented without any report of multiple random seeds, error bars, variance across runs, or statistical significance tests. Because DNN training is highly stochastic, single-run accuracy differences across LR/batch-size settings cannot reliably support the claim that the submodular strategy is superior.
  2. [Section 3] Section 3 (submodular formulation): the weakest modeling assumption—that the chosen submodular objective simultaneously encodes both informativeness and diversity—is asserted but not supported by any ablation that isolates the contribution of each term or compares against variants that omit one component.
minor comments (2)
  1. [Section 3] Notation for the submodular function and the distance metric used inside it should be introduced once and used consistently; several symbols appear to be redefined in passing.
  2. [Abstract] The abstract states 'extensive experiments' yet supplies no concrete dataset names, model architectures, or evaluation metrics; these details belong in the abstract or a dedicated experimental summary paragraph.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate the changes planned for the revised manuscript.

read point-by-point responses
  1. Referee: [Experiments section] Experiments section (and abstract): the central claim of consistent outperformance in generalization is presented without any report of multiple random seeds, error bars, variance across runs, or statistical significance tests. Because DNN training is highly stochastic, single-run accuracy differences across LR/batch-size settings cannot reliably support the claim that the submodular strategy is superior.

    Authors: We agree that the absence of multiple random seeds, error bars, and statistical tests weakens the reliability of the generalization claims, given the stochasticity of DNN training. In the revision we will rerun the main experiments (across the reported learning rates and batch sizes) with at least five independent random seeds, report mean test accuracies together with standard deviations, and include paired t-tests or Wilcoxon tests to assess statistical significance of the observed differences. revision: yes

  2. Referee: [Section 3] Section 3 (submodular formulation): the weakest modeling assumption—that the chosen submodular objective simultaneously encodes both informativeness and diversity—is asserted but not supported by any ablation that isolates the contribution of each term or compares against variants that omit one component.

    Authors: The objective is explicitly defined as the sum of a coverage (diversity) term and a modular (informativeness) term; this additive construction is the standard mechanism by which submodular functions jointly encode both properties. Nevertheless, we acknowledge that an explicit ablation isolating each component would strengthen the modeling justification. We will add this ablation study in the revised Section 3, comparing the full objective against the diversity-only and informativeness-only variants on the same datasets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method and claims are independently formulated and empirically validated.

full rationale

The paper presents a submodular objective that combines per-sample informativeness with subset diversity, solved via a standard greedy algorithm for the NP-hard problem. No equations or claims reduce by construction to fitted parameters renamed as predictions, self-definitions, or load-bearing self-citations. The derivation chain (formulation → algorithm → experiments) is self-contained against external benchmarks; generalization improvements are asserted via direct comparison to SGD and a baseline on standard datasets, without circular reduction. This is the expected honest non-finding for an applied methods paper whose core contribution is a new objective rather than a derived identity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach assumes standard properties of submodular functions allow effective capture of sample properties and that the greedy algorithm yields high-quality solutions for the NP-hard problem.

axioms (1)
  • standard math Submodular functions admit efficient greedy maximization with approximation guarantees.
    Invoked implicitly to justify the efficient algorithm design.

pith-pipeline@v0.9.0 · 5633 in / 926 out tokens · 18597 ms · 2026-05-25T19:28:00.518415+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 9 internal anchors

  1. [1]

    Variance Reduction in SGD by Distributed Importance Sampling

    Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, and Yoshua Bengio. Variance reduction in sgd by distributed importance sampling. arXiv preprint arXiv:1511.06481 , 2015

  2. [2]

    Katyusha: The first direct acceleration of stochastic gradient methods

    Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. The Journal of Machine Learning Research , 18(1):8194--8244, 2017

  3. [3]

    Subset replay based continual learning for scalable improvement of autonomous systems

    Pratik Prabhanjan Brahma and Adrienne Othon. Subset replay based continual learning for scalable improvement of autonomous systems. In CVPR Workshops , pages 1179--11798. IEEE, 2018

  4. [4]

    Adaptive batch mode active learning

    Shayok Chakraborty, Vineeth Balasubramanian, and Sethuraman Panchanathan. Adaptive batch mode active learning. IEEE TNNLS , 26(8):1747--1760, 2015

  5. [5]

    Active bias: Training more accurate neural networks by emphasizing high variance samples

    Haw-Shiuan Chang, Erik Learned-Miller, and Andrew McCallum. Active bias: Training more accurate neural networks by emphasizing high variance samples. In Advances in Neural Information Processing Systems , pages 1002--1012, 2017

  6. [6]

    Algorithms for subset selection in linear regression

    Abhimanyu Das and David Kempe. Algorithms for subset selection in linear regression. In ACM STOC , pages 45--54. ACM, 2008

  7. [7]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR , pages 770--778, 2016

  8. [8]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 , 2015

  9. [9]

    Accelerating stochastic gradient descent using predictive variance reduction

    Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS , pages 315--323, 2013

  10. [10]

    Biased Importance Sampling for Deep Neural Network Training

    Angelos Katharopoulos and Fran c ois Fleuret. Biased importance sampling for deep neural network training. arXiv preprint arXiv:1706.00043 , 2017

  11. [11]

    Not all samples are created equal: Deep learning with importance sampling

    Angelos Katharopoulos and Fran c ois Fleuret. Not all samples are created equal: Deep learning with importance sampling. arXiv:1803.00942 , 2018

  12. [12]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014

  13. [13]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009

  14. [14]

    Efficient mini-batch training for stochastic optimization

    Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining , pages 661--670. ACM, 2014

  15. [15]

    Fast DPP Sampling for Nystr\"om with Application to Kernel Methods

    Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Fast dpp sampling for nystr " om with application to kernel methods. arXiv preprint arXiv:1603.06052 , 2016

  16. [16]

    A class of submodular functions for document summarization

    Hui Lin and Jeff Bilmes. A class of submodular functions for document summarization. In ACL , pages 510--520. Association for Computational Linguistics, 2011

  17. [17]

    Online batch selection for faster training of neural networks

    Ilya Loshchilov and Frank Hutter. Online batch selection for faster training of neural networks. ICLR Workshops , 2015

  18. [18]

    Accelerated greedy algorithms for maximizing submodular set functions

    Michel Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization techniques , pages 234--243. Springer, 1978

  19. [19]

    Distributed submodular maximization: Identifying representative elements in massive data

    Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodular maximization: Identifying representative elements in massive data. In Advances in Neural Information Processing Systems , pages 2049--2057, 2013

  20. [20]

    Lazier than lazy greedy

    Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondr \'a k, and Andreas Krause. Lazier than lazy greedy. In AAAI , 2015

  21. [21]

    An analysis of approximations for maximizing submodular set functions—i

    George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions—i. Mathematical programming , 14(1):265--294, 1978

  22. [22]

    Reading digits in natural images with unsupervised feature learning

    Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop , page 5, 2011

  23. [23]

    Automatic differentiation in pytorch

    Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W , 2017

  24. [24]

    Greedy sensor selection: Leveraging submodularity

    Manohar Shamaiah, Siddhartha Banerjee, and Haris Vikalo. Greedy sensor selection: Leveraging submodularity. In 49th IEEE conference on decision and control (CDC) , pages 2572--2577. IEEE, 2010

  25. [25]

    Submodular importance sampling for neural network training

    Krishna Kant Singh and Vineeth N Balasubramanian. Submodular importance sampling for neural network training. Master's thesis, Indian Institute of Technology Hyderabad, 2018

  26. [26]

    Self-Paced Learning with Adaptive Deep Visual Embeddings

    Vithursan Thangarasa and Graham W Taylor. Self-paced learning with adaptive deep visual embeddings. arXiv preprint arXiv:1807.09200 , 2018

  27. [27]

    Submodular subset selection for large-scale speech training data

    Kai Wei, Yuzong Liu, Katrin Kirchhoff, Chris Bartels, and Jeff Bilmes. Submodular subset selection for large-scale speech training data. In ICASSP , pages 3311--3315. IEEE, 2014

  28. [28]

    Submodularity in data subset selection and active learning

    Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity in data subset selection and active learning. In ICML , pages 1954--1963, 2015

  29. [29]

    Determinantal Point Processes for Mini-Batch Diversification

    Cheng Zhang, Hedvig Kjellstrom, and Stephan Mandt. Determinantal point processes for mini-batch diversification. arXiv preprint arXiv:1705.00607 , 2017

  30. [30]

    Active Mini-Batch Sampling using Repulsive Point Processes

    Cheng Zhang, Cengiz \"O ztireli, Stephan Mandt, and Giampiero Salvi. Active mini-batch sampling using repulsive point processes. arXiv:1804.02772 , 2018

  31. [31]

    Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling

    Peilin Zhao and Tong Zhang. Accelerating minibatch stochastic gradient descent using stratified sampling. arXiv preprint arXiv:1405.3080 , 2014

  32. [32]

    Stochastic optimization with importance sampling for regularized loss minimization

    Peilin Zhao and Tong Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In ICML , pages 1--9, 2015

  33. [33]

    Minimax curriculum learning: Machine teaching with desirable difficulties and scheduled diversity

    Tianyi Zhou and Jeff Bilmes. Minimax curriculum learning: Machine teaching with desirable difficulties and scheduled diversity. In ICLR , 2018

  34. [34]

    write newline

    " write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...