Measuring the Data Efficiency of Deep Learning Methods
Pith reviewed 2026-05-25 09:52 UTC · model grok-4.3
The pith
HiGSFA matches or exceeds CNN performance on MNIST when trained with only 50 or 200 examples per class.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When both families of models are trained on 50 and 200 samples per class, HiGSFA achieves higher MNIST classification accuracy than standard CNNs; CNNs are superior at larger subset sizes and on the Omniglot tasks examined. The comparison therefore supplies concrete cases in which greedy, locally optimal bottom-up learning is at least as powerful as global gradient-based learning.
What carries the argument
The new experimental protocol that fixes training-set sizes, applies identical evaluation splits, and enforces comparable hyperparameter search effort across CNN and HiGSFA models.
If this is right
- For MNIST classification the advantage of bottom-up hierarchical construction appears only below a few hundred examples per class.
- Transfer-learning performance follows a different size-dependent pattern than pure classification on the same data.
- Any claim that end-to-end gradient methods are universally superior must be qualified by training-set size.
- The protocol itself can be reused to test additional architectures on the same size ladder.
Where Pith is reading between the lines
- Running the identical protocol on CIFAR-10 or ImageNet subsets would reveal whether the low-data crossover is dataset-specific.
- Hybrid models that insert slow-feature layers inside a CNN might combine the strengths observed at different data regimes.
- If the ordering persists under stricter regularization or different optimizers, it would constrain theories that attribute all low-data gains to global optimization.
Load-bearing premise
The protocol produces an unbiased head-to-head comparison because hyperparameter tuning effort and implementation details do not systematically favor either algorithm.
What would settle it
An independent re-run of the 50-sample MNIST experiment in which a carefully tuned CNN reaches higher accuracy than the reported HiGSFA figure would falsify the central ordering.
read the original abstract
In this paper, we propose a new experimental protocol and use it to benchmark the data efficiency --- performance as a function of training set size --- of two deep learning algorithms, convolutional neural networks (CNNs) and hierarchical information-preserving graph-based slow feature analysis (HiGSFA), for tasks in classification and transfer learning scenarios. The algorithms are trained on different-sized subsets of the MNIST and Omniglot data sets. HiGSFA outperforms standard CNN networks when the models are trained on 50 and 200 samples per class for MNIST classification. In other cases, the CNNs perform better. The results suggest that there are cases where greedy, locally optimal bottom-up learning is equally or more powerful than global gradient-based learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a new experimental protocol for measuring data efficiency (performance vs. training set size) and applies it to compare convolutional neural networks (CNNs) against hierarchical information-preserving graph-based slow feature analysis (HiGSFA) on classification and transfer learning tasks using subsets of MNIST and Omniglot. It reports that HiGSFA outperforms standard CNNs at 50 and 200 samples per class on MNIST classification while CNNs are superior in other regimes, and interprets this as evidence that greedy bottom-up learning can match or exceed global gradient-based optimization in some low-data settings.
Significance. If the comparison is shown to be unbiased, the result would be significant because it supplies concrete empirical counter-examples to the prevailing view that end-to-end gradient descent is always superior for deep architectures; it thereby opens a line of inquiry into when and why locally optimal hierarchical methods can be more data-efficient.
major comments (2)
- [Experimental protocol] Experimental protocol section: the manuscript states that both algorithms were evaluated under the proposed protocol but supplies no quantitative information on hyperparameter search budget, number of trials, validation-set size, or stopping criteria used for CNNs versus HiGSFA. Because the central claim rests on HiGSFA outperforming CNNs at 50/200 samples per class, the absence of evidence that tuning effort was equivalent leaves open the possibility that the reported gap reflects unequal optimization rather than intrinsic algorithmic properties.
- [Results] Results section (tables/figures reporting MNIST accuracies): no error bars, standard deviations across random seeds, or statistical significance tests are provided for the performance numbers. Without these, it is impossible to determine whether the claimed superiority of HiGSFA at the two smallest sample sizes is robust or could be explained by run-to-run variance.
minor comments (2)
- [Experimental protocol] The data-splitting procedure (how the N-sample-per-class subsets were drawn and whether the same splits were used for both methods) is not described in sufficient detail to allow exact reproduction.
- [Methods] The abstract and results text refer to “standard CNN networks” without specifying the exact architectures, regularization strengths, or optimization settings employed; a table listing these choices would improve clarity.
Simulated Author's Rebuttal
We thank the referee for these constructive comments on the experimental protocol and results presentation. We address each point below and commit to revisions that strengthen the transparency and robustness of the reported comparisons.
read point-by-point responses
-
Referee: [Experimental protocol] Experimental protocol section: the manuscript states that both algorithms were evaluated under the proposed protocol but supplies no quantitative information on hyperparameter search budget, number of trials, validation-set size, or stopping criteria used for CNNs versus HiGSFA. Because the central claim rests on HiGSFA outperforming CNNs at 50/200 samples per class, the absence of evidence that tuning effort was equivalent leaves open the possibility that the reported gap reflects unequal optimization rather than intrinsic algorithmic properties.
Authors: We agree that the manuscript lacks sufficient quantitative detail on hyperparameter tuning to fully demonstrate equivalence of optimization effort. In the revised version we will add an explicit subsection (or table) in the Experimental protocol section that reports, for each method: the hyperparameter ranges explored, the total number of configurations evaluated, the size of the validation set used for selection, and the stopping criteria. This addition will allow readers to assess whether the tuning budgets were comparable and will directly address the concern that the observed advantage at small sample sizes could stem from unequal optimization rather than algorithmic differences. revision: yes
-
Referee: [Results] Results section (tables/figures reporting MNIST accuracies): no error bars, standard deviations across random seeds, or statistical significance tests are provided for the performance numbers. Without these, it is impossible to determine whether the claimed superiority of HiGSFA at the two smallest sample sizes is robust or could be explained by run-to-run variance.
Authors: The referee correctly notes the absence of variability measures. We will revise the Results section and associated figures/tables to report means and standard deviations computed over multiple independent runs that differ in random seed (both for data subset sampling and for any stochastic components of training). Where appropriate we will also include pairwise statistical significance tests. These additions will make it possible to evaluate whether the HiGSFA advantage at 50 and 200 samples per class is statistically reliable. revision: yes
Circularity Check
No circularity: empirical benchmarking with direct held-out measurements
full rationale
The paper presents an experimental protocol and reports performance numbers obtained by training HiGSFA and CNN models on subsets of MNIST and Omniglot then evaluating on held-out test data. No equations, first-principles derivations, or predictions appear; the central claims are direct empirical measurements rather than quantities that reduce to fitted parameters or self-citations by construction. The protocol description and results therefore stand as self-contained external benchmarks with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MNIST and Omniglot are representative benchmarks for evaluating data efficiency in classification and transfer learning.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.