Measuring the Data Efficiency of Deep Learning Methods

Alberto N. Escalante-B.; Hlynur Dav\'i{\dh} Hlynsson; Laurenz Wiskott

arxiv: 1907.02549 · v1 · pith:2XYEB6NVnew · submitted 2019-07-03 · 💻 cs.LG · cs.CV· stat.ML

Measuring the Data Efficiency of Deep Learning Methods

Hlynur Dav\'i{\dh} Hlynsson , Alberto N. Escalante-B. , Laurenz Wiskott This is my paper

Pith reviewed 2026-05-25 09:52 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML

keywords data efficiencydeep learningconvolutional neural networksslow feature analysisMNISTOmniglotclassificationtransfer learning

0 comments

The pith

HiGSFA matches or exceeds CNN performance on MNIST when trained with only 50 or 200 examples per class.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a controlled protocol that trains both convolutional networks and hierarchical slow-feature analysis on successively smaller subsets of MNIST and Omniglot. It measures classification accuracy and transfer-learning performance as a function of training-set size. In the two smallest MNIST regimes the slow-feature method records higher accuracy; in all other tested conditions the convolutional networks are stronger. The pattern indicates that locally optimal bottom-up construction can rival or surpass global gradient descent when data are scarce. Anyone studying sample-efficient learning therefore needs to know whether this ordering reverses under different tuning or architecture choices.

Core claim

When both families of models are trained on 50 and 200 samples per class, HiGSFA achieves higher MNIST classification accuracy than standard CNNs; CNNs are superior at larger subset sizes and on the Omniglot tasks examined. The comparison therefore supplies concrete cases in which greedy, locally optimal bottom-up learning is at least as powerful as global gradient-based learning.

What carries the argument

The new experimental protocol that fixes training-set sizes, applies identical evaluation splits, and enforces comparable hyperparameter search effort across CNN and HiGSFA models.

If this is right

For MNIST classification the advantage of bottom-up hierarchical construction appears only below a few hundred examples per class.
Transfer-learning performance follows a different size-dependent pattern than pure classification on the same data.
Any claim that end-to-end gradient methods are universally superior must be qualified by training-set size.
The protocol itself can be reused to test additional architectures on the same size ladder.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Running the identical protocol on CIFAR-10 or ImageNet subsets would reveal whether the low-data crossover is dataset-specific.
Hybrid models that insert slow-feature layers inside a CNN might combine the strengths observed at different data regimes.
If the ordering persists under stricter regularization or different optimizers, it would constrain theories that attribute all low-data gains to global optimization.

Load-bearing premise

The protocol produces an unbiased head-to-head comparison because hyperparameter tuning effort and implementation details do not systematically favor either algorithm.

What would settle it

An independent re-run of the 50-sample MNIST experiment in which a carefully tuned CNN reaches higher accuracy than the reported HiGSFA figure would falsify the central ordering.

read the original abstract

In this paper, we propose a new experimental protocol and use it to benchmark the data efficiency --- performance as a function of training set size --- of two deep learning algorithms, convolutional neural networks (CNNs) and hierarchical information-preserving graph-based slow feature analysis (HiGSFA), for tasks in classification and transfer learning scenarios. The algorithms are trained on different-sized subsets of the MNIST and Omniglot data sets. HiGSFA outperforms standard CNN networks when the models are trained on 50 and 200 samples per class for MNIST classification. In other cases, the CNNs perform better. The results suggest that there are cases where greedy, locally optimal bottom-up learning is equally or more powerful than global gradient-based learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new protocol for data-efficiency curves is useful, but the HiGSFA win on 50- and 200-sample MNIST rests on an unverified assumption of equal hyperparameter effort.

read the letter

The main takeaway is straightforward: the authors define a protocol that plots accuracy against training-set size and report that HiGSFA beats a standard CNN on MNIST classification at 50 and 200 examples per class, while the CNN wins elsewhere. That single comparison is the concrete result readers will notice first. The protocol itself is the clearest addition. Measuring performance as a direct function of data volume on the same task gives a practical way to compare methods in low-data regimes, and the paper applies it to both classification and transfer on MNIST and Omniglot. The measurements are empirical and non-circular, which keeps the claim grounded in held-out test performance. The work is honest about its scope and does not overclaim broader implications. The soft spot is exactly the one flagged in the stress test. The abstract supplies no numbers on hyperparameter search budget, number of trials, or validation procedure for either method. If the CNN received a fixed off-the-shelf architecture while HiGSFA benefited from more targeted tuning, the reported gap at the smallest sizes could be an artifact rather than a property of greedy versus gradient learning. There is also no mention of variance across runs or statistical tests, so the numerical differences lack the usual error bars that would let a reader judge reliability. This paper is aimed at people who build or select algorithms for small-data settings and want a simple scaling plot to guide choices. A reader already working on data-efficiency benchmarks would get a usable protocol and a data point worth checking. The central claim is interesting enough to deserve referee time, but only if the authors can document that tuning effort was comparable; without that, the result stays too fragile for strong conclusions.

Referee Report

2 major / 2 minor

Summary. The paper introduces a new experimental protocol for measuring data efficiency (performance vs. training set size) and applies it to compare convolutional neural networks (CNNs) against hierarchical information-preserving graph-based slow feature analysis (HiGSFA) on classification and transfer learning tasks using subsets of MNIST and Omniglot. It reports that HiGSFA outperforms standard CNNs at 50 and 200 samples per class on MNIST classification while CNNs are superior in other regimes, and interprets this as evidence that greedy bottom-up learning can match or exceed global gradient-based optimization in some low-data settings.

Significance. If the comparison is shown to be unbiased, the result would be significant because it supplies concrete empirical counter-examples to the prevailing view that end-to-end gradient descent is always superior for deep architectures; it thereby opens a line of inquiry into when and why locally optimal hierarchical methods can be more data-efficient.

major comments (2)

[Experimental protocol] Experimental protocol section: the manuscript states that both algorithms were evaluated under the proposed protocol but supplies no quantitative information on hyperparameter search budget, number of trials, validation-set size, or stopping criteria used for CNNs versus HiGSFA. Because the central claim rests on HiGSFA outperforming CNNs at 50/200 samples per class, the absence of evidence that tuning effort was equivalent leaves open the possibility that the reported gap reflects unequal optimization rather than intrinsic algorithmic properties.
[Results] Results section (tables/figures reporting MNIST accuracies): no error bars, standard deviations across random seeds, or statistical significance tests are provided for the performance numbers. Without these, it is impossible to determine whether the claimed superiority of HiGSFA at the two smallest sample sizes is robust or could be explained by run-to-run variance.

minor comments (2)

[Experimental protocol] The data-splitting procedure (how the N-sample-per-class subsets were drawn and whether the same splits were used for both methods) is not described in sufficient detail to allow exact reproduction.
[Methods] The abstract and results text refer to “standard CNN networks” without specifying the exact architectures, regularization strengths, or optimization settings employed; a table listing these choices would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments on the experimental protocol and results presentation. We address each point below and commit to revisions that strengthen the transparency and robustness of the reported comparisons.

read point-by-point responses

Referee: [Experimental protocol] Experimental protocol section: the manuscript states that both algorithms were evaluated under the proposed protocol but supplies no quantitative information on hyperparameter search budget, number of trials, validation-set size, or stopping criteria used for CNNs versus HiGSFA. Because the central claim rests on HiGSFA outperforming CNNs at 50/200 samples per class, the absence of evidence that tuning effort was equivalent leaves open the possibility that the reported gap reflects unequal optimization rather than intrinsic algorithmic properties.

Authors: We agree that the manuscript lacks sufficient quantitative detail on hyperparameter tuning to fully demonstrate equivalence of optimization effort. In the revised version we will add an explicit subsection (or table) in the Experimental protocol section that reports, for each method: the hyperparameter ranges explored, the total number of configurations evaluated, the size of the validation set used for selection, and the stopping criteria. This addition will allow readers to assess whether the tuning budgets were comparable and will directly address the concern that the observed advantage at small sample sizes could stem from unequal optimization rather than algorithmic differences. revision: yes
Referee: [Results] Results section (tables/figures reporting MNIST accuracies): no error bars, standard deviations across random seeds, or statistical significance tests are provided for the performance numbers. Without these, it is impossible to determine whether the claimed superiority of HiGSFA at the two smallest sample sizes is robust or could be explained by run-to-run variance.

Authors: The referee correctly notes the absence of variability measures. We will revise the Results section and associated figures/tables to report means and standard deviations computed over multiple independent runs that differ in random seed (both for data subset sampling and for any stochastic components of training). Where appropriate we will also include pairwise statistical significance tests. These additions will make it possible to evaluate whether the HiGSFA advantage at 50 and 200 samples per class is statistically reliable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with direct held-out measurements

full rationale

The paper presents an experimental protocol and reports performance numbers obtained by training HiGSFA and CNN models on subsets of MNIST and Omniglot then evaluating on held-out test data. No equations, first-principles derivations, or predictions appear; the central claims are direct empirical measurements rather than quantities that reduce to fitted parameters or self-citations by construction. The protocol description and results therefore stand as self-contained external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical benchmark study that relies on standard machine-learning assumptions about dataset suitability and fair implementation; no free parameters, invented entities, or non-standard axioms are introduced in the abstract.

axioms (1)

domain assumption MNIST and Omniglot are representative benchmarks for evaluating data efficiency in classification and transfer learning.
The abstract applies the protocol directly to these datasets without further justification.

pith-pipeline@v0.9.0 · 5661 in / 1288 out tokens · 80404 ms · 2026-05-25T09:52:34.493148+00:00 · methodology

Measuring the Data Efficiency of Deep Learning Methods

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)