arxiv: 2604.09389 · v1 · submitted 2026-04-10 · 💻 cs.LG · cs.CL

Recognition: unknown

Is More Data Worth the Cost? Dataset Scaling Laws in a Tiny Attention-Only Decoder

G\"otz-Henrik Wiegand , Lorena Raichle , Rico St\"adeli , Tomas Hrycej , Bernhard Bermeitinger , Siegfried Handschuh

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:38 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords dataset scaling lawsattention-only decoderdata efficiencydiminishing returnslanguage model trainingtransformer performancescaling behavior

0 comments

The pith

A tiny attention-only decoder reaches about 90% of full-data accuracy using only 30% of the training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains a minimal attention-only decoder on power-of-two increasing fractions of a fixed dataset to measure performance as a function of data volume alone. Gains appear steadily at first but flatten, so that 30 percent of the examples already delivers roughly 90 percent of the token-level validation accuracy obtained from the complete set. This controlled isolation of dataset size supplies a concrete rule of thumb for deciding when extra data collection stops paying off. The result applies directly to settings where both compute and data are constrained.

Core claim

In the reduced attention-only decoder, validation token accuracy improves smoothly with dataset size yet shows pronounced diminishing returns, such that training on approximately 30 percent of the data suffices to reach about 90 percent of the accuracy achieved with the full training set.

What carries the argument

Progressively larger power-of-two subsets of the training data, which trace the scaling curve while the stripped-down decoder holds other architectural factors fixed.

If this is right

Performance follows predictable scaling-law shapes even in very small, component-isolated models.
The bulk of accuracy gains occurs early, so later additions of data yield progressively smaller improvements.
Compute budgets in restricted environments can be redirected from data collection to other uses once the 30-percent threshold is passed.
The observed diminishing returns supply a practical stopping criterion for dataset construction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same early-saturation pattern may appear in larger models, suggesting that targeted data pruning or quality filtering could substitute for raw volume in many cases.
The method could be reused to compare different data sources or curricula rather than simple random subsets.
Extending the power-of-two schedule to measure exact cost-benefit breakpoints for specific downstream tasks would make the guidance more actionable.

Load-bearing premise

The strongly reduced attention-only decoder isolates pure dataset-size effects without adding confounding behaviors that would not appear in standard full-scale models.

What would settle it

If a standard full-scale Transformer trained on the same power-of-two subsets required substantially more than 30 percent of the data to reach 90 percent of its own maximum accuracy, the isolation claim would be undermined.

Figures

Figures reproduced from arXiv: 2604.09389 by Bernhard Bermeitinger, G\"otz-Henrik Wiegand, Lorena Raichle, Rico St\"adeli, Siegfried Handschuh, Tomas Hrycej.

**Figure 2.** Figure 2: Jensen–Shannon divergence between token distributions of training subsets and the full dataset. The dashed curve shows the first derivative [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 4.** Figure 4: Training dynamics across dataset subset sizes [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Validation accuracy on AllTheNews2.0 dataset vs. subset tokens for fixed training budget 6.3 SCALING-LAW ANALYSIS OF PERFORMANCE LIMITS To place the experimental setting of our attention-only transformer in context, we relate our results on AllTheNews2.0 ( [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Empirical best validation CE loss per subset size compared to the Kaplan scaling-law [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Cross-entropy loss on AllTheNews2.0 dataset under a fixed training budget. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Validation accuracy (top) and cross-entropy loss (bottom) on [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Training Transformer language models is expensive, as performance typically improves with increasing dataset size and computational budget. Although scaling laws describe this trend at large scale, their implications in controlled, smaller-scale settings remain less explored. In this work, we isolate dataset-size effects using a strongly reduced attention-only decoder architecture. By training on progressively larger power-of-two subsets, we observe smooth performance improvements accompanied by clear diminishing returns, consistent with scaling-law behavior. Using only about 30% of the training data is sufficient to reach approximately 90% of the full-data validation token-level accuracy. These results provide actionable insights into dataset scaling in a controlled, component-isolated setting and offer practical guidance for balancing dataset size and computational cost in compute- and data-restricted environments, such as small research labs and exploratory model development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper measures diminishing returns on data in a tiny attention-only decoder and lands on a practical 30/90 rule, but the minimal architecture raises questions about whether capacity rather than data is doing most of the work.

read the letter

The central observation is straightforward: with their reduced decoder, roughly 30% of the training data reaches about 90% of the full-set validation token accuracy, and the gains taper off in the expected way as subsets grow by powers of two. That gives a concrete number for people running small experiments who need to decide when extra data stops being worth the cost. They execute the comparison cleanly enough by holding the model fixed and varying only the data volume, which produces the smooth curve they describe. For labs that cannot afford large-scale runs, this kind of controlled measurement can be directly useful as a rule of thumb. The work stays empirical and does not claim new theory, which keeps the scope honest. The main limitation is that the architecture is deliberately stripped down. In such a low-capacity model, performance is likely bounded by parameter count and depth long before data volume becomes the dominant factor, so the early plateau could be an artifact of under-capacity rather than a general dataset-scaling signal. The abstract gives no model dimensions, layer count, dataset identity, or sampling procedure, which makes it impossible to judge how much the reduction itself shapes the result. If the full paper supplies those details and shows the pattern survives modest increases in model size, the finding becomes more credible; without them the claim stays narrow. This is mainly for practitioners doing exploratory work in constrained settings rather than for theorists or people scaling large models. The empirical execution looks solid on its own terms, so it is worth sending to referees who can check the missing controls and reproducibility details.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that in a strongly reduced attention-only decoder, training on progressively larger power-of-two subsets of data yields smooth performance gains with diminishing returns, such that approximately 30% of the full training data suffices to reach about 90% of the full-data validation token-level accuracy. The work positions this as an empirical isolation of dataset-size effects in a controlled small-scale setting.

Significance. If the central empirical ratio holds after details are supplied, the result supplies concrete, actionable guidance for data-efficient training in compute-limited environments. The direct measurement of held-out token accuracy (rather than a fitted functional form) is a methodological strength that avoids circularity.

major comments (2)

[Abstract] Abstract: the 30%/90% claim is stated without any supporting numbers, error bars, model dimensions (layers, heads, embedding size), dataset identity, tokenization scheme, subset-sampling procedure, or training hyperparameters. These omissions are load-bearing because they prevent assessment of whether the observed ratio is reproducible or architecture-specific.
[Abstract] Abstract: the premise that the 'strongly reduced attention-only decoder' isolates dataset-size effects is asserted without justification or controls. In a minimal-capacity model, performance is plausibly dominated by parameter count and depth rather than data volume, so the early onset of diminishing returns could be an artifact of under-capacity rather than a general dataset scaling law.

minor comments (1)

[Abstract] Abstract: the phrase 'clear diminishing returns, consistent with scaling-law behavior' would benefit from a brief quantitative illustration (e.g., accuracy delta between the 50% and 100% data regimes) to make the consistency claim concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the 30%/90% claim is stated without any supporting numbers, error bars, model dimensions (layers, heads, embedding size), dataset identity, tokenization scheme, subset-sampling procedure, or training hyperparameters. These omissions are load-bearing because they prevent assessment of whether the observed ratio is reproducible or architecture-specific.

Authors: The abstract is intentionally concise as a high-level summary. All requested details—model dimensions, dataset identity, tokenization, subset-sampling method, training hyperparameters, and supporting numbers with variability measures—are provided in the Methods and Experiments sections of the full manuscript. To make the central claim more self-contained, we will revise the abstract to include a brief specification of the model architecture, dataset, and experimental procedure. revision: yes
Referee: [Abstract] Abstract: the premise that the 'strongly reduced attention-only decoder' isolates dataset-size effects is asserted without justification or controls. In a minimal-capacity model, performance is plausibly dominated by parameter count and depth rather than data volume, so the early onset of diminishing returns could be an artifact of under-capacity rather than a general dataset scaling law.

Authors: The strongly reduced attention-only decoder was deliberately chosen to hold architecture, depth, and parameter count fixed while varying only dataset size, thereby isolating the effect of data volume. This controlled design is justified in the introduction as a means to study dataset scaling in a minimal, reproducible setting relevant to compute-limited environments. We agree that capacity constraints may influence the exact point of diminishing returns and will add an expanded justification plus a limitations discussion in the introduction to clarify the scope and generalizability of the findings. revision: partial

Circularity Check

0 steps flagged

No circularity; direct empirical measurements of observed accuracy

full rationale

The paper reports direct training runs on power-of-two data subsets of a fixed tiny attention-only decoder, measuring validation token-level accuracy as an observed outcome. The central claim (approximately 30% data reaching 90% of full-data accuracy) is presented as an empirical result, not as a prediction derived from any equation, fitted parameter, or self-citation chain. No load-bearing step reduces to its own inputs by construction; the work contains no derivations, uniqueness theorems, or ansatzes that could introduce circularity. This is the most common honest finding for purely observational scaling studies.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that power-of-two data subsets are representative and that the reduced architecture does not alter the scaling behavior in ways that invalidate the comparison to full data.

axioms (1)

domain assumption Power-of-two data subsets allow fair isolation of dataset-size effects without selection bias or distribution shift.
Invoked by the choice of progressively larger subsets described in the abstract.

pith-pipeline@v0.9.0 · 5462 in / 1174 out tokens · 30678 ms · 2026-05-10T16:38:31.452967+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 10 canonical work pages · 3 internal anchors

[1]

URLhttp://arxiv.org/ abs/2410.13732

doi: 10.5220/0012891000003838. URLhttp://arxiv.org/ abs/2410.13732. arXiv:2410.13732 [cs]. Zhengyu Chen, Siqi Wang, Teng Xiao, Yudong Wang, Shiqi Chen, Xunliang Cai, Junxian He, and Jingang Wang. Revisiting Scaling Laws for Language Models: The Role of Data Qual- ity and Training Strategies. InProceedings of the 63rd Annual Meeting of the Association for ...

work page doi:10.5220/0012891000003838
[2]

doi: 10.18653/v1/2025.acl-long.1163

Association for Computational Linguistics. doi: 10.18653/v1/2025.acl-long.1163. URL https://aclanthology.org/2025.acl-long.1163. Danny Hernandez, Tom Brown, Tom Conerly, Nova DasSarma, Dawn Drain, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Tom Henighan, Tristan Hume, Scott Johnston, Ben Mann, Chris Olah, Catherine Olsson, Dario Amodei, Nicholas Jo...

work page doi:10.18653/v1/2025.acl-long.1163 2025
[3]

Computing Research Repository , eprint=

URL http://arxiv.org/abs/2205.10487. arXiv:2205.10487 [cs]. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan...

work page arXiv
[4]

Training Compute-Optimal Large Language Models

URLhttp: //arxiv.org/abs/2203.15556. arXiv:2203.15556 [cs]. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling Laws for Neural Language Models, January

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Scaling Laws for Neural Language Models

URLhttp://arxiv.org/abs/2001.08361. arXiv:2001.08361 [cs]. J. Lin. Divergence measures based on the Shannon entropy.IEEE Transactions on Information Theory, 37(1):145–151, January

work page internal anchor Pith review Pith/arXiv arXiv 2001
[6]

, year =

ISSN 00189448. doi: 10.1109/18.61115. URLhttp: //ieeexplore.ieee.org/document/61115/. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer Sentinel Mixture Models, September

work page doi:10.1109/18.61115
[7]

Pointer Sentinel Mixture Models

URLhttps://arxiv.org/abs/1609.07843. Niklas Muennighoff, Alexander M. Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Noua- mane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel. Scaling Data-Constrained Language Models, June

work page internal anchor Pith review arXiv
[8]

M., B ARAK , B., S CAO, T

URLhttp://arxiv.org/abs/2305.16264. arXiv:2305.16264 [cs]. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Lan- guage Models are Unsupervised Multitask Learners.OpenAI Blog,

work page arXiv
[9]

URLhttp://arxiv.org/ abs/2410.16523

doi: 10.5220/0012893600003838. URLhttp://arxiv.org/ abs/2410.16523. arXiv:2410.16523 [cs]. Andrew Thompson. All the news 2.0, March

work page doi:10.5220/0012893600003838
[10]

URLhttp://arxiv.org/abs/2305. 12816. arXiv:2305.12816 [cs]. 11 Presented as a paper at 3rd DATA-FM workshop @ ICLR 2026, Brazil. A DATASETSTATISTICSAllTheNews2.0 Table 4: Training and validation dataset statistics of articles after cleaning (>500tokens) Statistic Training Dataset Validation Dataset Number of Sequences 131,072 20,000 Sequence Length 1024 t...

work page arXiv 2026
[11]

Loss Val

Subset Level Val. Loss Val. Perplexity Val. Accuracy (%) 27 11.24±0.07 76 320±4973 0.40±0.12 28 9.71±0.10 16 453±1588 2.12±0.41 29 8.68±0.05 5911±268 3.13±0.21 210 8.21±0.02 3683±80 4.00±0.50 211 7.70±0.08 2203±178 5.74±1.20 212 7.36±0.04 1574±63 7.13±0.27 213 7.20±0.06 1336±86 7.80±0.31 214 6.88±0.05 974±47 9.18±0.28 215 6.61±0.05 746±35 10.37±0.19 216 6...

2026