pith. sign in

arxiv: 2607.01517 · v1 · pith:YUZTHJN6new · submitted 2026-07-01 · 💻 cs.CL

Parameter Golf: What Really Works?

Pith reviewed 2026-07-03 20:47 UTC · model grok-4.3

classification 💻 cs.CL
keywords language model optimizationbits per bytemodel compressiontraining efficiencycommunity contestBPB evaluationparameter budgetoptimization techniques
0
0 comments X

The pith

Community contest submissions cut language model BPB from 1.2244 to 1.058 despite most single techniques adding under 1%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines an open challenge in which teams trained language models whose full artifact had to fit in 16 MB and train in under ten minutes on 8xH100 GPUs, with quality scored by bits-per-byte on unseen text. It processes 2,037 pull requests and 1,430 scored submissions to build a taxonomy of 84 optimization techniques and quantifies each technique's measured effect on BPB. The leaderboard improved 13.6 percent overall, yet almost no individual technique exceeded a 1 percent gain and many contributions shrank once many teams competed. The analysis therefore isolates the small set of methods whose gains persist across different model stacks.

Core claim

The verified leaderboard score dropped from 1.2244 to 1.058 BPB across three phases -- a 13.6% reduction, despite individual techniques rarely improving BPB by more than 1%. We show that most gains in techniques shrink across competitive submissions, isolating the few methods that improve performance across stacks.

What carries the argument

The taxonomy of 84 optimization techniques extracted from the 1,430 submissions, together with per-technique contribution measurements to BPB.

If this is right

  • Only a minority of techniques retain their BPB gains once many submissions compete.
  • Overall score improvement can still reach double digits even when every single technique stays below 1 percent.
  • Techniques must be evaluated for cross-stack robustness rather than isolated peak effect.
  • Diminishing returns appear for the majority of common optimizations under tight artifact budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Re-running the contest with the same techniques but new participants would test whether the observed shrinkage is due to selection or to genuine interactions.
  • The same measurement approach could be applied to other constrained training settings such as mobile or edge models.
  • Future work could measure pairwise interactions among the few persistent techniques to explain why they combine better than the rest.

Load-bearing premise

The 1,430 clean scored submissions and the derived taxonomy of 84 techniques provide an unbiased and complete basis for attributing BPB improvements to specific optimizations rather than to unmeasured interactions or selection effects in the contest data.

What would settle it

A controlled replication that applies the isolated top techniques to fresh model stacks outside the original contest and checks whether the full 13.6 percent BPB reduction is recovered.

Figures

Figures reproduced from arXiv: 2607.01517 by Prashanna Mani Paudel, Shivanand Venkanna Sheshappanavar.

Figure 1
Figure 1. Figure 1: Parameter Golf: 43 days, 1,810 submissions, BPB improved from 1.2244 to 1.058 under a fixed artifact-size budget. Shaded bands (three improvement phases). Labeled points (README-verified milestones). Abstract How far can a language model improve un￾der a strict artifact budget? Parameter Golf posed this question as an open community chal￾lenge in which participants trained the best lan￾guage model, with th… view at source ↗
Figure 3
Figure 3. Figure 3: Category contribution per phase. Each record’s BPB drop over the previous record is split across its newly added techniques in proportion to their all-data ∆k (negatives clamped to zero; drops with no positive-∆k addition split equally). Bar height is the phase’s total verified improvement; segments are the per￾category shares. The weighting inherits ∆k’s adoption￾timing bias, so the split is indicative ra… view at source ↗
Figure 4
Figure 4. Figure 4: All-data versus within-frontier ∆k, ordered by within-frontier value. The within-frontier bars carry 95% bootstrap confidence intervals. negative within the frontier. GPTQ, Brotli, depth recurrence, and the TTT variants, which are all near-universal among strong submissions, fall to small but significantly negative values, confirm￾ing that their large all-data ∆k reflects era cor￾relation rather than an in… view at source ↗
Figure 5
Figure 5. Figure 5: Data collection and score-extraction pipeline. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Top-22 techniques ranked by all-data ∆k across the 1,430 clean scored submissions. ∆k is obser￾vational; see Section 3. A.2 Technique Impact Rankings [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Phase 1 techniques ranked by average BPB [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Phase 2 techniques ranked by average BPB [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Phase 3 techniques ranked by average BPB [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Weekly adoption rate per technique category. Each point is the fraction of that week’s PRs using at least [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Quantization technique deep dive. Left: average BPB improvement ∆k for all 17 quantization flags; blue bars are beneficial, red are harmful. Right: adoption counts coloured by impact direction. Int8-only and Binary (1-bit) correlate with worse BPB despite wide adoption, while GPTQ, SDClip, and LQER show the strongest gains [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Tokenization impact. Left: BPB distributions for the three SentencePiece vocabulary sizes; larger vocab￾ularies shift the median downward (SP8192 median 1.077 vs. SP1024 baseline 1.143). Right: CaseOps/Casefold tokenizer tightens the upper tail and shifts the median from 1.135 to 1.061 [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
read the original abstract

How far can a language model improve under a strict artifact budget? Parameter Golf posed this question as an open community challenge in which participants trained the best language model, with the complete artifact (training code + compressed weights) required to fit within 16 MB and be trained in under ten minutes on 8xH100 SXM GPUs. Quality was measured in bits-per-byte (BPB), the average number of bits required to encode each byte of unseen text. We analyze 2,037 pull requests and 1,430 clean scored submissions from the contest, build a taxonomy of 84 optimization techniques, and measure each technique's contribution to BPB. The verified leaderboard score dropped from 1.2244 to 1.058 BPB across three phases -- a 13.6% reduction, despite individual techniques rarely improving BPB by more than 1%. We show that most gains in techniques shrink across competitive submissions, isolating the few methods that improve performance across stacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper analyzes 2,037 pull requests and 1,430 scored submissions from the Parameter Golf contest, in which participants optimized language models to fit within a 16 MB artifact budget and train in under 10 minutes on 8xH100 GPUs. It constructs a taxonomy of 84 optimization techniques, attributes BPB improvements to them, and reports that the verified leaderboard score fell from 1.2244 to 1.058 BPB (13.6% reduction) across phases, with most individual techniques contributing <1% and gains shrinking over time, while isolating a small set of methods that improve performance across stacks.

Significance. If the per-technique attributions prove robust, the work supplies a large-scale empirical map of what optimizations matter under tight compute and size constraints, highlighting diminishing returns and cross-stack generalizers. The scale of the contest data and the explicit taxonomy constitute a reproducible resource for the community.

major comments (3)
  1. [Abstract / §4] Abstract and §4 (results): The central claim that the 13.6% BPB reduction can be decomposed into contributions from the 84-technique taxonomy rests on observational contest submissions without reported statistical methods, confidence intervals, or controls for co-occurrence and sequential dependence. Later high-scoring entries are conditioned on earlier ones, so observed deltas cannot be cleanly attributed to individual techniques rather than interactions, survivor bias, or search effort.
  2. [§3] §3 (taxonomy construction): The assignment of the 84 techniques to the 1,430 submissions is described as post-hoc labeling; without pre-specified criteria, inter-annotator reliability metrics, or sensitivity checks to alternative taxonomies, the isolation of 'few methods that improve performance across stacks' risks circularity with the leaderboard ordering itself.
  3. [§4.2] §4.2 (per-technique measurement): No matched-pair ablations, randomized controls, or regression models with interaction terms are mentioned to separate main effects from confounders; the reported shrinkage of gains across phases therefore cannot be distinguished from selection effects in the competitive data.
minor comments (2)
  1. [Abstract] Abstract: 'verified leaderboard score' is used without a definition of the verification procedure or exclusion criteria for the 1,430 clean submissions.
  2. [§2] The manuscript would benefit from an explicit statement of how BPB is computed on the held-out text and whether the same evaluation set was used across all phases.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, indicating revisions where we agree changes are warranted while defending the observational nature of the study.

read point-by-point responses
  1. Referee: [Abstract / §4] Abstract and §4 (results): The central claim that the 13.6% BPB reduction can be decomposed into contributions from the 84-technique taxonomy rests on observational contest submissions without reported statistical methods, confidence intervals, or controls for co-occurrence and sequential dependence. Later high-scoring entries are conditioned on earlier ones, so observed deltas cannot be cleanly attributed to individual techniques rather than interactions, survivor bias, or search effort.

    Authors: We agree the analysis is observational and lacks formal statistical controls such as regression models, confidence intervals, or explicit handling of sequential dependence. Attributions derive from associating technique introductions with score deltas across submissions. We will revise the abstract and §4 to qualify claims as observational, add a limitations paragraph on confounders including survivor bias and co-occurrence, and report technique co-occurrence frequencies among top entries. New controlled experiments are not feasible on historical data. revision: partial

  2. Referee: [§3] §3 (taxonomy construction): The assignment of the 84 techniques to the 1,430 submissions is described as post-hoc labeling; without pre-specified criteria, inter-annotator reliability metrics, or sensitivity checks to alternative taxonomies, the isolation of 'few methods that improve performance across stacks' risks circularity with the leaderboard ordering itself.

    Authors: The taxonomy was built by iterative review of PR descriptions and code diffs, with categories defined independently before scoring associations. No formal inter-annotator metrics were computed. We will add explicit discussion of the post-hoc process and a sensitivity analysis re-grouping a sample of techniques to test robustness of the cross-stack results. Circularity is mitigated because taxonomy labels precede score-based filtering and are applied uniformly across all submissions. revision: partial

  3. Referee: [§4.2] §4.2 (per-technique measurement): No matched-pair ablations, randomized controls, or regression models with interaction terms are mentioned to separate main effects from confounders; the reported shrinkage of gains across phases therefore cannot be distinguished from selection effects in the competitive data.

    Authors: We concur that no ablations or randomized controls exist, as the work analyzes existing contest data rather than new experiments. The phase-wise shrinkage is reported descriptively. We will revise §4.2 to state explicitly that selection effects cannot be ruled out and to frame the shrinkage finding as correlational. Core per-technique measurements remain unchanged without new data. revision: yes

Circularity Check

0 steps flagged

No circularity: analysis rests on external contest data without self-referential reductions

full rationale

The paper reports an observational analysis of 2,037 pull requests and 1,430 external contest submissions, constructing a taxonomy of 84 techniques and measuring their BPB contributions from public leaderboard scores. No equations, definitions, or self-citations are presented that reduce the reported 13.6% BPB drop or per-technique attributions to fitted inputs, self-definitions, or prior author work by construction. The derivation chain is self-contained against the external contest data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The reported improvement and technique taxonomy rest on the contest's evaluation protocol and the assumption that submitted models were scored consistently; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Bits-per-byte on unseen text is an appropriate scalar measure of language-model quality under the contest constraints.
    BPB is used as the sole quality metric for ranking submissions.

pith-pipeline@v0.9.1-grok · 5697 in / 1333 out tokens · 45803 ms · 2026-07-03T20:47:12.683627+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 4 internal anchors

  1. [1]

    2024 , howpublished =

    Keller Jordan , title =. 2024 , howpublished =

  2. [2]

    2025 , howpublished =

  3. [3]

    Guilherme Penedo and Hynek Kydl. The. Advances in Neural Information Processing Systems , year =

  4. [4]

    2024 , howpublished =

    Keller Jordan and Yuchen Jin and Vlado Boza and Jiacheng You and Franz Cesista and Laker Newhouse and Jeremy Bernstein , title =. 2024 , howpublished =

  5. [5]

    International Conference on Learning Representations , year =

    Songlin Yang and Jan Kautz and Ali Hatamizadeh , title =. International Conference on Learning Representations , year =

  6. [6]

    International Conference on Learning Representations , year =

    Elias Frantar and Saleh Ashkboos and Torsten Hoefler and Dan Alistarh , title =. International Conference on Learning Representations , year =

  7. [7]

    Constantinides and Yiren Zhao , title =

    Cheng Zhang and Jianyi Cheng and George A. Constantinides and Yiren Zhao , title =. International Conference on Machine Learning , year =

  8. [8]

    Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages =

    Taku Kudo and John Richardson , title =. Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations , pages =. 2018 , note =

  9. [9]

    Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI) , year =

    Pavel Izmailov and Dmitrii Podoprikhin and Timur Garipov and Dmitry Vetrov and Andrew Gordon Wilson , title =. Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI) , year =

  10. [10]

    Efros and Moritz Hardt , title =

    Yu Sun and Xiaolong Wang and Zhuang Liu and John Miller and Alexei A. Efros and Moritz Hardt , title =. International Conference on Machine Learning , year =

  11. [11]

    Katz , title =

    Slava M. Katz , title =. IEEE Transactions on Acoustics, Speech, and Signal Processing , volume =

  12. [12]

    Cleary and Ian H

    John G. Cleary and Ian H. Witten , title =. IEEE Transactions on Communications , volume =

  13. [13]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , journal =

    Yoshua Bengio and Nicholas L. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , journal =

  14. [14]

    Neurocomputing , volume =

    Jianlin Su and Yu Lu and Shengfeng Pan and Ahmed Murtadha and Bo Wen and Yunfeng Liu , title =. Neurocomputing , volume =. 2024 , note =

  15. [15]

    GLU Variants Improve Transformer

    Noam Shazeer , title =. arXiv preprint arXiv:2002.05202 , year =

  16. [16]

    ACM Transactions on Information Systems , volume =

    Jyrki Alakuijala and Andrea Farruggia and Paolo Ferragina and Eugene Kliuchnikov and Robert Obryk and Zoltan Szabadka and Lode Vandevenne , title =. ACM Transactions on Information Systems , volume =

  17. [17]

    Proceedings of the BabyLM Challenge at the Conference on Computational Natural Language Learning , year =

    Alex Warstadt and Aaron Mueller and Leshem Choshen and Ethan Wilcox and Chengxu Zhuang and Juan Ciro and Rafael Mosquera and Bhargavi Paranjabe and Adina Williams and Tal Linzen and Ryan Cotterell , title =. Proceedings of the BabyLM Challenge at the Conference on Computational Natural Language Learning , year =

  18. [18]

    TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

    Ronen Eldan and Yuanzhi Li , title =. arXiv preprint arXiv:2305.07759 , year =

  19. [19]

    Proceedings of the Conference on Empirical Methods in Natural Language Processing , year =

    Sharan Narang and Hyung Won Chung and Yi Tay and William Fedus and Thibault Fevry and Michael Matena and Karishma Malkan and Noah Fiedel and Noam Shazeer and Zhenzhong Lan and Yanqi Zhou and Wei Li and Nan Ding and Jake Marcus and Adam Roberts and Colin Raffel , title =. Proceedings of the Conference on Empirical Methods in Natural Language Processing , year =

  20. [20]

    So and Wojciech Ma

    David R. So and Wojciech Ma. Primer: Searching for Efficient Transformers for Language Modeling , booktitle =. 2021 , note =

  21. [21]

    Scaling Laws for Neural Language Models

    Jared Kaplan and Sam McCandlish and Tom Henighan and Tom B. Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei , title =. arXiv preprint arXiv:2001.08361 , year =

  22. [22]

    Rae and Oriol Vinyals and Laurent Sifre , title =

    Jordan Hoffmann and Sebastian Borgeaud and Arthur Mensch and Elena Buchatskaya and Trevor Cai and Eliza Rutherford and Diego de Las Casas and Lisa Anne Hendricks and Johannes Welbl and Aidan Clark and Tom Hennigan and Eric Noland and Katie Millican and George van den Driessche and Bogdan Damoc and Aurelia Guy and Simon Osindero and Karen Simonyan and Eric...

  23. [23]

    International Conference on Learning Representations , year =

    Urvashi Khandelwal and Omer Levy and Dan Jurafsky and Luke Zettlemoyer and Mike Lewis , title =. International Conference on Learning Representations , year =

  24. [24]

    Rae and Erich Elsen and Laurent Sifre , title =

    Sebastian Borgeaud and Arthur Mensch and Jordan Hoffmann and Trevor Cai and Eliza Rutherford and Katie Millican and George van den Driessche and Jean-Baptiste Lespiau and Bogdan Damoc and Aidan Clark and Diego de Las Casas and Aurelia Guy and Jacob Menick and Roman Ring and Tom Hennigan and Saffron Huang and Loren Maggiore and Chris Jones and Albin Cassir...

  25. [25]

    Conference on Language Modeling (COLM) , year =

    Jiacheng Liu and Sewon Min and Luke Zettlemoyer and Yejin Choi and Hannaneh Hajishirzi , title =. Conference on Language Modeling (COLM) , year =

  26. [26]

    Proceedings of Machine Learning and Systems (MLSys) , year =

    Ji Lin and Jiaming Tang and Haotian Tang and Shang Yang and Wei-Ming Chen and Wei-Chen Wang and Guangxuan Xiao and Xingyu Dang and Chuang Gan and Song Han , title =. Proceedings of Machine Learning and Systems (MLSys) , year =

  27. [27]

    Advances in Neural Information Processing Systems , year =

    Jay Shah and Ganesh Bikshandi and Ying Zhang and Vijay Thakkar and Pradeep Ramani and Tri Dao , title =. Advances in Neural Information Processing Systems , year =

  28. [28]

    International Conference on Learning Representations , year =

    Bowen Peng and Jeffrey Quesnelle and Honglu Fan and Enrico Shippole , title =. International Conference on Learning Representations , year =

  29. [29]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao , title =. arXiv preprint arXiv:2312.00752 , year =

  30. [30]

    IEEE Conference on Computer Vision and Pattern Recognition , year =

    Benoit Jacob and Skirmantas Kligys and Bo Chen and Menglong Zhu and Matthew Tang and Andrew Howard and Hartwig Adam and Dmitry Kalenichenko , title =. IEEE Conference on Computer Vision and Pattern Recognition , year =

  31. [31]

    Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Maarten Bosma and Gaurav Mishra and Adam Roberts and Paul Barham and Hyung Won Chung and Charles Sutton and Sebastian Gehrmann and Parker Schuh and Kensen Shi and Sasha Tsvyashchenko and Joshua Maynez and Abhishek Rao and Parker Barnes and Yi Tay and Noam Shazeer and Vinodkumar Prabhakaran and Emi...