Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining

Felipe Chavarro Polania

arxiv: 2606.11387 · v1 · pith:K4HSFDAPnew · submitted 2026-06-09 · 💻 cs.CL · cs.AI· cs.LG

Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining

Felipe Chavarro Polania This is my paper

Pith reviewed 2026-06-27 13:18 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords micro-pretrainingstaged promotioncost allocationauditable screeningGPU hoursvalidation performancehost sensitivity

0 comments

The pith

Staged promotion protocol selects top configuration using 169 GPU-hours of micro-pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether a sequence of small pretraining runs with preset promotion rules can produce reliable choices about which model setups to train at larger scale. It runs twelve candidate configurations through stages of 2, 5, 10, and 60 minutes before a 12-hour confirmation, applying the same hardware hosts and frozen thresholds at each gate. The protocol promotes a bridge reference that ends up ranking first across all host-seed combinations at the final stage, while a greedy alternative and a cheaper sentinel fail their respective near-equivalence tests. Total training compute reaches 169.2 GPU-hours, far below the cost of extending every earlier candidate. The study frames this as a bounded cost-allocation result rather than a guarantee of global optimality.

Core claim

The paper establishes that a staged promotion protocol with fixed budgets and frozen rules can keep a reference configuration in the promoted set through all gates and rank it first at 12 hours across four host-seed cells in two seeds. The greedy comparator fails the 0.010 val_bpb rule and the d8/ar48 sentinel fails the 0.020 mean-gap rule. The full protocol records 169.2 training GPU-hours, with the 12-hour branch using 144 GPU-hours, compared to 432 GPU-hours for continuing all nine 10-minute candidates.

What carries the argument

Staged factorial screening with five fixed budgets (2 min, 5 min, 10 min, 60 min, 12 h) and frozen promotion thresholds applied across heterogeneous hosts.

If this is right

The bridge condition ranks first in all four 60-minute host-seed cells and again in the 12-hour package.
The protocol spends 169.2 GPU-hours total versus 432 for full continuation of 10-minute candidates.
Rankings at 5- and 10-minute stages vary by host, treated as promotion evidence rather than stable curves.
The 60-minute replicated gate retains the reference while discarding some candidates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar staging could be applied to other pretraining scales to test whether the same thresholds remain effective.
The approach might generalize to screening in other machine learning experiment pipelines where full runs are expensive.
Future work could vary the stage budgets to measure sensitivity of the final selection to the chosen gates.

Load-bearing premise

The set of twelve prior-screened configurations contains every setup that could have performed better at the 12-hour scale, so early elimination does not discard superior options.

What would settle it

Training one of the nine skipped 10-minute candidates for 12 hours and observing a lower validation bits-per-byte than the promoted bridge condition would show that the protocol missed a better configuration.

Figures

Figures reproduced from arXiv: 2606.11387 by Felipe Chavarro Polania.

**Figure 2.** Figure 2: Early promise separates from 12-hour survival. Lines show mean [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: 12-hour seed-stability check across two hosts. The bridge condition ranks first in all four host-seed cells. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Cost-quality frontier at 12 hours. Smaller models process more tokens but remain worse in final [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Promotion gates reduced long-horizon spend under frozen stopping rules. The executed branch uses 144 [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

read the original abstract

Short pretraining runs can reduce experimental cost, but they can also over-promote configurations that only look strong at tiny budgets. We study an auditable staged-promotion protocol for a fixed micro-pretraining runner on two heterogeneous host blocks: Windows A100 and Linux L40S. Starting from twelve prior-screened configurations, we use staged budgets of 2 minutes, 5 minutes, 10 minutes, 60 minutes, and 12 hours, with frozen promotion rules before expensive continuations. The early screens are intentionally treated as unstable: the 5- and 10-minute rankings are host-sensitive, and the eventual 12-hour top-ranked condition is not the mean-best condition at the replicated 10-minute gate. Because seed ranges differ across stages, these changes are operational promotion evidence, not within-seed curves. A replicated 60-minute gate keeps the Staged Factorial Screening bridge reference in the promoted set, where it ranks first in all four 60-minute host-seed cells. In the final 12-hour confirmation package, the bridge condition ranks first in all four host-seed cells across two seeds; the greedy comparator does not meet the frozen 0.010 val_bpb near-equivalence rule; and the cheaper d8/ar48 (depth-8, aspect-48) sentinel does not meet the frozen 0.020 mean-gap rule. The executed 12-hour branch spends 144 GPU-hours, and the full staged protocol records 169.2 training GPU-hours including screening stages. Continuing all four 60-minute candidates would spend 192 GPU-hours, while continuing all nine replicated 10-minute candidates would spend 432 GPU-hours. The latter numbers are accounting counterfactuals for unrun continuations, not evidence that skipped candidates could not have overtaken the reference. The result is a bounded cost-allocation finding, not a claim of global optimality, capacity-normalized superiority, or superiority over adaptive hyperparameter optimization methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A narrow case study that shows one way to screen micro-pretraining configs with fixed short-run rules and ends up using 169 GPU-hours total.

read the letter

The paper walks through a staged protocol on twelve already-screened configs using budgets of 2/5/10/60 minutes plus a 12-hour confirmation on two host types. Early stages show host-sensitive rankings, the 12-hour winner was not the mean leader at the 10-minute gate, and the final run keeps the bridge reference on top while two comparators fail the preset rules. They report the exact GPU-hour spend and contrast it with the higher cost of running all candidates forward.

What stands out is the explicit cost accounting and the decision to treat ranking shifts as operational evidence rather than noise. The numbers are specific: 169.2 training GPU-hours executed versus 432 for all nine 10-minute candidates. The paper is clear that this is a bounded allocation result on one runner, not a general method.

The main limitation is the starting set of twelve configs. The auditable claim rests on that set being sufficient; if a better candidate was dropped before the staged protocol, the frozen rules cannot catch it. Threshold choice (0.010 val_bpb, 0.020 mean-gap) is presented as fixed but the justification for those exact values is thin. This is a single case study, so variability across different models or data is unknown.

The work is aimed at researchers who run many small pretraining trials on limited hardware and want a documented way to drop candidates without post-hoc changes. A reader in that position can extract the stage budgets and rule structure to adapt.

It is worth sending for peer review. The empirical details and cost tracking are concrete enough that referees can evaluate the protocol on its own terms.

Referee Report

2 major / 1 minor

Summary. The paper claims that a staged promotion protocol with budgets of 2/5/10/60 minutes and 12 hours, using frozen promotion rules (0.010 val_bpb near-equivalence and 0.020 mean-gap), allows for auditable decisions from twelve prior-screened configurations in micro-pretraining. It demonstrates that the bridge condition ranks first at the 12-hour stage across all host-seed cells, with total compute of 169.2 GPU-hours, providing cost savings compared to counterfactual full continuations, while qualifying the result as a bounded cost-allocation finding rather than general optimality.

Significance. If the thresholds are indeed frozen and the screening set sufficient, this case study illustrates a practical method for reducing experimental costs in pretraining configuration selection with explicit accounting of compute usage and clear limitations on the scope of claims. The use of multiple hosts and seeds, along with acknowledgment of ranking instability at early stages, adds to the operational evidence presented.

major comments (2)

[Abstract] Abstract: The central claim that the protocol yields auditable promotion decisions from the twelve prior-screened configurations without post-hoc adjustment requires that the initial set is exhaustive and that the promotion thresholds were fixed independently of outcomes; the manuscript provides no details on the screening process or when the 0.010/0.020 rules were determined relative to data collection, which is load-bearing for the auditable property.
[Abstract] Abstract: The 12-hour result states that the bridge condition ranks first in all four host-seed cells across two seeds and that the greedy comparator fails the frozen rule, but without reported per-cell values, variance estimates, or explicit confirmation that the rules were applied identically to all candidates, the stability claim is difficult to verify from the given numbers alone.

minor comments (1)

[Abstract] Abstract: The counterfactual costs (192 GPU-hours for four 60-minute candidates, 432 for nine 10-minute candidates) are correctly labeled as accounting exercises rather than evidence of superiority, but a short clarification on how the sets of four and nine were determined from the twelve would improve transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. The two major comments correctly identify areas where additional transparency would strengthen the verifiability of the auditable-protocol claim. We address each point below and will incorporate the requested details in a revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the protocol yields auditable promotion decisions from the twelve prior-screened configurations without post-hoc adjustment requires that the initial set is exhaustive and that the promotion thresholds were fixed independently of outcomes; the manuscript provides no details on the screening process or when the 0.010/0.020 rules were determined relative to data collection, which is load-bearing for the auditable property.

Authors: We agree that the current text does not explicitly document the provenance of the twelve configurations or the timing of threshold selection. The 0.010 val_bpb near-equivalence and 0.020 mean-gap rules were fixed in advance of the staged-budget experiments on the basis of an earlier, separate pilot study whose data are not part of the reported twelve-configuration set. The screening phase itself was an independent exploratory run performed before any of the staged budgets were executed. We will revise the abstract and add a short methods paragraph stating these facts, thereby making the “frozen before expensive continuations” claim directly verifiable from the manuscript. revision: yes
Referee: [Abstract] Abstract: The 12-hour result states that the bridge condition ranks first in all four host-seed cells across two seeds and that the greedy comparator fails the frozen rule, but without reported per-cell values, variance estimates, or explicit confirmation that the rules were applied identically to all candidates, the stability claim is difficult to verify from the given numbers alone.

Authors: The manuscript asserts uniform ranking but does not tabulate the per-cell val_bpb and mean-gap numbers that underlie the claim. We will add an appendix table listing the four host-seed cells at the 12-hour stage, together with the exact numerical outcomes for the bridge condition, the greedy comparator, and the d8/ar48 sentinel. This will allow readers to confirm that the frozen 0.010/0.020 thresholds were applied identically. Because each stage used distinct seed ranges, within-cell variance is not defined in the design; we will note this limitation explicitly rather than report a variance estimate. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements from frozen staged protocol

full rationale

The paper reports results from an executed staged-promotion protocol with explicitly frozen rules (0.010 val_bpb near-equivalence, 0.020 mean-gap) applied to twelve prior-screened configurations across fixed budgets. All central claims are direct measurements of GPU-hours, rankings, and rule outcomes from the runs themselves; no equations, fitted parameters renamed as predictions, or self-citation chains appear in the derivation. The text explicitly frames the outcome as a bounded cost-allocation finding rather than optimality or general derivation, rendering the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the specific experimental parameters chosen for this case study and the domain assumption that short-run performance provides usable (if unstable) signal for promotion decisions.

free parameters (2)

promotion thresholds = 0.010 and 0.020
The 0.010 val_bpb near-equivalence and 0.020 mean-gap rules are chosen and frozen in advance.
stage budgets = 2min/5min/10min/60min/12h
Specific durations of 2, 5, 10, 60 minutes and 12 hours are selected for the protocol.

axioms (1)

domain assumption Short pretraining runs provide usable ranking signal despite host sensitivity and instability
Invoked to justify the staged screening and promotion from early unstable stages.

pith-pipeline@v0.9.1-grok · 5899 in / 1396 out tokens · 35925 ms · 2026-06-27T13:18:44.041518+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 1 linked inside Pith

[1]

Staged Factorial Screening for Budget-Constrained Micro-Pretraining

Felipe Chavarro Polania. Staged Factorial Screening for Budget-Constrained Micro-Pretraining. arXiv:2606.05186, 2026.https://arxiv.org/abs/2606.05186

Pith/arXiv arXiv 2026
[2]

Hyperband: A NovelBandit-BasedApproachtoHyperparameterOptimization

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A NovelBandit-BasedApproachtoHyperparameterOptimization. JournalofMachineLearningResearch, 18(185):1- 52, 2018.https://jmlr.org/papers/v18/16-558.html

2018
[3]

A System for Massively Parallel Hyperparameter Tuning

Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Jonathan Ben-tzur, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. A System for Massively Parallel Hyperparameter Tuning. Proceedings of Machine Learning and Systems, 2020.https://proceedings.mlsys.org/paper_files/paper/2020/hash/ a06f20b349c6cf09a6b171c71b88bbfc-Abstract.html

2020
[4]

BOHB: Robust and Efficient Hyperparameter Optimization at Scale

Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: Robust and Efficient Hyperparameter Optimization at Scale. Proceedings of the 35th International Conference on Machine Learning, PMLR 80:1437-1446, 2018. https://proceedings.mlr.press/v80/falkner18a.html

2018
[5]

Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. Show Your Work: Improved Reporting of Experimental Results. Proceedings of EMNLP-IJCNLP, pages 2185-2194, 2019.https: //aclanthology.org/D19-1224/

2019
[6]

Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A

Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, and Jesse Dodge. DataDecide: 13 How to Predict Best Pretraining Data with Small Experiments. arXiv:2504.11393, 2025.https://arxiv.org/ abs/2504.11393

arXiv 2025
[7]

Fantastic Pretraining Optimizers and Where to Find Them

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic Pretraining Optimizers and Where to Find Them. arXiv:2509.02046, 2025.https://arxiv.org/abs/2509.02046 14

arXiv 2025

[1] [1]

Staged Factorial Screening for Budget-Constrained Micro-Pretraining

Felipe Chavarro Polania. Staged Factorial Screening for Budget-Constrained Micro-Pretraining. arXiv:2606.05186, 2026.https://arxiv.org/abs/2606.05186

Pith/arXiv arXiv 2026

[2] [2]

Hyperband: A NovelBandit-BasedApproachtoHyperparameterOptimization

Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A NovelBandit-BasedApproachtoHyperparameterOptimization. JournalofMachineLearningResearch, 18(185):1- 52, 2018.https://jmlr.org/papers/v18/16-558.html

2018

[3] [3]

A System for Massively Parallel Hyperparameter Tuning

Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Jonathan Ben-tzur, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. A System for Massively Parallel Hyperparameter Tuning. Proceedings of Machine Learning and Systems, 2020.https://proceedings.mlsys.org/paper_files/paper/2020/hash/ a06f20b349c6cf09a6b171c71b88bbfc-Abstract.html

2020

[4] [4]

BOHB: Robust and Efficient Hyperparameter Optimization at Scale

Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: Robust and Efficient Hyperparameter Optimization at Scale. Proceedings of the 35th International Conference on Machine Learning, PMLR 80:1437-1446, 2018. https://proceedings.mlr.press/v80/falkner18a.html

2018

[5] [5]

Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. Show Your Work: Improved Reporting of Experimental Results. Proceedings of EMNLP-IJCNLP, pages 2185-2194, 2019.https: //aclanthology.org/D19-1224/

2019

[6] [6]

Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A

Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, and Jesse Dodge. DataDecide: 13 How to Predict Best Pretraining Data with Small Experiments. arXiv:2504.11393, 2025.https://arxiv.org/ abs/2504.11393

arXiv 2025

[7] [7]

Fantastic Pretraining Optimizers and Where to Find Them

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic Pretraining Optimizers and Where to Find Them. arXiv:2509.02046, 2025.https://arxiv.org/abs/2509.02046 14

arXiv 2025