arxiv: 2604.22753 · v1 · submitted 2026-04-24 · 💻 cs.LG

Recognition: unknown

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection

Sijie Li , Shanda Li , Haowei Lin , Weiwei Sun , Ameet Talwalkar , Yiming Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:11 UTC · model grok-4.3

classification 💻 cs.LG

keywords scaling lawsactive experimental designsequential allocationbudget-efficient fittingextrapolation accuracyuncertainty-aware selectionpilot experimentsmachine learning

0 comments

The pith

An uncertainty-aware method for choosing low-cost pilot experiments can fit scaling laws nearly as accurately as using the full budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames scaling-law fitting as a sequential decision problem where each possible experiment has a known cost and the goal is to maximize prediction accuracy in an expensive target region rather than on average. It introduces an active-selection procedure that repeatedly fits a provisional scaling law, estimates uncertainty at candidate points, and allocates the next slice of budget to the runs expected to reduce error most in the high-cost zone. On a range of scaling-law benchmarks the procedure matches or exceeds classical experimental-design baselines while spending roughly one-tenth the total compute, and it often comes close to the accuracy obtained when every candidate run is performed.

Core claim

Formulating scaling-law fitting as budget-aware sequential experimental design and solving it with uncertainty-directed selection yields extrapolation accuracy that consistently surpasses classical baselines and frequently approaches the accuracy of the complete experimental set while consuming only about 10 percent of the total training budget.

What carries the argument

Uncertainty-aware sequential allocation that, at each step, fits the current scaling law to already-run points and selects the next lowest-cost candidate whose execution is predicted to reduce error most in the designated high-cost extrapolation region.

If this is right

Pilot studies for large language-model training can be budgeted and scheduled more efficiently without sacrificing the reliability of the resulting scaling predictions.
The same selection logic can be applied to any family of parametric curves that must be extrapolated from cheap observations to expensive ones.
Laboratories that adopt the procedure will obtain usable scaling laws after fewer GPU-hours, freeing resources for the main training runs those laws are meant to inform.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be extended to cases where experiment costs are not known in advance but must be estimated from partial runs or hardware models.
If the uncertainty estimates degrade under strong model misspecification, hybrid strategies that occasionally inject random or space-filling points might restore robustness.
The method’s success on scaling laws suggests it may transfer to other expensive-to-evaluate domains such as hyperparameter optimization or neural-architecture search where target regions are defined by compute or latency constraints.

Load-bearing premise

The uncertainty estimates produced by each provisional fit remain reliable enough to identify which untried experiments will actually improve accuracy in the true high-cost target region, even when the assumed functional form or noise model is imperfect.

What would settle it

Run the method on a new scaling-law benchmark until it has spent 10 percent of the budget, then compare the final extrapolation error on held-out high-cost points against the error obtained by simply choosing the same number of experiments at random; if the active method does not show lower error, the claim is falsified.

Figures

Figures reproduced from arXiv: 2604.22753 by Ameet Talwalkar, Haowei Lin, Shanda Li, Sijie Li, Weiwei Sun, Yiming Yang.

**Figure 1.** Figure 1: Our method identifies the extrapolation optimum using only a small fraction of view at source ↗

**Figure 2.** Figure 2: Mean target-region R 2 as a function of consumed budget on the benchmark. Our method reaches the strongest overall budget–accuracy trade-off and approaches the full-data reference using only a small fraction of the total experimental cost. Setting lr&bsz domain vocab parallel moe data con sparsity farseer 1% Random -0.65 ± 0.49 -0.36 ± 0.82 0.54 ± 0.57 -1.00 ± 0.00 -0.47 ± 0.66 -0.74 ± 0.48 -0.25 ± 0.63 -0… view at source ↗

**Figure 3.** Figure 3: Parameter-space visualization for one lr&bsz scaling law (sl 5) after fitting on the cheapest 12% of training points from 2048 initializations. We embed the fitted parameters with t-SNE and color each solution by its MSE on the selected points (left) or on the heldout test region (right). Multiple separated clusters indicate many local optima, while the mismatch between the two colorings shows that low er… view at source ↗

read the original abstract

Scaling laws are used to plan multi-million-dollar training runs, but fitting those laws can itself cost millions. In modern large-scale workflows, assembling a sufficiently informative set of pilot experiments is already a major budget-allocation problem rather than a routine preprocessing step. We formulate scaling-law fitting as budget-aware sequential experimental design: given a finite pool of runnable experiments with heterogeneous costs, choose which runs to execute so as to maximize extrapolation accuracy in a high-cost target region. We then propose an uncertainty-aware method for sequentially allocating experimental budget toward the runs most useful for target-region extrapolation. Across a diverse benchmark of scaling-law tasks, our method consistently outperforms classical design-based baselines, and often approaches the performance of fitting on the full experimental set while using only about 10% of the total training budget. Our code is available at https://github.com/PlanarG/active-sl.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames scaling-law fitting as budget-constrained active selection and shows a method that hits near full-data accuracy with roughly 10% of the runs on their benchmarks.

read the letter

The core contribution is treating the choice of which scaling experiments to run as a sequential design problem with heterogeneous costs and a focus on accurate extrapolation to expensive target regimes. They use uncertainty from the current fit to pick the next point, and the abstract reports consistent gains over classical baselines plus performance close to using everything while spending only about 10% of the budget. The code release is a plus for anyone who wants to try it directly.

Referee Report

1 major / 2 minor

Summary. The manuscript formulates scaling-law fitting as a budget-aware sequential experimental design task and proposes an uncertainty-aware active selection procedure that prioritizes experiments expected to reduce extrapolation error in a high-cost target region. On a diverse benchmark of scaling-law tasks the method is reported to outperform classical design baselines while achieving performance close to that of the full experimental budget using only about 10% of the total training cost; code is released.

Significance. If the empirical results are robust, the work directly addresses a high-cost practical bottleneck in large-scale model development. The public code release is a clear strength that supports reproducibility and follow-on research.

major comments (1)

[Method and Experimental Evaluation] The central claim that uncertainty-driven selection reliably improves target-region extrapolation rests on the assumption that posterior uncertainty under the fitted parametric model is well-calibrated with respect to true error. The manuscript provides no explicit stress tests on deliberately misspecified families (e.g., power-law fits to data containing logarithmic or saturation terms), which is load-bearing for the reported 10% budget savings under realistic scaling-curve deviations.

minor comments (2)

[Abstract] The abstract refers to 'a diverse benchmark of scaling-law tasks' without enumerating the tasks, model families, cost distributions, or exact metrics; a summary table or expanded description in the experimental section would improve clarity.
[Method] Clarify whether the uncertainty estimates are obtained from a Bayesian posterior, bootstrap, or another procedure, and state the precise acquisition function used for sequential selection.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting the importance of robustness under model misspecification. We address the major comment point by point below and describe the revisions we will make.

read point-by-point responses

Referee: The central claim that uncertainty-driven selection reliably improves target-region extrapolation rests on the assumption that posterior uncertainty under the fitted parametric model is well-calibrated with respect to true error. The manuscript provides no explicit stress tests on deliberately misspecified families (e.g., power-law fits to data containing logarithmic or saturation terms), which is load-bearing for the reported 10% budget savings under realistic scaling-curve deviations.

Authors: We agree that this is a substantive point. Our experiments are conducted on real scaling-law tasks using the standard parametric families (primarily power laws) that are conventional in the literature and that provide reasonable fits to the observed data. The reported gains therefore reflect performance under these commonly assumed models. However, the manuscript does not include controlled stress tests that deliberately introduce misspecification, such as generating data from logarithmic or saturating functions and then fitting power-law models. In the revised version we will add a dedicated synthetic-data experiment that systematically varies the degree of misspecification and reports the resulting extrapolation error and budget efficiency of our active-selection procedure relative to the baselines. This will directly test the load-bearing assumption and clarify the conditions under which the observed 10% budget savings remain reliable. revision: yes

Circularity Check

0 steps flagged

Algorithmic active-design procedure with external benchmark validation; no derivation reduces to inputs by construction

full rationale

The paper formulates scaling-law fitting as a sequential experimental-design problem and proposes an uncertainty-aware allocation rule. Performance is assessed via direct comparison against classical baselines on a held-out benchmark of tasks, using only a fraction of the total budget. No equations or claims reduce a target quantity to a fitted parameter by definition, no load-bearing self-citations justify uniqueness, and the central result is an empirical improvement rather than an algebraic identity. The method therefore remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract alone supplies no explicit free parameters, axioms, or invented entities; the method implicitly assumes that a probabilistic model of scaling-law residuals can be maintained and that its uncertainty is a reliable proxy for extrapolation value.

pith-pipeline@v0.9.0 · 5464 in / 1011 out tokens · 18973 ms · 2026-05-08T12:11:19.802955+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 1 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[4]

gold-standard

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page doi:10.1016/j.neuroimage.2023.120310 2017