Active Budget Allocation for Efficient Scaling Law Estimation via Surrogate-Guided Pruning

Daniel Beck; Markus Hiller; Trevor Cohn; Viktoria Schram

arxiv: 2605.17234 · v2 · pith:ZFNI7PYXnew · submitted 2026-05-17 · 💻 cs.LG

Active Budget Allocation for Efficient Scaling Law Estimation via Surrogate-Guided Pruning

Viktoria Schram , Markus Hiller , Daniel Beck , Trevor Cohn This is my paper

Pith reviewed 2026-05-20 14:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords scaling lawssuccessive halvingsurrogate modelsbudget allocationlearning curvescompute efficiencyperformance prediction

0 comments

The pith

Successive Halving guided by surrogate models yields scaling-law estimates with lower loss-compute values than uniform allocation while using far less compute.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how to allocate a fixed compute budget across many partial training runs when fitting empirical scaling laws. It applies Successive Halving to discard poor performers early and augments the procedure with parametric and non-parametric surrogate models that forecast final performance from early observations. This combination returns a collection of completed learning curves whose best loss-compute point is better than the best point obtained by uniform full-length runs or by Successive Halving alone. On real-world and synthetic datasets the method improves the frontier by average relative margins of 2.84 percent and 5.47 percent. The same strategic pruning reduces total compute to as little as 1.3 percent of the exhaustive baseline while still producing usable scaling-law fits.

Core claim

When Successive Halving is combined with surrogate models that predict which learning curves will ultimately deliver the best loss-compute trade-off, the resulting set of curves contains at least one point whose loss-compute value is lower than the minimum achieved by naive uniform allocation or by Successive Halving without surrogates. Experiments on real and synthetic learning-curve collections confirm mean relative improvements of 2.84 percent and 5.47 percent respectively, and the overall procedure obtains accurate scaling laws at up to 98.7 percent lower computational cost than exhaustive evaluation of every configuration.

What carries the argument

Successive Halving guided by parametric and non-parametric surrogate models that forecast final loss from partial training observations.

Load-bearing premise

The surrogate models can reliably predict which partially trained curves will ultimately produce the best loss-compute frontier, so early stopping never discards a superior curve.

What would settle it

Run the surrogate-guided procedure on a fresh collection of learning curves and verify whether the selected curve ever lies strictly above the frontier obtained by exhaustive uniform allocation; if it does for any tested budget, the improvement claim fails.

Figures

Figures reproduced from arXiv: 2605.17234 by Daniel Beck, Markus Hiller, Trevor Cohn, Viktoria Schram.

**Figure 1.** Figure 1: Comparison of different compute budget allocation strategies using the synthetic LC dataset. Approaches are described in Section 3. Left: Conventional uniform allocation. Middle: Successive Halving (SH). Right: Successive Halving with multitask Gaussian Process (LMC) surrogate model. Green curves denote observed learning curves for models of different sizes, where smaller models appear to the left and plat… view at source ↗

**Figure 2.** Figure 2: shows such learning curves for nanoGPT models of various sizes in green. Note that the model size is indirectly represented as the starting point and early slope of each curve: Smaller models reach lower loss values earlier, but also plateau faster – and are hence located more to the left. In theory, L(C) is defined over [0, ∞]. In practice, however, we obtain only a partial curve for each model m ∈ M, wit… view at source ↗

**Figure 3.** Figure 3: Illustration of LMC extrapolation during SH LMC for various noise models. Green solid lines: Performance of a trained model (used for LMC training). Red dashed lines: GP extrapolation beyond the training horizon (i.e. predicted continuation of learning curve). 0.000 0.005 2 0 2 4 6 8 Loss Brownian 0.0 0.2 2 Ornstein-Uhlenbeck 0.00 0.01 2 AWGN SH LMC (5) SH LMC (20) SH (5) SH (20) [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 4.** Figure 4: Average minimum loss obtained by SH and SH LMC in noisy scenarios. Reported are results for 5 and 20 learning curves (square and circle, respectively). The noise level is denoted by σ 2 , which represents the noise intensity that was steadily increased to generate the noise values. 104 petaFLOPs. We also report the mean (max) ‘relative degradation’ in performance that is faced when using a traditional non… view at source ↗

**Figure 5.** Figure 5: Left: SL obtained after SH LMC vs. ground-truth. Right: SL obtained after LMC extrapolation (closing the gap). Having established that both strategic budget allocation methods, Successive Halving (SH) and SH with surrogate predictions, are well-suited to efficiently obtain scaling laws, we now ask whether the unique predictive capabilities of the surrogate model can be leveraged to gain additional advanta… view at source ↗

**Figure 6.** Figure 6: In each round r the same compute is allocated to all models, and the LCs therefore stop at the same compute value. As previously detailed, only the ‘most-promising’ models selected to train further will be allocated more compute in the next round – resulting in the different end-compute values for the LCs (here between 1016 -1019), based on the number of rounds the models participated in. Note that predict… view at source ↗

**Figure 7.** Figure 7: An example of how SH LMC allocates budget to models in 5 rounds. Ground truth training data shown in green lines, predicted training data shown in dashed green lines, ground truth test data shown in blue and predicted test data shown in dashed red lines [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: For two different datasets: Top: Minimum test loss obtained when allocating compute budget according to SH. Bottom: Minimum test loss obtained when allocating compute budget according to SH LMC. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: An example of the power law, the exponential and the MMF function. The deep ensemble methods approximate the following functions fPL(x) = ax−b + c, fEXP(x) = a exp(−bx) + c, fMMF(x) = (ab + cxd )/ [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Scaling Laws for two different sets of models. J. Noise Models For all noise models, we assume the intensity is weighted from w ∈ [0, ..., 1] applied until either the inclination point of the LC is reached, which we measure using the gradient of the curve, or the end of the available training data points. Note, noise is only added to the synthetic dataset, i.e., this is feasible in this case. The noise n … view at source ↗

**Figure 11.** Figure 11: Noisy LCs for various values of noise intensity σ 2 and noise models. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Corner cases where SH LMC performs worse than SH assuming 10 models and different total available budget. L. Further Analysis using SH with Surrogate Models (Dataset: nanoGPT) [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Illustration of the nanoGPT dataset. 0 1 2 3 4 5 6 Compute 1e17 4 5 6 7 8 9 10 11 Loss (a) n = 11 and order 3. 0 1 2 3 4 5 6 Compute 1e17 4 5 6 7 8 9 10 11 Loss (b) n = 51 and order 3. 0 1 2 3 4 5 6 Compute 1e17 4 5 6 7 8 9 10 11 Loss (c) n = 101 and order 3. 0 1 2 3 4 5 6 Compute 1e17 4 5 6 7 8 9 10 11 Loss (d) Original Data. 0.0 0.2 0.4 0.6 0.8 1.0 Compute 0.0 0.2 0.4 0.6 0.8 1.0 Loss (e) Zero-One norma… view at source ↗

**Figure 14.** Figure 14: Illustration of the data preprocessing steps using a SavGol Filter of window length n. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

**Figure 15.** Figure 15: Area between Curves (AbC) for the compute range of interest being 1013 to 1023 FLOPs (shaded area). The ground truth scaling law according to Kaplan (Pearce & Song, 2024) is compared to an example of a predicted LC. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗

**Figure 16.** Figure 16: Scaling Law for the full nanoGPT dataset [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Scaling laws assuming the allocated budget is 105 petaFLOPs and the compute region of interest is between 1018 and 1020 petaFLOPs. a) The obtained scaling law after allocation according to SH in comparison to the two ground truths scaling laws. The first one is the scaling law assuming fully trained learning curves are available (entire LCs, i.e. green and light blue part is available) and the second one … view at source ↗

read the original abstract

Predicting model performance at larger scales enables the design of training strategies and architectures tailored to specific performance targets. Empirical scaling law research identifies functional forms to aid this prediction task. These describe the relationship between loss and compute using a loss-compute frontier defined by learning curves. Due to the empirical nature of this approach, the computational burden is substantial, making strategic resource allocation essential - yet it remains surprisingly underexplored. In this work, we address this shortcoming by exploring the suitability of Successive Halving (SH) and SH combined with parametric and non-parametric surrogate models. In addition to enabling a more systematic allocation of a given compute budget, our findings show that SH paired with surrogate models yields a set of learning curves that includes one with a lower loss-compute value than what naive uniform allocation or an SH-only approach can obtain. Our experiments demonstrate mean relative improvements of up to 2.84% and 5.47% on real-world and synthetic learning curve datasets. This strategic resource allocation enables us to obtain accurate scaling laws at significantly reduced computational costs, saving up to 98.7% over the traditional exhaustive approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes using Successive Halving (SH) combined with parametric and non-parametric surrogate models for active budget allocation when estimating scaling laws. By pruning inferior learning curves early based on predicted final performance, the method aims to produce a loss-compute frontier that includes a lower loss-compute point than uniform allocation or SH alone, while achieving mean relative improvements of up to 2.84% on real-world datasets and 5.47% on synthetic datasets and up to 98.7% compute savings versus exhaustive training.

Significance. If the central empirical claims hold after addressing validation gaps, the work would offer a practical reduction in the compute required for scaling-law experiments, allowing more efficient identification of promising architectures and hyperparameters. The combination of SH with surrogates is a natural extension of bandit-style pruning to this domain and could be adopted in large-scale empirical studies.

major comments (2)

[§4, Algorithm 1] §4, Algorithm 1: The surrogate-guided pruning step assumes early predictions reliably identify curves that will ultimately produce the best loss-compute frontier. Because learning curves frequently cross across compute regimes, any mis-ranking at the pruning checkpoint can discard a superior candidate. No ablation on synthetic crossing-curve data or oracle post-pruning comparison is described to quantify this risk, which directly affects the validity of the reported 2.84%/5.47% gains.
[Experimental results] Experimental results section: The abstract and results report concrete percentage improvements and 98.7% savings, yet provide no information on the number of runs, error bars, surrogate training/validation splits, or robustness across random seeds and dataset partitions. This absence makes it impossible to assess whether the gains are statistically reliable or survive different experimental conditions.

minor comments (2)

The definition of the 'loss-compute value' used to select the best frontier point should be stated explicitly in the main text rather than left implicit from the figures.
Table or figure captions would benefit from listing the exact hyperparameter ranges and dataset sizes used for the real-world and synthetic experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment point by point below, providing the strongest honest defense of the manuscript while acknowledging where revisions are needed to improve clarity and rigor.

read point-by-point responses

Referee: [§4, Algorithm 1] The surrogate-guided pruning step assumes early predictions reliably identify curves that will ultimately produce the best loss-compute frontier. Because learning curves frequently cross across compute regimes, any mis-ranking at the pruning checkpoint can discard a superior candidate. No ablation on synthetic crossing-curve data or oracle post-pruning comparison is described to quantify this risk, which directly affects the validity of the reported 2.84%/5.47% gains.

Authors: We agree that learning-curve crossings represent a genuine risk for any early-pruning method, including ours. The surrogate models (parametric power-law fits and non-parametric regressors) are trained only on the observed prefix of each curve and then extrapolate to the target compute budget; nothing in the current experiments explicitly isolates the effect of crossings. Nevertheless, the reported gains on both real-world and synthetic datasets already incorporate a range of curve shapes, and the surrogate-guided variant still recovers a better loss-compute point than uniform or plain SH allocation. To quantify the risk directly, we will add a new ablation that (i) generates controlled synthetic crossing curves and (ii) compares surrogate-guided pruning against an oracle that knows the true final performances. This analysis will be inserted into §4 and the experimental section of the revised manuscript. revision: yes
Referee: The abstract and results report concrete percentage improvements and 98.7% savings, yet provide no information on the number of runs, error bars, surrogate training/validation splits, or robustness across random seeds and dataset partitions. This absence makes it impossible to assess whether the gains are statistically reliable or survive different experimental conditions.

Authors: We acknowledge that the present manuscript omits the experimental-protocol details needed to judge statistical reliability. In the revised version we will expand the Experimental Results section to state: all quantitative results are averaged over 10 independent random seeds; error bars or shaded regions show one standard deviation; surrogate models are fit with an 80/20 train/validation split on the observed curve prefixes; and the same relative improvements hold across multiple random partitions of both the real-world and synthetic datasets. These additions will be accompanied by a short paragraph discussing sensitivity to seed and partition choice. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured against external baselines

full rationale

The paper presents an empirical method using Successive Halving combined with surrogate models to allocate compute for learning curve collection, then reports measured improvements (2.84% and 5.47% relative) and savings (98.7%) on real-world and synthetic datasets against uniform allocation and SH-only baselines. No equations, fitted parameters, or self-citations are shown that reduce the reported frontier improvements or pruning decisions to the inputs by construction. The central results remain falsifiable comparisons on held-out data rather than self-referential definitions or renamings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, invented entities, or detailed axioms are stated in the provided text.

axioms (1)

domain assumption Surrogate models can be trained on partial learning curves to guide pruning decisions in Successive Halving without discarding superior final curves.
This premise is required for the surrogate-guided pruning step described in the abstract.

pith-pipeline@v0.9.0 · 5738 in / 1434 out tokens · 50954 ms · 2026-05-20T14:37:36.572860+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We approach this challenge by building upon the Successive Halving (SH) algorithm ... combined with ... multitask Gaussian Process ... Deep Ensemble methods ... to predict learning curves.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

arg min Lm(C) s.t. CM ≤ B ... Top k(Mr, L̃, η)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.