Early-stopped aggregation: Adaptive inference with computational efficiency

Ilsang Ohn; Jungbin Jun; Lizhen Lin; Shitao Fan

arxiv: 2604.14404 · v1 · submitted 2026-04-15 · 🧮 math.ST · stat.ME· stat.ML· stat.TH

Early-stopped aggregation: Adaptive inference with computational efficiency

Ilsang Ohn , Shitao Fan , Jungbin Jun , Lizhen Lin This is my paper

Pith reviewed 2026-05-10 11:38 UTC · model grok-4.3

classification 🧮 math.ST stat.MEstat.MLstat.TH

keywords early-stopped aggregationadaptive inferencevariational Bayesmodel selectionpenalized estimationcontraction ratescomputational efficiencyenergy functional

0 comments

The pith

Early-stopped aggregation computes only simpler models to reach optimal adaptive rates in both Bayesian and frequentist inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces early-stopped aggregation to avoid computing estimators for all candidate models, including unnecessarily complex ones, when the true data-generating process is simple. Instead, an early-stopping criterion limits computation to a small number of simpler models whose estimators are then aggregated for final inference. The framework covers variational Bayes, variational empirical Bayes, and frequentist penalized or sample-splitting aggregation. It establishes optimal adaptive contraction rates under mild conditions and shows that both Bayesian and frequentist versions are driven by the same energy functional balancing data fit against complexity. A reader would care because the method delivers statistically optimal adaptive inference while substantially reducing the computational cost of exploring large model classes.

Core claim

Early-stopped aggregation achieves optimal adaptive contraction rates in the variational Bayes setting under mild conditions by computing and aggregating only a small number of simpler estimators selected via an early-stopping criterion rather than all candidates. The same construction extends to variational empirical Bayes with data-dependent hyperparameters and to frequentist aggregation through penalization or sample splitting. Bayesian and frequentist procedures are unified by a common energy functional that includes a data-fitting term and a complexity-control term.

What carries the argument

The early-stopped aggregation (ESA) procedure, which applies an early-stopping criterion to restrict computation and aggregation to simpler models and is governed by an energy functional combining data fit with complexity control.

Load-bearing premise

The early-stopping criterion can reliably identify when further model complexity is unnecessary without overlooking important structure in the data.

What would settle it

A concrete dataset or simulation in which the true process requires high complexity yet the early-stopping rule stops too soon, producing slower-than-optimal contraction rates or degraded inference accuracy.

Figures

Figures reproduced from arXiv: 2604.14404 by Ilsang Ohn, Jungbin Jun, Lizhen Lin, Shitao Fan.

read the original abstract

When considering a model selection or, more generally, an aggregation approach for adaptive statistical inference, it is often necessary to compute estimators over a wide range of model complexities including unnecessarily large models even when the true data-generating process is relatively simple, due to the lack of prior knowledge. This requirement can lead to substantial computational inefficiency. In this work, we propose a novel framework for efficient model aggregation called the early-stopped aggregation (ESA): instead of computing and aggregating estimators for all candidate models, we compute only a small number of simpler ones using an early-stopping criterion and aggregate only these for final inference. Our framework is versatile and applies to both Bayesian model selection, in particular, within the variational Bayes framework, and frequentist estimation, including a general penalized estimation setting. We investigate adaptive optimal property of the ESA approach across three learning paradigms. We first show that ESA achieves optimal adaptive contraction rates in the variational Bayes setting under mild conditions. We extend this result to variational empirical Bayes, where prior hyperparameters are chosen in a data-dependent manner. In addition, we apply the ESA approach to frequentist aggregation including both penalization-based and sample-splitting implementations, and establish corresponding theory. As we demonstrate, there is a clear unification between early-stopped Bayes and frequentist penalized aggregation, with a common "energy" functional comprising a data-fitting term and a complexity-control term that drives both procedures. We further present several applications and numerical studies that highlight the efficiency and strong performance of the proposed approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces early-stopped aggregation to cut compute in model aggregation while claiming optimal adaptive rates, with a unification to penalized estimation via an energy functional.

read the letter

The core idea is straightforward: instead of fitting and aggregating over every model complexity up to some large cutoff, stop early once a simple criterion says further models add nothing useful, then aggregate only what you have computed. This applies to variational Bayes, empirical Bayes, and frequentist penalized or sample-split aggregation, and the authors tie the procedures together by showing they all minimize versions of the same energy functional that balances data fit against complexity. That unification and the early-stopping rule itself look like the genuinely new pieces relative to standard full aggregation work. The theory claims optimal contraction rates in the variational setting and corresponding results for the frequentist cases, plus some applications and numerical checks that presumably show the expected speedups without much loss in accuracy. Those parts are useful if they hold up, especially for people running aggregation on large candidate sets. The soft spot is the repeated appeal to mild conditions for the rate results. The abstract does not spell out what those conditions require on the variational family, the prior, or how the stopping threshold interacts with approximation error, so it is not yet clear whether the early-stopping rule reliably avoids underfitting in the nonparametric or high-dimensional regimes where variational methods are most needed. If the full proofs make the conditions explicit and checkable, that concern shrinks; otherwise the central claim rests on an unverified assumption about when stopping is safe. This is the kind of paper a reader working on adaptive inference or efficient variational methods would want to see, because the computational angle is practical and the unification gives a clean conceptual frame. It is coherent enough on its own terms to deserve referee time rather than a desk reject, though the conditions and the robustness of the stopping rule will need close checking in review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes early-stopped aggregation (ESA) as a framework for computationally efficient adaptive inference. Instead of aggregating over a full range of model complexities, ESA uses an early-stopping criterion to compute and aggregate only simpler models. The paper claims that ESA attains optimal adaptive contraction rates in the variational Bayes setting under mild conditions, extends the result to variational empirical Bayes, and establishes corresponding theory for frequentist penalized estimation and sample-splitting aggregation. It further presents a unification of the Bayesian and frequentist procedures through a shared energy functional consisting of a data-fitting term and a complexity-control term, supported by applications and numerical studies.

Significance. If the theoretical guarantees hold, the work provides a practical route to adaptive inference that avoids unnecessary computation on overly complex models while preserving optimality. The explicit unification via the energy functional supplies a conceptual bridge between variational Bayes and penalized frequentist aggregation, which could facilitate cross-paradigm method transfer in high-dimensional or nonparametric settings where full aggregation is prohibitive.

major comments (2)

[Abstract and theoretical results] Abstract and main theoretical claims: the assertion that ESA achieves optimal adaptive contraction rates in the variational Bayes setting rests on unspecified 'mild conditions' (e.g., requirements on the variational family, prior, or relation between stopping threshold and approximation error). Without an explicit list or verification that the early-stopping rule selects sufficient complexity while keeping the variational approximation accurate, it is impossible to confirm the result is not driven by hidden restrictions that would fail for typical nonparametric targets.
[Unification discussion] Unification section: the claim of a clear unification between early-stopped variational Bayes and frequentist penalized aggregation via a common energy functional is presented conceptually, but it remains unclear whether the frequentist result is derived independently or reduces to the Bayesian case by construction. This distinction is load-bearing for assessing whether the unification adds independent grounding or is tautological.

minor comments (2)

[Numerical studies] Numerical studies are referenced but lack detail in the abstract; ensure that tables or figures explicitly report computational savings (e.g., number of models evaluated) alongside statistical performance metrics to substantiate the efficiency claims.
[Method and theory sections] Notation for the energy functional should be introduced with a single consistent definition early in the manuscript to avoid ambiguity when comparing Bayesian and frequentist versions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments, which have helped us identify areas for improvement in clarity and rigor. We address each major comment point by point below, proposing targeted revisions to strengthen the manuscript while preserving its core contributions. All changes will be incorporated in the revised version.

read point-by-point responses

Referee: [Abstract and theoretical results] Abstract and main theoretical claims: the assertion that ESA achieves optimal adaptive contraction rates in the variational Bayes setting rests on unspecified 'mild conditions' (e.g., requirements on the variational family, prior, or relation between stopping threshold and approximation error). Without an explicit list or verification that the early-stopping rule selects sufficient complexity while keeping the variational approximation accurate, it is impossible to confirm the result is not driven by hidden restrictions that would fail for typical nonparametric targets.

Authors: We appreciate this point and agree that greater explicitness will enhance the accessibility and verifiability of the results. While the manuscript describes the conditions as 'mild' to highlight their broad applicability (e.g., standard posterior contraction assumptions plus a controllable variational approximation error), we acknowledge that an enumerated list would be preferable. In the revision, we will introduce a new subsection (2.3) that explicitly lists the assumptions: (A1) the variational family achieves approximation error o(1) relative to the target contraction rate; (A2) the prior satisfies the usual entropy and prior mass conditions for adaptive rates; (A3) the early-stopping threshold is set to exceed the oracle complexity by a log n factor, ensuring the selected model is at least as rich as needed. We will also add a remark with verification for standard nonparametric settings (e.g., Gaussian process regression) showing that the early-stopping rule preserves the required accuracy. These additions will be cross-referenced in the abstract and main theorems. revision: yes
Referee: [Unification discussion] Unification section: the claim of a clear unification between early-stopped variational Bayes and frequentist penalized aggregation via a common energy functional is presented conceptually, but it remains unclear whether the frequentist result is derived independently or reduces to the Bayesian case by construction. This distinction is load-bearing for assessing whether the unification adds independent grounding or is tautological.

Authors: We thank the referee for raising this important clarification. The frequentist theory in Section 4 is derived independently from first principles in the penalized M-estimation framework, without invoking any variational posterior or Bayesian elements; the energy functional is then observed to match the variational objective as a consequence. To address the concern, we will revise the unification paragraph to explicitly state this independent derivation and add a short proposition (new Proposition 4.3) showing that the frequentist estimator arises as a specific limit of the variational procedure under a Dirac variational family, but not conversely. This establishes that the unification provides a genuine conceptual bridge rather than a tautology, enabling technique transfer in both directions. The revised text will emphasize the independent grounding of each result. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in the derivation chain.

full rationale

The paper derives adaptive contraction rates for early-stopped aggregation in variational Bayes and frequentist penalized settings by establishing theorems under explicitly stated mild conditions, with the common energy functional serving as an independent conceptual unification rather than a definitional reduction of one result to the other. No load-bearing self-citations, self-definitional steps, or fitted parameters renamed as predictions appear in the central claims; the results are presented as holding via separate theoretical arguments for each paradigm, with the unification as an observed structural similarity rather than a forced equivalence by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach relies on standard variational Bayes and penalized estimation assumptions not detailed here.

pith-pipeline@v0.9.0 · 5580 in / 1042 out tokens · 18988 ms · 2026-05-10T11:38:30.609801+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

An approach to large-scale quasi-Bayesian inference with spike-and-slab priors.arXiv preprint arXiv:1803.10282,

Yves A Atchad ´e and Anwesha Bhattacharyya. An approach to large-scale quasi-Bayesian inference with spike-and-slab priors.arXiv preprint arXiv:1803.10282,

work page arXiv
[2]

Vladimir Koltchinskii.Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole D’Et´e de Probabilit´es de Saint-Flour XXXVIII-2008, volume

work page 2008
[3]

Bayesian model selection consistency and oracle inequality with intractable marginal likelihood.arXiv preprint arXiv:1701.00311,

Yun Yang and Debdeep Pati. Bayesian model selection consistency and oracle inequality with intractable marginal likelihood.arXiv preprint arXiv:1701.00311,

work page arXiv
[4]

Convergence rates of variational posterior distributions.The Annals of Statistics, 48(4):2180 – 2207, 2020a

Fengshuo Zhang and Chao Gao. Convergence rates of variational posterior distributions.The Annals of Statistics, 48(4):2180 – 2207, 2020a. Fengshuo Zhang and Chao Gao. Convergence rates of empirical Bayes posterior distributions: A variational Ohn, Fan, Jun, and Lin/Early-stopped aggregation28 perspective.arXiv preprint arXiv:2009.03969, 2020b. Tong Zhang....

work page arXiv 2009
[5]

EARLY-STOPPED AGGREGATION: ADAPTIVE INFERENCE WITH COMPUTATIONAL EFFICIENCY

Ohn, Fan, Jun, and Lin/Early-stopped aggregationS-1 SUPPLEMENT TO“EARLY-STOPPED AGGREGATION: ADAPTIVE INFERENCE WITH COMPUTATIONAL EFFICIENCY” Ilsang Ohn, Shitao Fan, Jungbin Jun and Lizhen Lin Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation and background . . . . . . . ....

work page 2007
[6]

Baselines and resultsFor comparison, we consider full aggregation and best model selection as we did in the other experiments. We also consider a fixed LoRA adapter with a larger training budget (LoRA-Large), which uses the same configuration as thek“4candidate but is trained for a substantially larger number of optimization steps. This provides a strong ...

work page 2025
[7]

Lower NLL and time indicate better performance, while higher accuracy is preferred. Dataset Method Test NLLÓTest AccÒTime (min)Ó WT2 ESA 3.2728˘0.0002 0.4024˘0.0001160.8˘68.4 FA 3.2708˘0.0002 0.4028˘0.0002 202.8˘85.4 MS 3.2701˘0.0001 0.4028˘0.0001 202.8˘85.4 LoRA-Large3.2263˘0.0003 0.4087˘0.0002235.3˘74.8 Base 3.7750˘0.0000 0.3375˘0.0000 0.0˘0.0 WT103 ESA...

work page 2021

[1] [1]

An approach to large-scale quasi-Bayesian inference with spike-and-slab priors.arXiv preprint arXiv:1803.10282,

Yves A Atchad ´e and Anwesha Bhattacharyya. An approach to large-scale quasi-Bayesian inference with spike-and-slab priors.arXiv preprint arXiv:1803.10282,

work page arXiv

[2] [2]

Vladimir Koltchinskii.Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole D’Et´e de Probabilit´es de Saint-Flour XXXVIII-2008, volume

work page 2008

[3] [3]

Bayesian model selection consistency and oracle inequality with intractable marginal likelihood.arXiv preprint arXiv:1701.00311,

Yun Yang and Debdeep Pati. Bayesian model selection consistency and oracle inequality with intractable marginal likelihood.arXiv preprint arXiv:1701.00311,

work page arXiv

[4] [4]

Convergence rates of variational posterior distributions.The Annals of Statistics, 48(4):2180 – 2207, 2020a

Fengshuo Zhang and Chao Gao. Convergence rates of variational posterior distributions.The Annals of Statistics, 48(4):2180 – 2207, 2020a. Fengshuo Zhang and Chao Gao. Convergence rates of empirical Bayes posterior distributions: A variational Ohn, Fan, Jun, and Lin/Early-stopped aggregation28 perspective.arXiv preprint arXiv:2009.03969, 2020b. Tong Zhang....

work page arXiv 2009

[5] [5]

EARLY-STOPPED AGGREGATION: ADAPTIVE INFERENCE WITH COMPUTATIONAL EFFICIENCY

Ohn, Fan, Jun, and Lin/Early-stopped aggregationS-1 SUPPLEMENT TO“EARLY-STOPPED AGGREGATION: ADAPTIVE INFERENCE WITH COMPUTATIONAL EFFICIENCY” Ilsang Ohn, Shitao Fan, Jungbin Jun and Lizhen Lin Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation and background . . . . . . . ....

work page 2007

[6] [6]

Baselines and resultsFor comparison, we consider full aggregation and best model selection as we did in the other experiments. We also consider a fixed LoRA adapter with a larger training budget (LoRA-Large), which uses the same configuration as thek“4candidate but is trained for a substantially larger number of optimization steps. This provides a strong ...

work page 2025

[7] [7]

Lower NLL and time indicate better performance, while higher accuracy is preferred. Dataset Method Test NLLÓTest AccÒTime (min)Ó WT2 ESA 3.2728˘0.0002 0.4024˘0.0001160.8˘68.4 FA 3.2708˘0.0002 0.4028˘0.0002 202.8˘85.4 MS 3.2701˘0.0001 0.4028˘0.0001 202.8˘85.4 LoRA-Large3.2263˘0.0003 0.4087˘0.0002235.3˘74.8 Base 3.7750˘0.0000 0.3375˘0.0000 0.0˘0.0 WT103 ESA...

work page 2021