arxiv: 2605.13421 · v1 · submitted 2026-05-13 · 📊 stat.ME

Recognition: unknown

Combining pre-trained models via localized model averaging

Ziwen Gao , Baihua He , Yuhong Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:53 UTC · model grok-4.3

classification 📊 stat.ME

keywords localized model averagingpre-trained modelsasymptotic optimalitycovariate-dependent weightsweight consistencygeneral loss frameworkmodel combinationrisk optimality

0 comments

The pith

Modeling averaging weights as functions of covariates yields asymptotically optimal in-sample and out-of-sample risks when combining pre-trained models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a localized model averaging method in which the weights assigned to different pre-trained models are learned as flexible functions of the input covariates. This formulation lets the averaging procedure adapt to the fact that different models perform better in different contexts. The authors work under a general loss that covers many prediction tasks and prove that the resulting risks are asymptotically optimal both inside and outside the training sample while the estimated weights remain consistent. A sympathetic reader cares because fixed-weight averaging cannot capture how relative model strengths shift with the data, and the localized approach directly addresses that limitation.

Core claim

We introduce localized model averaging where the weights are modeled as functions of the covariates, allowing the procedure to capture varying relative advantages of pre-trained models across heterogeneous contexts. Under a general loss framework, we establish asymptotic optimality for both in-sample and out-of-sample risks together with consistency of the estimated weights.

What carries the argument

Localized weights expressed as functions of covariates and learned under a general loss.

If this is right

The averaging procedure adapts automatically to changes in input context.
Both in-sample and out-of-sample risks converge to the best attainable level.
The estimated weights are consistent for the true optimal local weights.
The same framework applies across a wide range of prediction tasks via the general loss.
No fixed set of weights is required when model rankings shift with covariates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same localized-weight idea could be tested on ensembles of fine-tuned models rather than only off-the-shelf pre-trained ones.
Implementation would require only that the weight functions be parameterized flexibly enough to capture the relevant covariate effects.
If the consistency result holds, practitioners could replace manual model selection with a single fitted weight surface.
Extensions to streaming or non-stationary data would need to check whether the same asymptotic arguments still apply.

Load-bearing premise

The data conditions permit consistent estimation of the covariate-dependent local weights under the chosen general loss.

What would settle it

A dataset or simulation in which the estimated weights fail to converge to the optimal local weights or the achieved risk stays a fixed amount above the oracle risk as sample size grows.

Figures

Figures reproduced from arXiv: 2605.13421 by Baihua He, Yuhong Yang, Ziwen Gao.

**Figure 1.** Figure 1: A motivating example. In machine learning, there is a similar idea known as the mixture of experts (MoE) method. The MoE framework proposed by Jacobs et al. (1991) involves a form of model averaging. MoE consists of a set of experts and a gating network, where the gating network dynamically adjusts the weights according to X to combine the predictions from multiple experts. MoE has been widely applied in l… view at source ↗

**Figure 2.** Figure 2: The true weight functions and the estimated weight functions in setting S1. [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of MSPE for different methods. [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of the classification accuracy for different methods. [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

read the original abstract

Many pre-trained models (PTMs) are available in modern applications. Because different PTMs are often trained on different datasets, their performances can vary substantially for different new tasks, and the ranking of the candidates may depend heavily on the input. Motivated by this, we propose a localized model averaging method with weights modeled as functions of the covariates, making it substantially more versatile than existing model averaging methods. This formulation allows the model averaging procedure to adaptively capture the varying relative advantages of different PTMs across heterogeneous contexts. Specifically, we learn flexible local weights under a general loss framework that accommodates a broad class of prediction tasks. We further establish the asymptotic optimality of the proposed method for both in-sample and out-of-sample risks, as well as the consistency of the estimated weights. Extensive numerical experiments further demonstrate the effectiveness of the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a localized model averaging procedure for combining pre-trained models, in which the averaging weights are modeled as flexible functions of the covariates rather than global constants. Under a general loss framework, the authors claim to establish asymptotic optimality of the resulting estimator for both in-sample and out-of-sample risks together with consistency of the estimated local weights, and they support these claims with numerical experiments on synthetic and real data.

Significance. If the asymptotic results are rigorously established, the work would provide a statistically grounded method for adaptive combination of pre-trained models that respects heterogeneity in covariate space, extending classical model averaging to settings where relative model performance varies locally. The general-loss formulation and out-of-sample optimality claim would be particularly useful for modern prediction pipelines.

major comments (2)

[§3.2, Theorem 3.2] §3.2, Theorem 3.2: the out-of-sample asymptotic optimality result requires uniform convergence of the nonparametric local-weight estimators over the entire covariate support, yet the stated regularity conditions do not explicitly include the Hölder smoothness order of the weight functions or the precise bandwidth rates needed to guarantee the uniform rate; without these, the oracle-risk property may fail in regions of low design density.
[Assumption 2.3] Assumption 2.3 and the proof of consistency: the conditions allowing consistent estimation of the local weights under a general loss are given, but it is not shown that these conditions are sufficient to control the remainder term when the loss is non-smooth or when the covariate density is unbounded, which is load-bearing for the claimed out-of-sample optimality.

minor comments (2)

[§2] The notation for the local weight functions w_k(x) is introduced without an explicit statement of the dimension of x or the support of the covariate distribution, which affects readability of the subsequent convergence arguments.
[§5] In the numerical experiments, the tables reporting risk values do not include standard errors or the number of Monte Carlo replications, making it difficult to assess the statistical significance of the reported improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments, which help strengthen the rigor of our asymptotic results. We address each major comment below and will revise the manuscript to incorporate the necessary clarifications and additions.

read point-by-point responses

Referee: [§3.2, Theorem 3.2] §3.2, Theorem 3.2: the out-of-sample asymptotic optimality result requires uniform convergence of the nonparametric local-weight estimators over the entire covariate support, yet the stated regularity conditions do not explicitly include the Hölder smoothness order of the weight functions or the precise bandwidth rates needed to guarantee the uniform rate; without these, the oracle-risk property may fail in regions of low design density.

Authors: We agree that the regularity conditions in the manuscript are incomplete for guaranteeing uniform convergence over the full covariate support. In the revised version, we will explicitly augment the assumptions to include the Hölder smoothness order α of the weight functions and specify the bandwidth rates (e.g., h_n = O(n^{-1/(2α + d)}) with n h_n^d → ∞) required for the uniform rate. We will add a supporting lemma establishing sup-norm convergence of the local-weight estimators, incorporating standard trimming or boundary corrections to handle low-density regions, thereby ensuring the oracle-risk property holds uniformly. revision: yes
Referee: [Assumption 2.3] Assumption 2.3 and the proof of consistency: the conditions allowing consistent estimation of the local weights under a general loss are given, but it is not shown that these conditions are sufficient to control the remainder term when the loss is non-smooth or when the covariate density is unbounded, which is load-bearing for the claimed out-of-sample optimality.

Authors: The referee correctly notes that the current proof does not explicitly bound the remainder term under non-smooth losses or unbounded densities. We will revise the proof of consistency under Assumption 2.3 to include these controls: we will add the assumption that the loss is uniformly Lipschitz continuous (standard for general losses and sufficient to handle non-smoothness) and restrict attention to compact sets where the covariate density is bounded away from zero and infinity, with a brief discussion of tail truncation for unbounded cases. These additions will make the out-of-sample optimality claim rigorous. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes localized model averaging with covariate-dependent weights under a general loss, then claims to establish asymptotic optimality for in-sample/out-of-sample risks plus weight consistency via theoretical analysis. No steps reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations; the optimality follows from standard consistency arguments under stated data conditions rather than renaming or smuggling ansatzes. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review limits visibility; method rests on standard statistical assumptions for asymptotic results and a general loss framework.

axioms (2)

domain assumption General loss framework accommodates broad class of prediction tasks
Explicitly stated as allowing flexible local weights under general loss.
domain assumption Data distribution permits consistent estimation of local weights
Required for the claimed consistency of estimated weights.

pith-pipeline@v0.9.0 · 5436 in / 1052 out tokens · 34947 ms · 2026-05-14T17:53:50.959691+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

202 extracted references · 138 canonical work pages · 5 internal anchors

[1]

The Annals of Statistics , volume=

Functional aggregation for nonparametric regression , author=. The Annals of Statistics , volume=. 2000 , publisher=

2000
[2]

Combining forecasting procedures:

Yang, Yuhong , journal=. Combining forecasting procedures:. 2004 , publisher=

2004
[3]

Journal of Econometrics , volume=

Adaptively combined forecasting for discrete response time series , author=. Journal of Econometrics , volume=. 2013 , publisher=

2013
[4]

Journal of the American Statistical Association , volume=

Adaboost semiparametric model averaging prediction for multiple categories , author=. Journal of the American Statistical Association , volume=. 2022 , publisher=

2022
[5]

Economics Letters , year=2017, volume=

Xie, Tian , title=. Economics Letters , year=2017, volume=

2017
[6]

Model averaging based on

Zhang, Xinyu and Zou, Guohua and Carroll, Raymond J , journal=. Model averaging based on. 2015 , publisher=

2015
[7]

Economics Letters , volume=

Prediction model averaging estimator , author=. Economics Letters , volume=. 2015 , publisher=

2015
[8]

Journal of Econometrics , volume=

Toward optimal model averaging in regression models with time series errors , author=. Journal of Econometrics , volume=. 2015 , publisher=

2015
[9]

Journal of Applied Econometrics , pages=

Feasible cross-validatory model selection for general stationary processes , author=. Journal of Applied Econometrics , pages=. 1997 , publisher=

1997
[10]

1952 , publisher=

Inequalities , author=. 1952 , publisher=

1952
[11]

, title =

Liu, C.-A. , title =. Journal of Econometrics , year =
[12]

Carroll , title =

Hua Liang, and Suojin Wang, and Raymond J. Carroll , title =. Biometrika , year =
[13]

and Linton, O

Li, D. and Linton, O. and Lu, Z. , title =. Journal of Econometrics , year =
[14]

Journal of Nonparametric Statistics , year =

Na Li, and Xingzhong Xu, and Pei Jin , title =. Journal of Nonparametric Statistics , year =
[15]

Hansen, B. E. , title =. Quantitative Economics , year =
[16]

IEEE Transactions on Information Theory , volume=

Information theory and mixing least-squares regressions , author=. IEEE Transactions on Information Theory , volume=. 2006 , publisher=

2006
[17]

Review of Finance , year =

Dieckmann, Stephan and Plank, Thomas , title =. Review of Finance , year =
[18]

Journal of the American Statistical Association , volume=

Semiparametric estimates of the relation between weather and electricity sales , author=. Journal of the American Statistical Association , volume=. 1986 , publisher=

1986
[19]

Journal of Multivariate Analysis , volume=

Local linear estimation in partly linear models , author=. Journal of Multivariate Analysis , volume=. 1997 , publisher=

1997
[20]

and Su, L

Lu, X. and Su, L. , year =. Jackknife model averaging for quantile regressions , journal =
[21]

2000 , publisher=

Partially linear models , author=. 2000 , publisher=

2000
[22]

Magnus, J. R. and Wan, A. T. K. and Zhang, X. , journal=. Weighted average least squares estimation with nonspherical disturbances and an application to the. 2011 , publisher=

2011
[23]

Theory of Probability & Its Applications , volume=

Bounds for the moments of linear and quadratic forms in independent variables , author=. Theory of Probability & Its Applications , volume=. 1960 , publisher=

work page 1960
[24]

Econometrica , volume =

Root-N-Consistent Semiparametric Regression , author =. Econometrica , volume =

work page
[25]

Econometric Theory , year=2005, volume=

Juhl, Ted and Xiao, Zhijie , title=. Econometric Theory , year=2005, volume=

work page 2005
[26]

Journal of Econometrics , year=1996, volume=

Li, Qi and Stengos, Thanasis , title=. Journal of Econometrics , year=1996, volume=

work page 1996
[27]

Journal of Econometrics , year=2010, volume=

Su, Liangjun and Jin, Sainan , title=. Journal of Econometrics , year=2010, volume=

work page 2010
[28]

Econometric Theory , year=2010, volume=

Su, Liangjun and White, Halbert , title=. Econometric Theory , year=2010, volume=

2010
[29]

, title=

Li, Qi and Wooldridge, Jeffrey M. , title=. Econometric Theory , year=2002, volume=

work page 2002
[30]

Annals of Economics and Finance , year=2005, volume=

Yiguo Sun , title=. Annals of Economics and Finance , year=2005, volume=

2005
[31]

Econometric Theory , volume =

Lee,Sokbae , title =. Econometric Theory , volume =. 2003 , pages =

2003
[32]

Computational Statistics & Data Analysis , volume =

Hua Liang , title =. Computational Statistics & Data Analysis , volume =. 2006 , pages =

2006
[33]

Krief , title =

Jerome M. Krief , title =. Econometric Theory , volume =. 2013 , pages =

work page 2013
[34]

Journal of the Royal Statistical Society

Spline Smoothing in a Partly Linear Model , author =. Journal of the Royal Statistical Society. Series B (Methodological) , volume =

work page
[35]

and Magnus, Jan R

Abadir, Karim M. and Magnus, Jan R. , title =

work page
[36]

Gijbels , title =

Jianqing Fan, and I. Gijbels , title =

work page
[37]

Estimation in a semiparametric partially linear errors-in-variables model , author=

work page
[38]

Econometrica , volume =

Household Gasoline Demand in the United States , author =. Econometrica , volume =

work page
[39]

Variable Selection in Nonparametric Varying-Coefficient Models for Analysis of Repeated Measurements , author=

work page
[40]

Statistica Sinica , volume=

Moment-Based Method For Random Effects Selection In Linear Mixed Models , author=. Statistica Sinica , volume=

work page
[41]

New Estimation and Model Selection Procedures for Semiparametric Modeling in Longitudinal Data Analysis , author=

work page
[42]

Variable selection in semiparametric regression modeling , author=
[43]

Akaike , title =

H. Akaike , title =. 1973 , journal =

1973
[44]

Operations Research Quarterly , volume =

The Combination of Forecasts , author =. Operations Research Quarterly , volume =
[45]

Biometrics , volume=

Joint Variable Selection for Fixed and Random Effects in Linear Mixed-Effects Models , author=. Biometrics , volume=. 2010 , publisher=

2010
[46]

and Carroll, R

Claeskens, G. and Carroll, R. J. , year = 2007, title =

2007
[47]

and Croux, C

Claeskens, G. and Croux, C. and. Variable selection for logistic regression using a prediction-focused information criterion , journal = bioc, volume = 62, number=4, pages =
[48]

and Hjort, N

Claeskens, G. and Hjort, N. L. , title =

work page
[49]

P. J. Green, and Bernard. W. Silverman , title =

work page
[50]

Donohue, M. C. and Overholser, R. and Xu, R. and Vaida, F. , title =. Biometrika , volume =. 2011 , pages =

2011
[51]

and Kneib, T

Greven, S. and Kneib, T. , title =. 2010 , pages =

2010
[52]

, title=

Chang, Roberto and Kaltani, Linda and Loayza, Norman V. , title=. Journal of Development Economics , year=2009, volume=

2009
[53]

Hjort, N. L. and Claeskens, G. , year = 2003, title =

work page 2003
[54]

Biometrics , volume=

Fixed and random effects selection in mixed effects models , author=. Biometrics , volume=. 2011 , publisher=

work page 2011
[55]

A forecast comparison of residential housing prices by parametric versus semiparametric conditional mean estimators , journal=

Gen. A forecast comparison of residential housing prices by parametric versus semiparametric conditional mean estimators , journal=

work page
[56]

, journal=

Andrews, D.W.K. , journal=. Asymptotic optimality of generalized. 1991 , publisher=

work page 1991
[57]

Biometrika , volume=

Single-index model selections , author=. Biometrika , volume=. 2001 , publisher=

work page 2001
[58]

and Wu, H

Liang, H. and Wu, H. and Zou, G. , year = 2008, title =

2008
[59]

and Zou, G

Liang, H. and Zou, G. and Wan, A. T. K. and Zhang, X. , year =. Optimal weight choice for frequentist model average estimators , journal = jasa, volume =
[60]

Journal of Applied Econometrics , volume =

Tan, Chih Ming , title =. Journal of Applied Econometrics , volume =
[61]

Magnus, and Wendun Wang , title =

Jan R. Magnus, and Wendun Wang , title =. Oxford Bulletin of Economics and Statistics , year =

work page
[62]

Determinants of Long-Term Growth: A

Xavier. Determinants of Long-Term Growth: A. American Economic Review , volume =

work page
[63]

McMahon and M

D. McMahon and M. Lederman and D. W. Haas and R. Haubrich and J. Stanford and E. Cooney and J. Horton and D. Kelleher and L. Ross and A. Cutrell and D. Lee and W. Spreen and J. W. Mellors , title =. Antiviral Therapy , year =

work page
[64]

Berger , title =

James O. Berger , title =
[65]

Miller, A. J. , title =

work page
[66]

David Ruppert, and M. P. Wand, and R. J. Carroll , title =
[67]

N., and Johnson, P

Durlauf, S. N., and Johnson, P. A., and Temple, J. R. W , title =. Handbook of Economic Growth , address =

work page
[68]

Adonis Yatchew , title =

work page
[69]

Yatchew , title=

A. Yatchew , title=. Journal of Applied Econometrics , year=2000, volume=

2000
[70]

Econometrica , year=2001, volume=

Adonis Yatchew and Joungyeo Angela No , title=. Econometrica , year=2001, volume=

work page 2001
[71]

and Niu, X.-F

Pu, W. and Niu, X.-F. , title =. Journal of Multivariate Analysis , volume =. 2006 , pages =

work page 2006
[72]

Robinson, G. K. , title =. Statistical Science , volume =. 1991 , pages =

work page 1991
[73]

Schwarz , title =

G. Schwarz , title =. 1978 , volume =

work page 1978
[74]

and Blanchard, S

Vaida, F. and Blanchard, S. , title =. Biometrika , volume =. 2005 , pages =

work page 2005
[75]

Wan, A. T. K. and Zhang, X. and Zou, G. , title =. Journal of Econometrics , year =

work page
[76]

, title =

Yang, Y. , title =. 2001 , number =

work page 2001
[77]

, title =

Yang, Y. , title =. Statistica Sinica , volume =. 2003 , pages =

work page 2003
[78]

and Yang, Y

Yuan, Z. and Yang, Y. , title =. 2005 , number=

work page 2005
[79]

and Liang, H

Zhang, X. and Liang, H. , year =. Focused information criterion and model averaging for generalized additive partial linear models , journal = annals, volume =

work page
[80]

and Wan, A

Zhang, X. and Wan, A. T. K. and Zhou, S. Z. , title =. Journal of Business & Economic Statistics , volume =. 2012 , pages =

work page 2012

Showing first 80 references.