A Divergence-Based Method for Weighting and Averaging Model Predictions

Olav Benjamin Vassend

arxiv: 2604.24172 · v1 · submitted 2026-04-27 · 📊 stat.ML · cs.LG· stat.ME

A Divergence-Based Method for Weighting and Averaging Model Predictions

Olav Benjamin Vassend This is my paper

Pith reviewed 2026-05-08 01:15 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.ME

keywords model averagingdivergence minimizationensemble methodssmall sample performanceprobabilistic predictionsmodel weightingstatistical learning

0 comments

The pith

Model weights derived by minimizing a divergence yield averaged predictions that match or exceed standard methods especially in small samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a method to compute weights for averaging probabilistic predictions from multiple models by minimizing a divergence between the weighted combination and a target derived from the data. The technique applies regardless of how the individual models were fitted, whether through frequentist, Bayesian, or other procedures. Tests show the resulting averaged predictions perform at least as well as model stacking and Akaike-style weighting, with a clear edge when the number of observations is limited. A supporting theoretical analysis accounts for the small-sample advantage by linking the divergence criterion to more stable weight selection. Readers would care because more reliable averaging can improve combined forecasts in settings where data collection is costly or sparse.

Core claim

The paper shows that weights obtained through a minimum divergence framework produce averaged model predictions that are competitive with or superior to those from established techniques such as stacking and exponentiated Akaike weighting. The advantage is most pronounced in small-sample regimes, where the divergence-based weights avoid the instability that affects likelihood-driven alternatives. The framework is presented as general, working for any collection of models and any fitting method, with theory explaining the performance gain through properties of the divergence minimization step.

What carries the argument

The minimum divergence framework, which selects model weights so that the averaged prediction is as close as possible to a data-based target in a chosen divergence measure.

If this is right

Averaged predictions from multiple models become more accurate or stable when the number of observations is small.
The weighting procedure applies without modification to models fitted by frequentist, Bayesian, or alternative methods.
Theoretical analysis ties the small-sample gain to the divergence criterion avoiding overfitting of the weights themselves.
Combined probabilistic forecasts can be obtained for any collection of models using the same weighting rule.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach might be applied to deep learning ensembles where training data is often limited relative to model complexity.
Choice of divergence measure could be tuned to specific prediction tasks to further reduce error.
The method may reduce reliance on cross-validation for determining how much to trust each model in an average.
Similar divergence ideas could be explored for weighting in time-series or spatial prediction problems.

Load-bearing premise

That minimizing a divergence between the averaged prediction and a data target produces weights that genuinely improve averaging accuracy across sample sizes and model types.

What would settle it

Apply the method and standard baselines to several small-sample datasets with held-out test data and measure whether the divergence-weighted averages fail to show lower prediction error than the comparators.

Figures

Figures reproduced from arXiv: 2604.24172 by Olav Benjamin Vassend.

**Figure 2.** Figure 2: Standard deviation of model weights across view at source ↗

**Figure 3.** Figure 3: First simulation: each model consists of a single predictor. view at source ↗

**Figure 4.** Figure 4: Second simulation: the models are increasingly larger subsets of the set of all predictors. view at source ↗

**Figure 5.** Figure 5: Top: c = 2 (black circles); c = 1 (blue triangles); c = 0.5 (red squares) Bottom left: Brier divergence view at source ↗

**Figure 6.** Figure 6: Blue dotted line: Divergence-based model weighting Red dotted line: Stacking with the log score view at source ↗

read the original abstract

This paper uses a minimum divergence framework to introduce a new way of calculating model weights that can be used to average probabilistic predictions from statistical and machine learning models. The method is general and can be applied regardless of whether the models under consideration are fit to data using frequentist, Bayesian, or some other fitting method. The proposed method is motivated in two different ways and is shown empirically to perform better than or on a par with standard model averaging methods, including model stacking and model averaging that relies on Akaike-style negative exponentiated model weighting, especially when the sample size is small. Our theoretical analysis explains why the method has a small-sample advantage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a minimum-divergence weighting scheme for averaging model predictions that shows a plausible small-sample edge over stacking and AIC-style methods.

read the letter

The main takeaway is that this paper introduces a minimum divergence approach to setting weights when averaging probabilistic predictions from multiple models. It claims this works at least as well as stacking or negative-exponentiated AIC weighting and pulls ahead when sample sizes are small, with a theoretical account for why that happens. The method is presented as general enough to apply whether the models were fit by frequentist, Bayesian, or other routes.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces a minimum-divergence framework for computing weights to average probabilistic predictions from statistical and machine learning models. The method is presented as general across frequentist, Bayesian, and other fitting paradigms. It is motivated in two ways, empirically compared to standard approaches including model stacking and Akaike-style negative-exponentiated weighting (with claimed advantages especially at small sample sizes), and supported by a theoretical analysis explaining the small-sample behavior.

Significance. If the empirical results and theoretical account hold under scrutiny, the work supplies a unified, divergence-based weighting scheme that could improve ensemble performance in small-n settings common to many applications. The cross-paradigm generality and the combination of empirical comparisons with a theoretical explanation for small-sample gains constitute the primary strengths.

minor comments (2)

[Abstract] Abstract: the empirical claims would be strengthened by briefly indicating the number of datasets, models, and sample-size regimes examined, as well as whether error bars or statistical tests accompany the reported performance differences.
The notation for the divergence functional and the resulting weight formula should be introduced with an explicit example in the main text to aid readers unfamiliar with minimum-divergence methods.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our manuscript and for recommending minor revision. The assessment correctly identifies the core contributions: a general minimum-divergence framework for weighting probabilistic predictions, its applicability across fitting paradigms, empirical advantages over stacking and Akaike-style weighting at small sample sizes, and the accompanying theoretical explanation.

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper motivates a minimum-divergence weighting scheme independently of the target performance claims, presents it as applicable across fitting paradigms, and supports the small-sample advantage via separate empirical comparisons to stacking and Akaike-style weighting plus a distinct theoretical analysis. No load-bearing step reduces by construction to a fitted parameter, self-definition, or self-citation chain; the central result is externally benchmarked rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5400 in / 1051 out tokens · 60810 ms · 2026-05-08T01:15:35.810631+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages

[1]

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle

work page 1973
[2]

Akaike, H. (1974). A new look at the statistical model identification . IEEE Transactions on Automatic Control\/ 19\/ (6), 716--723

work page 1974
[3]

Akaike, H. (1979). A Bayesian extension of the minimum AIC procedure of autoregressive model fitting . Biometrika\/ 66\/ (2), 237--242

work page 1979
[4]

Bissiri, P. G., C. Holmes, and S. Walker (2016). A general framework for updating belief distributions . Journal of the Royal Statistical Society. Series B (Methodological)\/ 78\/ (5), 1103--1130

work page 2016
[5]

Breiman, L. (1996). Stacked regressions . Machine Learning\/ 24\/ (1), 49--64

work page 1996
[6]

Brier, G. W. (1950). Verification of forecasts expressed in terms of probability . Monthly Weather Review\/ 7\/ (1), 1--3

work page 1950
[7]

Burnham, K. P. and D. R. Anderson (1998). Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach \/ (Second ed.). New York, NY: Springer-Verlag

work page 1998
[8]

Cesa-Bianchi, N. and G. Lugosi (2006). Prediction, Learning, and Games . Cambridge University Press, New York

work page 2006
[9]

Chicco, D. and G. Jurman (2020). Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone . BMC Medical Informatics and Decision Making\/ 20\/ (1), 16

work page 2020
[10]

Clarke, B. (2003). Comparing Bayes model averaging and stacking when model approximation error cannot be ignored . J. Mach. Learn. Res.\/ 4\/ (null), 683--712

work page 2003
[11]

Cerdeira, F

Cortez, P., A. Cerdeira, F. Almeida, T. Matos, and J. Reis (2009). Modeling wine preferences by data mining from physicochemical properties . Decision Support Systems\/ 47 , 547--553

work page 2009
[12]

Deane-Mayer, Z. A. and J. E. Knowles (2016). caretEnsemble: ensembles of caret models . R package version\/ 2\/ (0)

work page 2016
[13]

Diaconis, P. and S. L. Zabell (1982). Updating subjective probability . Journal of the American Statistical Association\/ 77\/ (380), 822--830

work page 1982
[14]

Dua, Dheeru and Graff , C. (2019). UCI Machine Learning Repository

work page 2019
[15]

Gelman, A., Hwang, J., and Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models . Statistics and Computing , 24(6):997--1016

work page 2014
[16]

Gil, D., J. L. Girela, J. De Juan , M. J. Gomez-Torres, and M. Johnsson (2012). Predicting seminal quality with artificial intelligence methods . Expert Systems with Applications\/ 39\/ (16), 12564--12573

work page 2012
[17]

Futami, F., Iwata, T., Ueda, N., Sato, I., and Sugiyama, M. (2021). Loss function based second-order Jensen inequality and its application to particle variational inference . Advances in Neural Information Processing Systems \/ 34\/ , 6803 -- 6815

work page 2021
[18]

Futami, F., Iwata, T., Ueda, N., Sato, I., and Sugiyama, M. (2022). Predictive variational Bayesian inference as risk-seeking optimization . International Conference on Artificial Intelligence and Statistics \/ 151\/ , 5051 -- 5083

work page 2022
[19]

Germain, P., Bach, F., Lacoste, A., and Lacoste-Julien, S. (2016). PAC-Bayesian theory meets Bayesian inference . Advances in Neural Information Processing Systems \/ 29\/ , 1--9

work page 2016
[20]

and Theussl, S

Ghalanos, A. and Theussl, S. (2015). Rsolnp: General Non-Linear Optimization

work page 2015
[21]

Gneiting, T. and A. E. Raftery (2007). Strictly proper scoring rules, prediction, and estimation . Journal of the American Statistical Association\/ 102\/ (477), 359--378

work page 2007
[22]

Hurvich, C. M. and C.-L. Tsai (1989). Regression and time series model selection in small samples . Biometrika\/ 76\/ (2), 297--307

work page 1989
[23]

Islam, M. M. F., R. Ferdousi, S. Rahman, and H. Y. Bushra (2019). Likelihood prediction of diabetes at early stage using data mining techniques . Computer Vision and Machine Intelligence in Medical Image Analysis\/

work page 2019
[24]

Jewson, and T

Knoblauch, J., J. Jewson, and T. Damoulas (2022). An optimization-centric view on Bayes’ rule: reviewing and generalizing variational inference . Journal of Machine Learning Research\/ 23\/ (132), 1--109

work page 2022
[25]

Kuhn, M. (2012). The caret package . Journal of Statistical Software\/ 28

work page 2012
[26]

Kullback, S. and R. Leibler (1951). On information and sufficiency . Annals of Mathematical Statistics\/ 22\/ (1), 79--86

work page 1951
[27]

Le, T. and B. Clarke (2017). A Bayes interpretation of stacking for M -Complete and M -Open settings . Bayesian Analysis\/ 12\/ (3), 807--829

work page 2017
[28]

Leblanc, M. and R. Tibshirani (1996). Combining estimates in regression and classification . Journal of the American Statistical Association\/ 91\/ (436), 1641--1650

work page 1996
[29]

Masegosa, A. (2020). Learning under model misspecification: applications to variational and ensemble methods . Advances in Neural Information Processing Systems\/ 33\/ , 5479 -- 5491

work page 2020
[30]

McAllester, D. (1999). PAC-Bayesian model averaging . Proceedings of the 12th Annual Conference on Computational Learning Theory\/ , 164 -- 170

work page 1999
[31]

McLatchie, B.-E

McLatchie, Y., B.-E. Cherief-Abdellatif, D. T. Frazier, and J. Knoblauch (2025). Predictively oriented posteriors . arXiv preprint arXiv:2510.01915\/

work page arXiv 2025
[32]

Minka, A. (2002). Bayesian model averaging is not model combination . https://tminka.github.io/papers/bma.html\/

work page 2002
[33]

Morningstar, W. R., A. A. Alemi, and J. V. Dillon (2022). Pacm-Bayes: narrowing the empirical risk gap in the misspecified Bayesian regime . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics\/ 151\/ , 8270 -- 8298

work page 2022
[34]

R: A Language and Environment for Statistical Computing

R Core Team (2020). R: A Language and Environment for Statistical Computing

work page 2020
[35]

Rigollet, P. and A. B. Tsybakov (2012). Sparse estimation by exponential weighting . Statistical Science\/ 27\/ (4), 18,558--575

work page 2012
[36]

Takacs, L

Sill, J., G. Takacs, L. Mackey, and D. Lin (2009). Feature-weighted linear stacking . arXiv:0911.0460v2\/

work page arXiv 2009
[37]

Machmud, and A

Sobar, R. Machmud, and A. Wijaya (2016). Behavior determinant based cervical cancer early detection with machine learning algorithm . Advanced Science Letters\/ 22\/ (10), 3120--3123

work page 2016
[38]

J., Best, N

Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and Van Der Linde , A. (2002). Bayesian measures of model complexity and fit . Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 64(4):583--639

work page 2002
[39]

Stan Modeling Language

Stan Development Team (2019). Stan Modeling Language

work page 2019
[40]

Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions . Journal of the Royal Statistical Society. Series B (Methodological)\/ 36\/ (2), 111--147

work page 1974
[41]

Sugiura, N. (1978). Further analysis of the data by Akaike's information criterion and the finite corrections . Communications in Statistics - Theory and Methods\/ 7\/ (1), 13--26

work page 1978
[42]

van Erven, and W

van der Hoeven, D., T. van Erven, and W. Kotlowski (2018). The many faces of exponential weights in online learning . Proceedings of Machine Learning Research\/ 75 , 1--26

work page 2018
[43]

van der Laan, M. J., E. C. Polley, and A. E. Hubbard (2007). Super learner . Statistical Applications in Genetics and Molecular Biology\/ 6\/ (1)

work page 2007
[44]

Vassend, O. B. (2022). Justifying the norms of inductive inference . British Journal for the Philosophy of Science\/ 73\/ (1), 135--160

work page 2022
[45]

Vehtari, A., Gabry, J., Magnusson, M., Yao, Y., B \" u rkner, P.-C., Paananen, T., and Gelman, A. (2020). LOO: Efficient leave-one-out cross-validation and WAIC for Bayesian models

work page 2020
[46]

Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC . Statistics and Computing , 27(5):1413--1432

work page 2017
[47]

Vershynin, R. (2025). High-Dimensional Probability: An Introduction with Applications in Data Science . Cambridge University Press, New York

work page 2025
[48]

Vovk, V. (1990). Aggregating strategies . In M. Fulk and J. Case (Eds.), Proceedings of the Third Annual Workshop on Computational Learning Theory , pp.\ 371--383. San Mateo, CA: Morgan Kaufmann

work page 1990
[49]

Wagenmakers, E.-J. and S. Farrell (2004). AIC model selection using Akaike weights . Psychonomic Bulletin & Review\/ 11\/ (1), 192--196

work page 2004
[50]

Watanabe, S. (2013). A widely applicable Bayesian information criterion . J. Mach. Learn. Res. , 14(1):867--897

work page 2013
[51]

Williams, P. M. (1980). Bayesian conditionalisation and the principle of minimum information . The British Journal for the Philosophy of Science\/ 31\/ (2), 131--144

work page 1980
[52]

Wolpert, D. H. (1992). Stacked generalization . Neural Networks\/ 5\/ (2), 241--259

work page 1992
[53]

Yao, Y., V. Aki, S. Daniel, and G. Andrew (2018). Using stacking to average Bayesian predictive distributions (with discussion) . Bayesian Analysis\/ 13\/ (3), 917--1007

work page 2018
[54]

Yao, Y., G. Pirs, A. Vehtari, and A. Gelman (2021). Bayesian hierarchical stacking: some models are (somewhere) useful . arXiv:2101.08954v2\/

work page arXiv 2021

[1] [1]

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle

work page 1973

[2] [2]

Akaike, H. (1974). A new look at the statistical model identification . IEEE Transactions on Automatic Control\/ 19\/ (6), 716--723

work page 1974

[3] [3]

Akaike, H. (1979). A Bayesian extension of the minimum AIC procedure of autoregressive model fitting . Biometrika\/ 66\/ (2), 237--242

work page 1979

[4] [4]

Bissiri, P. G., C. Holmes, and S. Walker (2016). A general framework for updating belief distributions . Journal of the Royal Statistical Society. Series B (Methodological)\/ 78\/ (5), 1103--1130

work page 2016

[5] [5]

Breiman, L. (1996). Stacked regressions . Machine Learning\/ 24\/ (1), 49--64

work page 1996

[6] [6]

Brier, G. W. (1950). Verification of forecasts expressed in terms of probability . Monthly Weather Review\/ 7\/ (1), 1--3

work page 1950

[7] [7]

Burnham, K. P. and D. R. Anderson (1998). Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach \/ (Second ed.). New York, NY: Springer-Verlag

work page 1998

[8] [8]

Cesa-Bianchi, N. and G. Lugosi (2006). Prediction, Learning, and Games . Cambridge University Press, New York

work page 2006

[9] [9]

Chicco, D. and G. Jurman (2020). Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone . BMC Medical Informatics and Decision Making\/ 20\/ (1), 16

work page 2020

[10] [10]

Clarke, B. (2003). Comparing Bayes model averaging and stacking when model approximation error cannot be ignored . J. Mach. Learn. Res.\/ 4\/ (null), 683--712

work page 2003

[11] [11]

Cerdeira, F

Cortez, P., A. Cerdeira, F. Almeida, T. Matos, and J. Reis (2009). Modeling wine preferences by data mining from physicochemical properties . Decision Support Systems\/ 47 , 547--553

work page 2009

[12] [12]

Deane-Mayer, Z. A. and J. E. Knowles (2016). caretEnsemble: ensembles of caret models . R package version\/ 2\/ (0)

work page 2016

[13] [13]

Diaconis, P. and S. L. Zabell (1982). Updating subjective probability . Journal of the American Statistical Association\/ 77\/ (380), 822--830

work page 1982

[14] [14]

Dua, Dheeru and Graff , C. (2019). UCI Machine Learning Repository

work page 2019

[15] [15]

Gelman, A., Hwang, J., and Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models . Statistics and Computing , 24(6):997--1016

work page 2014

[16] [16]

Gil, D., J. L. Girela, J. De Juan , M. J. Gomez-Torres, and M. Johnsson (2012). Predicting seminal quality with artificial intelligence methods . Expert Systems with Applications\/ 39\/ (16), 12564--12573

work page 2012

[17] [17]

Futami, F., Iwata, T., Ueda, N., Sato, I., and Sugiyama, M. (2021). Loss function based second-order Jensen inequality and its application to particle variational inference . Advances in Neural Information Processing Systems \/ 34\/ , 6803 -- 6815

work page 2021

[18] [18]

Futami, F., Iwata, T., Ueda, N., Sato, I., and Sugiyama, M. (2022). Predictive variational Bayesian inference as risk-seeking optimization . International Conference on Artificial Intelligence and Statistics \/ 151\/ , 5051 -- 5083

work page 2022

[19] [19]

Germain, P., Bach, F., Lacoste, A., and Lacoste-Julien, S. (2016). PAC-Bayesian theory meets Bayesian inference . Advances in Neural Information Processing Systems \/ 29\/ , 1--9

work page 2016

[20] [20]

and Theussl, S

Ghalanos, A. and Theussl, S. (2015). Rsolnp: General Non-Linear Optimization

work page 2015

[21] [21]

Gneiting, T. and A. E. Raftery (2007). Strictly proper scoring rules, prediction, and estimation . Journal of the American Statistical Association\/ 102\/ (477), 359--378

work page 2007

[22] [22]

Hurvich, C. M. and C.-L. Tsai (1989). Regression and time series model selection in small samples . Biometrika\/ 76\/ (2), 297--307

work page 1989

[23] [23]

Islam, M. M. F., R. Ferdousi, S. Rahman, and H. Y. Bushra (2019). Likelihood prediction of diabetes at early stage using data mining techniques . Computer Vision and Machine Intelligence in Medical Image Analysis\/

work page 2019

[24] [24]

Jewson, and T

Knoblauch, J., J. Jewson, and T. Damoulas (2022). An optimization-centric view on Bayes’ rule: reviewing and generalizing variational inference . Journal of Machine Learning Research\/ 23\/ (132), 1--109

work page 2022

[25] [25]

Kuhn, M. (2012). The caret package . Journal of Statistical Software\/ 28

work page 2012

[26] [26]

Kullback, S. and R. Leibler (1951). On information and sufficiency . Annals of Mathematical Statistics\/ 22\/ (1), 79--86

work page 1951

[27] [27]

Le, T. and B. Clarke (2017). A Bayes interpretation of stacking for M -Complete and M -Open settings . Bayesian Analysis\/ 12\/ (3), 807--829

work page 2017

[28] [28]

Leblanc, M. and R. Tibshirani (1996). Combining estimates in regression and classification . Journal of the American Statistical Association\/ 91\/ (436), 1641--1650

work page 1996

[29] [29]

Masegosa, A. (2020). Learning under model misspecification: applications to variational and ensemble methods . Advances in Neural Information Processing Systems\/ 33\/ , 5479 -- 5491

work page 2020

[30] [30]

McAllester, D. (1999). PAC-Bayesian model averaging . Proceedings of the 12th Annual Conference on Computational Learning Theory\/ , 164 -- 170

work page 1999

[31] [31]

McLatchie, B.-E

McLatchie, Y., B.-E. Cherief-Abdellatif, D. T. Frazier, and J. Knoblauch (2025). Predictively oriented posteriors . arXiv preprint arXiv:2510.01915\/

work page arXiv 2025

[32] [32]

Minka, A. (2002). Bayesian model averaging is not model combination . https://tminka.github.io/papers/bma.html\/

work page 2002

[33] [33]

Morningstar, W. R., A. A. Alemi, and J. V. Dillon (2022). Pacm-Bayes: narrowing the empirical risk gap in the misspecified Bayesian regime . Proceedings of The 25th International Conference on Artificial Intelligence and Statistics\/ 151\/ , 8270 -- 8298

work page 2022

[34] [34]

R: A Language and Environment for Statistical Computing

R Core Team (2020). R: A Language and Environment for Statistical Computing

work page 2020

[35] [35]

Rigollet, P. and A. B. Tsybakov (2012). Sparse estimation by exponential weighting . Statistical Science\/ 27\/ (4), 18,558--575

work page 2012

[36] [36]

Takacs, L

Sill, J., G. Takacs, L. Mackey, and D. Lin (2009). Feature-weighted linear stacking . arXiv:0911.0460v2\/

work page arXiv 2009

[37] [37]

Machmud, and A

Sobar, R. Machmud, and A. Wijaya (2016). Behavior determinant based cervical cancer early detection with machine learning algorithm . Advanced Science Letters\/ 22\/ (10), 3120--3123

work page 2016

[38] [38]

J., Best, N

Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and Van Der Linde , A. (2002). Bayesian measures of model complexity and fit . Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 64(4):583--639

work page 2002

[39] [39]

Stan Modeling Language

Stan Development Team (2019). Stan Modeling Language

work page 2019

[40] [40]

Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions . Journal of the Royal Statistical Society. Series B (Methodological)\/ 36\/ (2), 111--147

work page 1974

[41] [41]

Sugiura, N. (1978). Further analysis of the data by Akaike's information criterion and the finite corrections . Communications in Statistics - Theory and Methods\/ 7\/ (1), 13--26

work page 1978

[42] [42]

van Erven, and W

van der Hoeven, D., T. van Erven, and W. Kotlowski (2018). The many faces of exponential weights in online learning . Proceedings of Machine Learning Research\/ 75 , 1--26

work page 2018

[43] [43]

van der Laan, M. J., E. C. Polley, and A. E. Hubbard (2007). Super learner . Statistical Applications in Genetics and Molecular Biology\/ 6\/ (1)

work page 2007

[44] [44]

Vassend, O. B. (2022). Justifying the norms of inductive inference . British Journal for the Philosophy of Science\/ 73\/ (1), 135--160

work page 2022

[45] [45]

Vehtari, A., Gabry, J., Magnusson, M., Yao, Y., B \" u rkner, P.-C., Paananen, T., and Gelman, A. (2020). LOO: Efficient leave-one-out cross-validation and WAIC for Bayesian models

work page 2020

[46] [46]

Vehtari, A., Gelman, A., and Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC . Statistics and Computing , 27(5):1413--1432

work page 2017

[47] [47]

Vershynin, R. (2025). High-Dimensional Probability: An Introduction with Applications in Data Science . Cambridge University Press, New York

work page 2025

[48] [48]

Vovk, V. (1990). Aggregating strategies . In M. Fulk and J. Case (Eds.), Proceedings of the Third Annual Workshop on Computational Learning Theory , pp.\ 371--383. San Mateo, CA: Morgan Kaufmann

work page 1990

[49] [49]

Wagenmakers, E.-J. and S. Farrell (2004). AIC model selection using Akaike weights . Psychonomic Bulletin & Review\/ 11\/ (1), 192--196

work page 2004

[50] [50]

Watanabe, S. (2013). A widely applicable Bayesian information criterion . J. Mach. Learn. Res. , 14(1):867--897

work page 2013

[51] [51]

Williams, P. M. (1980). Bayesian conditionalisation and the principle of minimum information . The British Journal for the Philosophy of Science\/ 31\/ (2), 131--144

work page 1980

[52] [52]

Wolpert, D. H. (1992). Stacked generalization . Neural Networks\/ 5\/ (2), 241--259

work page 1992

[53] [53]

Yao, Y., V. Aki, S. Daniel, and G. Andrew (2018). Using stacking to average Bayesian predictive distributions (with discussion) . Bayesian Analysis\/ 13\/ (3), 917--1007

work page 2018

[54] [54]

Yao, Y., G. Pirs, A. Vehtari, and A. Gelman (2021). Bayesian hierarchical stacking: some models are (somewhere) useful . arXiv:2101.08954v2\/

work page arXiv 2021