pith. sign in

arxiv: 2605.27747 · v1 · pith:SCBEVL5Vnew · submitted 2026-05-26 · 📊 stat.ML · cs.LG· stat.CO

Soft Specialists: α-R\'enyi Ensembles for Uncertainty-Aware LLM Post-Training

Pith reviewed 2026-06-29 15:11 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.CO
keywords α-Rényi variational frameworkLLM post-trainingLoRA ensemblesepistemic uncertaintysoft routingpreference optimisationmodel specialisation
0
0 comments X

The pith

An α-Rényi variational framework learns distributions over LLM post-training parameters to represent uncertainty from conflicting data as epistemic spread.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an α-Rényi variational framework for learning distributions over post-training parameters instead of single point estimates. This offers an uncertainty-aware alternative to deep ensembles by interpolating between variational Bayes and predictively oriented objectives. The approach is applied to LLMs by attaching an ensemble of LoRA adapters to a shared frozen base model for both supervised fine-tuning and preference optimisation. Local stability criteria are identified to show that model misspecification favors non-degenerate posterior spread, turning contradictory data into epistemic uncertainty. Training examples are softly routed across ensemble members to promote specialization and yield actionable uncertainty estimates.

Core claim

We propose an α-Rényi variational framework for learning distributions over post-training parameters, offering an uncertainty-aware alternative to deep ensemble approaches. The resulting variational objective interpolates between classical variational Bayes and predictively oriented posterior learning, balancing between globally plausible individual models against systems of complementary specialists. We identify local stability criteria, demonstrating how model misspecification can make non-degenerate posterior spread locally favourable, manifesting contradictory or conflicting data as epistemic uncertainty. We apply our framework to LLM post-training, learning an ensemble of LoRA adapters

What carries the argument

The α-Rényi variational objective, which interpolates between classical variational Bayes and predictively oriented posterior learning to balance individual model plausibility against complementary specialists.

If this is right

  • Enables soft routing of training examples across ensemble members.
  • Promotes model specialisation among the adapters.
  • Provides actionable uncertainty estimates across different tasks.
  • Offers a scalable procedure for both supervised fine-tuning and preference optimisation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The soft routing mechanism could be used to diagnose which types of data trigger high uncertainty in practice.
  • The framework might extend naturally to other parameter-efficient methods beyond LoRA for handling heterogeneous data.
  • Uncertainty estimates from the ensemble could serve as a signal for active data collection on conflicting examples.

Load-bearing premise

That local stability criteria demonstrate how model misspecification makes non-degenerate posterior spread locally favourable and manifests contradictory data as epistemic uncertainty in LLM post-training.

What would settle it

An experiment on a dataset with known conflicting labels where the α-Rényi ensemble fails to produce better uncertainty calibration or task performance than a single fine-tuned model or a standard deep ensemble.

Figures

Figures reproduced from arXiv: 2605.27747 by Andrew B. Duncan, Georgy Tyukin, Paula Cordero-Encinar.

Figure 1
Figure 1. Figure 1: α-Rényi flow training for LLM ensembles. A single frozen base model W0 is shared across M trainable LoRA particles. For each minibatch example, the particles produce sequence log-likelihoods si,b which are coupled through the objective. The resulting responsibilities w (α) i,b softly route examples towards particles that explain them well, inducing specialisation for α > 0. The final predictor is the induc… view at source ↗
Figure 2
Figure 2. Figure 2: Behaviour of the model in Example 2 for α above and below the critical threshold. (Left) Clean and contaminated samples. (Right) Posterior predictive distributions when x ≥ 0 and x < 0. The posterior predictive mean and variance under Q = N (m⋆ , Σ) are EQ[Y | X = x] = ϕ(x) ⊤m⋆ = βx + εa x+, and VarQ(Y | X = x) = σ 2 |{z} aleatoric + ϕ(x) ⊤Σ ϕ(x) | {z } epistemic . Restricting the posterior covariance to t… view at source ↗
Figure 3
Figure 3. Figure 3: Emergent specialisation on the MMLU benchmark. Model performances (rows) are [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗
read the original abstract

Existing training approaches for large language models learn a single set of parameters, based on large volumes of data, which is typically heterogeneous, conflicting and often outright contradictory. As a result, the model is forced to compress conflicting goals, and inherent uncertainties into a single, averaged pattern of behaviour. We propose an $\alpha$-R\'{e}nyi variational framework for learning distributions over post-training parameters, offering an uncertainty-aware alternative to deep ensemble approaches. The resulting variational objective interpolates between classical variational Bayes and predictively oriented posterior learning, balancing between globally plausible individual models against systems of complementary specialists. We identify local stability criteria, demonstrating how model misspecification can make non-degenerate posterior spread locally favourable, manifesting contradictory or conflicting data as epistemic uncertainty. We apply our framework to LLM post-training, learning an ensemble of LoRA adapters attached to a shared, frozen base model, providing a scalable training procedure for both supervised fine-tuning and preference optimisation. Our approach enables training examples to be softly routed across ensemble members, promoting model specialisation and providing actionable uncertainty estimates across different tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes an α-Rényi variational framework for learning distributions over post-training parameters of LLMs. It applies this to ensembles of LoRA adapters on a frozen base model for both supervised fine-tuning and preference optimization, claiming the approach enables soft routing of examples, model specialization, and actionable uncertainty estimates. The framework is said to interpolate between variational Bayes and predictive posterior learning, with local stability criteria demonstrating that model misspecification favors non-degenerate posterior spread, turning conflicting data into epistemic uncertainty.

Significance. If the stability analysis and empirical validation hold, the work could supply a scalable, uncertainty-aware alternative to deep ensembles for LLM post-training, with explicit handling of data conflicts via specialization.

major comments (2)
  1. [Abstract] Abstract: the local stability criteria are asserted to show that misspecification makes non-degenerate posteriors locally favourable, yet no definition of the criteria, local expansion, or explicit conditions under which the non-degenerate solution is preferred are supplied; this leaves the claimed interpolation mechanism and its application to the LoRA-ensemble setting without demonstrated derivation.
  2. [Abstract] Abstract: the manuscript states that the framework is applied to LLM post-training with results for SFT and preference optimisation, but supplies neither the training procedure details, objective function, nor any empirical results or validation; without these the central claims on scalability and uncertainty estimates cannot be assessed.
minor comments (1)
  1. Notation for the α-Rényi objective and its relation to the variational parameters should be introduced with explicit equations rather than descriptive prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed feedback on the abstract. We address each major comment below and will revise the manuscript to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the local stability criteria are asserted to show that misspecification makes non-degenerate posteriors locally favourable, yet no definition of the criteria, local expansion, or explicit conditions under which the non-degenerate solution is preferred are supplied; this leaves the claimed interpolation mechanism and its application to the LoRA-ensemble setting without demonstrated derivation.

    Authors: We agree that the abstract does not contain the full derivation. The local stability criteria, including the local expansion around the degenerate solution and the explicit conditions favoring non-degenerate spread under misspecification, are derived in Section 3.2, where the interpolation between variational Bayes and predictive posterior learning is also shown. We will revise the abstract to include a concise reference to these criteria and their implications for the LoRA ensemble, and add a short summary paragraph in the introduction linking the stability analysis directly to the claimed specialization mechanism. revision: yes

  2. Referee: [Abstract] Abstract: the manuscript states that the framework is applied to LLM post-training with results for SFT and preference optimisation, but supplies neither the training procedure details, objective function, nor any empirical results or validation; without these the central claims on scalability and uncertainty estimates cannot be assessed.

    Authors: We acknowledge that the current version emphasizes the theoretical framework and describes the application at a high level without sufficient procedural or empirical detail. We will add a dedicated methods subsection specifying the α-Rényi variational objective for both SFT and preference optimization, the exact training procedure for the ensemble of LoRA adapters on the frozen base model, and a new experimental section with results validating scalability and uncertainty estimates on the relevant tasks. revision: yes

Circularity Check

0 steps flagged

No circularity detectable from provided text

full rationale

The abstract asserts identification of local stability criteria and an interpolation between variational Bayes and predictive posterior learning, but supplies no equations, derivations, or self-citations. No load-bearing step reduces by construction to fitted inputs or prior self-citations, as no such material is visible. The derivation chain cannot be walked for reductions; the paper is treated as self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The framework implicitly relies on standard variational inference assumptions and the existence of local stability criteria, but these cannot be audited without the full text.

pith-pipeline@v0.9.1-grok · 5735 in / 1114 out tokens · 37043 ms · 2026-06-29T15:11:46.946159+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

104 extracted references · 16 canonical work pages · 8 internal anchors

  1. [1]

    Aitchison

    J. Aitchison. Goodness of prediction fit.Biometrika, 62(3):547–554, 1975

  2. [2]

    A. N. Angelopoulos, S. Bates, E. J. Candès, M. I. Jordan, and L. Lei. Learn then test: Calibrating predictive algorithms to achieve risk control.The Annals of Applied Statistics, 19(2):1641–1662, 2025

  3. [3]

    Anwar, A

    U. Anwar, A. Saparov, J. Rando, D. Paleka, M. Turpin, P. Hase, E. S. Lubana, E. Jenner, S. Casper, O. Sourbut, B. L. Edelman, Z. Zhang, M. Günther, A. Korinek, J. Hernandez-Orallo, L. Hammond, E. J. Bigelow, A. Pan, L. Langosco, T. Korbak, H. C. Zhang, R. Zhong, S. O. hEigeartaigh, G. Recchia, G. Corsi, A. Chan, M. Anderljung, L. Edwards, A. Petrov, C. S....

  4. [4]

    Arbel, K

    J. Arbel, K. Pitas, M. Vladimirova, and V. Fortuin. A primer on Bayesian neural networks: review and debates.Statistical Science, 41(2):316–353, 2026

  5. [5]

    Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  6. [6]

    Bakker, M

    M. Bakker, M. Chadwick, H. Sheahan, M. Tessler, L. Campbell-Gillingham, J. Balaguer, N. McAleese, A. Glaese, J. Aslanides, M. Botvinick, and C. Summerfield. Fine-tuning language models to find agreement among humans with diverse preferences.Advances in neural information processing systems, 35:38176–38189, 2022

  7. [7]

    A. Basu, I. R. Harris, N. L. Hjort, and M. C. Jones. Robust and efficient estimation by minimising a density power divergence.Biometrika, 85(3):549–559, 1998

  8. [8]

    T. Bayes. An essay towards solving a problem in the doctrine of chances.Biometrika, 45(3-4):296–315, 1958

  9. [9]

    R. A. Becker. The variance drain and Jensen’s inequality.CAEPR Working Paper, No. 2012-004, 2012

  10. [10]

    Bhattacharya, D

    A. Bhattacharya, D. Pati, and Y. Yang. Bayesian fractional posteriors.The Annals of Statistics, 47(1):39–66, 2019

  11. [11]

    P. G. Bissiri, C. C. Holmes, and S. G. Walker. A general framework for updating belief distribu- tions.Journal of the Royal Statistical Society Series B: Statistical Methodology, 78(5):1103–1130, 2016. 25

  12. [12]

    Blundell, J

    C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. InInternational Conference on Machine Learning, pages 1613–1622. PMLR, 2015

  13. [13]

    R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–345, 1952

  14. [14]

    J. A. Carrillo, R. J. McCann, and C. Villani. Contractions in the 2-wasserstein length space and thermalization of granular media.Archive for Rational Mechanics and Analysis, 179(2):217–263, 2006

  15. [15]

    Casper, X

    S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lindner, P. Freire, T. T. Wang, S. Marks, C.-R. Segerie, M. Carroll, A. Peng, P. J. Christoffersen, M. Damani, S. Slocum, U. Anwar, A. Siththaranjan, M. Nadeau, E. J. Michaud, J. Pfau, D. Krasheninnikov, X. Chen, L. Langosco, P. Hase, E. Biyik, A. Dragan, D. Kru...

  16. [16]

    T. Chen, E. Fox, and C. Guestrin. Stochastic gradient hamiltonian Monte carlo. InInternational Conference on Machine Learning, pages 1683–1691. PMLR, 2014

  17. [17]

    Z. Chen, T. Karvonen, H. Kanagawa, F.-X. Briol, and C. Oates. Stationary MMD Points for Cubature.arXiv preprint arXiv:2505.20754, 2025

  18. [18]

    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

  19. [19]

    Cinquin, A

    T. Cinquin, A. Immer, M. Horn, and V. Fortuin. Pathologies in priors and inference for bayesian transformers.arXiv preprint arXiv:2110.04020, 2021

  20. [20]

    Cui, W.-L

    J. Cui, W.-L. Chiang, I. Stoica, and C.-J. Hsieh. Or-bench: An over-refusal benchmark for large language models. InInternational Conference on Machine Learning, pages 11515–11542. PMLR, 2025

  21. [21]

    J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y. Wang, and Y. Yang. Safe rlhf: Safe reinforcement learning from human feedback. InInternational Conference on Learning Representations, volume 2024, pages 50750–50777, 2024

  22. [22]

    D’Angelo and V

    F. D’Angelo and V. Fortuin. Repulsive deep ensembles are Bayesian.Advances in Neural Information Processing Systems, 34:3451–3465, 2021

  23. [23]

    Daxberger, A

    E. Daxberger, A. Kristiadi, A. Immer, R. Eschenhagen, M. Bauer, and P. Hennig. Laplace redux-effortless Bayesian deep learning.Advances in neural information processing systems, 34:20089–20103, 2021

  24. [24]

    Z. Deng, F. Zhou, and J. Zhu. Accelerated linearized laplace approximation for bayesian deep learning.Advances in Neural Information Processing Systems, 35:2695–2708, 2022

  25. [25]

    N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen, C.-M. Chan, W. Chen, et al. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models.arXiv preprint arXiv:2203.06904, 2022

  26. [26]

    B. G. Doan, A. Shamsi, X.-Y. Guo, A. Mohammadi, H. Alinejad-Rokny, D. Sejdinovic, D. Teney, D. C. Ranasinghe, and E. Abbasnejad. Bayesian low-rank learning (Bella): A practical approach to Bayesian neural networks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39: 15, pages 16298–16307, 2025. 26

  27. [27]

    M. D. Donsker and S. S. Varadhan. Asymptotic evaluation of certain markov process expecta- tions for large time—iii.Communications on pure and applied Mathematics, 29(4):389–461, 1976

  28. [28]

    Duncan, N

    A. Duncan, N. Nüsken, and L. Szpruch. On the geometry of Stein variational gradient descent. Journal of Machine Learning Research, 24(56):1–39, 2023

  29. [29]

    Dusenberry, G

    M. Dusenberry, G. Jerfel, Y. Wen, Y. Ma, J. Snoek, K. Heller, B. Lakshminarayanan, and D. Tran. Efficient and scalable Bayesian neural nets with rank-1 factors. InInternational Conference on Machine Learning, pages 2782–2792. PMLR, 2020

  30. [30]

    X. Fan, S. Zhang, B. Chen, and M. Zhou. Bayesian attention modules.Advances in Neural Information Processing Systems, 33:16362–16376, 2020

  31. [31]

    Föllmer and T

    H. Föllmer and T. Knispel. Entropic risk measures: Coherence vs. convexity, model ambiguity and robust large deviations.Stochastics and Dynamics, 11(02n03):333–351, 2011

  32. [32]

    S. Fort, H. Hu, and B. Lakshminarayanan. Deep ensembles: A loss landscape perspective. arXiv preprint arXiv:1912.02757, 2019

  33. [33]

    Fortuin, A

    V. Fortuin, A. Garriga-Alonso, S. W. Ober, F. Wenzel, G. Ratsch, R. E. Turner, M. van der Wilk, and L. Aitchison. Bayesian neural network priors revisited. InInternational Conference on Learning Representations, 2022

  34. [34]

    Gal and Z

    Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. InInternational Conference on Machine Learning, pages 1050–

  35. [35]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022

  36. [36]

    Geifman and R

    Y. Geifman and R. El-Yaniv. Selective classification for deep neural networks.Advances in neural information processing systems, 30, 2017

  37. [37]

    Germain, A

    P. Germain, A. Lacasse, F. Laviolette, M. March, and J.-F. Roy. Risk Bounds for the Majority Vote: From a PAC-Bayesian Analysis to a Learning Algorithm.Journal of Machine Learning Research, 16(26):787–860, 2015

  38. [38]

    Gheshlaghi Azar, Z

    M. Gheshlaghi Azar, Z. Daniel Guo, B. Piot, R. Munos, M. Rowland, M. Valko, and D. Calan- driello. A general theoretical paradigm to understand learning from human preferences. In S. Dasgupta, S. Mandt, and Y. Li, editors,Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 ofProceedings of Machine Learnin...

  39. [39]

    Grünwald

    P. Grünwald. The safe Bayesian: learning the learning rate via the mixability gap. In International Conference on Algorithmic Learning Theory, pages 169–183. Springer, 2012

  40. [40]

    Grünwald and J

    P. Grünwald and J. Langford. Suboptimal behavior of Bayes and MDL in classification under misspecification.Machine Learning, 66(2):119–149, 2007

  41. [41]

    Grünwald and T

    P. Grünwald and T. van Ommen. Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It.Bayesian Analysis, 12(4):1069 – 1103, 2017. 27

  42. [42]

    Guilmeau, E

    T. Guilmeau, E. Chouzenoux, and V. Elvira. Regularized Rényi divergence minimization through Bregman proximal gradient algorithms.Journal of Machine Learning Research, 26(157):1–56, 2025

  43. [43]

    D. Guo, A. M. Rush, and Y. Kim. Parameter-efficient transfer learning with diff pruning. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: Long papers), pages 4884–4896, 2021

  44. [44]

    Z. Han, C. Gao, J. Liu, J. Zhang, and S. Q. Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey.Transactions on Machine Learning Research, 2024

  45. [45]

    Harrison, J

    J. Harrison, J. Willes, and J. Snoek. Variational Bayesian Last Layers. InThe Twelfth International Conference on Learning Representations, 2024

  46. [46]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Mea- suring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

  47. [47]

    Hernandez-Lobato, Y

    J. Hernandez-Lobato, Y. Li, M. Rowland, T. Bui, D. Hernández-Lobato, and R. Turner. Black-box alpha divergence minimization. InInternational Conference on Machine Learning, pages 1511–1520. PMLR, 2016

  48. [48]

    Houlsby, A

    N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. At- tariyan, and S. Gelly. Parameter-efficient transfer learning for NLP. InInternational Conference on Machine Learning, pages 2790–2799. PMLR, 2019

  49. [49]

    Bayesian Active Learning for Classification and Preference Learning

    N. Houlsby, F. Huszár, Z. Ghahramani, and M. Lengyel. Bayesian active learning for classifica- tion and preference learning.arXiv preprint arXiv:1112.5745, 2011

  50. [50]

    E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

  51. [51]

    Z. Hu, L. Shen, Z. Wang, Y. Wei, and D. Tao. Adaptive defense against harmful fine-tuning for large language models via Bayesian data scheduler.Advances in Neural Information Processing Systems, 38:52131–52174, 2026

  52. [52]

    Huber.Robust statistics

    P. Huber.Robust statistics. Wiley New York, 1981

  53. [53]

    J. Jia, X. Cao, and N. Z. Gong. Intrinsic certified robustness of bagging against data poisoning attacks.Proceedings of the AAAI Conference on Artificial Intelligence, 35(9):7961–7969, 2021

  54. [54]

    Jiang and M

    W. Jiang and M. A. Tanner. Gibbs posterior for variable selection in high-dimensional classification and data mining.The Annals of Statistics, 36(5):2207 – 2231, 2008

  55. [55]

    Jiang, J

    Z. Jiang, J. Araki, H. Ding, and G. Neubig. How can we know when language models know? on the calibration of language models for question answering.Transactions of the Association for Computational Linguistics, 9:962–977, 2021

  56. [56]

    Kendall and Y

    A. Kendall and Y. Gal. What uncertainties do we need in Bayesian deep learning for computer vision?Advances in neural information processing systems, 30, 2017

  57. [57]

    D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameteri- zation trick.Advances in neural information processing systems, 28, 2015. 28

  58. [58]

    Auto-Encoding Variational Bayes

    D.P.KingmaandM.Welling. Auto-encodingvariationalBayes.arXiv preprint arXiv:1312.6114, 2013

  59. [59]

    Knoblauch, J

    J. Knoblauch, J. Jewson, and T. Damoulas. An Optimization-centric View on Bayes’ Rule: Reviewing and Generalizing Variational Inference.Journal of Machine Learning Research, 23:1–109, 2022

  60. [60]

    Lacasse, F

    A. Lacasse, F. Laviolette, M. Marchand, P. Germain, and N. Usunier. PAC-Bayes bounds for the risk of the majority vote and the variance of the Gibbs classifier.Advances in Neural information processing systems, 19, 2006

  61. [61]

    Predictive variational inference: Learn the predictively optimal posterior distribution

    J. Lai and Y. Yao. Predictive variational inference: Learn the predictively optimal posterior distribution.arXiv preprint arXiv:2410.14843, 2024

  62. [62]

    Lakshminarayanan, A

    B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles.Advances in neural information processing systems, 30, 2017

  63. [63]

    Lawton, A

    N. Lawton, A. Kumar, G. Thattai, A. Galstyan, and G. Ver Steeg. Neural architecture search for parameter-efficient fine-tuning of large pre-trained language models. InFindings of the Association for Computational Linguistics: ACL 2023, pages 8506–8515, 2023

  64. [64]

    Levine and S

    A. Levine and S. Feizi. Deep partition aggregation: Provable defenses against general poisoning attacks. InInternational Conference on Learning Representations, 2021

  65. [65]

    J. Li, W. Aitken, R. Bhambhoria, and X. Zhu. Prefix propagation: Parameter-efficient tuning for long sequences. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1408–1419, 2023

  66. [66]

    T. Li, A. Beirami, M. Sanjabi, and V. Smith. Tilted Empirical Risk Minimization. In International Conference on Learning Representations, 2021

  67. [67]

    X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, 2021

  68. [68]

    Li and Y

    Y. Li and Y. Gal. Dropout inference in Bayesian neural networks with alpha-divergences. In International Conference on Machine Learning, pages 2052–2061. PMLR, 2017

  69. [69]

    Li and R

    Y. Li and R. E. Turner. Rényi divergence variational inference.Advances in neural information processing systems, 29, 2016

  70. [70]

    J. G. Liao and A. Berg. Sharpening Jensen’s inequality.The American Statistician, 2019

  71. [71]

    Q. Liu, M. A. Fisher, Z. Shen, K. Tant, X. Zhao, A. Curtis, and C. J. Oates. Detecting Model Misspecification in Bayesian Inverse Problems via Variational Gradient Descent.arXiv preprint arXiv:2512.01667, 2025

  72. [72]

    Liu and D

    Q. Liu and D. Wang. Stein variational gradient descent: a general purpose bayesian inference algorithm. InProceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, page 2378–2386, Red Hook, NY, USA, 2016. Curran Associates Inc

  73. [73]

    D. J. MacKay. A practical Bayesian framework for backpropagation networks.Neural computation, 4(3):448–472, 1992. 29

  74. [74]

    W. J. Maddox, P. Izmailov, T. Garipov, D. P. Vetrov, and A. G. Wilson. A simple baseline for Bayesian uncertainty in deep learning.Advances in neural information processing systems, 32, 2019

  75. [75]

    Martin and N

    R. Martin and N. Syring. Direct Gibbs posterior inference on risk minimizers: Construction, concentration, and calibration. InHandbook of Statistics, volume 47, pages 1–41. Elsevier, 2022

  76. [76]

    Masegosa

    A. Masegosa. Learning under model misspecification: Applications to variational and ensemble methods.Advances in Neural Information Processing Systems, 33:5479–5491, 2020

  77. [77]

    McLatchie, B.-E

    Y. McLatchie, B.-E. Cherief-Abdellatif, D. T. Frazier, and J. Knoblauch. Predictively oriented posteriors.arXiv preprint arXiv:2510.01915, 2025

  78. [78]

    Mittal, Y

    S. Mittal, Y. Bengio, N. Malkin, and G. Lajoie. In-context parametric inference: Point or distribution estimators?arXiv preprint arXiv:2502.11617, 2025

  79. [79]

    R. M. Neal.Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012

  80. [80]

    Ollivier, H

    Y. Ollivier, H. Pajot, and C. Villani.Optimal transport: Theory and applications, volume 413. Cambridge University Press, 2014

Showing first 80 references.