pith. sign in

arxiv: 2506.10060 · v2 · submitted 2025-06-11 · 💻 cs.LG · cs.AI· stat.ML

Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems

Pith reviewed 2026-05-19 09:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords Bayesian inferenceLLM promptsuncertainty quantificationMCMCMetropolis-Hastingsprompt engineeringtextual parameters
0
0 comments X p. Extension

The pith

Treating prompts as textual parameters enables Bayesian inference over LLM prompts and predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes viewing prompts in LLM-based systems as parameters within a statistical model. This framing permits Bayesian inference over the prompts themselves using a modest training dataset. It further allows free-form textual priors to inform the process and supports uncertainty quantification for both the prompts and the resulting predictions. A sympathetic reader would care because many LLM applications depend heavily on prompt choice yet lack reliable ways to measure and control uncertainty, especially when models are closed-source.

Core claim

Interpreting prompts as textual parameters in a statistical model enables principled Bayesian inference over these prompts and downstream predictions while incorporating free-form textual priors. To carry out the inference the authors introduce Metropolis-Hastings through LLM Proposals (MHLP), a Markov chain Monte Carlo algorithm that pairs prompt-optimization techniques with standard MCMC sampling. The method functions as a turnkey addition to existing pipelines, including those using only black-box models, and produces measurable gains in predictive accuracy and uncertainty calibration on LLM benchmarks and dedicated UQ tasks.

What carries the argument

Metropolis-Hastings through LLM Proposals (MHLP), a Markov chain Monte Carlo sampler that generates candidate textual prompts via LLM-driven optimization to approximate the posterior distribution over prompts.

If this is right

  • Uncertainty can be quantified jointly over the choice of prompt and the downstream model output.
  • Prior knowledge about good prompts can be expressed directly in natural language and folded into the inference.
  • Existing LLM pipelines gain improved calibration without requiring changes to model weights or access to internals.
  • Predictive accuracy rises on standard benchmarks when posterior sampling replaces hand-tuned prompts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same posterior-sampling idea could be applied to other discrete choices such as tool selection or chain-of-thought templates.
  • Hybrid systems might combine MHLP with gradient-based methods when partial access to model internals becomes available.
  • Empirical studies on prompt-length scaling and mixing time would clarify practical limits of the approach.

Load-bearing premise

The MHLP algorithm can perform effective Bayesian inference over the discrete high-dimensional space of textual prompts even when the underlying LLM is a black box.

What would settle it

On a controlled benchmark where prompt sensitivity is known, if MHLP chains fail to mix or if the resulting uncertainty estimates show no improvement in calibration or accuracy over standard single-prompt baselines, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2506.10060 by Atiyeh Ashari Ghomi, Brendan Leigh Ross, Gabriel Loaiza-Ganem, Jesse C. Cresswell, Ji Xin, Kin Kwan Leung, No\"el Vouitsis, Rasa Hosseinzadeh, Shiyi Hou, Yi Sui, Zhaoyan Liu.

Figure 1
Figure 1. Figure 1: In chain-of-thought (CoT) prompting (left), answers are generated by an LLM using a single fixed prompt; this frequentist approach does not account for uncertainty about how the model should be prompted, causing potential issues such as overconfidence on incorrect answers. In our Textual Bayes approach (right), we sample prompts from our Bayesian posterior and then use each prompt to generate answers from … view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of conformal factuality for frequency scoring with a fixed prompt [40], and with prompts sampled through MHLP. (a) The empirical factuality achieved in practice is consistently within the bounds guaranteed by Equation 8. (b) MHLP achieves the same level of empirical factuality as frequency scoring but removes fewer claims, indicating better calibrated confidence. 5 Related Work LLMs are applicab… view at source ↗
read the original abstract

Although large language models (LLMs) are becoming increasingly capable of solving challenging real-world tasks, accurately quantifying their uncertainty remains a critical open problem--one that limits their applicability in high-stakes domains. This challenge is further compounded by the closed-source, black-box nature of many state-of-the-art LLMs. Moreover, LLM-based systems can be highly sensitive to the prompts that bind them together, which often require significant manual tuning (i.e., prompt engineering). In this work, we address these challenges by viewing LLM-based systems through a Bayesian lens. We interpret prompts as textual parameters in a statistical model, allowing us to use a small training dataset to perform Bayesian inference over these prompts. This novel perspective enables principled uncertainty quantification over both the model's textual parameters and its downstream predictions, while also incorporating prior beliefs about these parameters expressed in free-form text. To perform Bayesian inference--a difficult problem even for well-studied data modalities--we introduce Metropolis-Hastings through LLM Proposals (MHLP), a novel Markov chain Monte Carlo (MCMC) algorithm that combines prompt optimization techniques with standard MCMC methods. MHLP is a turnkey modification to existing LLM pipelines, including those that rely exclusively on closed-source models. Empirically, we demonstrate that our method yields improvements in both predictive accuracy and uncertainty quantification (UQ) on a range of LLM benchmarks and UQ tasks. More broadly, our work demonstrates a viable path for incorporating methods from the rich Bayesian literature into the era of LLMs, paving the way for more reliable and calibrated LLM-based systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that interpreting prompts as textual parameters in a statistical model enables Bayesian inference over prompts and predictions via a new MCMC algorithm, Metropolis-Hastings through LLM Proposals (MHLP). This allows incorporation of free-form textual priors and yields improvements in predictive accuracy and uncertainty quantification on LLM benchmarks and UQ tasks, even for closed-source models.

Significance. If the central claim holds, the work provides a practical bridge between Bayesian methods and LLM pipelines, enabling principled UQ over prompt sensitivity without requiring white-box access. It could support more calibrated systems in high-stakes settings by leveraging existing prompt optimization techniques within an MCMC framework.

major comments (2)
  1. [§3] §3 (MHLP algorithm description): The central claim requires that MHLP produces samples from the posterior p(prompt | data). However, the paper provides no proof or diagnostic that the LLM-based proposal kernel is irreducible and aperiodic over the combinatorial space of textual prompts, nor that the Metropolis-Hastings acceptance ratio is correctly evaluated when the likelihood involves a black-box LLM. This undermines the assertion that reported accuracy and calibration gains are Bayesian rather than artifacts of the proposal mechanism.
  2. [§4] §4 (Experimental evaluation): The reported improvements in predictive accuracy and UQ are presented without convergence diagnostics (e.g., trace plots, effective sample size, or Gelman-Rubin statistics) or details on chain length, burn-in, or proposal tuning. In a discrete space whose size grows exponentially with prompt length, this leaves open the possibility that the sampler has not mixed, rendering the empirical gains non-Bayesian.
minor comments (2)
  1. [Abstract] The abstract states that MHLP is a 'turnkey modification' but does not clarify how the likelihood is approximated when using closed-source models; this should be expanded with a concrete example in the methods.
  2. [Notation] Notation for the textual prior and the target density should be introduced earlier and used consistently to improve readability for readers unfamiliar with prompt engineering.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (MHLP algorithm description): The central claim requires that MHLP produces samples from the posterior p(prompt | data). However, the paper provides no proof or diagnostic that the LLM-based proposal kernel is irreducible and aperiodic over the combinatorial space of textual prompts, nor that the Metropolis-Hastings acceptance ratio is correctly evaluated when the likelihood involves a black-box LLM. This undermines the assertion that reported accuracy and calibration gains are Bayesian rather than artifacts of the proposal mechanism.

    Authors: We acknowledge that the manuscript does not include a formal proof of irreducibility or aperiodicity for the LLM proposal kernel, nor explicit diagnostics for the acceptance ratio in the black-box setting. The combinatorial and effectively unbounded nature of the prompt space makes such proofs challenging and non-standard compared to finite-state MCMC. However, MHLP follows the standard Metropolis-Hastings framework: proposals are generated by the LLM (leveraging prompt optimization techniques), and the acceptance ratio is computed using the exact ratio of the unnormalized posterior densities, where the likelihood p(data | prompt) is obtained by querying the (black-box) LLM on the training data. The method is thus correct whenever the proposal kernel allows exploration, which our empirical results across multiple benchmarks support through consistent gains over non-Bayesian baselines. We will add a dedicated discussion subsection on these theoretical considerations and practical limitations in the revision. revision: partial

  2. Referee: [§4] §4 (Experimental evaluation): The reported improvements in predictive accuracy and UQ are presented without convergence diagnostics (e.g., trace plots, effective sample size, or Gelman-Rubin statistics) or details on chain length, burn-in, or proposal tuning. In a discrete space whose size grows exponentially with prompt length, this leaves open the possibility that the sampler has not mixed, rendering the empirical gains non-Bayesian.

    Authors: We agree that the absence of convergence diagnostics is a limitation, particularly given the discrete prompt space. The current experiments report results from multiple independent chains but do not include trace plots, effective sample sizes, Gelman-Rubin statistics, or explicit details on burn-in and tuning. In the revised manuscript we will incorporate these diagnostics (including trace plots for log-posterior and accuracy metrics, ESS values, and chain length/burn-in specifications) to demonstrate adequate mixing and support that the reported gains arise from posterior sampling rather than proposal artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper frames prompts as textual parameters and introduces MHLP as a novel MCMC adaptation that combines prompt optimization with standard Metropolis-Hastings. The abstract and provided description present this as an independent algorithmic contribution enabling Bayesian inference over discrete textual spaces, without any quoted equations or steps that reduce predictions to fitted inputs by construction, self-definitional loops, or load-bearing self-citations. The claimed improvements in accuracy and UQ are positioned as empirical outcomes of the method rather than definitional equivalences. The derivation chain is therefore self-contained against external MCMC theory and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is limited to the abstract, so the ledger captures only the core modeling assumption stated there; no specific numerical free parameters or new entities are described.

axioms (1)
  • domain assumption Prompts can be interpreted as textual parameters in a statistical model suitable for Bayesian inference.
    This premise is required for the entire framework and is introduced in the abstract as the novel perspective.

pith-pipeline@v0.9.0 · 5863 in / 1318 out tokens · 57390 ms · 2026-05-19T09:16:59.298419+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

    cs.AI 2025-10 unverdicted novelty 6.0

    A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · cited by 1 Pith paper · 12 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam et al. GPT-4 Technical Report. arXiv:2303.08774, 2023

  2. [2]

    A statistical theory of cold posteriors in deep neural networks

    Laurence Aitchison. A statistical theory of cold posteriors in deep neural networks. In International Conference on Learning Representations, 2021

  3. [3]

    Llama-nemotron: Efficient reasoning models

    Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-nemotron: Efficient reasoning models. arXiv preprint arXiv:2505.00949, 2025

  4. [4]

    Bayesian Theory, volume 405

    José M Bernardo and Adrian FM Smith. Bayesian Theory, volume 405. John Wiley & Sons, 2009

  5. [5]

    Weight uncertainty in neural network

    Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1613–1622, 2015

  6. [6]

    Emergent autonomous scientific research capabilities of large language models

    Daniil A Boiko, Robert MacKnight, and Gabe Gomes. Emergent autonomous scientific research capabilities of large language models. arXiv:2304.05332, 2023

  7. [7]

    Opportunities and Challenges of AI-Driven Customer Service, pages 33–71

    Rijul Chaturvedi and Sanjeev Verma. Opportunities and Challenges of AI-Driven Customer Service, pages 33–71. Springer International Publishing, 2023. ISBN 978-3-031-33898-4. doi: 10.1007/978-3-031-33898-4_3

  8. [8]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  9. [9]

    Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs

    Ching-An Cheng, Allen Nie, and Adith Swaminathan. Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs. In Advances in Neural Information Processing Systems, volume 37, pages 71596–71642, 2024

  10. [10]

    Aime problems and solutions

    MAA Committees. Aime problems and solutions. https://artofproblemsolving.com/ wiki/index.php/AIME_Problems_and_Solutions

  11. [11]

    Smith, and Matt Gardner

    Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. CoRR, abs/2105.03011, 2021. URL https://arxiv.org/abs/2105.03011. 10

  12. [12]

    Laplace redux-effortless Bayesian deep learning

    Erik Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, and Philipp Hennig. Laplace redux-effortless Bayesian deep learning. In Advances in Neural Information Processing Systems, 2021

  13. [13]

    Simon Duane, A. D. Kennedy, B. J. Pendleton, and Duncan Roweth. Hybrid monte carlo. Physics Letters B, 195(2):216–222, 1987

  14. [14]

    Sample, don’t search: Rethinking test-time alignment for language models

    Gonçalo Faria and Noah A Smith. Sample, don’t search: Rethinking test-time alignment for language models. arXiv preprint arXiv:2504.03790, 2025

  15. [15]

    QUEST: Quality-aware metropolis-hastings sampling for machine translation

    Gonçalo Faria, Sweta Agrawal, António Farinhas, Ricardo Rei, José de Souza, and André Martins. QUEST: Quality-aware metropolis-hastings sampling for machine translation. In Advances in Neural Information Processing Systems, 2024

  16. [16]

    Ober, Florian Wenzel, Gunnar Ratsch, Richard E Turner, Mark van der Wilk, and Laurence Aitchison

    Vincent Fortuin, Adrià Garriga-Alonso, Sebastian W. Ober, Florian Wenzel, Gunnar Ratsch, Richard E Turner, Mark van der Wilk, and Laurence Aitchison. Bayesian neural network priors revisited. In International Conference on Learning Representations, 2022

  17. [17]

    SPUQ: Perturbation-based uncertainty quantification for large language models

    Xiang Gao, Jiaxin Zhang, Lalla Mouatadid, and Kamalika Das. SPUQ: Perturbation-based uncertainty quantification for large language models. arXiv preprint arXiv:2403.02509, 2024

  18. [18]

    A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing

    Carlos Gómez-Rodríguez and Paul Williams. A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14504–14528, 2023

  19. [19]

    Improving uncertainty quantification in large language models via semantic embeddings

    Yashvir S Grewal, Edwin V Bonilla, and Thang D Bui. Improving uncertainty quantification in large language models via semantic embeddings. arXiv:2410.22685, 2024

  20. [20]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1321–1330, 2017

  21. [21]

    De- composing uncertainty for large language models through input clarification ensembling

    Bairu Hou, Yujian Liu, Kaizhi Qian, Jacob Andreas, Shiyu Chang, and Yang Zhang. De- composing uncertainty for large language models through input clarification ensembling. In International Conference on Machine Learning, 2024

  22. [22]

    Automated Design of Agentic Systems

    Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. arXiv:2408.08435, 2024

  23. [23]

    What Are Bayesian Neural Network Posteriors Really Like? In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 4629–4640, 2021

    Pavel Izmailov, Sharad Vikram, Matthew D Hoffman, and Andrew Gordon Wilson. What Are Bayesian Neural Network Posteriors Really Like? In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 4629–4640, 2021

  24. [24]

    Estimating the hallucination rate of generative AI

    Andrew Jesson, Nicolas Beltran Velez, Quentin Chu, Sweta Karlekar, Jannik Kossen, Yarin Gal, John P Cunningham, and David Blei. Estimating the hallucination rate of generative AI. In Advances in Neural Information Processing Systems, 2024

  25. [25]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022

  26. [26]

    On uncertainty, tempering, and data augmentation in bayesian classification

    Sanyam Kapoor, Wesley J Maddox, Pavel Izmailov, and Andrew G Wilson. On uncertainty, tempering, and data augmentation in bayesian classification. In Advances in Neural Information Processing Systems, volume 35, 2022

  27. [27]

    Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learning Represen...

  28. [28]

    Auto-encoding variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014. 11

  29. [29]

    Being Bayesian, even just a bit, fixes overconfidence in relu networks

    Agustinus Kristiadi, Matthias Hein, and Philipp Hennig. Being Bayesian, even just a bit, fixes overconfidence in relu networks. In International Conference on Machine Learning, 2020

  30. [30]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In International Conference on Learning Representations, 2023

  31. [31]

    Simple and scalable predictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, 2017

  32. [32]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020

  33. [33]

    Generating with confidence: Uncertainty quantification for black-box large language models

    Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantification for black-box large language models. In Transactions on Machine Learning Research, 2024

  34. [34]

    Uncertainty quantification for in-context learning of large language models

    Chen Ling, Xujiang Zhao, Xuchao Zhang, Wei Cheng, Yanchi Liu, Yiyou Sun, Mika Oishi, Takao Osaki, Katsushi Matsuda, Jie Ji, et al. Uncertainty quantification for in-context learning of large language models. arXiv:2402.10189, 2024

  35. [35]

    Information Theory, Inference and Learning Algorithms

    David JC MacKay. Information Theory, Inference and Learning Algorithms . Cambridge University Press, 2003

  36. [36]

    Self- Refine: Iterative Refinement with Self-Feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- Refine: Iterative Refinement with Self-Feedback. InAdvances in Neural Information Processing Sy...

  37. [37]

    SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models

    Potsawee Manakul, Adian Liusie, and Mark Gales. SelfCheckGPT: Zero-Resource Black- Box Hallucination Detection for Generative Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.557

  38. [38]

    On faithfulness and factuality in abstractive summarization

    Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, 2020

  39. [39]

    FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics,

  40. [40]

    doi: 10.18653/v1/2023.emnlp-main.741

  41. [41]

    Language models with conformal factuality guarantees

    Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees. In Proceedings of the 41st International Conference on Machine Learning, 2024

  42. [42]

    Data augmentation in Bayesian neural networks and the cold posterior effect

    Seth Nabarro, Stoil Ganev, Adrià Garriga-Alonso, Vincent Fortuin, Mark van der Wilk, and Laurence Aitchison. Data augmentation in Bayesian neural networks and the cold posterior effect. In Uncertainty in Artificial Intelligence, pages 1434–1444. PMLR, 2022

  43. [43]

    Radford M. Neal. Bayesian Learning for Neural Networks, volume 118 of Lecture Notes in Statistics. Springer, 1996. doi: 10.1007/978-1-4612-0745-0

  44. [44]

    Kernel language entropy: Fine-grained uncertainty quantification for LLMs from semantic similarities

    Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncertainty quantification for LLMs from semantic similarities. In Advances in Neural Information Processing Systems, 2024. 12

  45. [45]

    Disentangling the roles of curation, data-augmentation and the prior in the cold posterior effect

    Lorenzo Noci, Kevin Roth, Gregor Bachmann, Sebastian Nowozin, and Thomas Hofmann. Disentangling the roles of curation, data-augmentation and the prior in the cold posterior effect. In Advances in Neural Information Processing Systems, volume 34, 2021

  46. [46]

    Obtaining well calibrated probabilities using bayesian binning

    Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. Proceedings of the AAAI Conference on Artificial Intelli- gence, 29(1), 2015. doi: 10.1609/aaai.v29i1.9602

  47. [47]

    Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space

    Xin Qiu and Risto Miikkulainen. Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space. In Advances in Neural Information Processing Systems, 2024

  48. [48]

    A Scalable Laplace Approximation for Neural Networks

    Hippolyt Ritter, Aleksandar Botev, and David Barber. A Scalable Laplace Approximation for Neural Networks. In International Conference on Learning Representations, 2018

  49. [49]

    A scalable Laplace approximation for neural networks

    Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable Laplace approximation for neural networks. In International Conference on Learning Representations, 2018

  50. [50]

    The Metropolis-Hastings algorithm

    Christian P Robert. The Metropolis-Hastings algorithm. arXiv:1504.01896, 2015

  51. [51]

    Optimal proposal distributions and adaptive MCMC

    Jeffrey S Rosenthal. Optimal proposal distributions and adaptive MCMC. Handbook of Markov Chain Monte Carlo, 4(10.1201):93–111, 2011

  52. [52]

    Mean field theory for sigmoid belief networks

    Lawrence K Saul, Tommi Jaakkola, and Michael I Jordan. Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4:61–76, 1996

  53. [53]

    Agent Laboratory: Using LLM Agents as Research Assistants

    Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum. Agent Laboratory: Using LLM Agents as Research Assistants. arXiv:2501.04227, 2025

  54. [54]

    An efficient minibatch acceptance test for metropolis-hastings

    Daniel Seita, Xinlei Pan, Haoyu Chen, and John Canny. An efficient minibatch acceptance test for metropolis-hastings. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 5359–5363, 2018

  55. [55]

    A tutorial on conformal prediction

    Glenn Shafer and Vladimir V ovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3), 2008

  56. [56]

    Springer, 2005

    Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, 2005

  57. [57]

    Lora ensembles for large language model fine-tuning

    Xi Wang, Laurence Aitchison, and Maja Rudolph. Lora ensembles for large language model fine-tuning. arXiv:2310.00035, 2023

  58. [58]

    Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023

  59. [59]

    Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation

    Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708, 2021

  60. [60]

    Helpsteer2-preference: Complementing ratings with prefer- ences

    Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, and Yi Dong. Helpsteer2-preference: Complementing ratings with prefer- ences, 2024. URL https://arxiv.org/abs/2410.01257

  61. [61]

    On subjective uncertainty quantification and calibration in natural language generation

    Ziyu Wang and Chris Holmes. On subjective uncertainty quantification and calibration in natural language generation. arXiv:2406.05213, 2024

  62. [62]

    Jailbroken: How does LLM safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110, 2023

    Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110, 2023. 13

  63. [63]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022

  64. [64]

    Measuring short-form factuality in large language models

    Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368, 2024

  65. [65]

    Bayesian Learning via Stochastic Gradient Langevin Dynam- ics

    Max Welling and Yee Whye Teh. Bayesian Learning via Stochastic Gradient Langevin Dynam- ics. In Proceedings of the 28th International Conference on Machine Learning, pages 681–688,

  66. [66]

    ISBN 978-1-4503-0619-5

  67. [67]

    Characterizing llm abstention behavior in science qa with context perturbations, 2024

    Bingbing Wen, Bill Howe, and Lucy Lu Wang. Characterizing llm abstention behavior in science qa with context perturbations, 2024. URL https://arxiv.org/abs/2404.12452

  68. [68]

    How good is the Bayes posterior in deep neural networks really? In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 10248–10259, 2020

    Florian Wenzel, Kevin Roth, Bastiaan Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How good is the Bayes posterior in deep neural networks really? In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 10248–10259, 2020

  69. [69]

    Intelligent agents: Theory and practice

    Michael Wooldridge and Nicholas R Jennings. Intelligent agents: Theory and practice. The Knowledge Engineering Review, 10(2):115–152, 1995

  70. [70]

    The rise and potential of large language model based agents: A survey

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, a...

  71. [71]

    Hallucination is Inevitable: An Innate Limitation of Large Language Models

    Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. Hallucination is inevitable: An innate limitation of large language models. arXiv:2401.11817, 2024

  72. [72]

    The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

    Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search. arXiv:2504.08066, 2025

  73. [73]

    Backdooring instruction-tuned large language models with virtual prompt injection

    Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, and Hongxia Jin. Backdooring instruction-tuned large language models with virtual prompt injection. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...

  74. [74]

    Bayesian low-rank adaptation for large language models

    Adam X Yang, Maxime Robeyns, Xi Wang, and Laurence Aitchison. Bayesian low-rank adaptation for large language models. In International Conference on Learning Representations, 2024

  75. [75]

    On Verbalized Confidence Scores for LLMs

    Daniel Yang, Yao-Hung Hubert Tsai, and Makoto Yamada. On verbalized confidence scores for LLMs. arXiv:2412.14737, 2024

  76. [76]

    Optimizing generative AI by backpropagating language model feedback

    Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative AI by backpropagating language model feedback. Nature, 639:609–616, 2025

  77. [77]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023

  78. [78]

    Large language models are human-level prompt engineers

    Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations, 2022. 14

  79. [79]

    GPTSwarm: Language Agents as Optimizable Graphs

    Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. GPTSwarm: Language Agents as Optimizable Graphs. In Forty-first International Conference on Machine Learning, 2024

  80. [80]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043, 2023. 15 A Method Details In general, MCMC can only be applied to Bayesian inference when the g(θ) is calculable, where g(θ) is defined by g(θ) = p(θ)p(D | θ) = p(θ) nY i=1 p(y...