Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems

arxiv: 2506.10060 · v2 · submitted 2025-06-11 · 💻 cs.LG · cs.AI· stat.ML

Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems

Brendan Leigh Ross , No\"el Vouitsis , Atiyeh Ashari Ghomi , Rasa Hosseinzadeh , Ji Xin , Zhaoyan Liu , Yi Sui , Shiyi Hou

show 3 more authors

Kin Kwan Leung Gabriel Loaiza-Ganem Jesse C. Cresswell

This is my paper

Pith reviewed 2026-05-19 09:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords Bayesian inferenceLLM promptsuncertainty quantificationMCMCMetropolis-Hastingsprompt engineeringtextual parameters

0 comments p. Extension

The pith

Treating prompts as textual parameters enables Bayesian inference over LLM prompts and predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes viewing prompts in LLM-based systems as parameters within a statistical model. This framing permits Bayesian inference over the prompts themselves using a modest training dataset. It further allows free-form textual priors to inform the process and supports uncertainty quantification for both the prompts and the resulting predictions. A sympathetic reader would care because many LLM applications depend heavily on prompt choice yet lack reliable ways to measure and control uncertainty, especially when models are closed-source.

Core claim

Interpreting prompts as textual parameters in a statistical model enables principled Bayesian inference over these prompts and downstream predictions while incorporating free-form textual priors. To carry out the inference the authors introduce Metropolis-Hastings through LLM Proposals (MHLP), a Markov chain Monte Carlo algorithm that pairs prompt-optimization techniques with standard MCMC sampling. The method functions as a turnkey addition to existing pipelines, including those using only black-box models, and produces measurable gains in predictive accuracy and uncertainty calibration on LLM benchmarks and dedicated UQ tasks.

What carries the argument

Metropolis-Hastings through LLM Proposals (MHLP), a Markov chain Monte Carlo sampler that generates candidate textual prompts via LLM-driven optimization to approximate the posterior distribution over prompts.

If this is right

Uncertainty can be quantified jointly over the choice of prompt and the downstream model output.
Prior knowledge about good prompts can be expressed directly in natural language and folded into the inference.
Existing LLM pipelines gain improved calibration without requiring changes to model weights or access to internals.
Predictive accuracy rises on standard benchmarks when posterior sampling replaces hand-tuned prompts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same posterior-sampling idea could be applied to other discrete choices such as tool selection or chain-of-thought templates.
Hybrid systems might combine MHLP with gradient-based methods when partial access to model internals becomes available.
Empirical studies on prompt-length scaling and mixing time would clarify practical limits of the approach.

Load-bearing premise

The MHLP algorithm can perform effective Bayesian inference over the discrete high-dimensional space of textual prompts even when the underlying LLM is a black box.

What would settle it

On a controlled benchmark where prompt sensitivity is known, if MHLP chains fail to mix or if the resulting uncertainty estimates show no improvement in calibration or accuracy over standard single-prompt baselines, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2506.10060 by Atiyeh Ashari Ghomi, Brendan Leigh Ross, Gabriel Loaiza-Ganem, Jesse C. Cresswell, Ji Xin, Kin Kwan Leung, No\"el Vouitsis, Rasa Hosseinzadeh, Shiyi Hou, Yi Sui, Zhaoyan Liu.

**Figure 1.** Figure 1: In chain-of-thought (CoT) prompting (left), answers are generated by an LLM using a single fixed prompt; this frequentist approach does not account for uncertainty about how the model should be prompted, causing potential issues such as overconfidence on incorrect answers. In our Textual Bayes approach (right), we sample prompts from our Bayesian posterior and then use each prompt to generate answers from … view at source ↗

**Figure 2.** Figure 2: Comparison of conformal factuality for frequency scoring with a fixed prompt [40], and with prompts sampled through MHLP. (a) The empirical factuality achieved in practice is consistently within the bounds guaranteed by Equation 8. (b) MHLP achieves the same level of empirical factuality as frequency scoring but removes fewer claims, indicating better calibrated confidence. 5 Related Work LLMs are applicab… view at source ↗

read the original abstract

Although large language models (LLMs) are becoming increasingly capable of solving challenging real-world tasks, accurately quantifying their uncertainty remains a critical open problem--one that limits their applicability in high-stakes domains. This challenge is further compounded by the closed-source, black-box nature of many state-of-the-art LLMs. Moreover, LLM-based systems can be highly sensitive to the prompts that bind them together, which often require significant manual tuning (i.e., prompt engineering). In this work, we address these challenges by viewing LLM-based systems through a Bayesian lens. We interpret prompts as textual parameters in a statistical model, allowing us to use a small training dataset to perform Bayesian inference over these prompts. This novel perspective enables principled uncertainty quantification over both the model's textual parameters and its downstream predictions, while also incorporating prior beliefs about these parameters expressed in free-form text. To perform Bayesian inference--a difficult problem even for well-studied data modalities--we introduce Metropolis-Hastings through LLM Proposals (MHLP), a novel Markov chain Monte Carlo (MCMC) algorithm that combines prompt optimization techniques with standard MCMC methods. MHLP is a turnkey modification to existing LLM pipelines, including those that rely exclusively on closed-source models. Empirically, we demonstrate that our method yields improvements in both predictive accuracy and uncertainty quantification (UQ) on a range of LLM benchmarks and UQ tasks. More broadly, our work demonstrates a viable path for incorporating methods from the rich Bayesian literature into the era of LLMs, paving the way for more reliable and calibrated LLM-based systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames prompts as Bayesian textual parameters and introduces MHLP for MCMC sampling, but the method's ability to deliver actual posterior samples in a huge discrete space is not clearly shown.

read the letter

Colleague, the main thing here is that they treat prompts as random variables in a statistical model and run a Metropolis-Hastings chain where the LLM itself proposes new prompts. This lets them do inference over both the prompt and the downstream predictions while allowing free-form text priors. The approach is designed to slot into existing pipelines, including closed-source ones, which is a practical move. They report gains in accuracy and calibration on benchmarks, which would matter if the gains are real and reproducible. What they do well is take the black-box constraint seriously and avoid needing gradients or model internals. The integration of prompt optimization ideas with standard MCMC is a distinct step that is not just rehashing existing prompt tuning work. The textual prior angle also fits the LLM setting naturally. The soft spot is the sampling step. The space of prompts grows exponentially with length, and the proposal distribution comes from another LLM call. Without shown diagnostics for mixing, multiple-chain convergence, or even basic trace plots, it is hard to tell whether the samples represent the posterior or just a biased search that happens to find better prompts. The acceptance ratio also depends on being able to evaluate the target density, which is tricky when the LLM is closed. If those pieces are missing or only sketched, the uncertainty quantification claims rest on an unverified assumption rather than demonstrated Bayesian behavior. This is aimed at applied people who need better UQ for LLM systems in sensitive settings. A reader who already works on calibration or prompt robustness would get concrete ideas to try. It deserves a serious referee because the problem is real and the method is concrete enough to be tested and improved.

Referee Report

2 major / 2 minor

Summary. The paper claims that interpreting prompts as textual parameters in a statistical model enables Bayesian inference over prompts and predictions via a new MCMC algorithm, Metropolis-Hastings through LLM Proposals (MHLP). This allows incorporation of free-form textual priors and yields improvements in predictive accuracy and uncertainty quantification on LLM benchmarks and UQ tasks, even for closed-source models.

Significance. If the central claim holds, the work provides a practical bridge between Bayesian methods and LLM pipelines, enabling principled UQ over prompt sensitivity without requiring white-box access. It could support more calibrated systems in high-stakes settings by leveraging existing prompt optimization techniques within an MCMC framework.

major comments (2)

[§3] §3 (MHLP algorithm description): The central claim requires that MHLP produces samples from the posterior p(prompt | data). However, the paper provides no proof or diagnostic that the LLM-based proposal kernel is irreducible and aperiodic over the combinatorial space of textual prompts, nor that the Metropolis-Hastings acceptance ratio is correctly evaluated when the likelihood involves a black-box LLM. This undermines the assertion that reported accuracy and calibration gains are Bayesian rather than artifacts of the proposal mechanism.
[§4] §4 (Experimental evaluation): The reported improvements in predictive accuracy and UQ are presented without convergence diagnostics (e.g., trace plots, effective sample size, or Gelman-Rubin statistics) or details on chain length, burn-in, or proposal tuning. In a discrete space whose size grows exponentially with prompt length, this leaves open the possibility that the sampler has not mixed, rendering the empirical gains non-Bayesian.

minor comments (2)

[Abstract] The abstract states that MHLP is a 'turnkey modification' but does not clarify how the likelihood is approximated when using closed-source models; this should be expanded with a concrete example in the methods.
[Notation] Notation for the textual prior and the target density should be introduced earlier and used consistently to improve readability for readers unfamiliar with prompt engineering.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [§3] §3 (MHLP algorithm description): The central claim requires that MHLP produces samples from the posterior p(prompt | data). However, the paper provides no proof or diagnostic that the LLM-based proposal kernel is irreducible and aperiodic over the combinatorial space of textual prompts, nor that the Metropolis-Hastings acceptance ratio is correctly evaluated when the likelihood involves a black-box LLM. This undermines the assertion that reported accuracy and calibration gains are Bayesian rather than artifacts of the proposal mechanism.

Authors: We acknowledge that the manuscript does not include a formal proof of irreducibility or aperiodicity for the LLM proposal kernel, nor explicit diagnostics for the acceptance ratio in the black-box setting. The combinatorial and effectively unbounded nature of the prompt space makes such proofs challenging and non-standard compared to finite-state MCMC. However, MHLP follows the standard Metropolis-Hastings framework: proposals are generated by the LLM (leveraging prompt optimization techniques), and the acceptance ratio is computed using the exact ratio of the unnormalized posterior densities, where the likelihood p(data | prompt) is obtained by querying the (black-box) LLM on the training data. The method is thus correct whenever the proposal kernel allows exploration, which our empirical results across multiple benchmarks support through consistent gains over non-Bayesian baselines. We will add a dedicated discussion subsection on these theoretical considerations and practical limitations in the revision. revision: partial
Referee: [§4] §4 (Experimental evaluation): The reported improvements in predictive accuracy and UQ are presented without convergence diagnostics (e.g., trace plots, effective sample size, or Gelman-Rubin statistics) or details on chain length, burn-in, or proposal tuning. In a discrete space whose size grows exponentially with prompt length, this leaves open the possibility that the sampler has not mixed, rendering the empirical gains non-Bayesian.

Authors: We agree that the absence of convergence diagnostics is a limitation, particularly given the discrete prompt space. The current experiments report results from multiple independent chains but do not include trace plots, effective sample sizes, Gelman-Rubin statistics, or explicit details on burn-in and tuning. In the revised manuscript we will incorporate these diagnostics (including trace plots for log-posterior and accuracy metrics, ESS values, and chain length/burn-in specifications) to demonstrate adequate mixing and support that the reported gains arise from posterior sampling rather than proposal artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper frames prompts as textual parameters and introduces MHLP as a novel MCMC adaptation that combines prompt optimization with standard Metropolis-Hastings. The abstract and provided description present this as an independent algorithmic contribution enabling Bayesian inference over discrete textual spaces, without any quoted equations or steps that reduce predictions to fitted inputs by construction, self-definitional loops, or load-bearing self-citations. The claimed improvements in accuracy and UQ are positioned as empirical outcomes of the method rather than definitional equivalences. The derivation chain is therefore self-contained against external MCMC theory and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is limited to the abstract, so the ledger captures only the core modeling assumption stated there; no specific numerical free parameters or new entities are described.

axioms (1)

domain assumption Prompts can be interpreted as textual parameters in a statistical model suitable for Bayesian inference.
This premise is required for the entire framework and is introduced in the abstract as the novel perspective.

pith-pipeline@v0.9.0 · 5863 in / 1318 out tokens · 57390 ms · 2026-05-19T09:16:59.298419+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Metropolis-Hastings through LLM Proposals (MHLP), a novel Markov chain Monte Carlo (MCMC) algorithm that combines prompt optimization techniques with standard MCMC methods.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We interpret prompts as textual parameters in a statistical model, allowing us to use a small training dataset to perform Bayesian inference over these prompts.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
cs.AI 2025-10 unverdicted novelty 6.0

A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · cited by 1 Pith paper · 12 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam et al. GPT-4 Technical Report. arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

A statistical theory of cold posteriors in deep neural networks

Laurence Aitchison. A statistical theory of cold posteriors in deep neural networks. In International Conference on Learning Representations, 2021

work page 2021
[3]

Llama-nemotron: Efficient reasoning models

Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-nemotron: Efficient reasoning models. arXiv preprint arXiv:2505.00949, 2025

work page arXiv 2025
[4]

Bayesian Theory, volume 405

José M Bernardo and Adrian FM Smith. Bayesian Theory, volume 405. John Wiley & Sons, 2009

work page 2009
[5]

Weight uncertainty in neural network

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1613–1622, 2015

work page 2015
[6]

Emergent autonomous scientific research capabilities of large language models

Daniil A Boiko, Robert MacKnight, and Gabe Gomes. Emergent autonomous scientific research capabilities of large language models. arXiv:2304.05332, 2023

work page internal anchor Pith review arXiv 2023
[7]

Opportunities and Challenges of AI-Driven Customer Service, pages 33–71

Rijul Chaturvedi and Sanjeev Verma. Opportunities and Challenges of AI-Driven Customer Service, pages 33–71. Springer International Publishing, 2023. ISBN 978-3-031-33898-4. doi: 10.1007/978-3-031-33898-4_3

work page doi:10.1007/978-3-031-33898-4_3 2023
[8]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs

Ching-An Cheng, Allen Nie, and Adith Swaminathan. Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs. In Advances in Neural Information Processing Systems, volume 37, pages 71596–71642, 2024

work page 2024
[10]

Aime problems and solutions

MAA Committees. Aime problems and solutions. https://artofproblemsolving.com/ wiki/index.php/AIME_Problems_and_Solutions

work page
[11]

Smith, and Matt Gardner

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. CoRR, abs/2105.03011, 2021. URL https://arxiv.org/abs/2105.03011. 10

work page arXiv 2021
[12]

Laplace redux-effortless Bayesian deep learning

Erik Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, and Philipp Hennig. Laplace redux-effortless Bayesian deep learning. In Advances in Neural Information Processing Systems, 2021

work page 2021
[13]

Simon Duane, A. D. Kennedy, B. J. Pendleton, and Duncan Roweth. Hybrid monte carlo. Physics Letters B, 195(2):216–222, 1987

work page 1987
[14]

Sample, don’t search: Rethinking test-time alignment for language models

Gonçalo Faria and Noah A Smith. Sample, don’t search: Rethinking test-time alignment for language models. arXiv preprint arXiv:2504.03790, 2025

work page arXiv 2025
[15]

QUEST: Quality-aware metropolis-hastings sampling for machine translation

Gonçalo Faria, Sweta Agrawal, António Farinhas, Ricardo Rei, José de Souza, and André Martins. QUEST: Quality-aware metropolis-hastings sampling for machine translation. In Advances in Neural Information Processing Systems, 2024

work page 2024
[16]

Ober, Florian Wenzel, Gunnar Ratsch, Richard E Turner, Mark van der Wilk, and Laurence Aitchison

Vincent Fortuin, Adrià Garriga-Alonso, Sebastian W. Ober, Florian Wenzel, Gunnar Ratsch, Richard E Turner, Mark van der Wilk, and Laurence Aitchison. Bayesian neural network priors revisited. In International Conference on Learning Representations, 2022

work page 2022
[17]

SPUQ: Perturbation-based uncertainty quantification for large language models

Xiang Gao, Jiaxin Zhang, Lalla Mouatadid, and Kamalika Das. SPUQ: Perturbation-based uncertainty quantification for large language models. arXiv preprint arXiv:2403.02509, 2024

work page arXiv 2024
[18]

A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing

Carlos Gómez-Rodríguez and Paul Williams. A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14504–14528, 2023

work page 2023
[19]

Improving uncertainty quantification in large language models via semantic embeddings

Yashvir S Grewal, Edwin V Bonilla, and Thang D Bui. Improving uncertainty quantification in large language models via semantic embeddings. arXiv:2410.22685, 2024

work page arXiv 2024
[20]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1321–1330, 2017

work page 2017
[21]

De- composing uncertainty for large language models through input clarification ensembling

Bairu Hou, Yujian Liu, Kaizhi Qian, Jacob Andreas, Shiyu Chang, and Yang Zhang. De- composing uncertainty for large language models through input clarification ensembling. In International Conference on Machine Learning, 2024

work page 2024
[22]

Automated Design of Agentic Systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. arXiv:2408.08435, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

What Are Bayesian Neural Network Posteriors Really Like? In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 4629–4640, 2021

Pavel Izmailov, Sharad Vikram, Matthew D Hoffman, and Andrew Gordon Wilson. What Are Bayesian Neural Network Posteriors Really Like? In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 4629–4640, 2021

work page 2021
[24]

Estimating the hallucination rate of generative AI

Andrew Jesson, Nicolas Beltran Velez, Quentin Chu, Sweta Karlekar, Jannik Kossen, Yarin Gal, John P Cunningham, and David Blei. Estimating the hallucination rate of generative AI. In Advances in Neural Information Processing Systems, 2024

work page 2024
[25]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

On uncertainty, tempering, and data augmentation in bayesian classification

Sanyam Kapoor, Wesley J Maddox, Pavel Izmailov, and Andrew G Wilson. On uncertainty, tempering, and data augmentation in bayesian classification. In Advances in Neural Information Processing Systems, volume 35, 2022

work page 2022
[27]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learning Represen...

work page 2024
[28]

Auto-encoding variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014. 11

work page 2014
[29]

Being Bayesian, even just a bit, fixes overconfidence in relu networks

Agustinus Kristiadi, Matthias Hein, and Philipp Hennig. Being Bayesian, even just a bit, fixes overconfidence in relu networks. In International Conference on Machine Learning, 2020

work page 2020
[30]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In International Conference on Learning Representations, 2023

work page 2023
[31]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, 2017

work page 2017
[32]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020

work page 2020
[33]

Generating with confidence: Uncertainty quantification for black-box large language models

Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantification for black-box large language models. In Transactions on Machine Learning Research, 2024

work page 2024
[34]

Uncertainty quantification for in-context learning of large language models

Chen Ling, Xujiang Zhao, Xuchao Zhang, Wei Cheng, Yanchi Liu, Yiyou Sun, Mika Oishi, Takao Osaki, Katsushi Matsuda, Jie Ji, et al. Uncertainty quantification for in-context learning of large language models. arXiv:2402.10189, 2024

work page arXiv 2024
[35]

Information Theory, Inference and Learning Algorithms

David JC MacKay. Information Theory, Inference and Learning Algorithms . Cambridge University Press, 2003

work page 2003
[36]

Self- Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- Refine: Iterative Refinement with Self-Feedback. InAdvances in Neural Information Processing Sy...

work page 2023
[37]

SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models

Potsawee Manakul, Adian Liusie, and Mark Gales. SelfCheckGPT: Zero-Resource Black- Box Hallucination Detection for Generative Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.557

work page doi:10.18653/v1/2023.emnlp-main.557 2023
[38]

On faithfulness and factuality in abstractive summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, 2020

work page 1906
[39]

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics,

work page 2023
[40]

doi: 10.18653/v1/2023.emnlp-main.741

work page doi:10.18653/v1/2023.emnlp-main.741 2023
[41]

Language models with conformal factuality guarantees

Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees. In Proceedings of the 41st International Conference on Machine Learning, 2024

work page 2024
[42]

Data augmentation in Bayesian neural networks and the cold posterior effect

Seth Nabarro, Stoil Ganev, Adrià Garriga-Alonso, Vincent Fortuin, Mark van der Wilk, and Laurence Aitchison. Data augmentation in Bayesian neural networks and the cold posterior effect. In Uncertainty in Artificial Intelligence, pages 1434–1444. PMLR, 2022

work page 2022
[43]

Radford M. Neal. Bayesian Learning for Neural Networks, volume 118 of Lecture Notes in Statistics. Springer, 1996. doi: 10.1007/978-1-4612-0745-0

work page doi:10.1007/978-1-4612-0745-0 1996
[44]

Kernel language entropy: Fine-grained uncertainty quantification for LLMs from semantic similarities

Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncertainty quantification for LLMs from semantic similarities. In Advances in Neural Information Processing Systems, 2024. 12

work page 2024
[45]

Disentangling the roles of curation, data-augmentation and the prior in the cold posterior effect

Lorenzo Noci, Kevin Roth, Gregor Bachmann, Sebastian Nowozin, and Thomas Hofmann. Disentangling the roles of curation, data-augmentation and the prior in the cold posterior effect. In Advances in Neural Information Processing Systems, volume 34, 2021

work page 2021
[46]

Obtaining well calibrated probabilities using bayesian binning

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. Proceedings of the AAAI Conference on Artificial Intelli- gence, 29(1), 2015. doi: 10.1609/aaai.v29i1.9602

work page doi:10.1609/aaai.v29i1.9602 2015
[47]

Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space

Xin Qiu and Risto Miikkulainen. Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space. In Advances in Neural Information Processing Systems, 2024

work page 2024
[48]

A Scalable Laplace Approximation for Neural Networks

Hippolyt Ritter, Aleksandar Botev, and David Barber. A Scalable Laplace Approximation for Neural Networks. In International Conference on Learning Representations, 2018

work page 2018
[49]

A scalable Laplace approximation for neural networks

Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable Laplace approximation for neural networks. In International Conference on Learning Representations, 2018

work page 2018
[50]

The Metropolis-Hastings algorithm

Christian P Robert. The Metropolis-Hastings algorithm. arXiv:1504.01896, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[51]

Optimal proposal distributions and adaptive MCMC

Jeffrey S Rosenthal. Optimal proposal distributions and adaptive MCMC. Handbook of Markov Chain Monte Carlo, 4(10.1201):93–111, 2011

work page 2011
[52]

Mean field theory for sigmoid belief networks

Lawrence K Saul, Tommi Jaakkola, and Michael I Jordan. Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4:61–76, 1996

work page 1996
[53]

Agent Laboratory: Using LLM Agents as Research Assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum. Agent Laboratory: Using LLM Agents as Research Assistants. arXiv:2501.04227, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

An efficient minibatch acceptance test for metropolis-hastings

Daniel Seita, Xinlei Pan, Haoyu Chen, and John Canny. An efficient minibatch acceptance test for metropolis-hastings. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 5359–5363, 2018

work page 2018
[55]

A tutorial on conformal prediction

Glenn Shafer and Vladimir V ovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3), 2008

work page 2008
[56]

Springer, 2005

Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, 2005

work page 2005
[57]

Lora ensembles for large language model fine-tuning

Xi Wang, Laurence Aitchison, and Maja Rudolph. Lora ensembles for large language model fine-tuning. arXiv:2310.00035, 2023

work page arXiv 2023
[58]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023

work page 2023
[59]

Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708, 2021

work page 2021
[60]

Helpsteer2-preference: Complementing ratings with prefer- ences

Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, and Yi Dong. Helpsteer2-preference: Complementing ratings with prefer- ences, 2024. URL https://arxiv.org/abs/2410.01257

work page arXiv 2024
[61]

On subjective uncertainty quantification and calibration in natural language generation

Ziyu Wang and Chris Holmes. On subjective uncertainty quantification and calibration in natural language generation. arXiv:2406.05213, 2024

work page arXiv 2024
[62]

Jailbroken: How does LLM safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110, 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110, 2023. 13

work page 2023
[63]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022

work page 2022
[64]

Measuring short-form factuality in large language models

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Bayesian Learning via Stochastic Gradient Langevin Dynam- ics

Max Welling and Yee Whye Teh. Bayesian Learning via Stochastic Gradient Langevin Dynam- ics. In Proceedings of the 28th International Conference on Machine Learning, pages 681–688,

work page
[66]

ISBN 978-1-4503-0619-5

work page
[67]

Characterizing llm abstention behavior in science qa with context perturbations, 2024

Bingbing Wen, Bill Howe, and Lucy Lu Wang. Characterizing llm abstention behavior in science qa with context perturbations, 2024. URL https://arxiv.org/abs/2404.12452

work page arXiv 2024
[68]

How good is the Bayes posterior in deep neural networks really? In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 10248–10259, 2020

Florian Wenzel, Kevin Roth, Bastiaan Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How good is the Bayes posterior in deep neural networks really? In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 10248–10259, 2020

work page 2020
[69]

Intelligent agents: Theory and practice

Michael Wooldridge and Nicholas R Jennings. Intelligent agents: Theory and practice. The Knowledge Engineering Review, 10(2):115–152, 1995

work page 1995
[70]

The rise and potential of large language model based agents: A survey

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, a...

work page 2025
[71]

Hallucination is Inevitable: An Innate Limitation of Large Language Models

Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. Hallucination is inevitable: An innate limitation of large language models. arXiv:2401.11817, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[72]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search. arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Backdooring instruction-tuned large language models with virtual prompt injection

Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, and Hongxia Jin. Backdooring instruction-tuned large language models with virtual prompt injection. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...

work page 2024
[74]

Bayesian low-rank adaptation for large language models

Adam X Yang, Maxime Robeyns, Xi Wang, and Laurence Aitchison. Bayesian low-rank adaptation for large language models. In International Conference on Learning Representations, 2024

work page 2024
[75]

On Verbalized Confidence Scores for LLMs

Daniel Yang, Yao-Hung Hubert Tsai, and Makoto Yamada. On verbalized confidence scores for LLMs. arXiv:2412.14737, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[76]

Optimizing generative AI by backpropagating language model feedback

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative AI by backpropagating language model feedback. Nature, 639:609–616, 2025

work page 2025
[77]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023

work page 2023
[78]

Large language models are human-level prompt engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations, 2022. 14

work page 2022
[79]

GPTSwarm: Language Agents as Optimizable Graphs

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. GPTSwarm: Language Agents as Optimizable Graphs. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[80]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043, 2023. 15 A Method Details In general, MCMC can only be applied to Bayesian inference when the g(θ) is calculable, where g(θ) is defined by g(θ) = p(θ)p(D | θ) = p(θ) nY i=1 p(y...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

GPT-4 Technical Report

Josh Achiam et al. GPT-4 Technical Report. arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

A statistical theory of cold posteriors in deep neural networks

Laurence Aitchison. A statistical theory of cold posteriors in deep neural networks. In International Conference on Learning Representations, 2021

work page 2021

[3] [3]

Llama-nemotron: Efficient reasoning models

Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-nemotron: Efficient reasoning models. arXiv preprint arXiv:2505.00949, 2025

work page arXiv 2025

[4] [4]

Bayesian Theory, volume 405

José M Bernardo and Adrian FM Smith. Bayesian Theory, volume 405. John Wiley & Sons, 2009

work page 2009

[5] [5]

Weight uncertainty in neural network

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1613–1622, 2015

work page 2015

[6] [6]

Emergent autonomous scientific research capabilities of large language models

Daniil A Boiko, Robert MacKnight, and Gabe Gomes. Emergent autonomous scientific research capabilities of large language models. arXiv:2304.05332, 2023

work page internal anchor Pith review arXiv 2023

[7] [7]

Opportunities and Challenges of AI-Driven Customer Service, pages 33–71

Rijul Chaturvedi and Sanjeev Verma. Opportunities and Challenges of AI-Driven Customer Service, pages 33–71. Springer International Publishing, 2023. ISBN 978-3-031-33898-4. doi: 10.1007/978-3-031-33898-4_3

work page doi:10.1007/978-3-031-33898-4_3 2023

[8] [8]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs

Ching-An Cheng, Allen Nie, and Adith Swaminathan. Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs. In Advances in Neural Information Processing Systems, volume 37, pages 71596–71642, 2024

work page 2024

[10] [10]

Aime problems and solutions

MAA Committees. Aime problems and solutions. https://artofproblemsolving.com/ wiki/index.php/AIME_Problems_and_Solutions

work page

[11] [11]

Smith, and Matt Gardner

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. CoRR, abs/2105.03011, 2021. URL https://arxiv.org/abs/2105.03011. 10

work page arXiv 2021

[12] [12]

Laplace redux-effortless Bayesian deep learning

Erik Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, and Philipp Hennig. Laplace redux-effortless Bayesian deep learning. In Advances in Neural Information Processing Systems, 2021

work page 2021

[13] [13]

Simon Duane, A. D. Kennedy, B. J. Pendleton, and Duncan Roweth. Hybrid monte carlo. Physics Letters B, 195(2):216–222, 1987

work page 1987

[14] [14]

Sample, don’t search: Rethinking test-time alignment for language models

Gonçalo Faria and Noah A Smith. Sample, don’t search: Rethinking test-time alignment for language models. arXiv preprint arXiv:2504.03790, 2025

work page arXiv 2025

[15] [15]

QUEST: Quality-aware metropolis-hastings sampling for machine translation

Gonçalo Faria, Sweta Agrawal, António Farinhas, Ricardo Rei, José de Souza, and André Martins. QUEST: Quality-aware metropolis-hastings sampling for machine translation. In Advances in Neural Information Processing Systems, 2024

work page 2024

[16] [16]

Ober, Florian Wenzel, Gunnar Ratsch, Richard E Turner, Mark van der Wilk, and Laurence Aitchison

Vincent Fortuin, Adrià Garriga-Alonso, Sebastian W. Ober, Florian Wenzel, Gunnar Ratsch, Richard E Turner, Mark van der Wilk, and Laurence Aitchison. Bayesian neural network priors revisited. In International Conference on Learning Representations, 2022

work page 2022

[17] [17]

SPUQ: Perturbation-based uncertainty quantification for large language models

Xiang Gao, Jiaxin Zhang, Lalla Mouatadid, and Kamalika Das. SPUQ: Perturbation-based uncertainty quantification for large language models. arXiv preprint arXiv:2403.02509, 2024

work page arXiv 2024

[18] [18]

A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing

Carlos Gómez-Rodríguez and Paul Williams. A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14504–14528, 2023

work page 2023

[19] [19]

Improving uncertainty quantification in large language models via semantic embeddings

Yashvir S Grewal, Edwin V Bonilla, and Thang D Bui. Improving uncertainty quantification in large language models via semantic embeddings. arXiv:2410.22685, 2024

work page arXiv 2024

[20] [20]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1321–1330, 2017

work page 2017

[21] [21]

De- composing uncertainty for large language models through input clarification ensembling

Bairu Hou, Yujian Liu, Kaizhi Qian, Jacob Andreas, Shiyu Chang, and Yang Zhang. De- composing uncertainty for large language models through input clarification ensembling. In International Conference on Machine Learning, 2024

work page 2024

[22] [22]

Automated Design of Agentic Systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. arXiv:2408.08435, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

What Are Bayesian Neural Network Posteriors Really Like? In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 4629–4640, 2021

Pavel Izmailov, Sharad Vikram, Matthew D Hoffman, and Andrew Gordon Wilson. What Are Bayesian Neural Network Posteriors Really Like? In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 4629–4640, 2021

work page 2021

[24] [24]

Estimating the hallucination rate of generative AI

Andrew Jesson, Nicolas Beltran Velez, Quentin Chu, Sweta Karlekar, Jannik Kossen, Yarin Gal, John P Cunningham, and David Blei. Estimating the hallucination rate of generative AI. In Advances in Neural Information Processing Systems, 2024

work page 2024

[25] [25]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

On uncertainty, tempering, and data augmentation in bayesian classification

Sanyam Kapoor, Wesley J Maddox, Pavel Izmailov, and Andrew G Wilson. On uncertainty, tempering, and data augmentation in bayesian classification. In Advances in Neural Information Processing Systems, volume 35, 2022

work page 2022

[27] [27]

Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learning Represen...

work page 2024

[28] [28]

Auto-encoding variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014. 11

work page 2014

[29] [29]

Being Bayesian, even just a bit, fixes overconfidence in relu networks

Agustinus Kristiadi, Matthias Hein, and Philipp Hennig. Being Bayesian, even just a bit, fixes overconfidence in relu networks. In International Conference on Machine Learning, 2020

work page 2020

[30] [30]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In International Conference on Learning Representations, 2023

work page 2023

[31] [31]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, 2017

work page 2017

[32] [32]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020

work page 2020

[33] [33]

Generating with confidence: Uncertainty quantification for black-box large language models

Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantification for black-box large language models. In Transactions on Machine Learning Research, 2024

work page 2024

[34] [34]

Uncertainty quantification for in-context learning of large language models

Chen Ling, Xujiang Zhao, Xuchao Zhang, Wei Cheng, Yanchi Liu, Yiyou Sun, Mika Oishi, Takao Osaki, Katsushi Matsuda, Jie Ji, et al. Uncertainty quantification for in-context learning of large language models. arXiv:2402.10189, 2024

work page arXiv 2024

[35] [35]

Information Theory, Inference and Learning Algorithms

David JC MacKay. Information Theory, Inference and Learning Algorithms . Cambridge University Press, 2003

work page 2003

[36] [36]

Self- Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- Refine: Iterative Refinement with Self-Feedback. InAdvances in Neural Information Processing Sy...

work page 2023

[37] [37]

SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models

Potsawee Manakul, Adian Liusie, and Mark Gales. SelfCheckGPT: Zero-Resource Black- Box Hallucination Detection for Generative Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.557

work page doi:10.18653/v1/2023.emnlp-main.557 2023

[38] [38]

On faithfulness and factuality in abstractive summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, 2020

work page 1906

[39] [39]

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics,

work page 2023

[40] [40]

doi: 10.18653/v1/2023.emnlp-main.741

work page doi:10.18653/v1/2023.emnlp-main.741 2023

[41] [41]

Language models with conformal factuality guarantees

Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees. In Proceedings of the 41st International Conference on Machine Learning, 2024

work page 2024

[42] [42]

Data augmentation in Bayesian neural networks and the cold posterior effect

Seth Nabarro, Stoil Ganev, Adrià Garriga-Alonso, Vincent Fortuin, Mark van der Wilk, and Laurence Aitchison. Data augmentation in Bayesian neural networks and the cold posterior effect. In Uncertainty in Artificial Intelligence, pages 1434–1444. PMLR, 2022

work page 2022

[43] [43]

Radford M. Neal. Bayesian Learning for Neural Networks, volume 118 of Lecture Notes in Statistics. Springer, 1996. doi: 10.1007/978-1-4612-0745-0

work page doi:10.1007/978-1-4612-0745-0 1996

[44] [44]

Kernel language entropy: Fine-grained uncertainty quantification for LLMs from semantic similarities

Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncertainty quantification for LLMs from semantic similarities. In Advances in Neural Information Processing Systems, 2024. 12

work page 2024

[45] [45]

Disentangling the roles of curation, data-augmentation and the prior in the cold posterior effect

Lorenzo Noci, Kevin Roth, Gregor Bachmann, Sebastian Nowozin, and Thomas Hofmann. Disentangling the roles of curation, data-augmentation and the prior in the cold posterior effect. In Advances in Neural Information Processing Systems, volume 34, 2021

work page 2021

[46] [46]

Obtaining well calibrated probabilities using bayesian binning

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. Proceedings of the AAAI Conference on Artificial Intelli- gence, 29(1), 2015. doi: 10.1609/aaai.v29i1.9602

work page doi:10.1609/aaai.v29i1.9602 2015

[47] [47]

Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space

Xin Qiu and Risto Miikkulainen. Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space. In Advances in Neural Information Processing Systems, 2024

work page 2024

[48] [48]

A Scalable Laplace Approximation for Neural Networks

Hippolyt Ritter, Aleksandar Botev, and David Barber. A Scalable Laplace Approximation for Neural Networks. In International Conference on Learning Representations, 2018

work page 2018

[49] [49]

A scalable Laplace approximation for neural networks

Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable Laplace approximation for neural networks. In International Conference on Learning Representations, 2018

work page 2018

[50] [50]

The Metropolis-Hastings algorithm

Christian P Robert. The Metropolis-Hastings algorithm. arXiv:1504.01896, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[51] [51]

Optimal proposal distributions and adaptive MCMC

Jeffrey S Rosenthal. Optimal proposal distributions and adaptive MCMC. Handbook of Markov Chain Monte Carlo, 4(10.1201):93–111, 2011

work page 2011

[52] [52]

Mean field theory for sigmoid belief networks

Lawrence K Saul, Tommi Jaakkola, and Michael I Jordan. Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4:61–76, 1996

work page 1996

[53] [53]

Agent Laboratory: Using LLM Agents as Research Assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum. Agent Laboratory: Using LLM Agents as Research Assistants. arXiv:2501.04227, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

An efficient minibatch acceptance test for metropolis-hastings

Daniel Seita, Xinlei Pan, Haoyu Chen, and John Canny. An efficient minibatch acceptance test for metropolis-hastings. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 5359–5363, 2018

work page 2018

[55] [55]

A tutorial on conformal prediction

Glenn Shafer and Vladimir V ovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3), 2008

work page 2008

[56] [56]

Springer, 2005

Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, 2005

work page 2005

[57] [57]

Lora ensembles for large language model fine-tuning

Xi Wang, Laurence Aitchison, and Maja Rudolph. Lora ensembles for large language model fine-tuning. arXiv:2310.00035, 2023

work page arXiv 2023

[58] [58]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023

work page 2023

[59] [59]

Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708, 2021

work page 2021

[60] [60]

Helpsteer2-preference: Complementing ratings with prefer- ences

Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, and Yi Dong. Helpsteer2-preference: Complementing ratings with prefer- ences, 2024. URL https://arxiv.org/abs/2410.01257

work page arXiv 2024

[61] [61]

On subjective uncertainty quantification and calibration in natural language generation

Ziyu Wang and Chris Holmes. On subjective uncertainty quantification and calibration in natural language generation. arXiv:2406.05213, 2024

work page arXiv 2024

[62] [62]

Jailbroken: How does LLM safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110, 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110, 2023. 13

work page 2023

[63] [63]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022

work page 2022

[64] [64]

Measuring short-form factuality in large language models

Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

Bayesian Learning via Stochastic Gradient Langevin Dynam- ics

Max Welling and Yee Whye Teh. Bayesian Learning via Stochastic Gradient Langevin Dynam- ics. In Proceedings of the 28th International Conference on Machine Learning, pages 681–688,

work page

[66] [66]

ISBN 978-1-4503-0619-5

work page

[67] [67]

Characterizing llm abstention behavior in science qa with context perturbations, 2024

Bingbing Wen, Bill Howe, and Lucy Lu Wang. Characterizing llm abstention behavior in science qa with context perturbations, 2024. URL https://arxiv.org/abs/2404.12452

work page arXiv 2024

[68] [68]

How good is the Bayes posterior in deep neural networks really? In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 10248–10259, 2020

Florian Wenzel, Kevin Roth, Bastiaan Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How good is the Bayes posterior in deep neural networks really? In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 10248–10259, 2020

work page 2020

[69] [69]

Intelligent agents: Theory and practice

Michael Wooldridge and Nicholas R Jennings. Intelligent agents: Theory and practice. The Knowledge Engineering Review, 10(2):115–152, 1995

work page 1995

[70] [70]

The rise and potential of large language model based agents: A survey

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, a...

work page 2025

[71] [71]

Hallucination is Inevitable: An Innate Limitation of Large Language Models

Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. Hallucination is inevitable: An innate limitation of large language models. arXiv:2401.11817, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[72] [72]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search. arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [73]

Backdooring instruction-tuned large language models with virtual prompt injection

Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, and Hongxia Jin. Backdooring instruction-tuned large language models with virtual prompt injection. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...

work page 2024

[74] [74]

Bayesian low-rank adaptation for large language models

Adam X Yang, Maxime Robeyns, Xi Wang, and Laurence Aitchison. Bayesian low-rank adaptation for large language models. In International Conference on Learning Representations, 2024

work page 2024

[75] [75]

On Verbalized Confidence Scores for LLMs

Daniel Yang, Yao-Hung Hubert Tsai, and Makoto Yamada. On verbalized confidence scores for LLMs. arXiv:2412.14737, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[76] [76]

Optimizing generative AI by backpropagating language model feedback

Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative AI by backpropagating language model feedback. Nature, 639:609–616, 2025

work page 2025

[77] [77]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023

work page 2023

[78] [78]

Large language models are human-level prompt engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations, 2022. 14

work page 2022

[79] [79]

GPTSwarm: Language Agents as Optimizable Graphs

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. GPTSwarm: Language Agents as Optimizable Graphs. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[80] [80]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043, 2023. 15 A Method Details In general, MCMC can only be applied to Bayesian inference when the g(θ) is calculable, where g(θ) is defined by g(θ) = p(θ)p(D | θ) = p(θ) nY i=1 p(y...

work page internal anchor Pith review Pith/arXiv arXiv 2023