Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems
Pith reviewed 2026-05-19 09:16 UTC · model grok-4.3
The pith
Treating prompts as textual parameters enables Bayesian inference over LLM prompts and predictions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Interpreting prompts as textual parameters in a statistical model enables principled Bayesian inference over these prompts and downstream predictions while incorporating free-form textual priors. To carry out the inference the authors introduce Metropolis-Hastings through LLM Proposals (MHLP), a Markov chain Monte Carlo algorithm that pairs prompt-optimization techniques with standard MCMC sampling. The method functions as a turnkey addition to existing pipelines, including those using only black-box models, and produces measurable gains in predictive accuracy and uncertainty calibration on LLM benchmarks and dedicated UQ tasks.
What carries the argument
Metropolis-Hastings through LLM Proposals (MHLP), a Markov chain Monte Carlo sampler that generates candidate textual prompts via LLM-driven optimization to approximate the posterior distribution over prompts.
If this is right
- Uncertainty can be quantified jointly over the choice of prompt and the downstream model output.
- Prior knowledge about good prompts can be expressed directly in natural language and folded into the inference.
- Existing LLM pipelines gain improved calibration without requiring changes to model weights or access to internals.
- Predictive accuracy rises on standard benchmarks when posterior sampling replaces hand-tuned prompts.
Where Pith is reading between the lines
- The same posterior-sampling idea could be applied to other discrete choices such as tool selection or chain-of-thought templates.
- Hybrid systems might combine MHLP with gradient-based methods when partial access to model internals becomes available.
- Empirical studies on prompt-length scaling and mixing time would clarify practical limits of the approach.
Load-bearing premise
The MHLP algorithm can perform effective Bayesian inference over the discrete high-dimensional space of textual prompts even when the underlying LLM is a black box.
What would settle it
On a controlled benchmark where prompt sensitivity is known, if MHLP chains fail to mix or if the resulting uncertainty estimates show no improvement in calibration or accuracy over standard single-prompt baselines, the central claim would be falsified.
Figures
read the original abstract
Although large language models (LLMs) are becoming increasingly capable of solving challenging real-world tasks, accurately quantifying their uncertainty remains a critical open problem--one that limits their applicability in high-stakes domains. This challenge is further compounded by the closed-source, black-box nature of many state-of-the-art LLMs. Moreover, LLM-based systems can be highly sensitive to the prompts that bind them together, which often require significant manual tuning (i.e., prompt engineering). In this work, we address these challenges by viewing LLM-based systems through a Bayesian lens. We interpret prompts as textual parameters in a statistical model, allowing us to use a small training dataset to perform Bayesian inference over these prompts. This novel perspective enables principled uncertainty quantification over both the model's textual parameters and its downstream predictions, while also incorporating prior beliefs about these parameters expressed in free-form text. To perform Bayesian inference--a difficult problem even for well-studied data modalities--we introduce Metropolis-Hastings through LLM Proposals (MHLP), a novel Markov chain Monte Carlo (MCMC) algorithm that combines prompt optimization techniques with standard MCMC methods. MHLP is a turnkey modification to existing LLM pipelines, including those that rely exclusively on closed-source models. Empirically, we demonstrate that our method yields improvements in both predictive accuracy and uncertainty quantification (UQ) on a range of LLM benchmarks and UQ tasks. More broadly, our work demonstrates a viable path for incorporating methods from the rich Bayesian literature into the era of LLMs, paving the way for more reliable and calibrated LLM-based systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that interpreting prompts as textual parameters in a statistical model enables Bayesian inference over prompts and predictions via a new MCMC algorithm, Metropolis-Hastings through LLM Proposals (MHLP). This allows incorporation of free-form textual priors and yields improvements in predictive accuracy and uncertainty quantification on LLM benchmarks and UQ tasks, even for closed-source models.
Significance. If the central claim holds, the work provides a practical bridge between Bayesian methods and LLM pipelines, enabling principled UQ over prompt sensitivity without requiring white-box access. It could support more calibrated systems in high-stakes settings by leveraging existing prompt optimization techniques within an MCMC framework.
major comments (2)
- [§3] §3 (MHLP algorithm description): The central claim requires that MHLP produces samples from the posterior p(prompt | data). However, the paper provides no proof or diagnostic that the LLM-based proposal kernel is irreducible and aperiodic over the combinatorial space of textual prompts, nor that the Metropolis-Hastings acceptance ratio is correctly evaluated when the likelihood involves a black-box LLM. This undermines the assertion that reported accuracy and calibration gains are Bayesian rather than artifacts of the proposal mechanism.
- [§4] §4 (Experimental evaluation): The reported improvements in predictive accuracy and UQ are presented without convergence diagnostics (e.g., trace plots, effective sample size, or Gelman-Rubin statistics) or details on chain length, burn-in, or proposal tuning. In a discrete space whose size grows exponentially with prompt length, this leaves open the possibility that the sampler has not mixed, rendering the empirical gains non-Bayesian.
minor comments (2)
- [Abstract] The abstract states that MHLP is a 'turnkey modification' but does not clarify how the likelihood is approximated when using closed-source models; this should be expanded with a concrete example in the methods.
- [Notation] Notation for the textual prior and the target density should be introduced earlier and used consistently to improve readability for readers unfamiliar with prompt engineering.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (MHLP algorithm description): The central claim requires that MHLP produces samples from the posterior p(prompt | data). However, the paper provides no proof or diagnostic that the LLM-based proposal kernel is irreducible and aperiodic over the combinatorial space of textual prompts, nor that the Metropolis-Hastings acceptance ratio is correctly evaluated when the likelihood involves a black-box LLM. This undermines the assertion that reported accuracy and calibration gains are Bayesian rather than artifacts of the proposal mechanism.
Authors: We acknowledge that the manuscript does not include a formal proof of irreducibility or aperiodicity for the LLM proposal kernel, nor explicit diagnostics for the acceptance ratio in the black-box setting. The combinatorial and effectively unbounded nature of the prompt space makes such proofs challenging and non-standard compared to finite-state MCMC. However, MHLP follows the standard Metropolis-Hastings framework: proposals are generated by the LLM (leveraging prompt optimization techniques), and the acceptance ratio is computed using the exact ratio of the unnormalized posterior densities, where the likelihood p(data | prompt) is obtained by querying the (black-box) LLM on the training data. The method is thus correct whenever the proposal kernel allows exploration, which our empirical results across multiple benchmarks support through consistent gains over non-Bayesian baselines. We will add a dedicated discussion subsection on these theoretical considerations and practical limitations in the revision. revision: partial
-
Referee: [§4] §4 (Experimental evaluation): The reported improvements in predictive accuracy and UQ are presented without convergence diagnostics (e.g., trace plots, effective sample size, or Gelman-Rubin statistics) or details on chain length, burn-in, or proposal tuning. In a discrete space whose size grows exponentially with prompt length, this leaves open the possibility that the sampler has not mixed, rendering the empirical gains non-Bayesian.
Authors: We agree that the absence of convergence diagnostics is a limitation, particularly given the discrete prompt space. The current experiments report results from multiple independent chains but do not include trace plots, effective sample sizes, Gelman-Rubin statistics, or explicit details on burn-in and tuning. In the revised manuscript we will incorporate these diagnostics (including trace plots for log-posterior and accuracy metrics, ESS values, and chain length/burn-in specifications) to demonstrate adequate mixing and support that the reported gains arise from posterior sampling rather than proposal artifacts. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper frames prompts as textual parameters and introduces MHLP as a novel MCMC adaptation that combines prompt optimization with standard Metropolis-Hastings. The abstract and provided description present this as an independent algorithmic contribution enabling Bayesian inference over discrete textual spaces, without any quoted equations or steps that reduce predictions to fitted inputs by construction, self-definitional loops, or load-bearing self-citations. The claimed improvements in accuracy and UQ are positioned as empirical outcomes of the method rather than definitional equivalences. The derivation chain is therefore self-contained against external MCMC theory and benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Prompts can be interpreted as textual parameters in a statistical model suitable for Bayesian inference.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce Metropolis-Hastings through LLM Proposals (MHLP), a novel Markov chain Monte Carlo (MCMC) algorithm that combines prompt optimization techniques with standard MCMC methods.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We interpret prompts as textual parameters in a statistical model, allowing us to use a small training dataset to perform Bayesian inference over these prompts.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam et al. GPT-4 Technical Report. arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
A statistical theory of cold posteriors in deep neural networks
Laurence Aitchison. A statistical theory of cold posteriors in deep neural networks. In International Conference on Learning Representations, 2021
work page 2021
-
[3]
Llama-nemotron: Efficient reasoning models
Akhiad Bercovich, Itay Levy, Izik Golan, Mohammad Dabbah, Ran El-Yaniv, Omri Puny, Ido Galil, Zach Moshe, Tomer Ronen, Najeeb Nabwani, et al. Llama-nemotron: Efficient reasoning models. arXiv preprint arXiv:2505.00949, 2025
-
[4]
José M Bernardo and Adrian FM Smith. Bayesian Theory, volume 405. John Wiley & Sons, 2009
work page 2009
-
[5]
Weight uncertainty in neural network
Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, pages 1613–1622, 2015
work page 2015
-
[6]
Emergent autonomous scientific research capabilities of large language models
Daniil A Boiko, Robert MacKnight, and Gabe Gomes. Emergent autonomous scientific research capabilities of large language models. arXiv:2304.05332, 2023
work page internal anchor Pith review arXiv 2023
-
[7]
Opportunities and Challenges of AI-Driven Customer Service, pages 33–71
Rijul Chaturvedi and Sanjeev Verma. Opportunities and Challenges of AI-Driven Customer Service, pages 33–71. Springer International Publishing, 2023. ISBN 978-3-031-33898-4. doi: 10.1007/978-3-031-33898-4_3
-
[8]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs
Ching-An Cheng, Allen Nie, and Adith Swaminathan. Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs. In Advances in Neural Information Processing Systems, volume 37, pages 71596–71642, 2024
work page 2024
-
[10]
MAA Committees. Aime problems and solutions. https://artofproblemsolving.com/ wiki/index.php/AIME_Problems_and_Solutions
-
[11]
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. CoRR, abs/2105.03011, 2021. URL https://arxiv.org/abs/2105.03011. 10
-
[12]
Laplace redux-effortless Bayesian deep learning
Erik Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, and Philipp Hennig. Laplace redux-effortless Bayesian deep learning. In Advances in Neural Information Processing Systems, 2021
work page 2021
-
[13]
Simon Duane, A. D. Kennedy, B. J. Pendleton, and Duncan Roweth. Hybrid monte carlo. Physics Letters B, 195(2):216–222, 1987
work page 1987
-
[14]
Sample, don’t search: Rethinking test-time alignment for language models
Gonçalo Faria and Noah A Smith. Sample, don’t search: Rethinking test-time alignment for language models. arXiv preprint arXiv:2504.03790, 2025
-
[15]
QUEST: Quality-aware metropolis-hastings sampling for machine translation
Gonçalo Faria, Sweta Agrawal, António Farinhas, Ricardo Rei, José de Souza, and André Martins. QUEST: Quality-aware metropolis-hastings sampling for machine translation. In Advances in Neural Information Processing Systems, 2024
work page 2024
-
[16]
Ober, Florian Wenzel, Gunnar Ratsch, Richard E Turner, Mark van der Wilk, and Laurence Aitchison
Vincent Fortuin, Adrià Garriga-Alonso, Sebastian W. Ober, Florian Wenzel, Gunnar Ratsch, Richard E Turner, Mark van der Wilk, and Laurence Aitchison. Bayesian neural network priors revisited. In International Conference on Learning Representations, 2022
work page 2022
-
[17]
SPUQ: Perturbation-based uncertainty quantification for large language models
Xiang Gao, Jiaxin Zhang, Lalla Mouatadid, and Kamalika Das. SPUQ: Perturbation-based uncertainty quantification for large language models. arXiv preprint arXiv:2403.02509, 2024
-
[18]
A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing
Carlos Gómez-Rodríguez and Paul Williams. A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14504–14528, 2023
work page 2023
-
[19]
Improving uncertainty quantification in large language models via semantic embeddings
Yashvir S Grewal, Edwin V Bonilla, and Thang D Bui. Improving uncertainty quantification in large language models via semantic embeddings. arXiv:2410.22685, 2024
-
[20]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1321–1330, 2017
work page 2017
-
[21]
De- composing uncertainty for large language models through input clarification ensembling
Bairu Hou, Yujian Liu, Kaizhi Qian, Jacob Andreas, Shiyu Chang, and Yang Zhang. De- composing uncertainty for large language models through input clarification ensembling. In International Conference on Machine Learning, 2024
work page 2024
-
[22]
Automated Design of Agentic Systems
Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. arXiv:2408.08435, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Pavel Izmailov, Sharad Vikram, Matthew D Hoffman, and Andrew Gordon Wilson. What Are Bayesian Neural Network Posteriors Really Like? In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 4629–4640, 2021
work page 2021
-
[24]
Estimating the hallucination rate of generative AI
Andrew Jesson, Nicolas Beltran Velez, Quentin Chu, Sweta Karlekar, Jannik Kossen, Yarin Gal, John P Cunningham, and David Blei. Estimating the hallucination rate of generative AI. In Advances in Neural Information Processing Systems, 2024
work page 2024
-
[25]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
On uncertainty, tempering, and data augmentation in bayesian classification
Sanyam Kapoor, Wesley J Maddox, Pavel Izmailov, and Andrew G Wilson. On uncertainty, tempering, and data augmentation in bayesian classification. In Advances in Neural Information Processing Systems, volume 35, 2022
work page 2022
-
[27]
Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan A, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In The Twelfth International Conference on Learning Represen...
work page 2024
-
[28]
Auto-encoding variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014. 11
work page 2014
-
[29]
Being Bayesian, even just a bit, fixes overconfidence in relu networks
Agustinus Kristiadi, Matthias Hein, and Philipp Hennig. Being Bayesian, even just a bit, fixes overconfidence in relu networks. In International Conference on Machine Learning, 2020
work page 2020
-
[30]
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In International Conference on Learning Representations, 2023
work page 2023
-
[31]
Simple and scalable predictive uncertainty estimation using deep ensembles
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, 2017
work page 2017
-
[32]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474, 2020
work page 2020
-
[33]
Generating with confidence: Uncertainty quantification for black-box large language models
Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantification for black-box large language models. In Transactions on Machine Learning Research, 2024
work page 2024
-
[34]
Uncertainty quantification for in-context learning of large language models
Chen Ling, Xujiang Zhao, Xuchao Zhang, Wei Cheng, Yanchi Liu, Yiyou Sun, Mika Oishi, Takao Osaki, Katsushi Matsuda, Jie Ji, et al. Uncertainty quantification for in-context learning of large language models. arXiv:2402.10189, 2024
-
[35]
Information Theory, Inference and Learning Algorithms
David JC MacKay. Information Theory, Inference and Learning Algorithms . Cambridge University Press, 2003
work page 2003
-
[36]
Self- Refine: Iterative Refinement with Self-Feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self- Refine: Iterative Refinement with Self-Feedback. InAdvances in Neural Information Processing Sy...
work page 2023
-
[37]
SelfCheckGPT: Zero-resource black- box hallucination detection for generative large language models
Potsawee Manakul, Adian Liusie, and Mark Gales. SelfCheckGPT: Zero-Resource Black- Box Hallucination Detection for Generative Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.557
-
[38]
On faithfulness and factuality in abstractive summarization
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, 2020
work page 1906
-
[39]
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics,
work page 2023
-
[40]
doi: 10.18653/v1/2023.emnlp-main.741
-
[41]
Language models with conformal factuality guarantees
Christopher Mohri and Tatsunori Hashimoto. Language models with conformal factuality guarantees. In Proceedings of the 41st International Conference on Machine Learning, 2024
work page 2024
-
[42]
Data augmentation in Bayesian neural networks and the cold posterior effect
Seth Nabarro, Stoil Ganev, Adrià Garriga-Alonso, Vincent Fortuin, Mark van der Wilk, and Laurence Aitchison. Data augmentation in Bayesian neural networks and the cold posterior effect. In Uncertainty in Artificial Intelligence, pages 1434–1444. PMLR, 2022
work page 2022
-
[43]
Radford M. Neal. Bayesian Learning for Neural Networks, volume 118 of Lecture Notes in Statistics. Springer, 1996. doi: 10.1007/978-1-4612-0745-0
-
[44]
Kernel language entropy: Fine-grained uncertainty quantification for LLMs from semantic similarities
Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. Kernel language entropy: Fine-grained uncertainty quantification for LLMs from semantic similarities. In Advances in Neural Information Processing Systems, 2024. 12
work page 2024
-
[45]
Disentangling the roles of curation, data-augmentation and the prior in the cold posterior effect
Lorenzo Noci, Kevin Roth, Gregor Bachmann, Sebastian Nowozin, and Thomas Hofmann. Disentangling the roles of curation, data-augmentation and the prior in the cold posterior effect. In Advances in Neural Information Processing Systems, volume 34, 2021
work page 2021
-
[46]
Obtaining well calibrated probabilities using bayesian binning
Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. Proceedings of the AAAI Conference on Artificial Intelli- gence, 29(1), 2015. doi: 10.1609/aaai.v29i1.9602
-
[47]
Xin Qiu and Risto Miikkulainen. Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space. In Advances in Neural Information Processing Systems, 2024
work page 2024
-
[48]
A Scalable Laplace Approximation for Neural Networks
Hippolyt Ritter, Aleksandar Botev, and David Barber. A Scalable Laplace Approximation for Neural Networks. In International Conference on Learning Representations, 2018
work page 2018
-
[49]
A scalable Laplace approximation for neural networks
Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable Laplace approximation for neural networks. In International Conference on Learning Representations, 2018
work page 2018
-
[50]
The Metropolis-Hastings algorithm
Christian P Robert. The Metropolis-Hastings algorithm. arXiv:1504.01896, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[51]
Optimal proposal distributions and adaptive MCMC
Jeffrey S Rosenthal. Optimal proposal distributions and adaptive MCMC. Handbook of Markov Chain Monte Carlo, 4(10.1201):93–111, 2011
work page 2011
-
[52]
Mean field theory for sigmoid belief networks
Lawrence K Saul, Tommi Jaakkola, and Michael I Jordan. Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4:61–76, 1996
work page 1996
-
[53]
Agent Laboratory: Using LLM Agents as Research Assistants
Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Zicheng Liu, and Emad Barsoum. Agent Laboratory: Using LLM Agents as Research Assistants. arXiv:2501.04227, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
An efficient minibatch acceptance test for metropolis-hastings
Daniel Seita, Xinlei Pan, Haoyu Chen, and John Canny. An efficient minibatch acceptance test for metropolis-hastings. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 5359–5363, 2018
work page 2018
-
[55]
A tutorial on conformal prediction
Glenn Shafer and Vladimir V ovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3), 2008
work page 2008
-
[56]
Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, 2005
work page 2005
-
[57]
Lora ensembles for large language model fine-tuning
Xi Wang, Laurence Aitchison, and Maja Rudolph. Lora ensembles for large language model fine-tuning. arXiv:2310.00035, 2023
-
[58]
Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[59]
Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8696–8708, 2021
work page 2021
-
[60]
Helpsteer2-preference: Complementing ratings with prefer- ences
Zhilin Wang, Alexander Bukharin, Olivier Delalleau, Daniel Egert, Gerald Shen, Jiaqi Zeng, Oleksii Kuchaiev, and Yi Dong. Helpsteer2-preference: Complementing ratings with prefer- ences, 2024. URL https://arxiv.org/abs/2410.01257
-
[61]
On subjective uncertainty quantification and calibration in natural language generation
Ziyu Wang and Chris Holmes. On subjective uncertainty quantification and calibration in natural language generation. arXiv:2406.05213, 2024
-
[62]
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does LLM safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110, 2023. 13
work page 2023
-
[63]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022
work page 2022
-
[64]
Measuring short-form factuality in large language models
Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
Bayesian Learning via Stochastic Gradient Langevin Dynam- ics
Max Welling and Yee Whye Teh. Bayesian Learning via Stochastic Gradient Langevin Dynam- ics. In Proceedings of the 28th International Conference on Machine Learning, pages 681–688,
-
[66]
ISBN 978-1-4503-0619-5
-
[67]
Characterizing llm abstention behavior in science qa with context perturbations, 2024
Bingbing Wen, Bill Howe, and Lucy Lu Wang. Characterizing llm abstention behavior in science qa with context perturbations, 2024. URL https://arxiv.org/abs/2404.12452
-
[68]
Florian Wenzel, Kevin Roth, Bastiaan Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How good is the Bayes posterior in deep neural networks really? In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 10248–10259, 2020
work page 2020
-
[69]
Intelligent agents: Theory and practice
Michael Wooldridge and Nicholas R Jennings. Intelligent agents: Theory and practice. The Knowledge Engineering Review, 10(2):115–152, 1995
work page 1995
-
[70]
The rise and potential of large language model based agents: A survey
Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, Rui Zheng, Xiaoran Fan, Xiao Wang, Limao Xiong, Yuhao Zhou, Weiran Wang, Changhao Jiang, Yicheng Zou, Xiangyang Liu, Zhangyue Yin, Shihan Dou, Rongxiang Weng, Wensen Cheng, Qi Zhang, Wenjuan Qin, Yongyan Zheng, Xipeng Qiu, Xuanjing Huang, a...
work page 2025
-
[71]
Hallucination is Inevitable: An Innate Limitation of Large Language Models
Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. Hallucination is inevitable: An innate limitation of large language models. arXiv:2401.11817, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[72]
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search. arXiv:2504.08066, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[73]
Backdooring instruction-tuned large language models with virtual prompt injection
Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, and Hongxia Jin. Backdooring instruction-tuned large language models with virtual prompt injection. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...
work page 2024
-
[74]
Bayesian low-rank adaptation for large language models
Adam X Yang, Maxime Robeyns, Xi Wang, and Laurence Aitchison. Bayesian low-rank adaptation for large language models. In International Conference on Learning Representations, 2024
work page 2024
-
[75]
On Verbalized Confidence Scores for LLMs
Daniel Yang, Yao-Hung Hubert Tsai, and Makoto Yamada. On verbalized confidence scores for LLMs. arXiv:2412.14737, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
Optimizing generative AI by backpropagating language model feedback
Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng Liu, Pan Lu, Zhi Huang, Carlos Guestrin, and James Zou. Optimizing generative AI by backpropagating language model feedback. Nature, 639:609–616, 2025
work page 2025
-
[77]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023
work page 2023
-
[78]
Large language models are human-level prompt engineers
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations, 2022. 14
work page 2022
-
[79]
GPTSwarm: Language Agents as Optimizable Graphs
Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. GPTSwarm: Language Agents as Optimizable Graphs. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[80]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv:2307.15043, 2023. 15 A Method Details In general, MCMC can only be applied to Bayesian inference when the g(θ) is calculable, where g(θ) is defined by g(θ) = p(θ)p(D | θ) = p(θ) nY i=1 p(y...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.