arxiv: 2604.23505 · v1 · submitted 2026-04-26 · 💻 cs.SE · cs.AI

Recognition: unknown

Uncertainty Propagation in LLM-Based Systems

Boming Xia , Liming Zhu , Erdun Gao , Qinghua Lu , Minhui Xue , Dino Sejdinovic

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:06 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords uncertainty propagationLLM-based systemssystems taxonomysocio-technical systemserror compoundingpropagation mechanismslarge language modelssystems engineering

0 comments

The pith

Uncertainty in LLM-based systems propagates and compounds across model internals, workflows, components, state, and human processes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that uncertainty is typically examined only at the output of a single model, yet real LLM applications are compound systems where uncertainty gets transformed and reused across many boundaries. It supplies a conceptual framing to describe these propagated signals and a taxonomy of mechanisms operating at intra-model, system-level, and socio-technical layers. A reader would care because the absence of such treatment allows early errors to spread in ways that are hard to detect or control. The authors also extract engineering insights from the taxonomy and name five open research challenges.

Core claim

Deployed LLM applications are compound systems in which uncertainty is transformed and reused across model internals, workflow stages, component boundaries, persistent state, and human or organisational processes. Without principled treatment of how uncertainty is carried and reused across these boundaries, early errors can propagate and compound in ways that are difficult to detect and govern. The paper develops a systems-level account by introducing a conceptual framing for characterising propagated uncertainty signals and presenting a structured taxonomy spanning intra-model (P1), system-level (P2), and socio-technical (P3) propagation mechanisms, while synthesising cross-cutting insights

What carries the argument

A structured taxonomy of uncertainty propagation mechanisms divided into intra-model (P1), system-level (P2), and socio-technical (P3) categories, supported by a conceptual framing for characterising propagated uncertainty signals.

If this is right

Early errors originating inside models can be carried forward through workflow stages and stored in persistent state, producing compounded downstream effects.
System-level mechanisms allow uncertainty to cross component boundaries inside compound applications.
Socio-technical mechanisms incorporate how humans and organisations receive and reuse uncertain outputs.
Cross-cutting engineering insights can inform practices for tracking and mitigating propagation.
Five named open research challenges must be addressed to achieve reliable governance of uncertainty in LLM systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same propagation lens could be applied to other multi-component AI systems such as agent frameworks or retrieval-augmented setups.
System designers might add explicit uncertainty provenance logs that follow signals across workflow and state boundaries.
Governance policies for AI could shift from focusing solely on model accuracy to monitoring flows through socio-technical layers.
Evaluation suites could include synthetic propagation scenarios to test whether a given architecture allows early errors to remain hidden.

Load-bearing premise

Uncertainty in deployed LLM applications is routinely transformed and reused across model internals, workflow stages, component boundaries, persistent state, and human or organisational processes in ways that require and benefit from a new principled systems-level treatment beyond single-model analysis.

What would settle it

A set of measurements or case studies of deployed LLM applications showing that uncertainty signals do not measurably transform or propagate across the described boundaries in ways that affect detection or governance.

Figures

Figures reproduced from arXiv: 2604.23505 by Boming Xia, Dino Sejdinovic, Erdun Gao, Liming Zhu, Minhui Xue, Qinghua Lu.

**Figure 1.** Figure 1: Illustrative example of uncertainty propagation in an LLM-based policy and compliance assistant. view at source ↗

**Figure 2.** Figure 2: Overview of the taxonomy of uncertainty propagation in LLM systems. view at source ↗

**Figure 3.** Figure 3: Schematic overview of intra-model uncertainty propagation in P1. view at source ↗

**Figure 4.** Figure 4: Uncertainty transition (P1.1) P1.1 covers within-request propagation in which an uncertainty signal is traced as it evolves along an ordered sequence of internal positions within a single model-facing request, such as generation steps, model depth, branches, or modules (see Figure 4). The statement remains at rint throughout. This distinguishes P1.1 from P1.2, where a signal is transformed into a new prox… view at source ↗

**Figure 5.** Figure 5: Uncertainty transformation (P1.2) P1.2 covers within-request propagation in which an uncertainty signal is transformed into a proxy of a different type or scope within a single model-facing request (see view at source ↗

**Figure 6.** Figure 6: Uncertainty-conditioned inference control (P1.3) view at source ↗

**Figure 7.** Figure 7: Schematic overview of system-level uncertainty propagation in P2. view at source ↗

**Figure 8.** Figure 8: Uncertainty carried across workflow steps view at source ↗

**Figure 9.** Figure 9: Uncertainty-guided workflow control (P2.2) view at source ↗

**Figure 10.** Figure 10: Uncertainty re-expression (P2.3) P2.3 covers system-level propagation in which an uncertainty signal is re-expressed for consumption by a downstream technical component across a system boundary (see view at source ↗

**Figure 11.** Figure 11: Cross-run adaptation (P2.4) P2.4 covers system-level propagation in which an uncertainty signal observed in one run is retained in persistent system state and later changes how subsequent runs proceed, including both deployed runtime pipelines and iterative training or alignment pipelines where uncertainty shapes what future executions inherit. The defining feature is cross-run adaptation: uncertainty is… view at source ↗

**Figure 12.** Figure 12: Schematic overview of socio-technical uncertainty propagation in P3. view at source ↗

read the original abstract

Uncertainty in large language model (LLM)-based systems is often studied at the level of a single model output, yet deployed LLM applications are compound systems in which uncertainty is transformed and reused across model internals, workflow stages, component boundaries, persistent state, and human or organisational processes. Without principled treatment of how uncertainty is carried and reused across these boundaries, early errors can propagate and compound in ways that are difficult to detect and govern. This paper develops a systems-level account of uncertainty propagation. It introduces a conceptual framing for characterising propagated uncertainty signals, presents a structured taxonomy spanning intra-model (P1), system-level (P2), and socio-technical (P3) propagation mechanisms, synthesises cross-cutting engineering insights, and identifies five open research challenges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a three-level taxonomy for uncertainty propagation in LLM systems that organizes the issue but stays purely conceptual.

read the letter

The main thing to know is that this paper proposes a taxonomy splitting uncertainty propagation in LLM-based systems into intra-model (P1), system-level (P2), and socio-technical (P3) layers, along with a framing for how uncertainty signals get transformed and reused across boundaries in compound applications. It also lists five open challenges and some cross-cutting engineering insights. That is the core new synthesis here, moving past single-model uncertainty work to look at full deployments. The authors do a clean job defining the categories and showing logical connections from the starting premise that uncertainty compounds when it crosses workflows, state, and human processes. The distinctions hold up without internal contradictions, and the structure gives a readable way to map risks in multi-component setups. It pulls together ideas that were scattered before. The soft spot is the complete lack of empirical grounding. There are no case studies, traces from actual systems, experiments, or even worked examples that apply the taxonomy to real LLM outputs or pipelines. Soundness therefore rests only on whether the framing feels coherent and useful, which is subjective and leaves open whether the three levels capture the main practical problems or add leverage beyond standard systems thinking. The paper assumes propagated signals are a distinct and addressable issue but does not demonstrate it with data. This is for researchers and engineers focused on reliable LLM applications, especially those building retrieval-augmented or multi-agent systems who want a structured lens on where uncertainty can accumulate. A reader in AI systems engineering would get value from the organization and the challenge list as a prompt for their own work. It shows clear thinking in how it engages the single-model literature and structures the problem. I would send it for peer review so referees can test the taxonomy's utility and suggest concrete expansions.

Referee Report

0 major / 2 minor

Summary. The paper claims that uncertainty in LLM-based systems is transformed and reused across model internals, workflow stages, component boundaries, persistent state, and socio-technical processes, necessitating a systems-level account beyond single-model analysis. It introduces a conceptual framing for 'propagated uncertainty signals' and a taxonomy of propagation mechanisms divided into intra-model (P1), system-level (P2), and socio-technical (P3) categories, followed by cross-cutting engineering insights and five open research challenges.

Significance. If the taxonomy holds, the work provides a useful organizing lens for an emerging area, synthesizing observations about compound LLM systems and directing attention to propagation across boundaries. Strengths include the clear distinction among the three levels and the explicit listing of open challenges to guide follow-on research. As a purely conceptual contribution without empirical validation, formal derivations, or data, its significance will depend on community adoption and subsequent testing of the proposed structure.

minor comments (2)

The abstract refers to 'five open research challenges' without enumerating them; including a brief list would improve the summary's standalone value.
The terms 'propagated uncertainty signals' and the P1/P2/P3 labels are central to the taxonomy; ensure they receive explicit, early definitions with concrete examples in the introduction or framing section to aid reader comprehension.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation of minor revision. We appreciate the recognition that the work offers a useful organizing lens through its three-level taxonomy and explicit open challenges.

Circularity Check

0 steps flagged

No significant circularity in conceptual taxonomy and framing

full rationale

The paper develops a systems-level conceptual framing and structured taxonomy for uncertainty propagation across intra-model (P1), system-level (P2), and socio-technical (P3) mechanisms, followed by engineering insights and open challenges. No equations, derivations, fitted parameters, or mathematical reductions appear in the manuscript. The contribution is a synthesis motivated by the observation that uncertainty crosses boundaries in compound LLM systems, with definitions, distinctions, and examples supplied directly rather than derived from prior self-citations or internal fits. This is a standard non-circular outcome for a taxonomy-style proposal whose central claim reduces only to coherent presentation of the framing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper is a conceptual synthesis that introduces new descriptive categories without quantitative fitting or new physical postulates.

axioms (1)

domain assumption Uncertainty in LLM-based systems is transformed and reused across model internals, workflow stages, component boundaries, persistent state, and human or organisational processes.
This premise is stated directly in the abstract as the motivation for needing a systems-level account.

invented entities (2)

Propagated uncertainty signals no independent evidence
purpose: Conceptual framing to characterise how uncertainty is carried and reused across boundaries.
Introduced as the core new descriptive object in the systems-level account.
P1 intra-model, P2 system-level, P3 socio-technical propagation mechanisms no independent evidence
purpose: Structured taxonomy to classify uncertainty propagation.
New categorization presented as the main contribution.

pith-pipeline@v0.9.0 · 5430 in / 1395 out tokens · 30553 ms · 2026-05-08T06:06:01.134006+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

95 extracted references · 43 canonical work pages · 8 internal anchors

[1]

Llm-based agentic systems in medicine and healthcare.Nature Machine Intelligence, 6(12):1418–1420, 2024

Jianing Qiu, Kyle Lam, Guohao Li, Amish Acharya, Tien Yin Wong, Ara Darzi, Wu Yuan, and Eric J Topol. Llm-based agentic systems in medicine and healthcare.Nature Machine Intelligence, 6(12):1418–1420, 2024

2024
[2]

From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

Jiayi Chen, Junyi Ye, and Guiling Wang. From standalone llms to integrated intelligence: A survey of compound al systems.arXiv preprint arXiv:2506.04565, 2025

work page internal anchor Pith review arXiv 2025
[3]

Addison-Wesley Professional, 2025

Len Bass, Qinghua Lu, Ingo Weber, and Liming Zhu.Engineering AI systems: architecture and DevOps essentials. Addison-Wesley Professional, 2025

2025
[4]

Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

2020
[5]

A review of prominent paradigms for llm-based agents: Tool use, planning (including rag), and feedback learning

Xinzhe Li. A review of prominent paradigms for llm-based agents: Tool use, planning (including rag), and feedback learning. InProceedings of the 31st International Conference on Computational Linguistics, pages 9760–9779, 2025

2025
[6]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

2023
[7]

Agent-as-a-judge: Evaluate agents with agents

Mingchen Zhuge, Changsheng Zhao, Dylan R Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent-as-a-judge: Evaluate agents with agents. InForty-second International Conference on Machine Learning, 2025

2025
[8]

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. Sok: Agentic skills–beyond tool use in llm agents.arXiv preprint arXiv:2602.20867, 2026. 24 Uncertainty Propagation in LLM-Based Systems

work page internal anchor Pith review arXiv 2026
[9]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

2022
[10]

Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

2023
[11]

Where llm agents fail and how they can learn from failures,

Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

work page arXiv 2025
[12]

Shielda: Structured handling of exceptions in llm-driven agentic workflows.arXiv preprint arXiv:2508.07935, 2025

Jingwen Zhou, Jieshan Chen, Qinghua Lu, Dehai Zhao, and Liming Zhu. Shielda: Structured handling of exceptions in llm-driven agentic workflows.arXiv preprint arXiv:2508.07935, 2025

work page arXiv 2025
[13]

Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

work page arXiv 2025
[14]

Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms

Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. InThe Twelfth International Conference on Learning Representations, 2024

2024
[15]

Calibrated language models must hallucinate

Adam Tauman Kalai and Santosh S Vempala. Calibrated language models must hallucinate. InProceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 160–171, 2024

2024
[16]

Why Language Models Hallucinate

Adam Tauman Kalai, Ofir Nachum, Santosh S Vempala, and Edwin Zhang. Why language models hallucinate. arXiv preprint arXiv:2509.04664, 2025

work page internal anchor Pith review arXiv 2025
[17]

Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.Transactions on Machine Learning Research, 2025

Prateek Chhikara. Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.Transactions on Machine Learning Research, 2025

2025
[18]

A survey of uncertainty estimation methods on large language models.arXiv preprint arXiv:2503.00172, 2025

Zhiqiu Xia, Jinxuan Xu, Yuqian Zhang, and Hang Liu. A survey of uncertainty estimation methods on large language models.arXiv preprint arXiv:2503.00172, 2025

work page arXiv 2025
[19]

A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions.ACM Computing Surveys, 2025

Ola Shorinwa, Zhiting Mei, Justin Lidard, Allen Z Ren, and Anirudha Majumdar. A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions.ACM Computing Surveys, 2025

2025
[20]

Uncer- tainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions.arXiv preprint arXiv:2510.12040, 2025

Sungmin Kang, Yavuz Faruk Bakman, Duygu Nur Yaldiz, Baturalp Buyukates, and Salman Avestimehr. Uncer- tainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions.arXiv preprint arXiv:2510.12040, 2025

work page arXiv 2025
[21]

A survey of uncertainty estimation in llms: Theory meets practice,

Hsiu-Yuan Huang, Yutong Yang, Zhaoxi Zhang, Sanwoo Lee, and Yunfang Wu. A survey of uncertainty estimation in llms: Theory meets practice.arXiv preprint arXiv:2410.15326, 2024

work page arXiv 2024
[22]

Survey of uncertainty estimation in large language models-sources, methods, applications, and challenge

Jianfeng He, Linlin Yu, Changbin Li, Runing Yang, Fanglan Chen, Kangshuo Li, Min Zhang, Shuo Lei, Xuchao Zhang, Mohammad Beigi, et al. Survey of uncertainty estimation in large language models-sources, methods, applications, and challenge. 2025

2025
[23]

Uncertainty quantification and confidence calibration in large language models: A survey

Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. Uncertainty quantification and confidence calibration in large language models: A survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 6107–6117, 2025

2025
[24]

Comparing uncertainty measurement and mitigation methods for large language models: A systematic review.arXiv preprint arXiv:2504.18346, 2025

Toghrul Abbasli, Kentaroh Toyoda, Yuan Wang, Leon Witt, Muhammad Asif Ali, Yukai Miao, Dan Li, and Qingsong Wei. Comparing uncertainty measurement and mitigation methods for large language models: A systematic review.arXiv preprint arXiv:2504.18346, 2025

work page arXiv 2025
[25]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2024

work page internal anchor Pith review arXiv 2024
[26]

Llm-based agents for tool learning: A survey.Data Science and Engineering, pages 1–31, 2025

Weikai Xu, Chengrui Huang, Shen Gao, and Shuo Shang. Llm-based agents for tool learning: A survey.Data Science and Engineering, pages 1–31, 2025

2025
[27]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, December 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, December 2024

2024
[28]

Chawla, Olaf Wiest, and Xiangliang Zhang

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: a survey of progress and challenges. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI ’24, 2024. 25 Uncertainty Propagation in LLM-Based Systems

2024
[29]

A survey on rag meeting llms: Towards retrieval-augmented large language models

Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6491–6501, 2024

2024
[30]

Understanding the planning of LLM agents: A survey

Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of llm agents: A survey.arXiv preprint arXiv:2402.02716, 2024

work page internal anchor Pith review arXiv 2024
[31]

The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

2025
[32]

Cognitive mirage: A review of hallucinations in large language models

Hongbin Ye, Tong Liu, Aijia Zhang, Wei Hua, and Weiqiang Jia. Cognitive mirage: A review of hallucinations in large language models.arXiv preprint arXiv:2309.06794, 2023

work page arXiv 2023
[33]

Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models.Computational Linguistics, pages 1–46, 2025

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, and Yulong Chen. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models.Computational Linguistics, pages 1–46, 2025

2025
[34]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans. Inf. Syst., 43(2), January 2025

2025
[35]

Towards reliable large language models: A survey on hallucination detection

Yao Pan, Linggang Kong, Jiaju Wu, Yonghui Yang, Hongfu Zuo, Ze Xiu, and Xiaodong Wang. Towards reliable large language models: A survey on hallucination detection. InInternational Conference on Intelligent Computing, pages 438–451. Springer, 2025

2025
[36]

A comprehensive survey of hallucination mitigation techniques in large language models

S. M Towhidul Islam Tonmoy, S M Mehedi Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das. A comprehensive survey of hallucination mitigation techniques in large language models.arXiv preprint arXiv:2401.01313, 2024

work page arXiv 2024
[37]

Verbalizing llm’s higher-order uncertainty via imprecise probabilities

Anita Yang, Krikamol Muandet, Michele Caprio, Siu Lun Chau, and Masaki Adachi. Verbalizing llm’s higher-order uncertainty via imprecise probabilities. 2026

2026
[38]

Unconditional truthfulness: Learning uncondi- tional uncertainty of large language models

Artem Vazhentsev, Ekaterina Fadeeva, Rui Xing, Gleb Kuzmin, Ivan Lazichny, Alexander Panchenko, Preslav Nakov, Timothy Baldwin, Maxim Panov, and Artem Shelmanov. Unconditional truthfulness: Learning uncondi- tional uncertainty of large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35661–35...

2025
[39]

CoRR, abs/2505.20045

Artem Vazhentsev, Lyudmila Rvanova, Gleb Kuzmin, Ekaterina Fadeeva, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Timothy Baldwin, Mrinmaya Sachan, Preslav Nakov, et al. Uncertainty-aware attention heads: Efficient unsupervised uncertainty quantification for llms.arXiv preprint arXiv:2505.20045, 2025

work page arXiv 2025
[40]

Uncertainty-aware contrastive decoding

Hakyung Lee, Subeen Park, Joowang Kim, Sungjun Lim, and Kyungwoo Song. Uncertainty-aware contrastive decoding. InFindings of the Association for Computational Linguistics: ACL 2025, pages 26376–26391, 2025

2025
[41]

Beyond Exponential Decay: Rethinking Error Accumulation in Large Language Models

Mikhail L Arbuzov, Alexey A Shvets, and Sisong Beir. Beyond exponential decay: Rethinking error accumulation in large language models.arXiv preprint arXiv:2505.24187, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Learned hallucination detection in black-box LLMs using token-level entropy production rate,

Charles Moslonka, Hicham Randrianarivo, Arthur Garnier, and Emmanuel Malherbe. Learned hallucination detection in black-box llms using token-level entropy production rate.arXiv preprint arXiv:2509.04492, 2025

work page arXiv 2025
[43]

Bottom-up policy optimization: Your language model policy secretly contains internal policies

Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, and Kang Liu. Bottom-up policy optimization: Your language model policy secretly contains internal policies. arXiv preprint arXiv:2512.19673, 2025

work page arXiv 2025
[44]

Reppl: Recalibrating perplexity by uncertainty in semantic propagation and language generation for explainable qa hallucination detection.arXiv preprint arXiv:2505.15386, 2025

Yiming Huang, Junyan Zhang, Zihao Wang, Biquan Bie, Yunzhong Qiu, Yi R Fung, and Xinlei He. Reppl: Recalibrating perplexity by uncertainty in semantic propagation and language generation for explainable qa hallucination detection.arXiv preprint arXiv:2505.15386, 2025

work page arXiv 2025
[45]

Numerical error analysis of large language models.arXiv preprint arXiv:2503.10251,

Stanislav Budzinskiy, Wenyi Fang, Longbin Zeng, and Philipp Petersen. Numerical error analysis of large language models.arXiv preprint arXiv:2503.10251, 2025

work page arXiv 2025
[46]

Are language models aware of the road not taken? token-level uncertainty and hidden state dynamics.arXiv preprint arXiv:2511.04527, 2025

Amir Zur, Atticus Geiger, Ekdeep Singh Lubana, and Eric Bigelow. Are language models aware of the road not taken? token-level uncertainty and hidden state dynamics.arXiv preprint arXiv:2511.04527, 2025

work page arXiv 2025
[47]

Analysis of image-and-text uncertainty propagation in multimodal large language models with cardiac mr-based applications

Yucheng Tang, Yunguan Fu, Weixi Yi, Yipei Wang, Daniel C Alexander, Rhodri Davies, and Yipeng Hu. Analysis of image-and-text uncertainty propagation in multimodal large language models with cardiac mr-based applications. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 36–45. Springer, 2025. 26 Uncertainty Pr...

2025
[48]

Flue: Streamlined uncertainty estimation for large language models

Shiqi Gao, Tianxiang Gong, Zijie Lin, Runhua Xu, Haoyi Zhou, and Jianxin Li. Flue: Streamlined uncertainty estimation for large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 16745–16753, 2025

2025
[49]

Can llms detect their confabulations? estimating reliability in uncertainty-aware language models.arXiv preprint arXiv:2508.08139, 2025

Tianyi Zhou, Johanne Medina, and Sanjay Chawla. Can llms detect their confabulations? estimating reliability in uncertainty-aware language models.arXiv preprint arXiv:2508.08139, 2025

work page arXiv 2025
[50]

CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought, June 2025

Boxuan Zhang and Ruqi Zhang. CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought, June 2025

2025
[51]

Halluentity: Benchmarking and understanding entity-level hallucination detection.Transactions on Machine Learning Research, 2025

Min-Hsuan Yeh, Max Kamachee, Seongheon Park, and Yixuan Li. Halluentity: Benchmarking and understanding entity-level hallucination detection.Transactions on Machine Learning Research, 2025

2025
[52]

Ramzi Dakhmouche, Adrien Letellier, and Hossein Gorji. Can linear probes measure llm uncertainty? In NeurIPS 2025 Workshop MLxOR: Mathematical Foundations and Operational Integration of Machine Learning for Uncertainty-Aware Decision-Making, 2025

2025
[53]

Can llms predict their own failures? self-awareness via internal circuits.arXiv preprint arXiv:2512.20578, 2025

Amirhosein Ghasemabadi and Di Niu. Can llms predict their own failures? self-awareness via internal circuits. arXiv preprint arXiv:2512.20578, 2025

work page arXiv 2025
[54]

Uncertainty- aware reward model: Teaching reward models to know what is unknown, 2025

Xingzhou Lou, Dong Yan, Wei Shen, Yuzi Yan, Jian Xie, and Junge Zhang. Uncertainty-aware reward model: Teaching reward models to know what is unknown.arXiv preprint arXiv:2410.00847, 2024

work page arXiv 2024
[55]

Confident adaptive language modeling.Advances in Neural Information Processing Systems, 35:17456–17472, 2022

Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, and Donald Metzler. Confident adaptive language modeling.Advances in Neural Information Processing Systems, 35:17456–17472, 2022

2022
[56]

Tokur: Token-level uncertainty estimation for large language model reasoning

Tunyu Zhang, Haizhou Shi, Yibin Wang, Hengyi Wang, Xiaoxiao He, Zhuowei Li, Haoxian Chen, Ligong Han, Kai Xu, Huan Zhang, et al. Tokur: Token-level uncertainty estimation for large language model reasoning. In First Workshop on Foundations of Reasoning in Language Models, 2025

2025
[57]

arXiv preprint arXiv:2508.15260 , year=

Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence.arXiv preprint arXiv:2508.15260, 2025

work page arXiv 2025
[58]

Zip-rc: Zero-overhead inference-time pre- diction of reward and cost for adaptive and interpretable generation.arXiv preprint arXiv:2512.01457,

Rohin Manvi, Joey Hong, Tim Seyde, Maxime Labonne, Mathias Lechner, and Sergey Levine. Zero-overhead introspection for adaptive test-time compute.arXiv preprint arXiv:2512.01457, 2025

work page arXiv 2025
[59]

Mitigating token-level uncertainty in retrieval-augmented large language models.Authorea Preprints, 2024

Liz Yarie, Dominic Soriano, Leonard Kaczmarek, Benjamin Wilkinson, and Eduardo Vasquez. Mitigating token-level uncertainty in retrieval-augmented large language models.Authorea Preprints, 2024

2024
[60]

Up- rop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.arXiv preprint arXiv:2506.17419, 2025

Jinhao Duan, James Diffenderfer, Sandeep Madireddy, Tianlong Chen, Bhavya Kailkhura, and Kaidi Xu. Up- rop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.arXiv preprint arXiv:2506.17419, 2025

work page arXiv 2025
[61]

Uncertainty propagation on llm agent

Qiwei Zhao, Dong Li, Yanchi Liu, Wei Cheng, Yiyou Sun, Mika Oishi, Takao Osaki, Katsushi Matsuda, Huaxiu Yao, Chen Zhao, et al. Uncertainty propagation on llm agent. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6064–6073, 2025

2025
[62]

Agentic confidence calibration.arXiv preprint arXiv:2601.15778, 2026

Jiaxin Zhang, Caiming Xiong, and Chien-Sheng Wu. Agentic confidence calibration.arXiv preprint arXiv:2601.15778, 2026

work page arXiv 2026
[63]

CER: confidence enhanced reasoning in llms

Ali Razghandi, Seyed Mohammad Hadi Hosseini, and Mahdieh Soleymani Baghshah. CER: confidence enhanced reasoning in llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 7918–7938. Association for Computational Linguistics, 2025

2025
[64]

Geometric uncertainty for detecting and correcting hallucinations in llms

Edward Phillips, Sean Wu, Soheila Molaei, Danielle Belgrave, Anshul Thakur, and David Clifton. Geometric uncertainty for detecting and correcting hallucinations in llms.arXiv preprint arXiv:2509.13813, 2025

work page arXiv 2025
[65]

Planu: Large language model reasoning through planning under uncertainty

Ziwei Deng, Mian Deng, Chenjing Liang, Zeming Gao, Chennan Ma, Chenxing Lin, Haipeng Zhang, Songzhu Mei, Siqi Shen, and Cheng Wang. Planu: Large language model reasoning through planning under uncertainty. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[66]

Robust search with uncertainty-aware value models for language model reasoning.arXiv preprint arXiv:2502.11155, 2025

Fei Yu, Yingru Li, and Benyou Wang. Robust search with uncertainty-aware value models for language model reasoning.arXiv preprint arXiv:2502.11155, 2025

work page arXiv 2025
[67]

BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

Deepro Choudhury, Sinead Williamson, Adam Goli´nski, Ning Miao, Freddie Bickford Smith, Michael Kirchhof, Yizhe Zhang, and Tom Rainforth. Bed-llm: Intelligent information gathering with llms and bayesian experimental design.arXiv preprint arXiv:2508.21184, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Uncertainty of thoughts: Uncertainty-aware planning enhances information seeking in llms

Zhiyuan Hu, Chumin Liu, Xidong Feng, Yilun Zhao, See-Kiong Ng, Anh Tuan Luu, Junxian He, Pang Wei W Koh, and Bryan Hooi. Uncertainty of thoughts: Uncertainty-aware planning enhances information seeking in llms. Advances in Neural Information Processing Systems, 37:24181–24215, 2024. 27 Uncertainty Propagation in LLM-Based Systems

2024
[69]

Reasoning in flux: Enhancing large language models reasoning through uncertainty-aware adaptive guidance

Zhangyue Yin, Qiushi Sun, Qipeng Guo, Zhiyuan Zeng, Xiaonan Li, Junqi Dai, Qinyuan Cheng, Xuan-Jing Huang, and Xipeng Qiu. Reasoning in flux: Enhancing large language models reasoning through uncertainty-aware adaptive guidance. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2401–2416, 2024

2024
[70]

Rational tuning of llm cascades via probabilistic modeling.arXiv preprint arXiv:2501.09345, 2025

Michael J Zellinger and Matt Thomson. Rational tuning of llm cascades via probabilistic modeling.arXiv preprint arXiv:2501.09345, 2025

work page arXiv 2025
[71]

When to trust the cheap check: Weak and strong verification for reasoning.arXiv preprint arXiv:2602.17633, 2026

Shayan Kiyani, Sima Noorani, George Pappas, and Hamed Hassani. When to trust the cheap check: Weak and strong verification for reasoning.arXiv preprint arXiv:2602.17633, 2026

work page arXiv 2026
[72]

Uncertainty- aware hybrid inference with on-device small and remote large language models

Seungeun Oh, Jinhyuk Kim, Jihong Park, Seung-Woo Ko, Tony QS Quek, and Seong-Lyun Kim. Uncertainty- aware hybrid inference with on-device small and remote large language models. In2025 IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN), pages 1–7. IEEE, 2025

2025
[73]

24 Published in Transactions on Machine Learning Research (04/2026) A Appendix Contents A.1 Reward Design and PPO Stabilization Sensitivity

Yuqi Zhu, Ge Li, Xue Jiang, Jia Li, Hong Mei, Zhi Jin, and Yihong Dong. Uncertainty-guided chain-of-thought for code generation with llms.arXiv preprint arXiv:2503.15341, 2025

work page arXiv 2025
[74]

Robots that ask for help: Uncertainty alignment for large language model planners

Allen Z Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, et al. Robots that ask for help: Uncertainty alignment for large language model planners. In7th Annual Conference on Robot Learning, 2023

2023
[75]

Structured Uncertainty guided Clarification for LLM Agents

Manan Suri, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A Rossi, and Dinesh Manocha. Structured uncertainty guided clarification for llm agents.arXiv preprint arXiv:2511.08798, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

Tools in the loop: Quantifying uncertainty of llm question answering systems that use tools.arXiv preprint arXiv:2505.16113, 2025

Panagiotis Lymperopoulos and Vasanth Sarathy. Tools in the loop: Quantifying uncertainty of llm question answering systems that use tools.arXiv preprint arXiv:2505.16113, 2025

work page arXiv 2025
[77]

How to correctly report llm-as-a-judge evaluations.arXiv preprint arXiv:2511.21140, 2025

Chungpa Lee, Thomas Zeng, Jongwon Jeong, Jy-yong Sohn, and Kangwook Lee. How to correctly report llm-as-a-judge evaluations.arXiv preprint arXiv:2511.21140, 2025

work page arXiv 2025
[78]

Kg-uq: Knowledge graph-based uncertainty quantification for long text in large language models

Yingqing Yuan, Linwei Tao, Haohui Lu, Matloob Khushi, Imran Razzak, Mark Dras, Jian Yang, and Usman Naseem. Kg-uq: Knowledge graph-based uncertainty quantification for long text in large language models. In Companion Proceedings of the ACM on Web Conference 2025, pages 2071–2077, 2025

2025
[79]

Enhancing uncertainty modeling with semantic graph for hallucination detection

Kedi Chen, Qin Chen, Jie Zhou, Xinqi Tao, Bowen Ding, Jingwen Xie, Mingchen Xie, Peilong Li, and Zheng Feng. Enhancing uncertainty modeling with semantic graph for hallucination detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23586–23594, 2025

2025
[80]

Reasoning over uncertain text by generative large language models

Aliakbar Nafar, Kristen Brent Venable, and Parisa Kordjamshidi. Reasoning over uncertain text by generative large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24911–24920, 2025

2025

Showing first 80 references.