pith. machine review for the scientific record. sign in

arxiv: 2604.23505 · v1 · submitted 2026-04-26 · 💻 cs.SE · cs.AI

Recognition: unknown

Uncertainty Propagation in LLM-Based Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:06 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords uncertainty propagationLLM-based systemssystems taxonomysocio-technical systemserror compoundingpropagation mechanismslarge language modelssystems engineering
0
0 comments X

The pith

Uncertainty in LLM-based systems propagates and compounds across model internals, workflows, components, state, and human processes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that uncertainty is typically examined only at the output of a single model, yet real LLM applications are compound systems where uncertainty gets transformed and reused across many boundaries. It supplies a conceptual framing to describe these propagated signals and a taxonomy of mechanisms operating at intra-model, system-level, and socio-technical layers. A reader would care because the absence of such treatment allows early errors to spread in ways that are hard to detect or control. The authors also extract engineering insights from the taxonomy and name five open research challenges.

Core claim

Deployed LLM applications are compound systems in which uncertainty is transformed and reused across model internals, workflow stages, component boundaries, persistent state, and human or organisational processes. Without principled treatment of how uncertainty is carried and reused across these boundaries, early errors can propagate and compound in ways that are difficult to detect and govern. The paper develops a systems-level account by introducing a conceptual framing for characterising propagated uncertainty signals and presenting a structured taxonomy spanning intra-model (P1), system-level (P2), and socio-technical (P3) propagation mechanisms, while synthesising cross-cutting insights

What carries the argument

A structured taxonomy of uncertainty propagation mechanisms divided into intra-model (P1), system-level (P2), and socio-technical (P3) categories, supported by a conceptual framing for characterising propagated uncertainty signals.

If this is right

  • Early errors originating inside models can be carried forward through workflow stages and stored in persistent state, producing compounded downstream effects.
  • System-level mechanisms allow uncertainty to cross component boundaries inside compound applications.
  • Socio-technical mechanisms incorporate how humans and organisations receive and reuse uncertain outputs.
  • Cross-cutting engineering insights can inform practices for tracking and mitigating propagation.
  • Five named open research challenges must be addressed to achieve reliable governance of uncertainty in LLM systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same propagation lens could be applied to other multi-component AI systems such as agent frameworks or retrieval-augmented setups.
  • System designers might add explicit uncertainty provenance logs that follow signals across workflow and state boundaries.
  • Governance policies for AI could shift from focusing solely on model accuracy to monitoring flows through socio-technical layers.
  • Evaluation suites could include synthetic propagation scenarios to test whether a given architecture allows early errors to remain hidden.

Load-bearing premise

Uncertainty in deployed LLM applications is routinely transformed and reused across model internals, workflow stages, component boundaries, persistent state, and human or organisational processes in ways that require and benefit from a new principled systems-level treatment beyond single-model analysis.

What would settle it

A set of measurements or case studies of deployed LLM applications showing that uncertainty signals do not measurably transform or propagate across the described boundaries in ways that affect detection or governance.

Figures

Figures reproduced from arXiv: 2604.23505 by Boming Xia, Dino Sejdinovic, Erdun Gao, Liming Zhu, Minhui Xue, Qinghua Lu.

Figure 1
Figure 1. Figure 1: Illustrative example of uncertainty propagation in an LLM-based policy and compliance assistant. view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the taxonomy of uncertainty propagation in LLM systems. view at source ↗
Figure 3
Figure 3. Figure 3: Schematic overview of intra-model uncertainty propagation in P1. view at source ↗
Figure 4
Figure 4. Figure 4: Uncertainty transition (P1.1) P1.1 covers within-request propagation in which an uncertainty signal is traced as it evolves along an ordered sequence of internal positions within a single model-facing request, such as generation steps, model depth, branches, or modules (see Fig￾ure 4). The statement remains at rint throughout. This distinguishes P1.1 from P1.2, where a signal is transformed into a new prox… view at source ↗
Figure 5
Figure 5. Figure 5: Uncertainty transformation (P1.2) P1.2 covers within-request propagation in which an uncertainty signal is transformed into a proxy of a different type or scope within a single model-facing request (see view at source ↗
Figure 6
Figure 6. Figure 6: Uncertainty-conditioned inference control (P1.3) view at source ↗
Figure 7
Figure 7. Figure 7: Schematic overview of system-level uncertainty propagation in P2. view at source ↗
Figure 8
Figure 8. Figure 8: Uncertainty carried across workflow steps view at source ↗
Figure 9
Figure 9. Figure 9: Uncertainty-guided workflow control (P2.2) view at source ↗
Figure 10
Figure 10. Figure 10: Uncertainty re-expression (P2.3) P2.3 covers system-level propagation in which an uncertainty signal is re-expressed for consumption by a downstream technical component across a system boundary (see view at source ↗
Figure 11
Figure 11. Figure 11: Cross-run adaptation (P2.4) P2.4 covers system-level propagation in which an uncertainty signal observed in one run is retained in persistent system state and later changes how subsequent runs proceed, including both deployed runtime pipelines and iterative training or alignment pipelines where uncertainty shapes what future executions in￾herit. The defining feature is cross-run adaptation: uncertainty is… view at source ↗
Figure 12
Figure 12. Figure 12: Schematic overview of socio-technical uncertainty propagation in P3. view at source ↗
read the original abstract

Uncertainty in large language model (LLM)-based systems is often studied at the level of a single model output, yet deployed LLM applications are compound systems in which uncertainty is transformed and reused across model internals, workflow stages, component boundaries, persistent state, and human or organisational processes. Without principled treatment of how uncertainty is carried and reused across these boundaries, early errors can propagate and compound in ways that are difficult to detect and govern. This paper develops a systems-level account of uncertainty propagation. It introduces a conceptual framing for characterising propagated uncertainty signals, presents a structured taxonomy spanning intra-model (P1), system-level (P2), and socio-technical (P3) propagation mechanisms, synthesises cross-cutting engineering insights, and identifies five open research challenges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that uncertainty in LLM-based systems is transformed and reused across model internals, workflow stages, component boundaries, persistent state, and socio-technical processes, necessitating a systems-level account beyond single-model analysis. It introduces a conceptual framing for 'propagated uncertainty signals' and a taxonomy of propagation mechanisms divided into intra-model (P1), system-level (P2), and socio-technical (P3) categories, followed by cross-cutting engineering insights and five open research challenges.

Significance. If the taxonomy holds, the work provides a useful organizing lens for an emerging area, synthesizing observations about compound LLM systems and directing attention to propagation across boundaries. Strengths include the clear distinction among the three levels and the explicit listing of open challenges to guide follow-on research. As a purely conceptual contribution without empirical validation, formal derivations, or data, its significance will depend on community adoption and subsequent testing of the proposed structure.

minor comments (2)
  1. The abstract refers to 'five open research challenges' without enumerating them; including a brief list would improve the summary's standalone value.
  2. The terms 'propagated uncertainty signals' and the P1/P2/P3 labels are central to the taxonomy; ensure they receive explicit, early definitions with concrete examples in the introduction or framing section to aid reader comprehension.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation of minor revision. We appreciate the recognition that the work offers a useful organizing lens through its three-level taxonomy and explicit open challenges.

Circularity Check

0 steps flagged

No significant circularity in conceptual taxonomy and framing

full rationale

The paper develops a systems-level conceptual framing and structured taxonomy for uncertainty propagation across intra-model (P1), system-level (P2), and socio-technical (P3) mechanisms, followed by engineering insights and open challenges. No equations, derivations, fitted parameters, or mathematical reductions appear in the manuscript. The contribution is a synthesis motivated by the observation that uncertainty crosses boundaries in compound LLM systems, with definitions, distinctions, and examples supplied directly rather than derived from prior self-citations or internal fits. This is a standard non-circular outcome for a taxonomy-style proposal whose central claim reduces only to coherent presentation of the framing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The paper is a conceptual synthesis that introduces new descriptive categories without quantitative fitting or new physical postulates.

axioms (1)
  • domain assumption Uncertainty in LLM-based systems is transformed and reused across model internals, workflow stages, component boundaries, persistent state, and human or organisational processes.
    This premise is stated directly in the abstract as the motivation for needing a systems-level account.
invented entities (2)
  • Propagated uncertainty signals no independent evidence
    purpose: Conceptual framing to characterise how uncertainty is carried and reused across boundaries.
    Introduced as the core new descriptive object in the systems-level account.
  • P1 intra-model, P2 system-level, P3 socio-technical propagation mechanisms no independent evidence
    purpose: Structured taxonomy to classify uncertainty propagation.
    New categorization presented as the main contribution.

pith-pipeline@v0.9.0 · 5430 in / 1395 out tokens · 30553 ms · 2026-05-08T06:06:01.134006+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

95 extracted references · 43 canonical work pages · 8 internal anchors

  1. [1]

    Llm-based agentic systems in medicine and healthcare.Nature Machine Intelligence, 6(12):1418–1420, 2024

    Jianing Qiu, Kyle Lam, Guohao Li, Amish Acharya, Tien Yin Wong, Ara Darzi, Wu Yuan, and Eric J Topol. Llm-based agentic systems in medicine and healthcare.Nature Machine Intelligence, 6(12):1418–1420, 2024

  2. [2]

    From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

    Jiayi Chen, Junyi Ye, and Guiling Wang. From standalone llms to integrated intelligence: A survey of compound al systems.arXiv preprint arXiv:2506.04565, 2025

  3. [3]

    Addison-Wesley Professional, 2025

    Len Bass, Qinghua Lu, Ingo Weber, and Liming Zhu.Engineering AI systems: architecture and DevOps essentials. Addison-Wesley Professional, 2025

  4. [4]

    Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  5. [5]

    A review of prominent paradigms for llm-based agents: Tool use, planning (including rag), and feedback learning

    Xinzhe Li. A review of prominent paradigms for llm-based agents: Tool use, planning (including rag), and feedback learning. InProceedings of the 31st International Conference on Computational Linguistics, pages 9760–9779, 2025

  6. [6]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

  7. [7]

    Agent-as-a-judge: Evaluate agents with agents

    Mingchen Zhuge, Changsheng Zhao, Dylan R Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, et al. Agent-as-a-judge: Evaluate agents with agents. InForty-second International Conference on Machine Learning, 2025

  8. [8]

    SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

    Yanna Jiang, Delong Li, Haiyu Deng, Baihe Ma, Xu Wang, Qin Wang, and Guangsheng Yu. Sok: Agentic skills–beyond tool use in llm agents.arXiv preprint arXiv:2602.20867, 2026. 24 Uncertainty Propagation in LLM-Based Systems

  9. [9]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  10. [10]

    Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural Information Processing Systems, 36:8634–8652, 2023

  11. [11]

    Where llm agents fail and how they can learn from failures,

    Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

  12. [12]

    Shielda: Structured handling of exceptions in llm-driven agentic workflows.arXiv preprint arXiv:2508.07935, 2025

    Jingwen Zhou, Jieshan Chen, Qinghua Lu, Dehai Zhao, and Liming Zhu. Shielda: Structured handling of exceptions in llm-driven agentic workflows.arXiv preprint arXiv:2508.07935, 2025

  13. [13]

    Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

    Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

  14. [14]

    Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. InThe Twelfth International Conference on Learning Representations, 2024

  15. [15]

    Calibrated language models must hallucinate

    Adam Tauman Kalai and Santosh S Vempala. Calibrated language models must hallucinate. InProceedings of the 56th Annual ACM Symposium on Theory of Computing, pages 160–171, 2024

  16. [16]

    Why Language Models Hallucinate

    Adam Tauman Kalai, Ofir Nachum, Santosh S Vempala, and Edwin Zhang. Why language models hallucinate. arXiv preprint arXiv:2509.04664, 2025

  17. [17]

    Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.Transactions on Machine Learning Research, 2025

    Prateek Chhikara. Mind the confidence gap: Overconfidence, calibration, and distractor effects in large language models.Transactions on Machine Learning Research, 2025

  18. [18]

    A survey of uncertainty estimation methods on large language models.arXiv preprint arXiv:2503.00172, 2025

    Zhiqiu Xia, Jinxuan Xu, Yuqian Zhang, and Hang Liu. A survey of uncertainty estimation methods on large language models.arXiv preprint arXiv:2503.00172, 2025

  19. [19]

    A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions.ACM Computing Surveys, 2025

    Ola Shorinwa, Zhiting Mei, Justin Lidard, Allen Z Ren, and Anirudha Majumdar. A survey on uncertainty quantification of large language models: Taxonomy, open research challenges, and future directions.ACM Computing Surveys, 2025

  20. [20]

    Uncer- tainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions.arXiv preprint arXiv:2510.12040, 2025

    Sungmin Kang, Yavuz Faruk Bakman, Duygu Nur Yaldiz, Baturalp Buyukates, and Salman Avestimehr. Uncer- tainty quantification for hallucination detection in large language models: Foundations, methodology, and future directions.arXiv preprint arXiv:2510.12040, 2025

  21. [21]

    A survey of uncertainty estimation in llms: Theory meets practice,

    Hsiu-Yuan Huang, Yutong Yang, Zhaoxi Zhang, Sanwoo Lee, and Yunfang Wu. A survey of uncertainty estimation in llms: Theory meets practice.arXiv preprint arXiv:2410.15326, 2024

  22. [22]

    Survey of uncertainty estimation in large language models-sources, methods, applications, and challenge

    Jianfeng He, Linlin Yu, Changbin Li, Runing Yang, Fanglan Chen, Kangshuo Li, Min Zhang, Shuo Lei, Xuchao Zhang, Mohammad Beigi, et al. Survey of uncertainty estimation in large language models-sources, methods, applications, and challenge. 2025

  23. [23]

    Uncertainty quantification and confidence calibration in large language models: A survey

    Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. Uncertainty quantification and confidence calibration in large language models: A survey. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 6107–6117, 2025

  24. [24]

    Comparing uncertainty measurement and mitigation methods for large language models: A systematic review.arXiv preprint arXiv:2504.18346, 2025

    Toghrul Abbasli, Kentaroh Toyoda, Yuan Wang, Leon Witt, Muhammad Asif Ali, Yukai Miao, Dan Li, and Qingsong Wei. Comparing uncertainty measurement and mitigation methods for large language models: A systematic review.arXiv preprint arXiv:2504.18346, 2025

  25. [25]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2024

  26. [26]

    Llm-based agents for tool learning: A survey.Data Science and Engineering, pages 1–31, 2025

    Weikai Xu, Chengrui Huang, Shen Gao, and Shuo Shang. Llm-based agents for tool learning: A survey.Data Science and Engineering, pages 1–31, 2025

  27. [27]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, December 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, December 2024

  28. [28]

    Chawla, Olaf Wiest, and Xiangliang Zhang

    Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: a survey of progress and challenges. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI ’24, 2024. 25 Uncertainty Propagation in LLM-Based Systems

  29. [29]

    A survey on rag meeting llms: Towards retrieval-augmented large language models

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A survey on rag meeting llms: Towards retrieval-augmented large language models. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6491–6501, 2024

  30. [30]

    Understanding the planning of LLM agents: A survey

    Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of llm agents: A survey.arXiv preprint arXiv:2402.02716, 2024

  31. [31]

    The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

  32. [32]

    Cognitive mirage: A review of hallucinations in large language models

    Hongbin Ye, Tong Liu, Aijia Zhang, Wei Hua, and Weiqiang Jia. Cognitive mirage: A review of hallucinations in large language models.arXiv preprint arXiv:2309.06794, 2023

  33. [33]

    Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models.Computational Linguistics, pages 1–46, 2025

    Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, and Yulong Chen. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models.Computational Linguistics, pages 1–46, 2025

  34. [34]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans. Inf. Syst., 43(2), January 2025

  35. [35]

    Towards reliable large language models: A survey on hallucination detection

    Yao Pan, Linggang Kong, Jiaju Wu, Yonghui Yang, Hongfu Zuo, Ze Xiu, and Xiaodong Wang. Towards reliable large language models: A survey on hallucination detection. InInternational Conference on Intelligent Computing, pages 438–451. Springer, 2025

  36. [36]

    A comprehensive survey of hallucination mitigation techniques in large language models

    S. M Towhidul Islam Tonmoy, S M Mehedi Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das. A comprehensive survey of hallucination mitigation techniques in large language models.arXiv preprint arXiv:2401.01313, 2024

  37. [37]

    Verbalizing llm’s higher-order uncertainty via imprecise probabilities

    Anita Yang, Krikamol Muandet, Michele Caprio, Siu Lun Chau, and Masaki Adachi. Verbalizing llm’s higher-order uncertainty via imprecise probabilities. 2026

  38. [38]

    Unconditional truthfulness: Learning uncondi- tional uncertainty of large language models

    Artem Vazhentsev, Ekaterina Fadeeva, Rui Xing, Gleb Kuzmin, Ivan Lazichny, Alexander Panchenko, Preslav Nakov, Timothy Baldwin, Maxim Panov, and Artem Shelmanov. Unconditional truthfulness: Learning uncondi- tional uncertainty of large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35661–35...

  39. [39]

    CoRR, abs/2505.20045

    Artem Vazhentsev, Lyudmila Rvanova, Gleb Kuzmin, Ekaterina Fadeeva, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Timothy Baldwin, Mrinmaya Sachan, Preslav Nakov, et al. Uncertainty-aware attention heads: Efficient unsupervised uncertainty quantification for llms.arXiv preprint arXiv:2505.20045, 2025

  40. [40]

    Uncertainty-aware contrastive decoding

    Hakyung Lee, Subeen Park, Joowang Kim, Sungjun Lim, and Kyungwoo Song. Uncertainty-aware contrastive decoding. InFindings of the Association for Computational Linguistics: ACL 2025, pages 26376–26391, 2025

  41. [41]

    Beyond Exponential Decay: Rethinking Error Accumulation in Large Language Models

    Mikhail L Arbuzov, Alexey A Shvets, and Sisong Beir. Beyond exponential decay: Rethinking error accumulation in large language models.arXiv preprint arXiv:2505.24187, 2025

  42. [42]

    Learned hallucination detection in black-box LLMs using token-level entropy production rate,

    Charles Moslonka, Hicham Randrianarivo, Arthur Garnier, and Emmanuel Malherbe. Learned hallucination detection in black-box llms using token-level entropy production rate.arXiv preprint arXiv:2509.04492, 2025

  43. [43]

    Bottom-up policy optimization: Your language model policy secretly contains internal policies

    Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, and Kang Liu. Bottom-up policy optimization: Your language model policy secretly contains internal policies. arXiv preprint arXiv:2512.19673, 2025

  44. [44]

    Reppl: Recalibrating perplexity by uncertainty in semantic propagation and language generation for explainable qa hallucination detection.arXiv preprint arXiv:2505.15386, 2025

    Yiming Huang, Junyan Zhang, Zihao Wang, Biquan Bie, Yunzhong Qiu, Yi R Fung, and Xinlei He. Reppl: Recalibrating perplexity by uncertainty in semantic propagation and language generation for explainable qa hallucination detection.arXiv preprint arXiv:2505.15386, 2025

  45. [45]

    Numerical error analysis of large language models.arXiv preprint arXiv:2503.10251,

    Stanislav Budzinskiy, Wenyi Fang, Longbin Zeng, and Philipp Petersen. Numerical error analysis of large language models.arXiv preprint arXiv:2503.10251, 2025

  46. [46]

    Are language models aware of the road not taken? token-level uncertainty and hidden state dynamics.arXiv preprint arXiv:2511.04527, 2025

    Amir Zur, Atticus Geiger, Ekdeep Singh Lubana, and Eric Bigelow. Are language models aware of the road not taken? token-level uncertainty and hidden state dynamics.arXiv preprint arXiv:2511.04527, 2025

  47. [47]

    Analysis of image-and-text uncertainty propagation in multimodal large language models with cardiac mr-based applications

    Yucheng Tang, Yunguan Fu, Weixi Yi, Yipei Wang, Daniel C Alexander, Rhodri Davies, and Yipeng Hu. Analysis of image-and-text uncertainty propagation in multimodal large language models with cardiac mr-based applications. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 36–45. Springer, 2025. 26 Uncertainty Pr...

  48. [48]

    Flue: Streamlined uncertainty estimation for large language models

    Shiqi Gao, Tianxiang Gong, Zijie Lin, Runhua Xu, Haoyi Zhou, and Jianxin Li. Flue: Streamlined uncertainty estimation for large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 16745–16753, 2025

  49. [49]

    Can llms detect their confabulations? estimating reliability in uncertainty-aware language models.arXiv preprint arXiv:2508.08139, 2025

    Tianyi Zhou, Johanne Medina, and Sanjay Chawla. Can llms detect their confabulations? estimating reliability in uncertainty-aware language models.arXiv preprint arXiv:2508.08139, 2025

  50. [50]

    CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought, June 2025

    Boxuan Zhang and Ruqi Zhang. CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought, June 2025

  51. [51]

    Halluentity: Benchmarking and understanding entity-level hallucination detection.Transactions on Machine Learning Research, 2025

    Min-Hsuan Yeh, Max Kamachee, Seongheon Park, and Yixuan Li. Halluentity: Benchmarking and understanding entity-level hallucination detection.Transactions on Machine Learning Research, 2025

  52. [52]

    Ramzi Dakhmouche, Adrien Letellier, and Hossein Gorji. Can linear probes measure llm uncertainty? In NeurIPS 2025 Workshop MLxOR: Mathematical Foundations and Operational Integration of Machine Learning for Uncertainty-Aware Decision-Making, 2025

  53. [53]

    Can llms predict their own failures? self-awareness via internal circuits.arXiv preprint arXiv:2512.20578, 2025

    Amirhosein Ghasemabadi and Di Niu. Can llms predict their own failures? self-awareness via internal circuits. arXiv preprint arXiv:2512.20578, 2025

  54. [54]

    Uncertainty- aware reward model: Teaching reward models to know what is unknown, 2025

    Xingzhou Lou, Dong Yan, Wei Shen, Yuzi Yan, Jian Xie, and Junge Zhang. Uncertainty-aware reward model: Teaching reward models to know what is unknown.arXiv preprint arXiv:2410.00847, 2024

  55. [55]

    Confident adaptive language modeling.Advances in Neural Information Processing Systems, 35:17456–17472, 2022

    Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, and Donald Metzler. Confident adaptive language modeling.Advances in Neural Information Processing Systems, 35:17456–17472, 2022

  56. [56]

    Tokur: Token-level uncertainty estimation for large language model reasoning

    Tunyu Zhang, Haizhou Shi, Yibin Wang, Hengyi Wang, Xiaoxiao He, Zhuowei Li, Haoxian Chen, Ligong Han, Kai Xu, Huan Zhang, et al. Tokur: Token-level uncertainty estimation for large language model reasoning. In First Workshop on Foundations of Reasoning in Language Models, 2025

  57. [57]

    arXiv preprint arXiv:2508.15260 , year=

    Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence.arXiv preprint arXiv:2508.15260, 2025

  58. [58]

    Zip-rc: Zero-overhead inference-time pre- diction of reward and cost for adaptive and interpretable generation.arXiv preprint arXiv:2512.01457,

    Rohin Manvi, Joey Hong, Tim Seyde, Maxime Labonne, Mathias Lechner, and Sergey Levine. Zero-overhead introspection for adaptive test-time compute.arXiv preprint arXiv:2512.01457, 2025

  59. [59]

    Mitigating token-level uncertainty in retrieval-augmented large language models.Authorea Preprints, 2024

    Liz Yarie, Dominic Soriano, Leonard Kaczmarek, Benjamin Wilkinson, and Eduardo Vasquez. Mitigating token-level uncertainty in retrieval-augmented large language models.Authorea Preprints, 2024

  60. [60]

    Up- rop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.arXiv preprint arXiv:2506.17419, 2025

    Jinhao Duan, James Diffenderfer, Sandeep Madireddy, Tianlong Chen, Bhavya Kailkhura, and Kaidi Xu. Up- rop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making.arXiv preprint arXiv:2506.17419, 2025

  61. [61]

    Uncertainty propagation on llm agent

    Qiwei Zhao, Dong Li, Yanchi Liu, Wei Cheng, Yiyou Sun, Mika Oishi, Takao Osaki, Katsushi Matsuda, Huaxiu Yao, Chen Zhao, et al. Uncertainty propagation on llm agent. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6064–6073, 2025

  62. [62]

    Agentic confidence calibration.arXiv preprint arXiv:2601.15778, 2026

    Jiaxin Zhang, Caiming Xiong, and Chien-Sheng Wu. Agentic confidence calibration.arXiv preprint arXiv:2601.15778, 2026

  63. [63]

    CER: confidence enhanced reasoning in llms

    Ali Razghandi, Seyed Mohammad Hadi Hosseini, and Mahdieh Soleymani Baghshah. CER: confidence enhanced reasoning in llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025, pages 7918–7938. Association for Computational Linguistics, 2025

  64. [64]

    Geometric uncertainty for detecting and correcting hallucinations in llms

    Edward Phillips, Sean Wu, Soheila Molaei, Danielle Belgrave, Anshul Thakur, and David Clifton. Geometric uncertainty for detecting and correcting hallucinations in llms.arXiv preprint arXiv:2509.13813, 2025

  65. [65]

    Planu: Large language model reasoning through planning under uncertainty

    Ziwei Deng, Mian Deng, Chenjing Liang, Zeming Gao, Chennan Ma, Chenxing Lin, Haipeng Zhang, Songzhu Mei, Siqi Shen, and Cheng Wang. Planu: Large language model reasoning through planning under uncertainty. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  66. [66]

    Robust search with uncertainty-aware value models for language model reasoning.arXiv preprint arXiv:2502.11155, 2025

    Fei Yu, Yingru Li, and Benyou Wang. Robust search with uncertainty-aware value models for language model reasoning.arXiv preprint arXiv:2502.11155, 2025

  67. [67]

    BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

    Deepro Choudhury, Sinead Williamson, Adam Goli´nski, Ning Miao, Freddie Bickford Smith, Michael Kirchhof, Yizhe Zhang, and Tom Rainforth. Bed-llm: Intelligent information gathering with llms and bayesian experimental design.arXiv preprint arXiv:2508.21184, 2025

  68. [68]

    Uncertainty of thoughts: Uncertainty-aware planning enhances information seeking in llms

    Zhiyuan Hu, Chumin Liu, Xidong Feng, Yilun Zhao, See-Kiong Ng, Anh Tuan Luu, Junxian He, Pang Wei W Koh, and Bryan Hooi. Uncertainty of thoughts: Uncertainty-aware planning enhances information seeking in llms. Advances in Neural Information Processing Systems, 37:24181–24215, 2024. 27 Uncertainty Propagation in LLM-Based Systems

  69. [69]

    Reasoning in flux: Enhancing large language models reasoning through uncertainty-aware adaptive guidance

    Zhangyue Yin, Qiushi Sun, Qipeng Guo, Zhiyuan Zeng, Xiaonan Li, Junqi Dai, Qinyuan Cheng, Xuan-Jing Huang, and Xipeng Qiu. Reasoning in flux: Enhancing large language models reasoning through uncertainty-aware adaptive guidance. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2401–2416, 2024

  70. [70]

    Rational tuning of llm cascades via probabilistic modeling.arXiv preprint arXiv:2501.09345, 2025

    Michael J Zellinger and Matt Thomson. Rational tuning of llm cascades via probabilistic modeling.arXiv preprint arXiv:2501.09345, 2025

  71. [71]

    When to trust the cheap check: Weak and strong verification for reasoning.arXiv preprint arXiv:2602.17633, 2026

    Shayan Kiyani, Sima Noorani, George Pappas, and Hamed Hassani. When to trust the cheap check: Weak and strong verification for reasoning.arXiv preprint arXiv:2602.17633, 2026

  72. [72]

    Uncertainty- aware hybrid inference with on-device small and remote large language models

    Seungeun Oh, Jinhyuk Kim, Jihong Park, Seung-Woo Ko, Tony QS Quek, and Seong-Lyun Kim. Uncertainty- aware hybrid inference with on-device small and remote large language models. In2025 IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN), pages 1–7. IEEE, 2025

  73. [73]

    24 Published in Transactions on Machine Learning Research (04/2026) A Appendix Contents A.1 Reward Design and PPO Stabilization Sensitivity

    Yuqi Zhu, Ge Li, Xue Jiang, Jia Li, Hong Mei, Zhi Jin, and Yihong Dong. Uncertainty-guided chain-of-thought for code generation with llms.arXiv preprint arXiv:2503.15341, 2025

  74. [74]

    Robots that ask for help: Uncertainty alignment for large language model planners

    Allen Z Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, Fei Xia, Jake Varley, et al. Robots that ask for help: Uncertainty alignment for large language model planners. In7th Annual Conference on Robot Learning, 2023

  75. [75]

    Structured Uncertainty guided Clarification for LLM Agents

    Manan Suri, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A Rossi, and Dinesh Manocha. Structured uncertainty guided clarification for llm agents.arXiv preprint arXiv:2511.08798, 2025

  76. [76]

    Tools in the loop: Quantifying uncertainty of llm question answering systems that use tools.arXiv preprint arXiv:2505.16113, 2025

    Panagiotis Lymperopoulos and Vasanth Sarathy. Tools in the loop: Quantifying uncertainty of llm question answering systems that use tools.arXiv preprint arXiv:2505.16113, 2025

  77. [77]

    How to correctly report llm-as-a-judge evaluations.arXiv preprint arXiv:2511.21140, 2025

    Chungpa Lee, Thomas Zeng, Jongwon Jeong, Jy-yong Sohn, and Kangwook Lee. How to correctly report llm-as-a-judge evaluations.arXiv preprint arXiv:2511.21140, 2025

  78. [78]

    Kg-uq: Knowledge graph-based uncertainty quantification for long text in large language models

    Yingqing Yuan, Linwei Tao, Haohui Lu, Matloob Khushi, Imran Razzak, Mark Dras, Jian Yang, and Usman Naseem. Kg-uq: Knowledge graph-based uncertainty quantification for long text in large language models. In Companion Proceedings of the ACM on Web Conference 2025, pages 2071–2077, 2025

  79. [79]

    Enhancing uncertainty modeling with semantic graph for hallucination detection

    Kedi Chen, Qin Chen, Jie Zhou, Xinqi Tao, Bowen Ding, Jingwen Xie, Mingchen Xie, Peilong Li, and Zheng Feng. Enhancing uncertainty modeling with semantic graph for hallucination detection. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23586–23594, 2025

  80. [80]

    Reasoning over uncertain text by generative large language models

    Aliakbar Nafar, Kristen Brent Venable, and Parisa Kordjamshidi. Reasoning over uncertain text by generative large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24911–24920, 2025

Showing first 80 references.