pith. machine review for the scientific record. sign in

arxiv: 2604.15109 · v2 · submitted 2026-04-16 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

IUQ: Interrogative Uncertainty Quantification for Long-Form Large Language Model Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords uncertainty quantificationlarge language modelslong-form generationinterrogative paradigmconsistencyfaithfulnessclaim-level uncertainty
0
0 comments X

The pith

IUQ quantifies uncertainty in long-form LLM outputs using an interrogate-then-respond process that checks consistency across samples and faithfulness within each response.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Interrogative Uncertainty Quantification to address the difficulty of measuring uncertainty when LLMs generate extended, free-form text that can be coherent yet factually wrong. It proposes an interrogate-then-respond paradigm that elicits multiple answers to evaluate inter-sample consistency for overall uncertainty and intra-sample faithfulness for individual claims. Experimental tests on standard long-form datasets show this approach outperforms existing methods across various model families and sizes. A sympathetic reader would care because reliable uncertainty signals could help users trust or verify extended LLM responses in applications like summarization or question answering.

Core claim

IUQ measures claim-level uncertainty and model faithfulness in long-form generations by first interrogating the model to produce multiple responses and then assessing inter-sample consistency together with intra-sample faithfulness.

What carries the argument

The interrogate-then-respond paradigm, which generates multiple answers to a query and derives uncertainty from consistency across them and faithfulness within each.

If this is right

  • Claim-level uncertainty scores become available for long outputs instead of only token-level or short-answer measures.
  • The same interrogate-then-respond process can flag specific unfaithful claims inside a single extended response.
  • Performance gains hold across model families and sizes on existing long-form benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Users could route uncertain claims to external verification tools or human review while accepting confident parts of the output.
  • The method might extend to iterative self-correction loops that ask follow-up questions only on high-uncertainty claims.
  • Similar interrogation patterns could be tested on other open-ended generation tasks such as code or story writing.

Load-bearing premise

That consistency across multiple model responses and faithfulness within each response reliably indicate the true uncertainty and factual accuracy of long-form generations.

What would settle it

A test set of long-form generations where high IUQ scores align with independent human or fact-checking verification of accuracy, or where low scores flag known hallucinations.

Figures

Figures reproduced from arXiv: 2604.15109 by Haozhi Fan, Jinhao Duan, Kaidi Xu.

Figure 1
Figure 1. Figure 1: An example of LLM generation on biography. The [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The framework of Interrogative Uncertainty Quantification (IUQ): Responses are sampled from LLMs and decomposed [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Statistics of the claim-level faithfulness over selected models. (a) The faithfulness scores of all claims over FActScore [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: AUROCs of IUQ and baselines on different numbers [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Model faithfulness on claims within individual generation. Results for FActScore and LongFact are shown with [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
read the original abstract

Despite the rapid advancement of Large Language Models (LLMs), uncertainty quantification in LLM generation is a persistent challenge. Although recent approaches have achieved strong performance by restricting LLMs to produce short or constrained answer sets, many real-world applications require long-form and free-form text generation. A key difficulty in this setting is that LLMs often produce responses that are semantically coherent yet factually inaccurate, while the underlying semantics are multifaceted and the linguistic structure is complex. To tackle this challenge, this paper introduces Interrogative Uncertainty Quantification (IUQ), a novel framework that leverages inter-sample consistency and intra-sample faithfulness to quantify the uncertainty in long-form LLM outputs. By utilizing an interrogate-then-respond paradigm, our method provides reliable measures of claim-level uncertainty and the model's faithfulness. Experimental results across diverse model families and model sizes demonstrate the superior performance of IUQ over two widely used long-form generation datasets. The code is available at https://github.com/louisfanhz/IUQ.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Interrogative Uncertainty Quantification (IUQ), a framework for uncertainty estimation in long-form, free-form LLM generations. It uses an interrogate-then-respond paradigm to compute inter-sample consistency (across multiple interrogated responses) and intra-sample faithfulness (within a single response) as proxies for claim-level uncertainty and factual reliability. The central claim is that this yields superior performance compared to prior methods on two standard long-form generation datasets, across multiple model families and sizes.

Significance. If the interrogate-then-respond measurements cleanly isolate properties of the original generation without confounding effects, IUQ could meaningfully advance uncertainty quantification beyond short-answer or constrained settings, where current methods are limited. The availability of code supports reproducibility and potential follow-up work on claim-level calibration in open-ended text.

major comments (3)
  1. [Abstract and §4 (Experiments)] The central experimental claim (superior performance on long-form datasets) is asserted in the abstract and §4 but lacks reported metrics, baselines, statistical significance tests, or ablation results in the provided summary information; without these, the superiority cannot be verified and the claim remains ungrounded.
  2. [§3 (Methodology)] The interrogate-then-respond paradigm (described in §3) risks confounding the measured inter-sample consistency and intra-sample faithfulness with interrogation-induced shifts in the model's output distribution or semantic focus. For semantically coherent but factually inaccurate long-form text, this could mean the consistency scores reflect the modified interaction rather than the uncertainty of the original response, directly undermining the claim-level uncertainty quantification.
  3. [§4 (Experiments) and §5 (Discussion)] The weakest assumption—that inter-sample consistency and intra-sample faithfulness reliably proxy true uncertainty and factual accuracy—is not tested against ground-truth factuality annotations or human judgments in the reported experiments; this leaves open whether the measures correlate with actual error rates in multifaceted outputs.
minor comments (2)
  1. [§3] Notation for consistency and faithfulness scores should be defined with explicit formulas (e.g., how inter-sample agreement is aggregated across claims) to allow replication.
  2. [§4] The two datasets used should be named explicitly with citation and statistics (e.g., average response length, number of claims per response) in §4.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below with clarifications from the full paper and indicate where revisions will be made to improve rigor and transparency.

read point-by-point responses
  1. Referee: [Abstract and §4 (Experiments)] The central experimental claim (superior performance on long-form datasets) is asserted in the abstract and §4 but lacks reported metrics, baselines, statistical significance tests, or ablation results in the provided summary information; without these, the superiority cannot be verified and the claim remains ungrounded.

    Authors: The full manuscript reports these details in §4. Tables 1 and 2 present quantitative metrics (AUROC, ECE, factuality F1) on LongForm and ELI5 datasets, with comparisons to baselines including semantic entropy, self-consistency, and verbalized uncertainty across LLaMA, Mistral, and GPT model families. Ablation results on inter-sample and intra-sample components appear in Table 3 and Figure 4. Statistical significance via paired t-tests is included for main results. If the summary provided to the referee was abbreviated, we will ensure all tables and significance values are highlighted more prominently in the revision. revision: partial

  2. Referee: [§3 (Methodology)] The interrogate-then-respond paradigm (described in §3) risks confounding the measured inter-sample consistency and intra-sample faithfulness with interrogation-induced shifts in the model's output distribution or semantic focus. For semantically coherent but factually inaccurate long-form text, this could mean the consistency scores reflect the modified interaction rather than the uncertainty of the original response, directly undermining the claim-level uncertainty quantification.

    Authors: This is a substantive methodological concern. IUQ derives interrogative prompts directly from claims in the original generation and applies identical sampling parameters to preserve the output distribution. We will add a dedicated paragraph in §3.2 with a formal argument for why the paradigm isolates original uncertainty (supported by invariance checks) and include qualitative examples in the appendix showing that low consistency scores align with factual errors in the base response rather than interrogation artifacts. revision: partial

  3. Referee: [§4 (Experiments) and §5 (Discussion)] The weakest assumption—that inter-sample consistency and intra-sample faithfulness reliably proxy true uncertainty and factual accuracy—is not tested against ground-truth factuality annotations or human judgments in the reported experiments; this leaves open whether the measures correlate with actual error rates in multifaceted outputs.

    Authors: We agree that explicit correlation analysis with human factuality judgments would provide stronger validation. The current results rely on the datasets' existing factuality annotations and automatic metrics. We will add a human evaluation study (with inter-annotator agreement and Spearman correlations between IUQ scores and per-claim accuracy ratings) to an expanded §4.5, with discussion of implications in §5. revision: yes

Circularity Check

0 steps flagged

No circularity: IUQ defines uncertainty directly from computed consistency and faithfulness

full rationale

The paper introduces IUQ as a framework that computes claim-level uncertainty and faithfulness via inter-sample consistency and intra-sample faithfulness under an interrogate-then-respond paradigm. These quantities are obtained by direct measurement on generated outputs rather than by fitting parameters to the target result or by self-referential definition. The central claim of superior performance is supported by experiments on external datasets, not by any derivation that reduces to the inputs by construction. No self-citations, uniqueness theorems, ansatz smuggling, or renaming of known results appear as load-bearing steps in the described approach. The derivation chain remains self-contained and independent of the measured quantities themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework relies on standard LLM prompting and consistency metrics without additional postulated components.

pith-pipeline@v0.9.0 · 5472 in / 1066 out tokens · 28160 ms · 2026-05-10T11:41:03.636964+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 32 canonical work pages · 9 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, and 12 others. 2020. https://arxiv.org/abs/2005.14165 Lan...

  4. [4]

    Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. 2024 a . https://openreview.net/forum?id=Zj12nzlQbz INSIDE : LLM s' internal states retain the power of hallucination detection . In The Twelfth International Conference on Learning Representations

  5. [5]

    Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. 2024 b . https://openreview.net/forum?id=LjsjHF7nAN Universal self-consistency for large language models . In ICML 2024 Workshop on In-Context Learning

  6. [6]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, and 48 others. 2022. https://arxiv.org/abs/2204.02311 Palm: ...

  7. [7]

    Jinhao Duan, Hao Cheng, Shiqi Wang, Alex Zavalny, Chenan Wang, Renjing Xu, Bhavya Kailkhura, and Kaidi Xu. 2024. https://doi.org/10.18653/v1/2024.acl-long.276 Shifting attention to relevance: Towards the predictive uncertainty quantification of free-form large language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational...

  8. [8]

    Jinhao Duan, James Diffenderfer, Sandeep Madireddy, Tianlong Chen, Bhavya Kailkhura, and Kaidi Xu. 2025. https://arxiv.org/abs/2506.17419 Uprop: Investigating the uncertainty propagation of llms in multi-step agentic decision-making . Preprint, arXiv:2506.17419

  9. [9]

    Jinhao Duan, Fei Kong, Hao Cheng, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Xiaofeng Zhu, Xiaoshuang Shi, and Kaidi Xu. 2026. https://arxiv.org/abs/2503.10602 Truthprint: Mitigating large vision-language models object hallucination via latent truthful-guided pre-intervention . Preprint, arXiv:2503.10602

  10. [10]

    Jinhao Duan, Fei Kong, Shiqi Wang, Xiaoshuang Shi, and Kaidi Xu. 2023. https://arxiv.org/abs/2302.01316 Are diffusion models vulnerable to membership inference attacks? Preprint, arXiv:2302.01316

  11. [12]

    Ekaterina Fadeeva, Aleksandr Rubashevskii, Artem Shelmanov, Sergey Petrakov, Haonan Li, Hamdy Mubarak, Evgenii Tsymbalov, Gleb Kuzmin, Alexander Panchenko, Timothy Baldwin, Preslav Nakov, and Maxim Panov. 2024 b . https://arxiv.org/abs/2403.04696 Fact-checking the output of large language models via token-level uncertainty quantification . Preprint, arXiv...

  12. [13]

    Ekaterina Fadeeva, Roman Vashurin, Akim Tsvigun, Artem Vazhentsev, Sergey Petrakov, Kirill Fedyanin, Daniil Vasilev, Elizaveta Goncharova, Alexander Panchenko, Maxim Panov, Timothy Baldwin, and Artem Shelmanov. 2023. https://doi.org/10.18653/v1/2023.emnlp-demo.41 LM -polygraph: Uncertainty estimation for language models . In Proceedings of the 2023 Confer...

  13. [14]

    Marina Fomicheva, Shuo Sun, Lisa Yankovskaya, Frédéric Blain, Francisco Guzmán, Mark Fishel, Nikolaos Aletras, Vishrav Chaudhary, and Lucia Specia. 2020. https://doi.org/10.1162/tacl_a_00330 Unsupervised quality estimation for neural machine translation . Transactions of the Association for Computational Linguistics, 8:539--555

  14. [15]

    Xiang Gao, Jiaxin Zhang, Lalla Mouatadid, and Kamalika Das. 2024. https://aclanthology.org/2024.eacl-long.143/ SPUQ : Perturbation-based uncertainty quantification for large language models . In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2336--2346. Association...

  15. [16]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...

  16. [17]

    Yu, and Zhijiang Guo

    Xuming Hu, Junzhe Chen, Xiaochuan Li, Yufei Guo, Lijie Wen, Philip S. Yu, and Zhijiang Guo. 2024. https://openreview.net/forum?id=9OevMUdods Towards understanding factual knowledge of large language models . In The Twelfth International Conference on Learning Representations

  17. [18]

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. https://doi.org/10.1145/3703155 A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions . ACM Transactions on Information Systems, 43(2):1–55

  18. [19]

    Ziwei Ji, Yuzhe Gu, Wenwei Zhang, Chengqi Lyu, Dahua Lin, and Kai Chen. 2024. https://doi.org/10.18653/v1/2024.acl-long.442 ANAH : Analytical annotation of hallucinations in large language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8135--8158, Bangkok, Thailand. Associ...

  19. [20]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. https://arxiv.org/abs/2310.0...

  20. [21]

    Che Jiang, Biqing Qi, Xiangyu Hong, Dayuan Fu, Yang Cheng, Fandong Meng, Mo Yu, Bowen Zhou, and Jie Zhou. 2024 a . https://doi.org/10.18653/v1/2024.naacl-long.60 On large language models' hallucination with regard to known facts . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human La...

  21. [22]

    Mingjian Jiang, Yangjun Ruan, Prasanna Sattigeri, Salim Roukos, and Tatsunori Hashimoto. 2024 b . https://openreview.net/forum?id=YgJPQW0lkO Graph-based uncertainty metrics for long-form language model generations . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

  22. [23]

    Ehsan Kamalloo, Nouha Dziri, Charles Clarke, and Davood Rafiei. 2023. https://doi.org/10.18653/v1/2023.acl-long.307 Evaluating open-domain question answering in the era of large language models . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5591--5606. Association for Computation...

  23. [24]

    Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. 2023. Large language models struggle to learn long-tail knowledge. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org

  24. [25]

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. https://openreview.net/forum?id=VD-AYtP0dve Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation . In The Eleventh International Conference on Learning Representations

  25. [26]

    Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. 2024. https://openreview.net/forum?id=DWkJCSxKU5 Generating with confidence: Uncertainty quantification for black-box large language models . Transactions on Machine Learning Research

  26. [27]

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. https://doi.org/10.18653/v1/2023.acl-long.546 When not to trust language models: Investigating effectiveness of parametric and non-parametric memories . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: L...

  27. [28]

    Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.557 S elf C heck GPT : Zero-resource black-box hallucination detection for generative large language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004--9017. Association for Computational Linguistics

  28. [29]

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.741 FA ct S core: Fine-grained atomic evaluation of factual precision in long form text generation . In Proceedings of the 2023 Conference on Empirical Methods in Natural Langua...

  29. [30]

    Christopher Mohri and Tatsunori Hashimoto. 2024. https://arxiv.org/abs/2402.10978 Language models with conformal factuality guarantees . Preprint, arXiv:2402.10978

  30. [31]

    arXiv preprint arXiv:2112.07899 , year=

    Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Y. Zhao, Yi Luan, Keith B. Hall, Ming-Wei Chang, and Yinfei Yang. 2021. https://arxiv.org/abs/2112.07899 Large dual encoders are generalizable retrievers . Preprint, arXiv:2112.07899

  31. [32]

    Alexander V Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen. 2024. https://openreview.net/forum?id=j2wCrWmgMX Kernel language entropy: Fine-grained uncertainty quantification for LLM s from semantic similarities . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

  32. [33]

    OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, and 401 others. 2024. https://arxiv.org/abs/2410.21276 Gpt-4o system card . Preprint, arXiv:2410.21276

  33. [34]

    Liangming Pan, Michael Saxon, Wenda Xu, Deepak Nathani, Xinyi Wang, and William Yang Wang. 2024. https://doi.org/10.1162/tacl_a_00660 Automatically correcting large language models: Surveying the landscape of diverse automated correction strategies . Transactions of the Association for Computational Linguistics, 12:484--506

  34. [35]

    Ruiyang Ren, Yuhao Wang, Yingqi Qu, Wayne Xin Zhao, Jing Liu, Hua Wu, Ji-Rong Wen, and Haifeng Wang. 2025. https://aclanthology.org/2025.coling-main.250/ Investigating the factual knowledge boundary of large language models with retrieval augmentation . In Proceedings of the 31st International Conference on Computational Linguistics, pages 3697--3715. Ass...

  35. [36]

    Mauricio Rivera, Jean-Fran c ois Godbout, Reihaneh Rabbany, and Kellin Pelrine. 2024. https://aclanthology.org/2024.uncertainlp-1.12/ Combining confidence elicitation and sample-based methods for uncertainty quantification in misinformation mitigation . In Proceedings of the 1st Workshop on Uncertainty-Aware NLP (UncertaiNLP 2024), pages 114--126. Associa...

  36. [37]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. 2023. https://openreview.net/forum?id=vAElhFcKW6 Reflexion: language agents with verbal reinforcement learning . In Thirty-seventh Conference on Neural Information Processing Systems

  37. [38]

    Yixiao Song, Yekyung Kim, and Mohit Iyyer. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.552 V eri S core: Evaluating the factuality of verifiable claims in long-form text generation . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9447--9474. Association for Computational Linguistics

  38. [39]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://arxiv.org/abs/2503.19786...

  39. [40]

    Francesco Tonolini, Nikolaos Aletras, Jordan Massiah, and Gabriella Kazai. 2024. https://doi.org/10.18653/v1/2024.findings-acl.728 B ayesian prompt ensembles: Model uncertainty estimation for black-box large language models . In Findings of the Association for Computational Linguistics: ACL 2024, pages 12229--12272. Association for Computational Linguistics

  40. [41]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. https://arxiv.org/abs/2302.13971 Llama: Open and efficient foundation language models . Preprint, arXiv:2302.13971

  41. [42]

    Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. https://openreview.net/forum?id=1PL1NIMMrw Self-consistency improves chain of thought reasoning in language models . In The Eleventh International Conference on Learning Representations

  42. [43]

    Zhiyuan Wang, Jinhao Duan, Qingni Wang, Xiaofeng Zhu, Tianlong Chen, Xiaoshuang Shi, and Kaidi Xu. 2025. https://arxiv.org/abs/2506.20178 Coin: Uncertainty-guarding selective question answering for foundation models with provable risk guarantees . Preprint, arXiv:2506.20178

  43. [44]

    Zhiyuan Wang, Jinhao Duan, Chenxi Yuan, Qingyu Chen, Tianlong Chen, Yue Zhang, Ren Wang, Xiaoshuang Shi, and Kaidi Xu. 2024. https://arxiv.org/abs/2402.14259 Word-sequence entropy: Towards uncertainty estimation in free-form medical question answering applications and beyond . Preprint, arXiv:2402.14259

  44. [45]

    Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Zixia Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, and Quoc V Le. 2024. https://openreview.net/forum?id=4M9f8VMt2C Long-form factuality in large language models . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

  45. [46]

    Florian Wellmann and Klaus Regenauer-Lieb

    J. Florian Wellmann and Klaus Regenauer-Lieb. 2012. https://doi.org/10.1016/j.tecto.2011.05.001 Uncertainties have a meaning: Information entropy as a quality measure for 3-d geological models . Tectonophysics, 526-529:207--216. Modelling in Geosciences

  46. [47]

    Wikipedia . 2026. https://en.wikipedia.org/wiki/Rory_Byrne Rory byrne --- wikipedia , the free encyclopedia . [Online; accessed 14-April-2026]

  47. [48]

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, YIFEI LI, Jie Fu, Junxian He, and Bryan Hooi. 2024. https://openreview.net/forum?id=gjeQKFxFpZ Can LLM s express their uncertainty? an empirical evaluation of confidence elicitation in LLM s . In The Twelfth International Conference on Learning Representations

  48. [49]

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, and 43 others. 2024. https://arxiv.org/abs/2407.10671 Qwen2 technical report . Preprint, arXiv:2407.10671

  49. [50]

    Caiqi Zhang, Fangyu Liu, Marco Basaldella, and Nigel Collier. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.299 LUQ : Long-text uncertainty quantification for LLM s . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 5244--5262. Association for Computational Linguistics

  50. [51]

    Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. 2023. https://arxiv.org/abs/2309.01219 Siren's song in the ai ocean: A survey on hallucination in large language models . Preprint, arXiv:2309.01219