pith. machine review for the scientific record. sign in

arxiv: 2605.01749 · v1 · submitted 2026-05-03 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Only Say What You Know: Calibration-Aware Generation for Long-Form Factuality

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:45 UTC · model grok-4.3

classification 💻 cs.CL
keywords calibration-aware generationlong-form factualityexploration-commitment decouplinghallucination reductionreliable reasoninglarge language modelsfactuality benchmarks
0
0 comments X

The pith

By decoupling exploration from commitment, language models can generate more factual long-form responses using calibrated reliability estimates for each reasoning step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current methods for factuality in large reasoning models link every intermediate reasoning step directly to the final output, which allows errors to build up across long generations. The paper proposes an Exploration-Commitment Decoupling paradigm that separates free exploration of knowledge from cautious commitment to an answer. It implements this through Calibration-Aware Generation, which adds reliability estimates to reasoning steps and selects only the reliable parts for the output. A sympathetic reader would care because long-form generation is where hallucinations most damage trustworthiness in AI systems.

Core claim

Large Reasoning Models achieve strong performance on complex tasks but remain prone to hallucinations, particularly in long-form generation where errors compound across reasoning steps. Existing approaches follow a coupled exploration-commitment paradigm that unconditionally propagates intermediate reasoning to the final output and limits fine-grained control. We propose an Exploration-Commitment Decoupling paradigm that disentangles knowledge exploration from final commitment, enabling models to explore with awareness while answering cautiously. We instantiate the paradigm with Calibration-Aware Generation, a framework that equips models with end-to-end calibration-aware generation byaug

What carries the argument

Calibration-Aware Generation (CAG): a framework that augments each intermediate reasoning step with a calibrated reliability estimate and uses those estimates to prioritize reliable content when forming the final output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of exploration from commitment could be tested in code generation or mathematical proof tasks where partial errors also compound.
  • Models trained under this paradigm might need fewer external fact-checking steps after generation because unreliable content is filtered during decoding.
  • Extending the reliability estimates to include uncertainty from retrieved external documents could further strengthen long-form outputs that draw on outside sources.

Load-bearing premise

That models can produce accurate calibrated reliability estimates for their own intermediate reasoning steps in a way that permits correct prioritization of reliable content without dropping necessary information or adding selection biases.

What would settle it

Apply Calibration-Aware Generation to the same long-form factuality benchmarks used in the paper and observe no gain or a loss in factuality scores compared with standard generation baselines.

Figures

Figures reproduced from arXiv: 2605.01749 by Feifan Song, Furu Wei, Guangyue Peng, Houfeng Wang, Liang Wang, Nan Yang, Shaohang Wei, Wei Li, Wen Luo, Yuhan Song.

Figure 1
Figure 1. Figure 1: Comparison between standard generation (left) and the proposed Exploration–Commitment [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance comparison across PopQA, GPQA, and Vicuna QA benchmarks. CASS and [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Calibration performance and its effect on answer organization. (a) Calibration quality [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation on the bucketing threshold τ . Performance is measured by VeriScore across different datasets and model families. Moderate thresholds achieve the best performance, reflecting a factuality–helpfulness trade-off: lower thresholds improve coverage but introduce hallucinations, while higher thresholds increase factuality at the cost of discarding useful information. The optimal threshold range is stab… view at source ↗
read the original abstract

Large Reasoning Models achieve strong performance on complex tasks but remain prone to hallucinations, particularly in long-form generation where errors compound across reasoning steps. Existing approaches to improving factuality, including abstention and factuality-driven optimization, follow a \emph{coupled exploration-commitment} paradigm, in which intermediate reasoning is unconditionally propagated to the final output, limiting fine-grained control over information selection and integration. In this paper, we propose an \textbf{Exploration-Commitment Decoupling} paradigm that disentangles knowledge exploration from final commitment, enabling models to explore with awareness while answering cautiously. We instantiate the paradigm with \textbf{Calibration-Aware Generation (CAG)}, a framework that equips models with end-to-end, calibration-aware generation capabilities, by augmenting intermediate reasoning with calibrated reliability estimates and prioritizing reliable content in final outputs. Across five long-form factuality benchmarks and multiple model families, CAG improves factuality by up to 13%, while reducing decoding time by up to 37%. Overall, our work highlights decoupling as a principled approach for more reliable long-form generation, offering directions for trustworthy and self-aware generative systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that large reasoning models suffer from hallucinations in long-form generation due to a coupled exploration-commitment paradigm in existing factuality methods. It proposes an Exploration-Commitment Decoupling paradigm instantiated via Calibration-Aware Generation (CAG), which augments intermediate reasoning steps with calibrated reliability estimates to prioritize reliable content during final output generation. Across five long-form factuality benchmarks and multiple model families, CAG is reported to improve factuality by up to 13% while reducing decoding time by up to 37%.

Significance. If the empirical claims hold under rigorous evaluation, the decoupling paradigm could represent a meaningful advance for trustworthy long-form generation in LLMs. By separating exploration from commitment and incorporating calibration, the approach offers a principled alternative to abstention or optimization-based methods, with potential for more fine-grained control over factuality and efficiency gains that could aid practical deployment of reliable generative systems.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments section: The abstract asserts specific percentage improvements (up to 13% factuality and 37% decoding time) but provides no information on evaluation protocols, baseline comparisons, statistical testing, or how factuality was quantified. This leaves the central empirical claim unsupported by visible evidence and requires detailed reporting in the results to substantiate the CAG framework's effectiveness.
  2. [§3] §3 (CAG Framework description): The core mechanism relies on the assumption that model-generated calibrated reliability estimates enable accurate prioritization of reliable content without losing critical information or introducing selection biases. However, no separate verification or ablation is described to confirm the accuracy of these estimates on intermediate steps, which is load-bearing for the reported gains given that the estimates come from the same hallucination-prone model family.
minor comments (2)
  1. [Method] Clarify the exact definition and computation of 'calibrated reliability estimates' with an equation or pseudocode in the method section to improve reproducibility.
  2. [Related Work] Ensure the related work section comprehensively cites prior calibration techniques for LLMs and long-form factuality benchmarks for proper context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback and the recommendation for major revision. We address each major comment below, providing clarifications on the existing content in the manuscript and outlining targeted revisions to enhance the visibility of evaluation details and the validation of the framework's core components.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: The abstract asserts specific percentage improvements (up to 13% factuality and 37% decoding time) but provides no information on evaluation protocols, baseline comparisons, statistical testing, or how factuality was quantified. This leaves the central empirical claim unsupported by visible evidence and requires detailed reporting in the results to substantiate the CAG framework's effectiveness.

    Authors: We agree that the abstract would benefit from additional context to immediately support the reported gains. While Section 4 (Experiments) already details the five benchmarks, baseline comparisons (including coupled exploration-commitment methods), factuality quantification via atomic fact verification, and efficiency via decoding time, along with statistical reporting over multiple seeds, we will revise the abstract to concisely reference the evaluation protocol (e.g., 'across five benchmarks using automatic factuality metrics and standard baselines'). We will also add an explicit summary paragraph at the start of the results subsection reiterating the metrics, baselines, and testing procedures. These changes will make the empirical support more self-contained without expanding the abstract length substantially. revision: yes

  2. Referee: [§3] §3 (CAG Framework description): The core mechanism relies on the assumption that model-generated calibrated reliability estimates enable accurate prioritization of reliable content without losing critical information or introducing selection biases. However, no separate verification or ablation is described to confirm the accuracy of these estimates on intermediate steps, which is load-bearing for the reported gains given that the estimates come from the same hallucination-prone model family.

    Authors: This concern about validating the intermediate reliability estimates is well-taken, as they are central to the decoupling paradigm. Although the end-to-end benchmark results demonstrate the overall benefits of prioritization, we will add a dedicated ablation and analysis in Section 4 (or an appendix) that directly assesses the calibration accuracy of the estimates on intermediate reasoning steps. This will include correlation analysis against ground-truth factuality labels on held-out data and an examination of selection biases or information loss. Such additions will explicitly address the use of estimates from the same model family by highlighting how the calibration process improves their reliability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework with independent experimental validation

full rationale

The paper proposes Exploration-Commitment Decoupling instantiated as Calibration-Aware Generation (CAG) as a practical framework for long-form factuality, validated empirically across five benchmarks and multiple model families with reported gains in factuality and decoding efficiency. No equations, derivations, or first-principles predictions are described that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The core contribution is an architectural and prompting approach whose claims rest on external benchmark results rather than renaming known patterns or smuggling ansatzes via self-citation. This is the common case of a self-contained empirical method paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, mathematical axioms, or new physical entities are identifiable; the contribution centers on a procedural framework rather than new theoretical primitives.

pith-pipeline@v0.9.0 · 5523 in / 1114 out tokens · 88131 ms · 2026-05-10T15:45:46.139730+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 22 canonical work pages · 7 internal anchors

  1. [1]

    URLhttps://arxiv.org/abs/2412.19437. Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alex...

  2. [2]

    OpenAI GPT-5 System Card

    URL https://arxiv.org/abs/2601.03267. Junjie Ye, Zhengyin Du, Xuesong Yao, Weijian Lin, Yufei Xu, Zehui Chen, Zaiyuan Wang, Sining Zhu, Zhiheng Xi, Siyu Yuan, Tao Gui, Qi Zhang, Xuanjing Huang, and Jiecao Chen. ToolHop: A query-driven benchmark for evaluating large language models in multi-hop tool use. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, a...

  3. [3]

    ISBN 979-8-89176-251-0

    Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.150. URL https://aclanthology. org/2025.acl-long.150/. Li Hu, Guoqiang Chen, Xiuwei Shang, Shaoyin Cheng, Benlong Wu, LiGangyang LiGangyang, Xu Zhu, Weiming Zhang, and Nenghai Yu. Compileagent: Automated real-world repo-level compilation with tool-integrated ...

  4. [4]

    Xilun Chen, Ilia Kulikov, Vincent-Pierre Berges, Barlas O ˘guz, Rulin Shao, Gargi Ghosh, Jason Weston, and Wen tau Yih

    URLhttps://aclanthology.org/2025.acl-long.103/. Xilun Chen, Ilia Kulikov, Vincent-Pierre Berges, Barlas O ˘guz, Rulin Shao, Gargi Ghosh, Jason Weston, and Wen tau Yih. Learning to reason for factuality, 2025a. URL https://arxiv.org/ abs/2508.05618. Jiayun Wu, Jiashuo Liu, Zhiyuan Zeng, Tianyang Zhan, Tianle Cai, and Wenhao Huang. Mitigating llm hallucinat...

  5. [5]

    Mit- igating llm hallucination via behaviorally calibrated reinforcement learning.arXiv preprint arXiv:2512.19920,

    URL https://arxiv. org/abs/2512.19920. Hao An and Yang Xu. Teaching llms to abstain via fine-grained semantic confidence reward,

  6. [6]

    Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Jiahao Xu, Tian Liang, Pinjia He, and Zhaopeng Tu

    URLhttps://arxiv.org/abs/2510.24020. Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Jiahao Xu, Tian Liang, Pinjia He, and Zhaopeng Tu. Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual M...

  7. [7]

    org/2025.acl-long.158/

    URL https://aclanthology. org/2025.acl-long.158/. Baochang Ren, Shuofei Qiao, Da Zheng, Huajun Chen, and Ningyu Zhang. Knowrl: Exploring knowledgeable reinforcement learning for factuality,

  8. [8]

    KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality

    URL https://arxiv.org/abs/ 2506.19807. Boyang Xue, Fei Mi, Qi Zhu, Hongru Wang, Rui Wang, Sheng Wang, Erxin Yu, Xuming Hu, and Kam-Fai Wong. UAlign: Leveraging uncertainty estimations for factuality alignment on large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Mee...

  9. [9]

    ISBN 979-8-89176-251-0

    Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.299. URLhttps://aclanthology.org/2025.acl-long.299/. Yiqing Xie, Wenxuan Zhou, Pradyot Prakash, Di Jin, Yuning Mao, Quintin Fettes, Arya Talebzadeh, Sinong Wang, Han Fang, Carolyn P. Rosé, Daniel Fried, and Hejia Zhang. Improving model factuality with fine-gr...

  10. [10]

    Sheng-Chieh Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Scott Yih, and Xilun Chen

    URL https://aclanthology.org/2025.acl-long.400/. Sheng-Chieh Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Scott Yih, and Xilun Chen. FLAME : Factuality-aware alignment for large language models. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information...

  11. [11]

    Tong Chen, Akari Asai, Luke Zettlemoyer, Hannaneh Hajishirzi, and Faeze Brahman

    URL http://papers.nips.cc/paper_files/paper/2024/hash/ d16152d53088ad779ffa634e7bf66166-Abstract-Conference.html. Tong Chen, Akari Asai, Luke Zettlemoyer, Hannaneh Hajishirzi, and Faeze Brahman. Train for truth, keep the skills: Binary retrieval-augmented reward mitigates hallucinations, 2025b. URL https://arxiv.org/abs/2510.17733. Zhe Su, Xuhui Zhou, San...

  12. [12]

    ISBN 979-8-89176-189-6

    Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/ v1/2025.naacl-long.595. URLhttps://aclanthology.org/2025.naacl-long.595/. 13 Qinyuan Cheng, Tianxiang Sun, Xiangyang Liu, Wenwei Zhang, Zhangyue Yin, Shimin Li, Linyang Li, Zhengfu He, Kai Chen, and Xipeng Qiu. Can AI assistants know what they don’t know? In Ruslan Salakhutd...

  13. [13]

    Yixiao Song, Yekyung Kim, and Mohit Iyyer

    URL https://proceedings.mlr.press/v235/ cheng24i.html. Yixiao Song, Yekyung Kim, and Mohit Iyyer. VeriScore: Evaluating the factuality of verifiable claims in long-form text generation. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9447–9474, Mi- ami, Florida, US...

  14. [14]

    InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Association for Computational Linguistics. doi: 10.18653/v1/ 2024.findings-emnlp.552. URL https://aclanthology.org/2024.findings-emnlp.552/. Yu-Neng Chuang, Prathusha Kameswara Sarma, Parikshit Gopalan, John Boccio, Sara Bolouki, Xia Hu, and Helen Zhou. Learning to route llms with confidence tokens. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-...

  15. [15]

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem

    URLhttps://proceedings.mlr.press/v267/chuang25b.html. Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

  16. [16]

    https://thinkingmachines.ai/blog/ on-policy-distillation/

    doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev...

  17. [17]

    URLhttps://arxiv.org/abs/2407.21783. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang L...

  18. [18]

    Qwen3 Technical Report

    URLhttps://arxiv.org/abs/2505.09388. Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for meth- ods that learn from human feedback. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, edit...

  19. [19]

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi

    URL http://papers.nips.cc/paper_files/ paper/2023/hash/5fc47800ee5b30b8777fdd30abcaaf3b-Abstract-Conference.html. Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evalua- tion of factual precision in long form text generation. In Houda Bouam...

  20. [20]

    In: Bouamor, H., Pino, J., Bali, K

    doi: 10.18653/V1/2023.EMNLP-MAIN.741. URLhttps://doi.org/10.18653/v1/2023.emnlp-main.741. Farima Fatahi Bayat, Lechen Zhang, Sheza Munir, and Lu Wang. FactBench: A dynamic benchmark for in-the-wild language model factuality evaluation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Me...

  21. [21]

    URLhttps://aclanthology.org/2025.acl-long.1587/

    18653/v1/2025.acl-long.1587. URLhttps://aclanthology.org/2025.acl-long.1587/. Mingda Chen, Yang Li, Xilun Chen, Adina Williams, Gargi Ghosh, and Scott Yih. Factory: A challenging human-verified prompt set for long-form factuality, 2025c. URL https://arxiv. org/abs/2508.00109. Abhika Mishra, Akari Asai, Vidhisha Balachandran, Yizhong Wang, Graham Neubig, Y...

  22. [22]

    Hanning Zhang, Shizhe Diao, Yong Lin, Yi Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang

    URL https://openreview.net/forum?id= dJMTn3QOWO. Hanning Zhang, Shizhe Diao, Yong Lin, Yi Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. R-tuning: Instructing large language models to say ‘I don’t know’. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Ass...

  23. [23]

    I don’t know

    Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.394. URL https: //aclanthology.org/2024.naacl-long.394/. Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. ELI5: Long form question answering. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting o...

  24. [24]

    doi: 10.18653/ v1/P19-1346

    Association for Computational Linguistics. doi: 10.18653/ v1/P19-1346. URLhttps://aclanthology.org/P19-1346/. Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric 16 memories. In Anna Rogers, Jordan L. Boyd-Graber, and ...

  25. [25]

    When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

    doi: 10.18653/V1/2023.ACL-LONG.546. URL https://doi.org/10.18653/v1/2023.acl-long.546. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling,

  26. [26]

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P

    URL https://openreview.net/forum? id=Ti67584b98. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt,...

  27. [27]

    Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, and Furu Wei

    URL http://papers.nips.cc/paper_files/paper/2023/hash/ 91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html. Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, and Furu Wei. Chain-of- retrieval augmented generation,

  28. [28]

    Chain- of-retrieval augmented generation

    URLhttps://arxiv.org/abs/2501.14342. Wen Luo, Guangyue Peng, Wei Li, Shaohang Wei, Feifan Song, Liang Wang, Nan Yang, Xingxing Zhang, Jing Jin, Furu Wei, and Houfeng Wang. Two pathways to truthfulness: On the intrinsic encoding of llm hallucinations,

  29. [29]

    Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations

    URLhttps://arxiv.org/abs/2601.07422. Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

  30. [30]

    Yung-Sung Chuang, Linlu Qiu, Cheng-Yu Hsieh, Ranjay Krishna, Yoon Kim, and James R

    URLhttps://arxiv.org/abs/2406.07070. Yung-Sung Chuang, Linlu Qiu, Cheng-Yu Hsieh, Ranjay Krishna, Yoon Kim, and James R. Glass. Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical ...

  31. [31]

    URL https: //doi.org/10.18653/v1/2024.emnlp-main.84

    doi: 10.18653/V1/2024.EMNLP-MAIN.84. URL https: //doi.org/10.18653/v1/2024.emnlp-main.84. Wen Luo, Feifan Song, Wei Li, Guangyue Peng, Shaohang Wei, and Houfeng Wang. Odysseus navigates the sirens’ song: Dynamic focus decoding for factual and diverse open-ended text generation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar...

  32. [32]

    acl-long.1320/

    URL https://aclanthology.org/2025. acl-long.1320/. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. Lmsys-chat-1m: A large-scale real-world LLM conversation dataset. InThe Twelfth Interna- tional Conference on Learning Represen...

  33. [33]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    URLhttps://arxiv.org/abs/2402.03300. 17 A Approximate Bayes-Optimal Selection via Score Thresholding In this section, we provide a decision-theoretic justification for reliability bucketing. We show that (i) thresholding posterior correctness probabilities is Bayes-optimal under asymmetric utility, and (ii) thresholding an accurate estimate of these proba...

  34. [34]

    Otherwise,p i ands i lie on opposite sides ofτ ∗, which implies |pi −τ ∗| ≤ |p i −s i| ≤ϵ.(22) Therefore, ∆i =|V i|= (u 1 +u 2)|pi −τ ∗| ≤(u 1 +u 2)ϵ.(23) Thus, thresholding anϵ-accurate estimate of the posterior yields anO(ϵ)-optimal decision rule. 18 B Implementation Details B.1 Models Our experiments span 5 LLMs that differ in both scale and architectu...