Recognition: 2 theorem links
· Lean TheoremOnly Say What You Know: Calibration-Aware Generation for Long-Form Factuality
Pith reviewed 2026-05-10 15:45 UTC · model grok-4.3
The pith
By decoupling exploration from commitment, language models can generate more factual long-form responses using calibrated reliability estimates for each reasoning step.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large Reasoning Models achieve strong performance on complex tasks but remain prone to hallucinations, particularly in long-form generation where errors compound across reasoning steps. Existing approaches follow a coupled exploration-commitment paradigm that unconditionally propagates intermediate reasoning to the final output and limits fine-grained control. We propose an Exploration-Commitment Decoupling paradigm that disentangles knowledge exploration from final commitment, enabling models to explore with awareness while answering cautiously. We instantiate the paradigm with Calibration-Aware Generation, a framework that equips models with end-to-end calibration-aware generation byaug
What carries the argument
Calibration-Aware Generation (CAG): a framework that augments each intermediate reasoning step with a calibrated reliability estimate and uses those estimates to prioritize reliable content when forming the final output.
Where Pith is reading between the lines
- The same separation of exploration from commitment could be tested in code generation or mathematical proof tasks where partial errors also compound.
- Models trained under this paradigm might need fewer external fact-checking steps after generation because unreliable content is filtered during decoding.
- Extending the reliability estimates to include uncertainty from retrieved external documents could further strengthen long-form outputs that draw on outside sources.
Load-bearing premise
That models can produce accurate calibrated reliability estimates for their own intermediate reasoning steps in a way that permits correct prioritization of reliable content without dropping necessary information or adding selection biases.
What would settle it
Apply Calibration-Aware Generation to the same long-form factuality benchmarks used in the paper and observe no gain or a loss in factuality scores compared with standard generation baselines.
Figures
read the original abstract
Large Reasoning Models achieve strong performance on complex tasks but remain prone to hallucinations, particularly in long-form generation where errors compound across reasoning steps. Existing approaches to improving factuality, including abstention and factuality-driven optimization, follow a \emph{coupled exploration-commitment} paradigm, in which intermediate reasoning is unconditionally propagated to the final output, limiting fine-grained control over information selection and integration. In this paper, we propose an \textbf{Exploration-Commitment Decoupling} paradigm that disentangles knowledge exploration from final commitment, enabling models to explore with awareness while answering cautiously. We instantiate the paradigm with \textbf{Calibration-Aware Generation (CAG)}, a framework that equips models with end-to-end, calibration-aware generation capabilities, by augmenting intermediate reasoning with calibrated reliability estimates and prioritizing reliable content in final outputs. Across five long-form factuality benchmarks and multiple model families, CAG improves factuality by up to 13%, while reducing decoding time by up to 37%. Overall, our work highlights decoupling as a principled approach for more reliable long-form generation, offering directions for trustworthy and self-aware generative systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that large reasoning models suffer from hallucinations in long-form generation due to a coupled exploration-commitment paradigm in existing factuality methods. It proposes an Exploration-Commitment Decoupling paradigm instantiated via Calibration-Aware Generation (CAG), which augments intermediate reasoning steps with calibrated reliability estimates to prioritize reliable content during final output generation. Across five long-form factuality benchmarks and multiple model families, CAG is reported to improve factuality by up to 13% while reducing decoding time by up to 37%.
Significance. If the empirical claims hold under rigorous evaluation, the decoupling paradigm could represent a meaningful advance for trustworthy long-form generation in LLMs. By separating exploration from commitment and incorporating calibration, the approach offers a principled alternative to abstention or optimization-based methods, with potential for more fine-grained control over factuality and efficiency gains that could aid practical deployment of reliable generative systems.
major comments (2)
- [Abstract and Experiments] Abstract and Experiments section: The abstract asserts specific percentage improvements (up to 13% factuality and 37% decoding time) but provides no information on evaluation protocols, baseline comparisons, statistical testing, or how factuality was quantified. This leaves the central empirical claim unsupported by visible evidence and requires detailed reporting in the results to substantiate the CAG framework's effectiveness.
- [§3] §3 (CAG Framework description): The core mechanism relies on the assumption that model-generated calibrated reliability estimates enable accurate prioritization of reliable content without losing critical information or introducing selection biases. However, no separate verification or ablation is described to confirm the accuracy of these estimates on intermediate steps, which is load-bearing for the reported gains given that the estimates come from the same hallucination-prone model family.
minor comments (2)
- [Method] Clarify the exact definition and computation of 'calibrated reliability estimates' with an equation or pseudocode in the method section to improve reproducibility.
- [Related Work] Ensure the related work section comprehensively cites prior calibration techniques for LLMs and long-form factuality benchmarks for proper context.
Simulated Author's Rebuttal
Thank you for the constructive feedback and the recommendation for major revision. We address each major comment below, providing clarifications on the existing content in the manuscript and outlining targeted revisions to enhance the visibility of evaluation details and the validation of the framework's core components.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and Experiments section: The abstract asserts specific percentage improvements (up to 13% factuality and 37% decoding time) but provides no information on evaluation protocols, baseline comparisons, statistical testing, or how factuality was quantified. This leaves the central empirical claim unsupported by visible evidence and requires detailed reporting in the results to substantiate the CAG framework's effectiveness.
Authors: We agree that the abstract would benefit from additional context to immediately support the reported gains. While Section 4 (Experiments) already details the five benchmarks, baseline comparisons (including coupled exploration-commitment methods), factuality quantification via atomic fact verification, and efficiency via decoding time, along with statistical reporting over multiple seeds, we will revise the abstract to concisely reference the evaluation protocol (e.g., 'across five benchmarks using automatic factuality metrics and standard baselines'). We will also add an explicit summary paragraph at the start of the results subsection reiterating the metrics, baselines, and testing procedures. These changes will make the empirical support more self-contained without expanding the abstract length substantially. revision: yes
-
Referee: [§3] §3 (CAG Framework description): The core mechanism relies on the assumption that model-generated calibrated reliability estimates enable accurate prioritization of reliable content without losing critical information or introducing selection biases. However, no separate verification or ablation is described to confirm the accuracy of these estimates on intermediate steps, which is load-bearing for the reported gains given that the estimates come from the same hallucination-prone model family.
Authors: This concern about validating the intermediate reliability estimates is well-taken, as they are central to the decoupling paradigm. Although the end-to-end benchmark results demonstrate the overall benefits of prioritization, we will add a dedicated ablation and analysis in Section 4 (or an appendix) that directly assesses the calibration accuracy of the estimates on intermediate reasoning steps. This will include correlation analysis against ground-truth factuality labels on held-out data and an examination of selection biases or information loss. Such additions will explicitly address the use of estimates from the same model family by highlighting how the calibration process improves their reliability. revision: yes
Circularity Check
No significant circularity; empirical framework with independent experimental validation
full rationale
The paper proposes Exploration-Commitment Decoupling instantiated as Calibration-Aware Generation (CAG) as a practical framework for long-form factuality, validated empirically across five benchmarks and multiple model families with reported gains in factuality and decoding efficiency. No equations, derivations, or first-principles predictions are described that reduce by construction to fitted inputs, self-definitions, or self-citation chains. The core contribution is an architectural and prompting approach whose claims rest on external benchmark results rather than renaming known patterns or smuggling ansatzes via self-citation. This is the common case of a self-contained empirical method paper.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose an Exploration-Commitment Decoupling paradigm... augmenting intermediate reasoning with calibrated reliability estimates and prioritizing reliable content in final outputs.
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ci = Bucket(si) ... <reliable> if si ≥ τ, <unreliable> otherwise
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2412.19437. Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker-Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alex...
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
URL https://arxiv.org/abs/2601.03267. Junjie Ye, Zhengyin Du, Xuesong Yao, Weijian Lin, Yufei Xu, Zehui Chen, Zaiyuan Wang, Sining Zhu, Zhiheng Xi, Siyu Yuan, Tao Gui, Qi Zhang, Xuanjing Huang, and Jiecao Chen. ToolHop: A query-driven benchmark for evaluating large language models in multi-hop tool use. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, a...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.150. URL https://aclanthology. org/2025.acl-long.150/. Li Hu, Guoqiang Chen, Xiuwei Shang, Shaoyin Cheng, Benlong Wu, LiGangyang LiGangyang, Xu Zhu, Weiming Zhang, and Nenghai Yu. Compileagent: Automated real-world repo-level compilation with tool-integrated ...
-
[4]
URLhttps://aclanthology.org/2025.acl-long.103/. Xilun Chen, Ilia Kulikov, Vincent-Pierre Berges, Barlas O ˘guz, Rulin Shao, Gargi Ghosh, Jason Weston, and Wen tau Yih. Learning to reason for factuality, 2025a. URL https://arxiv.org/ abs/2508.05618. Jiayun Wu, Jiashuo Liu, Zhiyuan Zeng, Tianyang Zhan, Tianle Cai, and Wenhao Huang. Mitigating llm hallucinat...
-
[5]
URL https://arxiv. org/abs/2512.19920. Hao An and Yang Xu. Teaching llms to abstain via fine-grained semantic confidence reward,
-
[6]
URLhttps://arxiv.org/abs/2510.24020. Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Jiahao Xu, Tian Liang, Pinjia He, and Zhaopeng Tu. Refuse whenever you feel unsafe: Improving safety in llms via decoupled refusal training. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual M...
-
[7]
org/2025.acl-long.158/
URL https://aclanthology. org/2025.acl-long.158/. Baochang Ren, Shuofei Qiao, Da Zheng, Huajun Chen, and Ningyu Zhang. Knowrl: Exploring knowledgeable reinforcement learning for factuality,
2025
-
[8]
KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality
URL https://arxiv.org/abs/ 2506.19807. Boyang Xue, Fei Mi, Qi Zhu, Hongru Wang, Rui Wang, Sheng Wang, Erxin Yu, Xuming Hu, and Kam-Fai Wong. UAlign: Leveraging uncertainty estimations for factuality alignment on large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Mee...
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.299. URLhttps://aclanthology.org/2025.acl-long.299/. Yiqing Xie, Wenxuan Zhou, Pradyot Prakash, Di Jin, Yuning Mao, Quintin Fettes, Arya Talebzadeh, Sinong Wang, Han Fang, Carolyn P. Rosé, Daniel Fried, and Hejia Zhang. Improving model factuality with fine-gr...
-
[10]
Sheng-Chieh Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Scott Yih, and Xilun Chen
URL https://aclanthology.org/2025.acl-long.400/. Sheng-Chieh Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Scott Yih, and Xilun Chen. FLAME : Factuality-aware alignment for large language models. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information...
2025
-
[11]
Tong Chen, Akari Asai, Luke Zettlemoyer, Hannaneh Hajishirzi, and Faeze Brahman
URL http://papers.nips.cc/paper_files/paper/2024/hash/ d16152d53088ad779ffa634e7bf66166-Abstract-Conference.html. Tong Chen, Akari Asai, Luke Zettlemoyer, Hannaneh Hajishirzi, and Faeze Brahman. Train for truth, keep the skills: Binary retrieval-augmented reward mitigates hallucinations, 2025b. URL https://arxiv.org/abs/2510.17733. Zhe Su, Xuhui Zhou, San...
-
[12]
ISBN 979-8-89176-189-6
Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/ v1/2025.naacl-long.595. URLhttps://aclanthology.org/2025.naacl-long.595/. 13 Qinyuan Cheng, Tianxiang Sun, Xiangyang Liu, Wenwei Zhang, Zhangyue Yin, Shimin Li, Linyang Li, Zhengfu He, Kai Chen, and Xipeng Qiu. Can AI assistants know what they don’t know? In Ruslan Salakhutd...
2025
-
[13]
Yixiao Song, Yekyung Kim, and Mohit Iyyer
URL https://proceedings.mlr.press/v235/ cheng24i.html. Yixiao Song, Yekyung Kim, and Mohit Iyyer. VeriScore: Evaluating the factuality of verifiable claims in long-form text generation. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9447–9474, Mi- ami, Florida, US...
2024
-
[14]
Association for Computational Linguistics. doi: 10.18653/v1/ 2024.findings-emnlp.552. URL https://aclanthology.org/2024.findings-emnlp.552/. Yu-Neng Chuang, Prathusha Kameswara Sarma, Parikshit Gopalan, John Boccio, Sara Bolouki, Xia Hu, and Helen Zhou. Learning to route llms with confidence tokens. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-...
-
[15]
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem
URLhttps://proceedings.mlr.press/v267/chuang25b.html. Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,
2024
-
[16]
https://thinkingmachines.ai/blog/ on-policy-distillation/
doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev...
-
[17]
URLhttps://arxiv.org/abs/2407.21783. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang L...
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
URLhttps://arxiv.org/abs/2505.09388. Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for meth- ods that learn from human feedback. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, edit...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi
URL http://papers.nips.cc/paper_files/ paper/2023/hash/5fc47800ee5b30b8777fdd30abcaaf3b-Abstract-Conference.html. Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evalua- tion of factual precision in long form text generation. In Houda Bouam...
2023
-
[20]
In: Bouamor, H., Pino, J., Bali, K
doi: 10.18653/V1/2023.EMNLP-MAIN.741. URLhttps://doi.org/10.18653/v1/2023.emnlp-main.741. Farima Fatahi Bayat, Lechen Zhang, Sheza Munir, and Lu Wang. FactBench: A dynamic benchmark for in-the-wild language model factuality evaluation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Me...
-
[21]
URLhttps://aclanthology.org/2025.acl-long.1587/
18653/v1/2025.acl-long.1587. URLhttps://aclanthology.org/2025.acl-long.1587/. Mingda Chen, Yang Li, Xilun Chen, Adina Williams, Gargi Ghosh, and Scott Yih. Factory: A challenging human-verified prompt set for long-form factuality, 2025c. URL https://arxiv. org/abs/2508.00109. Abhika Mishra, Akari Asai, Vidhisha Balachandran, Yizhong Wang, Graham Neubig, Y...
-
[22]
Hanning Zhang, Shizhe Diao, Yong Lin, Yi Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang
URL https://openreview.net/forum?id= dJMTn3QOWO. Hanning Zhang, Shizhe Diao, Yong Lin, Yi Fung, Qing Lian, Xingyao Wang, Yangyi Chen, Heng Ji, and Tong Zhang. R-tuning: Instructing large language models to say ‘I don’t know’. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Ass...
2024
-
[23]
Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.394. URL https: //aclanthology.org/2024.naacl-long.394/. Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. ELI5: Long form question answering. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting o...
-
[24]
doi: 10.18653/ v1/P19-1346
Association for Computational Linguistics. doi: 10.18653/ v1/P19-1346. URLhttps://aclanthology.org/P19-1346/. Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric 16 memories. In Anna Rogers, Jordan L. Boyd-Graber, and ...
2023
-
[25]
doi: 10.18653/V1/2023.ACL-LONG.546. URL https://doi.org/10.18653/v1/2023.acl-long.546. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling,
-
[26]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P
URL https://openreview.net/forum? id=Ti67584b98. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt,...
2023
-
[27]
Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, and Furu Wei
URL http://papers.nips.cc/paper_files/paper/2023/hash/ 91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html. Liang Wang, Haonan Chen, Nan Yang, Xiaolong Huang, Zhicheng Dou, and Furu Wei. Chain-of- retrieval augmented generation,
2023
-
[28]
Chain- of-retrieval augmented generation
URLhttps://arxiv.org/abs/2501.14342. Wen Luo, Guangyue Peng, Wei Li, Shaohang Wei, Feifan Song, Liang Wang, Nan Yang, Xingxing Zhang, Jing Jin, Furu Wei, and Houfeng Wang. Two pathways to truthfulness: On the intrinsic encoding of llm hallucinations,
-
[29]
Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations
URLhttps://arxiv.org/abs/2601.07422. Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Yung-Sung Chuang, Linlu Qiu, Cheng-Yu Hsieh, Ranjay Krishna, Yoon Kim, and James R
URLhttps://arxiv.org/abs/2406.07070. Yung-Sung Chuang, Linlu Qiu, Cheng-Yu Hsieh, Ranjay Krishna, Yoon Kim, and James R. Glass. Lookback lens: Detecting and mitigating contextual hallucinations in large language models using only attention maps. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical ...
-
[31]
URL https: //doi.org/10.18653/v1/2024.emnlp-main.84
doi: 10.18653/V1/2024.EMNLP-MAIN.84. URL https: //doi.org/10.18653/v1/2024.emnlp-main.84. Wen Luo, Feifan Song, Wei Li, Guangyue Peng, Shaohang Wei, and Houfeng Wang. Odysseus navigates the sirens’ song: Dynamic focus decoding for factual and diverse open-ended text generation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar...
-
[32]
acl-long.1320/
URL https://aclanthology.org/2025. acl-long.1320/. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. Lmsys-chat-1m: A large-scale real-world LLM conversation dataset. InThe Twelfth Interna- tional Conference on Learning Represen...
2025
-
[33]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
URLhttps://arxiv.org/abs/2402.03300. 17 A Approximate Bayes-Optimal Selection via Score Thresholding In this section, we provide a decision-theoretic justification for reliability bucketing. We show that (i) thresholding posterior correctness probabilities is Bayes-optimal under asymmetric utility, and (ii) thresholding an accurate estimate of these proba...
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Otherwise,p i ands i lie on opposite sides ofτ ∗, which implies |pi −τ ∗| ≤ |p i −s i| ≤ϵ.(22) Therefore, ∆i =|V i|= (u 1 +u 2)|pi −τ ∗| ≤(u 1 +u 2)ϵ.(23) Thus, thresholding anϵ-accurate estimate of the posterior yields anO(ϵ)-optimal decision rule. 18 B Implementation Details B.1 Models Our experiments span 5 LLMs that differ in both scale and architectu...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.