pith. machine review for the scientific record. sign in

arxiv: 2604.19656 · v1 · submitted 2026-04-21 · 💻 cs.CL

Recognition: unknown

Pause or Fabricate? Training Language Models for Grounded Reasoning

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:58 UTC · model grok-4.3

classification 💻 cs.CL
keywords grounded reasoninginferential boundary awarenessreinforcement learninghallucinationpremise detectioninsufficient informationmulti-turn RL
0
0 comments X

The pith

LLMs can be trained to detect missing information and pause rather than fabricate answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often generate confident but unreliable conclusions when inputs lack necessary facts, a failure the paper attributes to absent awareness of inferential boundaries rather than weak reasoning ability. GRIL addresses this by using multi-turn reinforcement learning to split the process into an initial stage that checks whether premises are sufficient and a later stage that only performs grounded reasoning once gaps are filled. Stage-specific rewards penalize hallucinations in both stages, training the model to stop proactively and request clarification instead of continuing with invented details. On math reasoning datasets constructed with insufficient information, this produces substantially higher rates of correct premise detection and task completion alongside shorter responses. The training also proves robust when user clarifications contain noise and transfers to problems outside the original training distribution.

Core claim

The central claim is that decomposing the reasoning process into a clarify-and-pause stage followed by a grounded reasoning stage, then applying stage-specific rewards within an interactive reinforcement learning loop, allows language models to acquire inferential boundary awareness so they reliably identify when premises are missing, refrain from fabrication, and achieve higher success on tasks with incomplete inputs.

What carries the argument

GRIL, a multi-turn reinforcement learning framework that decomposes reasoning into clarify-and-pause and grounded reasoning stages guided by rewards that penalize hallucinations.

Load-bearing premise

That custom stage-specific rewards can reliably instill inferential boundary awareness in LLMs without the model learning to game the rewards or overfit to the artificial insufficient datasets.

What would settle it

A new collection of incomplete reasoning problems on which GRIL-trained models still fabricate information at rates comparable to untrained models or show no gain in task success.

read the original abstract

Large language models have achieved remarkable progress on complex reasoning tasks. However, they often implicitly fabricate information when inputs are incomplete, producing confident but unreliable conclusions -- a failure mode we term ungrounded reasoning. We argue that this issue arises not from insufficient reasoning capability, but from the lack of inferential boundary awareness -- the ability to recognize when the necessary premises for valid inference are missing. To address this issue, we propose Grounded Reasoning via Interactive Reinforcement Learning (GRIL), a multi-turn reinforcement learning framework for grounded reasoning under incomplete information. GRIL decomposes the reasoning process into two stages: clarify and pause, which identifies whether the available information is sufficient, and grounded reasoning, which performs task solving once the necessary premises are established. We design stage-specific rewards to penalize hallucinations, enabling models to detect gaps, stop proactively, and resume reasoning after clarification. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient show that GRIL significantly improves premise detection (up to 45%), leading to a 30% increase in task success while reducing average response length by over 20%. Additional analyses confirm robustness to noisy user responses and generalization to out-of-distribution tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs exhibit ungrounded reasoning by fabricating information under incomplete inputs due to lacking inferential boundary awareness. It proposes GRIL, a multi-turn RL framework that decomposes reasoning into a clarify/pause stage (to detect missing premises) and a grounded reasoning stage, using stage-specific rewards to penalize hallucinations. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient report up to 45% gains in premise detection, 30% higher task success, and >20% shorter responses, with added robustness and generalization checks.

Significance. If the empirical gains are robust and not artifacts of the training setup, GRIL offers a concrete mechanism to instill boundary awareness via interactive RL, which could meaningfully improve reliability in reasoning tasks with incomplete information. The multi-stage decomposition and reward design are novel relative to standard RLHF or chain-of-thought approaches.

major comments (3)
  1. [Experiments / Dataset Construction] The abstract and experimental sections report quantitative gains on GSM8K-Insufficient and MetaMATH-Insufficient without detailing dataset construction (e.g., how premises are removed or insufficiency is verified). This is load-bearing because selection effects or artificial cues in these variants could inflate premise-detection and success metrics without demonstrating general boundary awareness.
  2. [GRIL Framework / Reward Design] Stage-specific reward formulations are referenced but not specified exactly (e.g., the precise penalty for hallucination in the clarify stage or the reward for resuming after clarification). Without the equations or pseudocode, it is impossible to assess whether the reported 45% premise-detection lift could arise from simple policies that maximize reward without learning true inferential boundaries.
  3. [Experiments / Baselines and Metrics] The paper compares GRIL to baselines but does not report the exact baseline implementations, hyperparameter matching, or statistical significance tests for the 30% task-success and 20% length reductions. This undermines the claim that the gains are attributable to the clarify/pause mechanism rather than training differences.
minor comments (2)
  1. [Method] Notation for the two stages (clarify/pause vs. grounded reasoning) is introduced in the abstract but should be formalized with consistent symbols or diagrams in §3 to aid reproducibility.
  2. [Additional Analyses] The generalization claim to out-of-distribution tasks is stated but lacks a dedicated table or figure showing per-task breakdowns; adding one would strengthen the evidence.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The comments identify areas where additional detail will improve clarity and reproducibility. We address each point below and commit to incorporating the requested information in the revised manuscript.

read point-by-point responses
  1. Referee: [Experiments / Dataset Construction] The abstract and experimental sections report quantitative gains on GSM8K-Insufficient and MetaMATH-Insufficient without detailing dataset construction (e.g., how premises are removed or insufficiency is verified). This is load-bearing because selection effects or artificial cues in these variants could inflate premise-detection and success metrics without demonstrating general boundary awareness.

    Authors: We agree that a more explicit description of dataset construction is necessary for full reproducibility and to rule out potential artifacts. While the current manuscript briefly introduces the insufficient variants, it does not provide step-by-step procedures for premise removal or verification. In the revised version we will add a dedicated subsection (likely in Section 4 or an appendix) that specifies the exact removal criteria, the verification protocol used to confirm insufficiency, and representative examples of original and modified problems. This addition will directly address concerns about selection effects and strengthen the interpretation of the reported gains as evidence of boundary awareness. revision: yes

  2. Referee: [GRIL Framework / Reward Design] Stage-specific reward formulations are referenced but not specified exactly (e.g., the precise penalty for hallucination in the clarify stage or the reward for resuming after clarification). Without the equations or pseudocode, it is impossible to assess whether the reported 45% premise-detection lift could arise from simple policies that maximize reward without learning true inferential boundaries.

    Authors: We acknowledge that the precise mathematical forms of the stage-specific rewards were described at a high level but not formalized with equations or pseudocode. The clarify-stage reward penalizes fabrication while the reasoning-stage reward encourages resumption only after clarification; however, without the explicit expressions it is difficult for readers to evaluate whether the observed improvements reflect genuine boundary learning. In the revision we will insert the exact reward equations and a short pseudocode block (either in the main text or a new appendix) so that the reward design can be scrutinized and replicated. revision: yes

  3. Referee: [Experiments / Baselines and Metrics] The paper compares GRIL to baselines but does not report the exact baseline implementations, hyperparameter matching, or statistical significance tests for the 30% task-success and 20% length reductions. This undermines the claim that the gains are attributable to the clarify/pause mechanism rather than training differences.

    Authors: We concur that greater transparency on baseline implementations and statistical rigor is required. The current manuscript references standard RLHF and CoT baselines but omits full hyperparameter tables and significance testing. In the revised manuscript we will (i) document the precise baseline codebases and training configurations, (ii) confirm that hyperparameters were matched to the extent possible, and (iii) add statistical significance results (e.g., paired t-tests or bootstrap confidence intervals) for the key metrics. These additions will allow readers to attribute performance differences more confidently to the GRIL framework. revision: yes

Circularity Check

0 steps flagged

No circularity: GRIL is an empirical RL training method evaluated on held-out data

full rationale

The paper defines GRIL as a multi-turn RL framework that decomposes reasoning into clarify/pause and grounded stages with custom rewards penalizing hallucinations. Claims of improved premise detection (up to 45%) and task success (30%) rest on experimental results from training and testing on GSM8K-Insufficient and MetaMATH-Insufficient, which are external benchmarks. No equations, derivations, or self-citations reduce the reported gains to fitted parameters or self-definitions by construction. The method and metrics are independently specified; success is measured via standard task accuracy and response length on held-out instances rather than tautologically.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the design of stage-specific rewards that penalize fabrication and the assumption that RL can instill premise detection; these are tuned elements not derived from first principles.

free parameters (1)
  • stage-specific reward weights
    Rewards for pausing, hallucination penalties, and task success are custom-designed and implicitly tuned for the reported gains.
axioms (1)
  • domain assumption LLMs can acquire inferential boundary awareness through interactive RL with targeted rewards
    Foundational premise for GRIL's effectiveness on incomplete inputs.

pith-pipeline@v0.9.0 · 5538 in / 1128 out tokens · 89070 ms · 2026-05-10T02:58:40.765017+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 39 canonical work pages · 17 internal anchors

  1. [1]

    Simran Arora, Patrick S. H. Lewis, Angela Fan, Jacob Kahn, and Christopher R \' e . Reasoning over public and private data in retrieval-based systems. Trans. Assoc. Comput. Linguistics, 11: 0 902--921, 2023. doi:10.1162/TACL\_A\_00580. URL https://doi.org/10.1162/tacl\_a\_00580

  2. [2]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield - Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson...

  3. [3]

    In: Wooldridge, M.J., Dy, J.G., Natarajan, S

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan (eds.), Thirty-Eighth AAAI Confere...

  4. [4]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

  5. [5]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  6. [6]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  7. [7]

    arXiv preprint arXiv:2504.06514 , year=

    Chenrui Fan, Ming Li, Lichao Sun, and Tianyi Zhou. Missing premise exacerbates overthinking: Are reasoning models losing critical thinking skill? CoRR, abs/2504.06514, 2025. doi:10.48550/ARXIV.2504.06514. URL https://doi.org/10.48550/arXiv.2504.06514

  8. [8]

    arXiv preprint arXiv:2508.15260 , year=

    Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence. CoRR, abs/2508.15260, 2025. doi:10.48550/ARXIV.2508.15260. URL https://doi.org/10.48550/arXiv.2508.15260

  9. [9]

    Clarq- LLM : A benchmark for models clarifying and requesting information in task-oriented dialog

    Yujian Gan, Changling Li, Jinxia Xie, Luou Wen, Matthew Purver, and Massimo Poesio. Clarq-llm: A benchmark for models clarifying and requesting information in task-oriented dialog. arXiv preprint arXiv:2409.06097, 2024

  10. [10]

    Modeling reasoning as markov decision processes: A theoretical investigation into nlp transformer models

    Zhenyu Gao. Modeling reasoning as markov decision processes: A theoretical investigation into nlp transformer models. 2025

  11. [11]

    Rl of thoughts: Navigating llm reasoning with inference-time reinforcement learning

    Qianyue Hao, Sibo Li, Jian Yuan, and Yong Li. Rl of thoughts: Navigating llm reasoning with inference-time reinforcement learning. arXiv preprint arXiv:2505.14140, 2025

  12. [12]

    Interactive question clarification in dialogue via reinforcement learning

    Xiang Hu, Zujie Wen, Yafang Wang, Xiaolong Li, and Gerard de Melo. Interactive question clarification in dialogue via reinforcement learning. In Ann Clifton and Courtney Napoles (eds.), Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020 - Industry Track, Online, December 12, 2020 , pp.\ 78--89. International Committ...

  13. [13]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , volume=

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. , 43 0 (2): 0 42:1--42:55, 2025. doi:10.1145/3703155. URL https://doi.org/10.1145/3703155

  14. [14]

    Pag: Multi-turn reinforced llm self-correction with policy as generative verifier, 2025

    Yuhua Jiang, Yuwen Xiong, Yufeng Yuan, Chao Xin, Wenyuan Xu, Yu Yue, Qianchuan Zhao, and Lin Yan. Pag: Multi-turn reinforced llm self-correction with policy as generative verifier, 2025. URL https://arxiv.org/abs/2506.10406

  15. [15]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025

  16. [16]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022

  17. [17]

    Process reward models that think.arXiv preprint arXiv:2504.16828, 2025

    Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, and Lu Wang. Process reward models that think. arXiv preprint arXiv:2504.16828, 2025

  18. [18]

    Clam: Selective clarification for ambiguous questions with generative language models

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Clam: Selective clarification for ambiguous questions with generative language models. arXiv preprint arXiv:2212.07769, 2022

  19. [19]

    LLMs Get Lost In Multi-Turn Conversation

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation. CoRR, abs/2505.06120, 2025. doi:10.48550/ARXIV.2505.06120. URL https://doi.org/10.48550/arXiv.2505.06120

  20. [20]

    Inference-time intervention: Eliciting truthful answers from a language model

    Kenneth Li, Oam Patel, Fernanda Vi \'e gas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36: 0 41451--41530, 2023

  21. [21]

    Search-o1: Agentic Search-Enhanced Large Reasoning Models

    Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. arXiv preprint arXiv:2501.05366, 2025

  22. [22]

    Let's verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2023

  23. [23]

    Reverse thinking enhances missing information detection in large language models, 2025

    Yuxin Liu, Chaojie Gu, Yihang Zhang, Bin Qian, and Shibo He. Reverse thinking enhances missing information detection in large language models, 2025. URL https://arxiv.org/abs/2512.10273

  24. [24]

    Improve Mathematical Reasoning in Language Models by Automated Process Supervision

    Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. Improve mathematical reasoning in language models by automated process supervision. CoRR, abs/2406.06592, 2024. doi:10.48550/ARXIV.2406.06592. URL https://doi.org/10.48550/arXiv.2406.06592

  25. [25]

    Clarifymt-bench: Benchmarking and improving multi-turn clarification for conversational large language models, 2025

    Sichun Luo, Yi Huang, Mukai Li, Shichang Meng, Fengyuan Liu, Zefa Hu, Junlan Feng, and Qi Liu. Clarifymt-bench: Benchmarking and improving multi-turn clarification for conversational large language models, 2025. URL https://arxiv.org/abs/2512.21120

  26. [26]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Alice Oh, Tristan Naumann, Amir Globerson, K...

  27. [27]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi:10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774

  28. [28]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

  29. [29]

    Countering reward over-optimization in llm with demonstration-guided reinforcement learning

    Mathieu Rita, Florian Strub, Rahma Chaabouni, Paul Michel, Emmanuel Dupoux, and Olivier Pietquin. Countering reward over-optimization in llm with demonstration-guided reinforcement learning. arXiv preprint arXiv:2404.19409, 2024

  30. [30]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi - Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Sys...

  31. [31]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347

  32. [32]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. CoRR, abs/2402.03300, 2024. doi:10.48550/ARXIV.2402.03300. URL https://doi.org/10.48550/arXiv.2402.03300

  33. [33]

    Towards Understanding Sycophancy in Language Models

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548, 2023

  34. [34]

    R-prm: Reasoning-driven process reward modeling,

    Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, and Shujian Huang. R-PRM: reasoning-driven process reward modeling. CoRR, abs/2503.21295, 2025. doi:10.48550/ARXIV.2503.21295. URL https://doi.org/10.48550/arXiv.2503.21295

  35. [35]

    HybridFlow: A Flexible and Efficient RLHF Framework , url=

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pp.\ 1279--1297. ACM , 2025. doi:10.1145/368...

  36. [36]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36: 0 8634--8652, 2023

  37. [37]

    Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

    Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A survey on efficient reasoning for large language models. arXiv preprint arXiv:2503.16419, 2025

  38. [38]

    Math-shepherd: Verify and reinforce llms step-by-step without human annotations

    Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 9426--9439, 2024

  39. [39]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023 a . URL https://openreview.net/for...

  40. [40]

    A comprehensive survey on trustworthiness in reasoning with large language models

    Yanbo Wang, Yongcan Yu, Jian Liang, and Ran He. A comprehensive survey on trustworthiness in reasoning with large language models. arXiv preprint arXiv:2509.03871, 2025 a

  41. [41]

    Interactive natural language processing

    Zekun Wang, Ge Zhang, Kexin Yang, Ning Shi, Wangchunshu Zhou, Shaochun Hao, Guangzheng Xiong, Yizhi Li, Mong Yuan Sim, Xiuying Chen, et al. Interactive natural language processing. arXiv preprint arXiv:2305.13246, 2023 b

  42. [42]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei - Fei, Lijuan Wang, Yejin Choi, and Manling Li. RAGEN: understanding self-evolution in LLM agents via multi-turn reinforcement learning. CoRR, abs/2504.20073, 2025 b ....

  43. [43]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural ...

  44. [44]

    Can llms express their uncertainty

    Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty. An Empirical Evaluation of Confidence Elicitation in LLMs. arXiv, 2024

  45. [45]

    Mathfimer: Enhancing mathematical reasoning by expanding reasoning steps through fill-in-the-middle task, 2025

    Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Xin Xu, Mengdi Zhang, Jian Shao, and Yueting Zhuang. Mathfimer: Enhancing mathematical reasoning by expanding reasoning steps through fill-in-the-middle task, 2025. URL https://arxiv.org/abs/2502.11684

  46. [46]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

  47. [47]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...

  48. [48]

    , booktitle =

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun'ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Process...

  49. [49]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Info...

  50. [50]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023 b . URL https://openreview.net/forum?id=WE\_vluYUL-X

  51. [51]

    Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https:...

  52. [52]

    Modeling future conversation turns to teach llms to ask clarifying questions

    Michael JQ Zhang, W Bradley Knox, and Eunsol Choi. Modeling future conversation turns to teach llms to ask clarifying questions. arXiv preprint arXiv:2410.13788, 2024

  53. [53]

    Asktoact: Enhancing llms tool use via self-correcting clarification

    Xuan Zhang, Yongliang Shen, Zhe Zheng, Linjuan Wu, Wenqi Zhang, Yuchen Yan, Qiuying Peng, Jun Wang, and Weiming Lu. Asktoact: Enhancing llms tool use via self-correcting clarification. CoRR, abs/2503.01940, 2025. doi:10.48550/ARXIV.2503.01940. URL https://doi.org/10.48550/arXiv.2503.01940

  54. [54]

    Archer: Training language model agents via hierarchical multi-turn RL

    Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. Archer: Training language model agents via hierarchical multi-turn RL . In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=b6rA0kAHT1