arxiv: 2604.19656 · v1 · submitted 2026-04-21 · 💻 cs.CL

Recognition: unknown

Pause or Fabricate? Training Language Models for Grounded Reasoning

Yiwen Qiu , Linjuan Wu , Yizhou Liu , Yuchen Yan , Jin Ma , Xu Tan , Yao Hu , Daoxin Zhang

show 4 more authors

Wenqi Zhang Weiming Lu Jun Xiao Yongliang Shen

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:58 UTC · model grok-4.3

classification 💻 cs.CL

keywords grounded reasoninginferential boundary awarenessreinforcement learninghallucinationpremise detectioninsufficient informationmulti-turn RL

0 comments

The pith

LLMs can be trained to detect missing information and pause rather than fabricate answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often generate confident but unreliable conclusions when inputs lack necessary facts, a failure the paper attributes to absent awareness of inferential boundaries rather than weak reasoning ability. GRIL addresses this by using multi-turn reinforcement learning to split the process into an initial stage that checks whether premises are sufficient and a later stage that only performs grounded reasoning once gaps are filled. Stage-specific rewards penalize hallucinations in both stages, training the model to stop proactively and request clarification instead of continuing with invented details. On math reasoning datasets constructed with insufficient information, this produces substantially higher rates of correct premise detection and task completion alongside shorter responses. The training also proves robust when user clarifications contain noise and transfers to problems outside the original training distribution.

Core claim

The central claim is that decomposing the reasoning process into a clarify-and-pause stage followed by a grounded reasoning stage, then applying stage-specific rewards within an interactive reinforcement learning loop, allows language models to acquire inferential boundary awareness so they reliably identify when premises are missing, refrain from fabrication, and achieve higher success on tasks with incomplete inputs.

What carries the argument

GRIL, a multi-turn reinforcement learning framework that decomposes reasoning into clarify-and-pause and grounded reasoning stages guided by rewards that penalize hallucinations.

Load-bearing premise

That custom stage-specific rewards can reliably instill inferential boundary awareness in LLMs without the model learning to game the rewards or overfit to the artificial insufficient datasets.

What would settle it

A new collection of incomplete reasoning problems on which GRIL-trained models still fabricate information at rates comparable to untrained models or show no gain in task success.

read the original abstract

Large language models have achieved remarkable progress on complex reasoning tasks. However, they often implicitly fabricate information when inputs are incomplete, producing confident but unreliable conclusions -- a failure mode we term ungrounded reasoning. We argue that this issue arises not from insufficient reasoning capability, but from the lack of inferential boundary awareness -- the ability to recognize when the necessary premises for valid inference are missing. To address this issue, we propose Grounded Reasoning via Interactive Reinforcement Learning (GRIL), a multi-turn reinforcement learning framework for grounded reasoning under incomplete information. GRIL decomposes the reasoning process into two stages: clarify and pause, which identifies whether the available information is sufficient, and grounded reasoning, which performs task solving once the necessary premises are established. We design stage-specific rewards to penalize hallucinations, enabling models to detect gaps, stop proactively, and resume reasoning after clarification. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient show that GRIL significantly improves premise detection (up to 45%), leading to a 30% increase in task success while reducing average response length by over 20%. Additional analyses confirm robustness to noisy user responses and generalization to out-of-distribution tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRIL adds a clarify/pause stage plus stage-specific RL rewards to train LLMs to detect missing premises instead of fabricating, with reported gains on two insufficient math datasets, but thin experimental details leave open whether the improvements reflect real boundary awareness or just reward exploitation.

read the letter

The main takeaway is that this paper trains models to pause when information is incomplete rather than filling gaps with made-up details. GRIL breaks reasoning into a first stage that checks sufficiency and a second that solves the task only after clarification, using interactive RL with rewards that penalize hallucinations in each stage. On versions of GSM8K and MetaMATH where premises are deliberately missing, the approach lifts premise detection by up to 45 percent, raises task success by 30 percent, and shortens responses by more than 20 percent. That is the concrete result they show.

Referee Report

3 major / 2 minor

Summary. The paper claims that LLMs exhibit ungrounded reasoning by fabricating information under incomplete inputs due to lacking inferential boundary awareness. It proposes GRIL, a multi-turn RL framework that decomposes reasoning into a clarify/pause stage (to detect missing premises) and a grounded reasoning stage, using stage-specific rewards to penalize hallucinations. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient report up to 45% gains in premise detection, 30% higher task success, and >20% shorter responses, with added robustness and generalization checks.

Significance. If the empirical gains are robust and not artifacts of the training setup, GRIL offers a concrete mechanism to instill boundary awareness via interactive RL, which could meaningfully improve reliability in reasoning tasks with incomplete information. The multi-stage decomposition and reward design are novel relative to standard RLHF or chain-of-thought approaches.

major comments (3)

[Experiments / Dataset Construction] The abstract and experimental sections report quantitative gains on GSM8K-Insufficient and MetaMATH-Insufficient without detailing dataset construction (e.g., how premises are removed or insufficiency is verified). This is load-bearing because selection effects or artificial cues in these variants could inflate premise-detection and success metrics without demonstrating general boundary awareness.
[GRIL Framework / Reward Design] Stage-specific reward formulations are referenced but not specified exactly (e.g., the precise penalty for hallucination in the clarify stage or the reward for resuming after clarification). Without the equations or pseudocode, it is impossible to assess whether the reported 45% premise-detection lift could arise from simple policies that maximize reward without learning true inferential boundaries.
[Experiments / Baselines and Metrics] The paper compares GRIL to baselines but does not report the exact baseline implementations, hyperparameter matching, or statistical significance tests for the 30% task-success and 20% length reductions. This undermines the claim that the gains are attributable to the clarify/pause mechanism rather than training differences.

minor comments (2)

[Method] Notation for the two stages (clarify/pause vs. grounded reasoning) is introduced in the abstract but should be formalized with consistent symbols or diagrams in §3 to aid reproducibility.
[Additional Analyses] The generalization claim to out-of-distribution tasks is stated but lacks a dedicated table or figure showing per-task breakdowns; adding one would strengthen the evidence.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The comments identify areas where additional detail will improve clarity and reproducibility. We address each point below and commit to incorporating the requested information in the revised manuscript.

read point-by-point responses

Referee: [Experiments / Dataset Construction] The abstract and experimental sections report quantitative gains on GSM8K-Insufficient and MetaMATH-Insufficient without detailing dataset construction (e.g., how premises are removed or insufficiency is verified). This is load-bearing because selection effects or artificial cues in these variants could inflate premise-detection and success metrics without demonstrating general boundary awareness.

Authors: We agree that a more explicit description of dataset construction is necessary for full reproducibility and to rule out potential artifacts. While the current manuscript briefly introduces the insufficient variants, it does not provide step-by-step procedures for premise removal or verification. In the revised version we will add a dedicated subsection (likely in Section 4 or an appendix) that specifies the exact removal criteria, the verification protocol used to confirm insufficiency, and representative examples of original and modified problems. This addition will directly address concerns about selection effects and strengthen the interpretation of the reported gains as evidence of boundary awareness. revision: yes
Referee: [GRIL Framework / Reward Design] Stage-specific reward formulations are referenced but not specified exactly (e.g., the precise penalty for hallucination in the clarify stage or the reward for resuming after clarification). Without the equations or pseudocode, it is impossible to assess whether the reported 45% premise-detection lift could arise from simple policies that maximize reward without learning true inferential boundaries.

Authors: We acknowledge that the precise mathematical forms of the stage-specific rewards were described at a high level but not formalized with equations or pseudocode. The clarify-stage reward penalizes fabrication while the reasoning-stage reward encourages resumption only after clarification; however, without the explicit expressions it is difficult for readers to evaluate whether the observed improvements reflect genuine boundary learning. In the revision we will insert the exact reward equations and a short pseudocode block (either in the main text or a new appendix) so that the reward design can be scrutinized and replicated. revision: yes
Referee: [Experiments / Baselines and Metrics] The paper compares GRIL to baselines but does not report the exact baseline implementations, hyperparameter matching, or statistical significance tests for the 30% task-success and 20% length reductions. This undermines the claim that the gains are attributable to the clarify/pause mechanism rather than training differences.

Authors: We concur that greater transparency on baseline implementations and statistical rigor is required. The current manuscript references standard RLHF and CoT baselines but omits full hyperparameter tables and significance testing. In the revised manuscript we will (i) document the precise baseline codebases and training configurations, (ii) confirm that hyperparameters were matched to the extent possible, and (iii) add statistical significance results (e.g., paired t-tests or bootstrap confidence intervals) for the key metrics. These additions will allow readers to attribute performance differences more confidently to the GRIL framework. revision: yes

Circularity Check

0 steps flagged

No circularity: GRIL is an empirical RL training method evaluated on held-out data

full rationale

The paper defines GRIL as a multi-turn RL framework that decomposes reasoning into clarify/pause and grounded stages with custom rewards penalizing hallucinations. Claims of improved premise detection (up to 45%) and task success (30%) rest on experimental results from training and testing on GSM8K-Insufficient and MetaMATH-Insufficient, which are external benchmarks. No equations, derivations, or self-citations reduce the reported gains to fitted parameters or self-definitions by construction. The method and metrics are independently specified; success is measured via standard task accuracy and response length on held-out instances rather than tautologically.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the design of stage-specific rewards that penalize fabrication and the assumption that RL can instill premise detection; these are tuned elements not derived from first principles.

free parameters (1)

stage-specific reward weights
Rewards for pausing, hallucination penalties, and task success are custom-designed and implicitly tuned for the reported gains.

axioms (1)

domain assumption LLMs can acquire inferential boundary awareness through interactive RL with targeted rewards
Foundational premise for GRIL's effectiveness on incomplete inputs.

pith-pipeline@v0.9.0 · 5538 in / 1128 out tokens · 89070 ms · 2026-05-10T02:58:40.765017+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 39 canonical work pages · 17 internal anchors

[1]

Simran Arora, Patrick S. H. Lewis, Angela Fan, Jacob Kahn, and Christopher R \' e . Reasoning over public and private data in retrieval-based systems. Trans. Assoc. Comput. Linguistics, 11: 0 902--921, 2023. doi:10.1162/TACL\_A\_00580. URL https://doi.org/10.1162/tacl\_a\_00580

work page internal anchor Pith review doi:10.1162/tacl 2023
[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield - Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson...

work page Pith review doi:10.48550/arxiv.2204.05862 2022
[3]

In: Wooldridge, M.J., Dy, J.G., Natarajan, S

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan (eds.), Thirty-Eighth AAAI Confere...

work page doi:10.1609/aaai.v38i16.29720 2024
[4]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

arXiv preprint arXiv:2504.06514 , year=

Chenrui Fan, Ming Li, Lichao Sun, and Tianyi Zhou. Missing premise exacerbates overthinking: Are reasoning models losing critical thinking skill? CoRR, abs/2504.06514, 2025. doi:10.48550/ARXIV.2504.06514. URL https://doi.org/10.48550/arXiv.2504.06514

work page doi:10.48550/arxiv.2504.06514 2025
[8]

arXiv preprint arXiv:2508.15260 , year=

Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence. CoRR, abs/2508.15260, 2025. doi:10.48550/ARXIV.2508.15260. URL https://doi.org/10.48550/arXiv.2508.15260

work page doi:10.48550/arxiv.2508.15260 2025
[9]

Clarq- LLM : A benchmark for models clarifying and requesting information in task-oriented dialog

Yujian Gan, Changling Li, Jinxia Xie, Luou Wen, Matthew Purver, and Massimo Poesio. Clarq-llm: A benchmark for models clarifying and requesting information in task-oriented dialog. arXiv preprint arXiv:2409.06097, 2024

work page arXiv 2024
[10]

Modeling reasoning as markov decision processes: A theoretical investigation into nlp transformer models

Zhenyu Gao. Modeling reasoning as markov decision processes: A theoretical investigation into nlp transformer models. 2025

2025
[11]

Rl of thoughts: Navigating llm reasoning with inference-time reinforcement learning

Qianyue Hao, Sibo Li, Jian Yuan, and Yong Li. Rl of thoughts: Navigating llm reasoning with inference-time reinforcement learning. arXiv preprint arXiv:2505.14140, 2025

work page arXiv 2025
[12]

Interactive question clarification in dialogue via reinforcement learning

Xiang Hu, Zujie Wen, Yafang Wang, Xiaolong Li, and Gerard de Melo. Interactive question clarification in dialogue via reinforcement learning. In Ann Clifton and Courtney Napoles (eds.), Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020 - Industry Track, Online, December 12, 2020 , pp.\ 78--89. International Committ...

work page doi:10.18653/v1/2020.coling-industry.8 2020
[13]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , volume=

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. , 43 0 (2): 0 42:1--42:55, 2025. doi:10.1145/3703155. URL https://doi.org/10.1145/3703155

work page doi:10.1145/3703155 2025
[14]

Pag: Multi-turn reinforced llm self-correction with policy as generative verifier, 2025

Yuhua Jiang, Yuwen Xiong, Yufeng Yuan, Chao Xin, Wenyuan Xu, Yu Yue, Qianchuan Zhao, and Lin Yan. Pag: Multi-turn reinforced llm self-correction with policy as generative verifier, 2025. URL https://arxiv.org/abs/2506.10406

work page arXiv 2025
[15]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review arXiv 2025
[16]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022

work page internal anchor Pith review arXiv 2022
[17]

Process reward models that think.arXiv preprint arXiv:2504.16828, 2025

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, and Lu Wang. Process reward models that think. arXiv preprint arXiv:2504.16828, 2025

work page arXiv 2025
[18]

Clam: Selective clarification for ambiguous questions with generative language models

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Clam: Selective clarification for ambiguous questions with generative language models. arXiv preprint arXiv:2212.07769, 2022

work page arXiv 2022
[19]

LLMs Get Lost In Multi-Turn Conversation

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation. CoRR, abs/2505.06120, 2025. doi:10.48550/ARXIV.2505.06120. URL https://doi.org/10.48550/arXiv.2505.06120

work page internal anchor Pith review doi:10.48550/arxiv.2505.06120 2025
[20]

Inference-time intervention: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Vi \'e gas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36: 0 41451--41530, 2023

2023
[21]

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. arXiv preprint arXiv:2501.05366, 2025

work page internal anchor Pith review arXiv 2025
[22]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2023

2023
[23]

Reverse thinking enhances missing information detection in large language models, 2025

Yuxin Liu, Chaojie Gu, Yihang Zhang, Bin Qian, and Shibo He. Reverse thinking enhances missing information detection in large language models, 2025. URL https://arxiv.org/abs/2512.10273

work page arXiv 2025
[24]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. Improve mathematical reasoning in language models by automated process supervision. CoRR, abs/2406.06592, 2024. doi:10.48550/ARXIV.2406.06592. URL https://doi.org/10.48550/arXiv.2406.06592

work page internal anchor Pith review doi:10.48550/arxiv.2406.06592 2024
[25]

Clarifymt-bench: Benchmarking and improving multi-turn clarification for conversational large language models, 2025

Sichun Luo, Yi Huang, Mukai Li, Shichang Meng, Fengyuan Liu, Zefa Hu, Junlan Feng, and Qi Liu. Clarifymt-bench: Benchmarking and improving multi-turn clarification for conversational large language models, 2025. URL https://arxiv.org/abs/2512.21120

work page arXiv 2025
[26]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Alice Oh, Tristan Naumann, Amir Globerson, K...

2023
[27]

GPT-4 Technical Report

OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi:10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[28]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

2022
[29]

Countering reward over-optimization in llm with demonstration-guided reinforcement learning

Mathieu Rita, Florian Strub, Rahma Chaabouni, Paul Michel, Emmanuel Dupoux, and Olivier Pietquin. Countering reward over-optimization in llm with demonstration-guided reinforcement learning. arXiv preprint arXiv:2404.19409, 2024

work page arXiv 2024
[30]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi - Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Sys...

2023
[31]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. CoRR, abs/2402.03300, 2024. doi:10.48550/ARXIV.2402.03300. URL https://doi.org/10.48550/arXiv.2402.03300

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024
[33]

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548, 2023

work page internal anchor Pith review arXiv 2023
[34]

R-prm: Reasoning-driven process reward modeling,

Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, and Shujian Huang. R-PRM: reasoning-driven process reward modeling. CoRR, abs/2503.21295, 2025. doi:10.48550/ARXIV.2503.21295. URL https://doi.org/10.48550/arXiv.2503.21295

work page doi:10.48550/arxiv.2503.21295 2025
[35]

HybridFlow: A Flexible and Efficient RLHF Framework , url=

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pp.\ 1279--1297. ACM , 2025. doi:10.1145/368...

work page doi:10.1145/3689031.3696075 2025
[36]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36: 0 8634--8652, 2023

2023
[37]

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A survey on efficient reasoning for large language models. arXiv preprint arXiv:2503.16419, 2025

work page internal anchor Pith review arXiv 2025
[38]

Math-shepherd: Verify and reinforce llms step-by-step without human annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 9426--9439, 2024

2024
[39]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023 a . URL https://openreview.net/for...

2023
[40]

A comprehensive survey on trustworthiness in reasoning with large language models

Yanbo Wang, Yongcan Yu, Jian Liang, and Ran He. A comprehensive survey on trustworthiness in reasoning with large language models. arXiv preprint arXiv:2509.03871, 2025 a

work page arXiv 2025
[41]

Interactive natural language processing

Zekun Wang, Ge Zhang, Kexin Yang, Ning Shi, Wangchunshu Zhou, Shaochun Hao, Guangzheng Xiong, Yizhi Li, Mong Yuan Sim, Xiuying Chen, et al. Interactive natural language processing. arXiv preprint arXiv:2305.13246, 2023 b

work page arXiv 2023
[42]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei - Fei, Lijuan Wang, Yejin Choi, and Manling Li. RAGEN: understanding self-evolution in LLM agents via multi-turn reinforcement learning. CoRR, abs/2504.20073, 2025 b ....

work page internal anchor Pith review doi:10.48550/arxiv.2504.20073 2025
[43]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural ...

2022
[44]

Can llms express their uncertainty

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty. An Empirical Evaluation of Confidence Elicitation in LLMs. arXiv, 2024

2024
[45]

Mathfimer: Enhancing mathematical reasoning by expanding reasoning steps through fill-in-the-middle task, 2025

Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Xin Xu, Mengdi Zhang, Jian Shao, and Yueting Zhuang. Mathfimer: Enhancing mathematical reasoning by expanding reasoning steps through fill-in-the-middle task, 2025. URL https://arxiv.org/abs/2502.11684

work page arXiv 2025
[46]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024
[47]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[48]

, booktitle =

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun'ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Process...

work page doi:10.18653/v1/d18-1259 2018
[49]

Tree of thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Info...

2023
[50]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023 b . URL https://openreview.net/forum?id=WE\_vluYUL-X

2023
[51]

Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https:...

2024
[52]

Modeling future conversation turns to teach llms to ask clarifying questions

Michael JQ Zhang, W Bradley Knox, and Eunsol Choi. Modeling future conversation turns to teach llms to ask clarifying questions. arXiv preprint arXiv:2410.13788, 2024

work page arXiv 2024
[53]

Asktoact: Enhancing llms tool use via self-correcting clarification

Xuan Zhang, Yongliang Shen, Zhe Zheng, Linjuan Wu, Wenqi Zhang, Yuchen Yan, Qiuying Peng, Jun Wang, and Weiming Lu. Asktoact: Enhancing llms tool use via self-correcting clarification. CoRR, abs/2503.01940, 2025. doi:10.48550/ARXIV.2503.01940. URL https://doi.org/10.48550/arXiv.2503.01940

work page doi:10.48550/arxiv.2503.01940 2025
[54]

Archer: Training language model agents via hierarchical multi-turn RL

Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. Archer: Training language model agents via hierarchical multi-turn RL . In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=b6rA0kAHT1

2024