Recognition: unknown
Pause or Fabricate? Training Language Models for Grounded Reasoning
Pith reviewed 2026-05-10 02:58 UTC · model grok-4.3
The pith
LLMs can be trained to detect missing information and pause rather than fabricate answers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that decomposing the reasoning process into a clarify-and-pause stage followed by a grounded reasoning stage, then applying stage-specific rewards within an interactive reinforcement learning loop, allows language models to acquire inferential boundary awareness so they reliably identify when premises are missing, refrain from fabrication, and achieve higher success on tasks with incomplete inputs.
What carries the argument
GRIL, a multi-turn reinforcement learning framework that decomposes reasoning into clarify-and-pause and grounded reasoning stages guided by rewards that penalize hallucinations.
Load-bearing premise
That custom stage-specific rewards can reliably instill inferential boundary awareness in LLMs without the model learning to game the rewards or overfit to the artificial insufficient datasets.
What would settle it
A new collection of incomplete reasoning problems on which GRIL-trained models still fabricate information at rates comparable to untrained models or show no gain in task success.
read the original abstract
Large language models have achieved remarkable progress on complex reasoning tasks. However, they often implicitly fabricate information when inputs are incomplete, producing confident but unreliable conclusions -- a failure mode we term ungrounded reasoning. We argue that this issue arises not from insufficient reasoning capability, but from the lack of inferential boundary awareness -- the ability to recognize when the necessary premises for valid inference are missing. To address this issue, we propose Grounded Reasoning via Interactive Reinforcement Learning (GRIL), a multi-turn reinforcement learning framework for grounded reasoning under incomplete information. GRIL decomposes the reasoning process into two stages: clarify and pause, which identifies whether the available information is sufficient, and grounded reasoning, which performs task solving once the necessary premises are established. We design stage-specific rewards to penalize hallucinations, enabling models to detect gaps, stop proactively, and resume reasoning after clarification. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient show that GRIL significantly improves premise detection (up to 45%), leading to a 30% increase in task success while reducing average response length by over 20%. Additional analyses confirm robustness to noisy user responses and generalization to out-of-distribution tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs exhibit ungrounded reasoning by fabricating information under incomplete inputs due to lacking inferential boundary awareness. It proposes GRIL, a multi-turn RL framework that decomposes reasoning into a clarify/pause stage (to detect missing premises) and a grounded reasoning stage, using stage-specific rewards to penalize hallucinations. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient report up to 45% gains in premise detection, 30% higher task success, and >20% shorter responses, with added robustness and generalization checks.
Significance. If the empirical gains are robust and not artifacts of the training setup, GRIL offers a concrete mechanism to instill boundary awareness via interactive RL, which could meaningfully improve reliability in reasoning tasks with incomplete information. The multi-stage decomposition and reward design are novel relative to standard RLHF or chain-of-thought approaches.
major comments (3)
- [Experiments / Dataset Construction] The abstract and experimental sections report quantitative gains on GSM8K-Insufficient and MetaMATH-Insufficient without detailing dataset construction (e.g., how premises are removed or insufficiency is verified). This is load-bearing because selection effects or artificial cues in these variants could inflate premise-detection and success metrics without demonstrating general boundary awareness.
- [GRIL Framework / Reward Design] Stage-specific reward formulations are referenced but not specified exactly (e.g., the precise penalty for hallucination in the clarify stage or the reward for resuming after clarification). Without the equations or pseudocode, it is impossible to assess whether the reported 45% premise-detection lift could arise from simple policies that maximize reward without learning true inferential boundaries.
- [Experiments / Baselines and Metrics] The paper compares GRIL to baselines but does not report the exact baseline implementations, hyperparameter matching, or statistical significance tests for the 30% task-success and 20% length reductions. This undermines the claim that the gains are attributable to the clarify/pause mechanism rather than training differences.
minor comments (2)
- [Method] Notation for the two stages (clarify/pause vs. grounded reasoning) is introduced in the abstract but should be formalized with consistent symbols or diagrams in §3 to aid reproducibility.
- [Additional Analyses] The generalization claim to out-of-distribution tasks is stated but lacks a dedicated table or figure showing per-task breakdowns; adding one would strengthen the evidence.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. The comments identify areas where additional detail will improve clarity and reproducibility. We address each point below and commit to incorporating the requested information in the revised manuscript.
read point-by-point responses
-
Referee: [Experiments / Dataset Construction] The abstract and experimental sections report quantitative gains on GSM8K-Insufficient and MetaMATH-Insufficient without detailing dataset construction (e.g., how premises are removed or insufficiency is verified). This is load-bearing because selection effects or artificial cues in these variants could inflate premise-detection and success metrics without demonstrating general boundary awareness.
Authors: We agree that a more explicit description of dataset construction is necessary for full reproducibility and to rule out potential artifacts. While the current manuscript briefly introduces the insufficient variants, it does not provide step-by-step procedures for premise removal or verification. In the revised version we will add a dedicated subsection (likely in Section 4 or an appendix) that specifies the exact removal criteria, the verification protocol used to confirm insufficiency, and representative examples of original and modified problems. This addition will directly address concerns about selection effects and strengthen the interpretation of the reported gains as evidence of boundary awareness. revision: yes
-
Referee: [GRIL Framework / Reward Design] Stage-specific reward formulations are referenced but not specified exactly (e.g., the precise penalty for hallucination in the clarify stage or the reward for resuming after clarification). Without the equations or pseudocode, it is impossible to assess whether the reported 45% premise-detection lift could arise from simple policies that maximize reward without learning true inferential boundaries.
Authors: We acknowledge that the precise mathematical forms of the stage-specific rewards were described at a high level but not formalized with equations or pseudocode. The clarify-stage reward penalizes fabrication while the reasoning-stage reward encourages resumption only after clarification; however, without the explicit expressions it is difficult for readers to evaluate whether the observed improvements reflect genuine boundary learning. In the revision we will insert the exact reward equations and a short pseudocode block (either in the main text or a new appendix) so that the reward design can be scrutinized and replicated. revision: yes
-
Referee: [Experiments / Baselines and Metrics] The paper compares GRIL to baselines but does not report the exact baseline implementations, hyperparameter matching, or statistical significance tests for the 30% task-success and 20% length reductions. This undermines the claim that the gains are attributable to the clarify/pause mechanism rather than training differences.
Authors: We concur that greater transparency on baseline implementations and statistical rigor is required. The current manuscript references standard RLHF and CoT baselines but omits full hyperparameter tables and significance testing. In the revised manuscript we will (i) document the precise baseline codebases and training configurations, (ii) confirm that hyperparameters were matched to the extent possible, and (iii) add statistical significance results (e.g., paired t-tests or bootstrap confidence intervals) for the key metrics. These additions will allow readers to attribute performance differences more confidently to the GRIL framework. revision: yes
Circularity Check
No circularity: GRIL is an empirical RL training method evaluated on held-out data
full rationale
The paper defines GRIL as a multi-turn RL framework that decomposes reasoning into clarify/pause and grounded stages with custom rewards penalizing hallucinations. Claims of improved premise detection (up to 45%) and task success (30%) rest on experimental results from training and testing on GSM8K-Insufficient and MetaMATH-Insufficient, which are external benchmarks. No equations, derivations, or self-citations reduce the reported gains to fitted parameters or self-definitions by construction. The method and metrics are independently specified; success is measured via standard task accuracy and response length on held-out instances rather than tautologically.
Axiom & Free-Parameter Ledger
free parameters (1)
- stage-specific reward weights
axioms (1)
- domain assumption LLMs can acquire inferential boundary awareness through interactive RL with targeted rewards
Reference graph
Works this paper leans on
-
[1]
Simran Arora, Patrick S. H. Lewis, Angela Fan, Jacob Kahn, and Christopher R \' e . Reasoning over public and private data in retrieval-based systems. Trans. Assoc. Comput. Linguistics, 11: 0 902--921, 2023. doi:10.1162/TACL\_A\_00580. URL https://doi.org/10.1162/tacl\_a\_00580
work page internal anchor Pith review doi:10.1162/tacl 2023
-
[2]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield - Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson...
-
[3]
In: Wooldridge, M.J., Dy, J.G., Natarajan, S
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan (eds.), Thirty-Eighth AAAI Confere...
-
[4]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
arXiv preprint arXiv:2504.06514 , year=
Chenrui Fan, Ming Li, Lichao Sun, and Tianyi Zhou. Missing premise exacerbates overthinking: Are reasoning models losing critical thinking skill? CoRR, abs/2504.06514, 2025. doi:10.48550/ARXIV.2504.06514. URL https://doi.org/10.48550/arXiv.2504.06514
-
[8]
arXiv preprint arXiv:2508.15260 , year=
Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence. CoRR, abs/2508.15260, 2025. doi:10.48550/ARXIV.2508.15260. URL https://doi.org/10.48550/arXiv.2508.15260
-
[9]
Clarq- LLM : A benchmark for models clarifying and requesting information in task-oriented dialog
Yujian Gan, Changling Li, Jinxia Xie, Luou Wen, Matthew Purver, and Massimo Poesio. Clarq-llm: A benchmark for models clarifying and requesting information in task-oriented dialog. arXiv preprint arXiv:2409.06097, 2024
-
[10]
Modeling reasoning as markov decision processes: A theoretical investigation into nlp transformer models
Zhenyu Gao. Modeling reasoning as markov decision processes: A theoretical investigation into nlp transformer models. 2025
2025
-
[11]
Rl of thoughts: Navigating llm reasoning with inference-time reinforcement learning
Qianyue Hao, Sibo Li, Jian Yuan, and Yong Li. Rl of thoughts: Navigating llm reasoning with inference-time reinforcement learning. arXiv preprint arXiv:2505.14140, 2025
-
[12]
Interactive question clarification in dialogue via reinforcement learning
Xiang Hu, Zujie Wen, Yafang Wang, Xiaolong Li, and Gerard de Melo. Interactive question clarification in dialogue via reinforcement learning. In Ann Clifton and Courtney Napoles (eds.), Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020 - Industry Track, Online, December 12, 2020 , pp.\ 78--89. International Committ...
-
[13]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst. , 43 0 (2): 0 42:1--42:55, 2025. doi:10.1145/3703155. URL https://doi.org/10.1145/3703155
-
[14]
Pag: Multi-turn reinforced llm self-correction with policy as generative verifier, 2025
Yuhua Jiang, Yuwen Xiong, Yufeng Yuan, Chao Xin, Wenyuan Xu, Yu Yue, Qianchuan Zhao, and Lin Yan. Pag: Multi-turn reinforced llm self-correction with policy as generative verifier, 2025. URL https://arxiv.org/abs/2506.10406
-
[15]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025
work page internal anchor Pith review arXiv 2025
-
[16]
Language Models (Mostly) Know What They Know
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022
work page internal anchor Pith review arXiv 2022
-
[17]
Process reward models that think.arXiv preprint arXiv:2504.16828, 2025
Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, and Lu Wang. Process reward models that think. arXiv preprint arXiv:2504.16828, 2025
-
[18]
Clam: Selective clarification for ambiguous questions with generative language models
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Clam: Selective clarification for ambiguous questions with generative language models. arXiv preprint arXiv:2212.07769, 2022
-
[19]
LLMs Get Lost In Multi-Turn Conversation
Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation. CoRR, abs/2505.06120, 2025. doi:10.48550/ARXIV.2505.06120. URL https://doi.org/10.48550/arXiv.2505.06120
work page internal anchor Pith review doi:10.48550/arxiv.2505.06120 2025
-
[20]
Inference-time intervention: Eliciting truthful answers from a language model
Kenneth Li, Oam Patel, Fernanda Vi \'e gas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36: 0 41451--41530, 2023
2023
-
[21]
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou. Search-o1: Agentic search-enhanced large reasoning models. arXiv preprint arXiv:2501.05366, 2025
work page internal anchor Pith review arXiv 2025
-
[22]
Let's verify step by step
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2023
2023
-
[23]
Reverse thinking enhances missing information detection in large language models, 2025
Yuxin Liu, Chaojie Gu, Yihang Zhang, Bin Qian, and Shibo He. Reverse thinking enhances missing information detection in large language models, 2025. URL https://arxiv.org/abs/2512.10273
-
[24]
Improve Mathematical Reasoning in Language Models by Automated Process Supervision
Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. Improve mathematical reasoning in language models by automated process supervision. CoRR, abs/2406.06592, 2024. doi:10.48550/ARXIV.2406.06592. URL https://doi.org/10.48550/arXiv.2406.06592
work page internal anchor Pith review doi:10.48550/arxiv.2406.06592 2024
-
[25]
Sichun Luo, Yi Huang, Mukai Li, Shichang Meng, Fengyuan Liu, Zefa Hu, Junlan Feng, and Qi Liu. Clarifymt-bench: Benchmarking and improving multi-turn clarification for conversational large language models, 2025. URL https://arxiv.org/abs/2512.21120
-
[26]
Self-refine: Iterative refinement with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Alice Oh, Tristan Naumann, Amir Globerson, K...
2023
-
[27]
OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi:10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
-
[28]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022
2022
-
[29]
Countering reward over-optimization in llm with demonstration-guided reinforcement learning
Mathieu Rita, Florian Strub, Rahma Chaabouni, Paul Michel, Emmanuel Dupoux, and Olivier Pietquin. Countering reward over-optimization in llm with demonstration-guided reinforcement learning. arXiv preprint arXiv:2404.19409, 2024
-
[30]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi - Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Sys...
2023
-
[31]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. CoRR, abs/2402.03300, 2024. doi:10.48550/ARXIV.2402.03300. URL https://doi.org/10.48550/arXiv.2402.03300
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300 2024
-
[33]
Towards Understanding Sycophancy in Language Models
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R Johnston, et al. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548, 2023
work page internal anchor Pith review arXiv 2023
-
[34]
R-prm: Reasoning-driven process reward modeling,
Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, and Shujian Huang. R-PRM: reasoning-driven process reward modeling. CoRR, abs/2503.21295, 2025. doi:10.48550/ARXIV.2503.21295. URL https://doi.org/10.48550/arXiv.2503.21295
-
[35]
HybridFlow: A Flexible and Efficient RLHF Framework , url=
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pp.\ 1279--1297. ACM , 2025. doi:10.1145/368...
-
[36]
Reflexion: Language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36: 0 8634--8652, 2023
2023
-
[37]
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, et al. Stop overthinking: A survey on efficient reasoning for large language models. arXiv preprint arXiv:2503.16419, 2025
work page internal anchor Pith review arXiv 2025
-
[38]
Math-shepherd: Verify and reinforce llms step-by-step without human annotations
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 9426--9439, 2024
2024
-
[39]
Le, Ed H
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023 a . URL https://openreview.net/for...
2023
-
[40]
A comprehensive survey on trustworthiness in reasoning with large language models
Yanbo Wang, Yongcan Yu, Jian Liang, and Ran He. A comprehensive survey on trustworthiness in reasoning with large language models. arXiv preprint arXiv:2509.03871, 2025 a
-
[41]
Interactive natural language processing
Zekun Wang, Ge Zhang, Kexin Yang, Ning Shi, Wangchunshu Zhou, Shaochun Hao, Guangzheng Xiong, Yizhi Li, Mong Yuan Sim, Xiuying Chen, et al. Interactive natural language processing. arXiv preprint arXiv:2305.13246, 2023 b
-
[42]
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei - Fei, Lijuan Wang, Yejin Choi, and Manling Li. RAGEN: understanding self-evolution in LLM agents via multi-turn reinforcement learning. CoRR, abs/2504.20073, 2025 b ....
work page internal anchor Pith review doi:10.48550/arxiv.2504.20073 2025
-
[43]
Chi, Quoc V
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural ...
2022
-
[44]
Can llms express their uncertainty
Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty. An Empirical Evaluation of Confidence Elicitation in LLMs. arXiv, 2024
2024
-
[45]
Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Xin Xu, Mengdi Zhang, Jian Shao, and Yueting Zhuang. Mathfimer: Enhancing mathematical reasoning by expanding reasoning steps through fill-in-the-middle task, 2025. URL https://arxiv.org/abs/2502.11684
-
[46]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024
-
[47]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jian Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Liangha...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
-
[48]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun'ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Process...
-
[49]
Tree of thoughts: Deliberate problem solving with large language models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Info...
2023
-
[50]
Narasimhan, and Yuan Cao
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net, 2023 b . URL https://openreview.net/forum?id=WE\_vluYUL-X
2023
-
[51]
Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu
Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https:...
2024
-
[52]
Modeling future conversation turns to teach llms to ask clarifying questions
Michael JQ Zhang, W Bradley Knox, and Eunsol Choi. Modeling future conversation turns to teach llms to ask clarifying questions. arXiv preprint arXiv:2410.13788, 2024
-
[53]
Asktoact: Enhancing llms tool use via self-correcting clarification
Xuan Zhang, Yongliang Shen, Zhe Zheng, Linjuan Wu, Wenqi Zhang, Yuchen Yan, Qiuying Peng, Jun Wang, and Weiming Lu. Asktoact: Enhancing llms tool use via self-correcting clarification. CoRR, abs/2503.01940, 2025. doi:10.48550/ARXIV.2503.01940. URL https://doi.org/10.48550/arXiv.2503.01940
-
[54]
Archer: Training language model agents via hierarchical multi-turn RL
Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. Archer: Training language model agents via hierarchical multi-turn RL . In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=b6rA0kAHT1
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.