Recognition: unknown
Process Supervision via Verbal Critique Improves Reasoning in Large Language Models
Pith reviewed 2026-05-09 21:33 UTC · model grok-4.3
The pith
Structured verbal critiques from stronger models let weaker LLMs reach 94.9 percent accuracy on GPQA Diamond and lift AIME scores by up to 63 points through iterative refinement without any training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Verbal Process Supervision (VPS) uses structured natural-language critique from a stronger supervisor to drive an iterative generate-critique-refine loop, establishing critique granularity as a fourth axis of inference-time scaling that improves reasoning performance across closed and open models on GPQA Diamond, AIME 2025, and LiveCodeBench without gradient updates or model training.
What carries the argument
The generate-critique-refine loop of Verbal Process Supervision (VPS), in which a stronger supervisor supplies structured natural-language critiques to guide the actor model's successive refinements within a round budget R.
If this is right
- VPS turns low-performing actor models into high performers on hard math and science benchmarks when a stronger supervisor is available.
- Performance gains increase as the capability gap between supervisor and actor widens, with observed Pearson correlation of 0.90.
- At equal compute budgets VPS exceeds Reflexion by 8.5-12.1 points and self-consistency by 5-8 points on GPQA and LiveCodeBench.
- The method fails to deliver gains when errors cannot be expressed in words, as seen in some code tasks, pointing to the need for hybrid verbal-executable supervision.
- VPS works across both closed and open models and requires no parameter updates, making it immediately usable with existing stronger models.
Where Pith is reading between the lines
- Future systems could combine VPS with executable checkers to handle domains where verbal description is insufficient.
- The strong correlation with supervisor strength suggests that investing in better critique models may yield larger returns than scaling the actor alone.
- The loop structure could be adapted to non-reasoning tasks such as planning or creative generation where iterative feedback is useful.
- Because VPS is training-free, it lowers the barrier for organizations that lack the resources to fine-tune models but can access stronger ones at inference time.
Load-bearing premise
Reasoning errors must be expressible in natural language so that a stronger supervisor can supply critiques that reliably steer the actor toward better answers.
What would settle it
Run VPS on a task where the dominant errors are not linguistically describable, such as certain code-synthesis problems, and check whether performance gains disappear or reverse.
Figures
read the original abstract
Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision, via Verbal Process Supervision (VPS), a training-free framework that uses structured natural-language critique from a stronger supervisor to guide an iterative generate-critique-refine loop up to a round budget R. Across GPQA Diamond, AIME 2025, and LiveCodeBench V6 (covering both closed and open models), VPS yields three key results. First, on GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reaches 94.9% at R=4, surpassing the 94.1% state of the art without gradient updates. Second, on AIME 2025, VPS enables strong weak-actor rescue, boosting scores from 11.7-26.7% to 63.3-90.0% (up to +63.3 points). Third, at matched compute, VPS outperforms Reflexion by +8.5 to +12.1 points and Self-Consistency@5 by +5.0 pp (GPQA) and +8.3 pp (LiveCodeBench), isolating critique granularity as the key driver. Performance scales with the supervisor-actor capability gap (Pearson r=0.90) and degrades when errors are not linguistically expressible (e.g., code synthesis), motivating hybrid verbal-executable methods. These results establish critique granularity as a new axis of inference-time scaling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Verbal Process Supervision (VPS), a training-free framework that applies structured natural-language critiques from a stronger supervisor LLM within an iterative generate-critique-refine loop (up to round budget R) to improve reasoning in weaker actor models. It reports three main empirical results: on GPQA Diamond, a GPT-5.4 (High) supervisor with GPT-5.4 (Low) actor reaches 94.9% at R=4, exceeding the prior 94.1% SOTA without any gradient updates; on AIME 2025, VPS rescues weak actors by lifting scores from 11.7-26.7% to 63.3-90.0%; and at matched compute, VPS outperforms Reflexion (+8.5 to +12.1 points) and Self-Consistency@5 (+5.0 pp on GPQA, +8.3 pp on LiveCodeBench). Performance scales with the supervisor-actor capability gap (Pearson r=0.90) and degrades when errors are not linguistically expressible, motivating hybrid methods.
Significance. If the results hold under full experimental scrutiny, the work is significant for establishing critique granularity as a distinct, practical axis of inference-time scaling alongside chain depth, sample breadth, and learned process reward models. The training-free rescue of weak actors, matched-compute gains over strong baselines, and explicit scaling correlation provide actionable evidence that external verbal supervision can be leveraged effectively, while the acknowledgment of limitations for non-verbalizable errors (e.g., code synthesis) offers a balanced foundation for future hybrid verbal-executable approaches.
minor comments (4)
- [Abstract] Abstract: the statement that VPS 'surpasses the 94.1% state of the art' should explicitly name the prior work and confirm whether that baseline operates under comparable inference-time compute or model access.
- [Results] Results section: reported accuracies (e.g., 94.9%, 63.3-90.0%) lack error bars, number of runs, or statistical tests; adding these would strengthen claims of outperformance over Reflexion and Self-Consistency.
- [Methods] Experimental protocol: the exact prompt templates and formatting rules for generating structured verbal critiques are central to reproducibility yet appear only in passing; they should be provided verbatim in the main text or appendix.
- [Discussion] Discussion: the Pearson r=0.90 scaling result is cited without the underlying data points or figure; including the scatter plot and supervisor/actor model pairs used would make the correlation claim more transparent.
Simulated Author's Rebuttal
We thank the referee for their positive and accurate summary of Verbal Process Supervision (VPS), the reported results on GPQA Diamond, AIME 2025, and LiveCodeBench, and the recommendation for minor revision. The assessment correctly identifies the contribution as establishing critique granularity as a new inference-time scaling axis and notes the scaling behavior with supervisor-actor gap as well as the limitation for non-linguistically expressible errors.
Circularity Check
No significant circularity
full rationale
The paper is entirely empirical, reporting experimental results from a training-free generate-critique-refine loop on standard benchmarks (GPQA Diamond, AIME 2025, LiveCodeBench). It compares VPS against Reflexion and Self-Consistency at matched compute and notes scaling with supervisor-actor gap. No equations, derivations, fitted parameters, or self-referential definitions appear; all claims rest on observable performance metrics against external baselines. This satisfies the self-contained-against-benchmarks criterion for a score of 0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reasoning errors in LLMs are linguistically expressible and correctable via structured natural-language feedback from a stronger model.
Reference graph
Works this paper leans on
-
[1]
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. CRITIC: Large language models can self-correct with tool-interactive critiquing.arXiv preprint arXiv:2305.11738,
work page internal anchor Pith review arXiv
-
[2]
Reasoning with language model is planning with world model.arXiv preprint arXiv:2305.14992,
Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model.arXiv preprint arXiv:2305.14992,
- [3]
-
[4]
Large Language Models Cannot Self-Correct Reasoning Yet
Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798,
work page internal anchor Pith review arXiv
-
[5]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,
work page internal anchor Pith review arXiv
-
[6]
URL https://arxiv.org/abs/2305.20050. Dancheng Liu, Amir Nassereldine, Ziming Yang, Chenhui Xu, Yuting Hu, Jiajie Li, Utkarsh Kumar, Changjae Lee, and Jinjun Xiong. Large language models have intrinsic self-correction ability. arXiv preprint arXiv:2406.15673,
work page internal anchor Pith review arXiv
-
[7]
URLhttps://arxiv.org/abs/2501.19393. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022,
-
[8]
R-PRM: Reasoning-driven process reward modeling
Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, and Shujian Huang. R-PRM: Reasoning-driven process reward modeling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13438–13451, Suzhou, China, November
2025
-
[9]
Associa- tion for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main
-
[10]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao
URLhttps://aclanthology.org/2025.emnlp-main.679/. Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neu- ral Information Processing Systems, volume 36, pages 8634–8652...
2025
-
[11]
Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Y
URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 1b44b878bb782e6954cd888628510e90-Paper-Conference.pdf. Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Y . Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Associa...
2023
-
[12]
Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations
doi: 10.18653/v1/2024.acl-long.510. URL https://arxiv.org/abs/2312.08935. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations,
-
[13]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36, 2023a. URL https://arxiv. org/abs/2305.10601. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Na...
work page internal anchor Pith review arXiv
-
[14]
Processbench: Identifying process errors in mathematical reasoning,
URL https://research.google/pubs/ star-self-taught-reasoner-bootstrapping-reasoning-with-reasoning/. Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jin- gren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. arXiv preprint arXiv:2412.06559,
-
[15]
Your final answer is incorrect. Revisit your case enumeration and check whether all parities are accounted for. Try a different counting approach
4. Bench. ConfigR= 1R= 2R= 3R= 4Peak PeakRStd. Dev. GPQA GPT-5.4 (H|L) 87.9 94.4 93.9 94.9 94.9 4 3.0 GPQA GLM-5.1|Nemotron 56.6 57.1 56.6 61.1 61.1 4 1.9 GPQA Gemma 4|GPT-OSS 20B 68.2 71.7 73.2 67.7 73.2 3 2.3 GPQA GPT-OSS 120B|20B 67.7 63.1 71.1 72.2 72.2 4 3.5 AIME GPT-5.4 Nano (H|L) 63.3 83.3 90.0 90.0 90.0 3 10.9 AIME GLM-5.1|Nemotron 63.3 80.0 70.0 ...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.