arxiv: 2604.21611 · v1 · submitted 2026-04-23 · 💻 cs.CL · cs.AI

Recognition: unknown

Process Supervision via Verbal Critique Improves Reasoning in Large Language Models

Hao-Yuan Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords verbal process supervisioninference-time scalingLLM reasoningcritique refinementprocess supervisionGPQAAIME

0 comments

The pith

Structured verbal critiques from stronger models let weaker LLMs reach 94.9 percent accuracy on GPQA Diamond and lift AIME scores by up to 63 points through iterative refinement without any training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Verbal Process Supervision as a training-free method that adds a new dimension to inference-time scaling for LLM reasoning. A stronger supervisor provides natural-language critiques that steer a weaker actor through repeated generate-critique-refine cycles up to a fixed round budget. On GPQA Diamond this reaches 94.9 percent, beating the prior state of the art, while on AIME 2025 it rescues weak models from single-digit or low-teen scores to 63-90 percent. At matched compute the method beats Reflexion by 8.5 to 12.1 points and self-consistency by 5 to 8 points, showing that the granularity of external verbal feedback is the active ingredient. Gains scale with the supervisor-actor capability gap and drop when errors cannot be expressed clearly in language.

Core claim

Verbal Process Supervision (VPS) uses structured natural-language critique from a stronger supervisor to drive an iterative generate-critique-refine loop, establishing critique granularity as a fourth axis of inference-time scaling that improves reasoning performance across closed and open models on GPQA Diamond, AIME 2025, and LiveCodeBench without gradient updates or model training.

What carries the argument

The generate-critique-refine loop of Verbal Process Supervision (VPS), in which a stronger supervisor supplies structured natural-language critiques to guide the actor model's successive refinements within a round budget R.

If this is right

VPS turns low-performing actor models into high performers on hard math and science benchmarks when a stronger supervisor is available.
Performance gains increase as the capability gap between supervisor and actor widens, with observed Pearson correlation of 0.90.
At equal compute budgets VPS exceeds Reflexion by 8.5-12.1 points and self-consistency by 5-8 points on GPQA and LiveCodeBench.
The method fails to deliver gains when errors cannot be expressed in words, as seen in some code tasks, pointing to the need for hybrid verbal-executable supervision.
VPS works across both closed and open models and requires no parameter updates, making it immediately usable with existing stronger models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future systems could combine VPS with executable checkers to handle domains where verbal description is insufficient.
The strong correlation with supervisor strength suggests that investing in better critique models may yield larger returns than scaling the actor alone.
The loop structure could be adapted to non-reasoning tasks such as planning or creative generation where iterative feedback is useful.
Because VPS is training-free, it lowers the barrier for organizations that lack the resources to fine-tune models but can access stronger ones at inference time.

Load-bearing premise

Reasoning errors must be expressible in natural language so that a stronger supervisor can supply critiques that reliably steer the actor toward better answers.

What would settle it

Run VPS on a task where the dominant errors are not linguistically describable, such as certain code-synthesis problems, and check whether performance gains disappear or reverse.

Figures

Figures reproduced from arXiv: 2604.21611 by Hao-Yuan Chen.

**Figure 2.** Figure 2: Round count ablation on GPQA Diamond under VPS. GPT-5.4 (High [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Round count ablation on AIME 2025. GPT-5.4 Nano (High [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Round count ablation on LiveCodeBench V6. GPT-5.4 Mini improves from 38.5% at [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Panel (b): Best VPS score vs. actor baseline. Points above y = x indicate improvement; weak-actor configs rise most clearly above the diagonal. Panel (c): Supervisor headroom vs. VPS gain (r = +0.90); headroom is the primary predictor of benefit. Panel (f): Round-to-round accuracy std. dev. — AIME rescue configs are most volatile, suggesting the most productive regime is also the least stable. B Complete R… view at source ↗

**Figure 6.** Figure 6: Headroom–gain summary by regime. Green: rescue (H > 50 pp); blue: marginal; red: degradation; grey: code domain boundary. Gain ∆ = best VPS − actor. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Inference-time scaling for LLM reasoning has focused on three axes: chain depth, sample breadth, and learned step-scorers (PRMs). We introduce a fourth axis, granularity of external verbal supervision, via Verbal Process Supervision (VPS), a training-free framework that uses structured natural-language critique from a stronger supervisor to guide an iterative generate-critique-refine loop up to a round budget R. Across GPQA Diamond, AIME 2025, and LiveCodeBench V6 (covering both closed and open models), VPS yields three key results. First, on GPQA Diamond, GPT-5.4 (High) | GPT-5.4 (Low) reaches 94.9% at R=4, surpassing the 94.1% state of the art without gradient updates. Second, on AIME 2025, VPS enables strong weak-actor rescue, boosting scores from 11.7-26.7% to 63.3-90.0% (up to +63.3 points). Third, at matched compute, VPS outperforms Reflexion by +8.5 to +12.1 points and Self-Consistency@5 by +5.0 pp (GPQA) and +8.3 pp (LiveCodeBench), isolating critique granularity as the key driver. Performance scales with the supervisor-actor capability gap (Pearson r=0.90) and degrades when errors are not linguistically expressible (e.g., code synthesis), motivating hybrid verbal-executable methods. These results establish critique granularity as a new axis of inference-time scaling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VPS shows verbal critiques can lift reasoning scores on math and science tasks without training, but the gains rest on how well errors translate into natural language feedback.

read the letter

The main point is that this work treats the detail level of verbal feedback as a controllable scaling lever for inference-time reasoning. By running a generate-critique-refine loop with a stronger supervisor model, they report clear lifts over Reflexion and self-consistency at matched compute on GPQA Diamond, AIME 2025, and LiveCodeBench. The AIME rescue numbers for weak actors stand out, and the 0.90 correlation between supervisor-actor gap and performance gain gives the mechanism some empirical grounding. They also note the method degrades when errors resist verbal description, which keeps the claim honest rather than overstated. That framing of critique granularity as a fourth axis alongside chain depth, sample count, and process reward models is the clearest new angle. The paper does a decent job of positioning the results against prior iterative refinement work and showing the gains are not just from extra compute. The comparisons isolate the verbal supervision component reasonably well for an empirical study. The soft spots are mostly in the level of detail available. The abstract gives headline numbers but leaves the exact critique template, round budget handling, and error analysis thin, so it is hard to judge how much prompt sensitivity or model-specific quirks drive the outcomes. Reproducibility would benefit from more ablations on critique structure and failure cases beyond the code example. No load-bearing circularity or hidden fitting shows up in the reported results. This is aimed at groups already running inference-time experiments on reasoning benchmarks. A reader who cares about practical ways to stretch existing models without fine-tuning will find usable ideas and numbers to test. It is worth sending for peer review because the empirical pattern is concrete enough to merit checking the full methods and seeing whether the scaling claim holds under closer scrutiny.

Referee Report

0 major / 4 minor

Summary. The manuscript introduces Verbal Process Supervision (VPS), a training-free framework that applies structured natural-language critiques from a stronger supervisor LLM within an iterative generate-critique-refine loop (up to round budget R) to improve reasoning in weaker actor models. It reports three main empirical results: on GPQA Diamond, a GPT-5.4 (High) supervisor with GPT-5.4 (Low) actor reaches 94.9% at R=4, exceeding the prior 94.1% SOTA without any gradient updates; on AIME 2025, VPS rescues weak actors by lifting scores from 11.7-26.7% to 63.3-90.0%; and at matched compute, VPS outperforms Reflexion (+8.5 to +12.1 points) and Self-Consistency@5 (+5.0 pp on GPQA, +8.3 pp on LiveCodeBench). Performance scales with the supervisor-actor capability gap (Pearson r=0.90) and degrades when errors are not linguistically expressible, motivating hybrid methods.

Significance. If the results hold under full experimental scrutiny, the work is significant for establishing critique granularity as a distinct, practical axis of inference-time scaling alongside chain depth, sample breadth, and learned process reward models. The training-free rescue of weak actors, matched-compute gains over strong baselines, and explicit scaling correlation provide actionable evidence that external verbal supervision can be leveraged effectively, while the acknowledgment of limitations for non-verbalizable errors (e.g., code synthesis) offers a balanced foundation for future hybrid verbal-executable approaches.

minor comments (4)

[Abstract] Abstract: the statement that VPS 'surpasses the 94.1% state of the art' should explicitly name the prior work and confirm whether that baseline operates under comparable inference-time compute or model access.
[Results] Results section: reported accuracies (e.g., 94.9%, 63.3-90.0%) lack error bars, number of runs, or statistical tests; adding these would strengthen claims of outperformance over Reflexion and Self-Consistency.
[Methods] Experimental protocol: the exact prompt templates and formatting rules for generating structured verbal critiques are central to reproducibility yet appear only in passing; they should be provided verbatim in the main text or appendix.
[Discussion] Discussion: the Pearson r=0.90 scaling result is cited without the underlying data points or figure; including the scatter plot and supervisor/actor model pairs used would make the correlation claim more transparent.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of Verbal Process Supervision (VPS), the reported results on GPQA Diamond, AIME 2025, and LiveCodeBench, and the recommendation for minor revision. The assessment correctly identifies the contribution as establishing critique granularity as a new inference-time scaling axis and notes the scaling behavior with supervisor-actor gap as well as the limitation for non-linguistically expressible errors.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is entirely empirical, reporting experimental results from a training-free generate-critique-refine loop on standard benchmarks (GPQA Diamond, AIME 2025, LiveCodeBench). It compares VPS against Reflexion and Self-Consistency at matched compute and notes scaling with supervisor-actor gap. No equations, derivations, fitted parameters, or self-referential definitions appear; all claims rest on observable performance metrics against external baselines. This satisfies the self-contained-against-benchmarks criterion for a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No mathematical derivations present. The central claim rests on the untested domain assumption that verbal critiques can effectively express and correct reasoning errors.

axioms (1)

domain assumption Reasoning errors in LLMs are linguistically expressible and correctable via structured natural-language feedback from a stronger model.
Paper notes degradation when errors are not linguistically expressible, making this assumption load-bearing for the method's applicability.

pith-pipeline@v0.9.0 · 5583 in / 1292 out tokens · 42242 ms · 2026-05-09T21:33:22.350880+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 11 canonical work pages · 5 internal anchors

[1]

CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. CRITIC: Large language models can self-correct with tool-interactive critiquing.arXiv preprint arXiv:2305.11738,

work page internal anchor Pith review arXiv
[2]

Reasoning with language model is planning with world model.arXiv preprint arXiv:2305.14992,

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model.arXiv preprint arXiv:2305.14992,

work page arXiv
[3]

Huang, S

Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve.arXiv preprint arXiv:2210.11610,

work page arXiv
[4]

Large Language Models Cannot Self-Correct Reasoning Yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet.arXiv preprint arXiv:2310.01798,

work page internal anchor Pith review arXiv
[5]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review arXiv
[6]

Let's Verify Step by Step

URL https://arxiv.org/abs/2305.20050. Dancheng Liu, Amir Nassereldine, Ziming Yang, Chenhui Xu, Yuting Hu, Jiajie Li, Utkarsh Kumar, Changjae Lee, and Jinjun Xiong. Large language models have intrinsic self-correction ability. arXiv preprint arXiv:2406.15673,

work page internal anchor Pith review arXiv
[7]

s1: Simple test-time scaling

URLhttps://arxiv.org/abs/2501.19393. David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. arXiv preprint arXiv:2311.12022,

work page Pith review arXiv
[8]

R-PRM: Reasoning-driven process reward modeling

Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, and Shujian Huang. R-PRM: Reasoning-driven process reward modeling. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 13438–13451, Suzhou, China, November

2025
[9]

ISBN 979-8-89176-332-6

Associa- tion for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main

work page doi:10.18653/v1/2025.emnlp-main 2025
[10]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao

URLhttps://aclanthology.org/2025.emnlp-main.679/. Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neu- ral Information Processing Systems, volume 36, pages 8634–8652...

2025
[11]

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Y

URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 1b44b878bb782e6954cd888628510e90-Paper-Conference.pdf. Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Y . Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations. InProceedings of the 62nd Annual Meeting of the Associa...

2023
[12]

Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations

doi: 10.18653/v1/2024.acl-long.510. URL https://arxiv.org/abs/2312.08935. Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations,

work page doi:10.18653/v1/2024.acl-long.510 2024
[13]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36, 2023a. URL https://arxiv. org/abs/2305.10601. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Na...

work page internal anchor Pith review arXiv
[14]

Processbench: Identifying process errors in mathematical reasoning,

URL https://research.google/pubs/ star-self-taught-reasoner-bootstrapping-reasoning-with-reasoning/. Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jin- gren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. arXiv preprint arXiv:2412.06559,

work page arXiv
[15]

Your final answer is incorrect. Revisit your case enumeration and check whether all parities are accounted for. Try a different counting approach

4. Bench. ConfigR= 1R= 2R= 3R= 4Peak PeakRStd. Dev. GPQA GPT-5.4 (H|L) 87.9 94.4 93.9 94.9 94.9 4 3.0 GPQA GLM-5.1|Nemotron 56.6 57.1 56.6 61.1 61.1 4 1.9 GPQA Gemma 4|GPT-OSS 20B 68.2 71.7 73.2 67.7 73.2 3 2.3 GPQA GPT-OSS 120B|20B 67.7 63.1 71.1 72.2 72.2 4 3.5 AIME GPT-5.4 Nano (H|L) 63.3 83.3 90.0 90.0 90.0 3 10.9 AIME GLM-5.1|Nemotron 63.3 80.0 70.0 ...

2025