Step-wise Rubric Rewards for LLM Reasoning

Baobao Chang; Haozhe Zhao; Jiaqi Wang; Kean Shi; Liang Chen; Minghao Ye; Nan Duan; Ruoyu Wu; Shuai Dong; Weichu Xie

arxiv: 2605.17291 · v1 · pith:DQNRXODZnew · submitted 2026-05-17 · 💻 cs.LG

Step-wise Rubric Rewards for LLM Reasoning

Weichu Xie , Haozhe Zhao , Wenpu Liu , Yongfu Zhu , Liang Chen , Minghao Ye , Zirong Chen , Yuqi Xu

show 10 more authors

Shuai Dong Ziyue Wang Xinbo Xu Kean Shi Ruoyu Wu Xiaoying Zhang Wenqi Shao Baobao Chang Nan Duan Jiaqi Wang

This is my paper

Pith reviewed 2026-05-20 13:26 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningLLM reasoningrubric rewardsstep-wise supervisionmathematical reasoningfaithful reasoningself-correction

0 comments

The pith

Step-wise Rubric Rewards assign supervision to individual reasoning steps to fix aggregation problems in LLM reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning for language models typically rewards only whether the final answer is correct, leaving intermediate reasoning steps without targeted feedback. Rubric-based methods score entire responses against structured criteria but still collapse everything into one scalar reward, which can reward incorrect steps inside correct answers and penalize correct steps inside wrong answers. This paper introduces Step-wise Rubrics as Rewards that uses an LLM judge to link each rubric item to a particular reasoning step, normalizes the per-step scores across multiple rollouts so only steps whose quality actually varies drive learning, and combines the step rewards with the final outcome reward through a decoupled advantage estimator that keeps the outcome baseline stable. The approach is supported by a 16K-problem rubric dataset created through contrastive distillation from correct and flawed reasoning paths. Experiments across six mathematical reasoning benchmarks show higher accuracy, increased faithful reasoning, and reduced self-correction looping compared with prior rubric aggregation methods.

Core claim

SRaR uses an LLM judge to attribute each rubric item to a specific reasoning step, normalizes per-step rubric scores across rollouts so only steps whose quality varies produce a learning signal, and combines the per-step reward with the outcome reward through a decoupled advantage estimator that keeps the outcome baseline stable.

What carries the argument

The LLM-judge attribution of rubric items to individual reasoning steps, combined with per-step score normalization across rollouts and a decoupled advantage estimator for merging with outcome rewards.

If this is right

Improves average accuracy over RaR by 3.57 points on Qwen3-8B and 2.75 points on Qwen3-32B across six mathematical reasoning benchmarks.
Raises the Faithful Reasoning Rate on AIME 2025 from 34.5% to 46.7%.
Reduces self-correction looping from 48.1% to 26.5%.
Avoids the observed pattern where 18.2% of steps in correct-answer responses are wrong yet positively rewarded and 49.9% of steps in incorrect-answer responses are correct yet penalized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The contrastive distillation process used to build the 16K-problem rubric dataset could be reused to create targeted training signals for other multi-step tasks such as code generation or scientific proof construction.
Decoupling step-wise and outcome advantages may stabilize training in reinforcement learning settings that combine several different reward components beyond reasoning.
Applying the same attribution technique outside mathematical domains could test whether step-level supervision reduces reward hacking in areas like long-form writing or multi-turn dialogue.
If future judge models become more accurate at step attribution, the framework would automatically deliver stronger per-step signals without requiring changes to the training pipeline.

Load-bearing premise

The LLM judge can reliably map each rubric item to the correct reasoning step without frequent or systematic attribution errors.

What would settle it

Direct measurement of the LLM judge's attribution accuracy on a sample of rollouts showing high error rates that would make the per-step reward signal too noisy to produce the claimed gains over standard rubric aggregation.

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in large language models, but rewards only final-answer correctness with no supervision over intermediate steps. Rubric-based methods such as Rubrics as Rewards (RaR) introduce finer-grained supervision by scoring rollouts against structured criteria, yet the rubric scores are still aggregated into a single scalar applied to the entire response, causing three weaknesses: loss of multi-criterion structure, uniform supervision of correct and incorrect steps, and reward hacking through unbounded self-correction. On 1,000 problems, we find 18.2% of steps in correct-answer responses are wrong yet positively rewarded, while 49.9% of steps in incorrect-answer responses are correct yet penalized. We introduce Step-wise Rubrics as Rewards (SRaR), an RLVR framework that (i) uses an LLM judge to attribute each rubric item to a specific reasoning step, (ii) normalizes per-step rubric scores across rollouts so only steps whose quality varies produce a learning signal, and (iii) combines the per-step reward with the outcome reward through a decoupled advantage estimator that keeps the outcome baseline stable. We further build a 16K-problem rubric dataset by contrastively distilling rubric items from correct and flawed reasoning paths sampled from a strong model. Across six mathematical reasoning benchmarks, SRaR improves average accuracy over RaR by 3.57 points on Qwen3-8B and 2.75 points on Qwen3-32B, raises the Faithful Reasoning Rate on AIME 2025 from 34.5% to 46.7%, and reduces self-correction looping from 48.1% to 26.5%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SRaR adds LLM-based step attribution to rubric rewards plus normalization and a decoupled advantage estimator, delivering modest gains over RaR on math benchmarks, but the attribution step lacks validation.

read the letter

SRaR's main contribution is using an LLM judge to attribute rubric scores to individual reasoning steps, then normalizing those across rollouts and feeding them into a decoupled advantage estimator that doesn't disturb the outcome reward baseline. This setup is new compared to RaR and standard RLVR. The authors also release a 16K-problem rubric dataset built by distilling from correct and flawed paths. They clearly document the issues with aggregated rewards, showing that many wrong steps get rewarded and correct ones penalized in current setups. The results indicate modest but consistent lifts: 3.57 points average accuracy gain on Qwen3-8B across six math benchmarks, plus better faithful reasoning and fewer looping issues on AIME. The soft spot is the reliance on the LLM judge for attribution. The paper gives no human agreement numbers or error analysis on whether rubric items are mapped to the right steps. If attribution is off, the per-step signals could be noisy, and the normalization might not help as intended. The abstract also skips ablations on the three components and any stats on run variance or significance. This paper is for people working on reward design for LLM reasoning agents. Readers focused on math problem solving or reducing hallucinations in chains would find the diagnostics and the proposed fixes useful. It has enough concrete results and a clear problem statement to merit peer review, though referees will likely press on the judge validation and component contributions.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Step-wise Rubrics as Rewards (SRaR), extending Rubrics as Rewards (RaR) within RLVR for LLM reasoning. SRaR uses an LLM judge to attribute each rubric item to a specific reasoning step in a rollout, normalizes per-step scores across rollouts so that only varying-quality steps generate a signal, and combines the resulting per-step reward with the outcome reward through a decoupled advantage estimator. The authors construct a 16K-problem rubric dataset via contrastive distillation from correct and flawed paths and report accuracy gains over RaR of 3.57 points on Qwen3-8B and 2.75 points on Qwen3-32B across six math benchmarks, plus an increase in Faithful Reasoning Rate on AIME 2025 from 34.5% to 46.7% and a drop in self-correction looping from 48.1% to 26.5%.

Significance. If the LLM-judge attribution step is shown to be reliable, SRaR would constitute a clear technical advance by restoring multi-criterion structure and step-level supervision while mitigating uniform reward of incorrect steps and self-correction hacking. The contrastive construction of the 16K rubric dataset and the explicit diagnostic statistics on rewarded wrong steps are concrete strengths that future work can build upon.

major comments (2)

[Abstract and §3 (Method)] Abstract and method description of the attribution procedure: the central diagnostic (18.2% of steps in correct-answer trajectories are wrong yet positively rewarded) and all per-step learning signals rest on the LLM judge correctly mapping rubric items to individual steps. No human-agreement rates, error analysis, or validation of this attribution step are reported. If attribution errors are systematic, the variance-based normalization produces noisy or biased advantages, so the claimed 3.57-point gain over RaR cannot be confidently attributed to step-wise supervision rather than incidental changes in training dynamics.
[§5 (Experiments)] Experimental results (benchmark tables and AIME 2025 numbers): concrete accuracy lifts and the reduction in self-correction looping are presented without statistical significance tests, standard deviations across seeds, or ablations that isolate the three new components (attribution, normalization, decoupled estimator). This makes it difficult to assess whether the reported improvements are robust or load-bearing for the central claim.

minor comments (2)

[§3.2] The normalization formula and the precise form of the decoupled advantage estimator would benefit from an explicit equation or pseudocode block for reproducibility.
[§3.1] Clarify whether the LLM judge is drawn from the same model family as the policy or trained on overlapping data; this bears on the circularity concern noted in the review.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commit to specific revisions that directly strengthen the manuscript's claims.

read point-by-point responses

Referee: [Abstract and §3 (Method)] Abstract and method description of the attribution procedure: the central diagnostic (18.2% of steps in correct-answer trajectories are wrong yet positively rewarded) and all per-step learning signals rest on the LLM judge correctly mapping rubric items to individual steps. No human-agreement rates, error analysis, or validation of this attribution step are reported. If attribution errors are systematic, the variance-based normalization produces noisy or biased advantages, so the claimed 3.57-point gain over RaR cannot be confidently attributed to step-wise supervision rather than incidental changes in training dynamics.

Authors: We agree that the absence of explicit validation for the LLM-judge attribution step is a limitation that weakens the ability to fully attribute gains to step-wise supervision. The manuscript does report diagnostic statistics (18.2% wrong steps positively rewarded in correct trajectories and 49.9% correct steps penalized in incorrect ones) that were obtained via the same judge, providing indirect evidence that attribution errors exist and that the method still yields net benefits. However, without human agreement rates or error analysis, systematic biases cannot be ruled out. In the revised version we will add a dedicated validation subsection: we will sample 200 attributions, have two human annotators label them for correctness, report inter-annotator agreement and judge accuracy, and analyze the most frequent error categories. These results will be used to qualify the strength of the central claim. revision: yes
Referee: [§5 (Experiments)] Experimental results (benchmark tables and AIME 2025 numbers): concrete accuracy lifts and the reduction in self-correction looping are presented without statistical significance tests, standard deviations across seeds, or ablations that isolate the three new components (attribution, normalization, decoupled estimator). This makes it difficult to assess whether the reported improvements are robust or load-bearing for the central claim.

Authors: We concur that the experimental presentation would be more convincing with statistical rigor and component-wise ablations. The current results are reported from single runs without error bars or significance tests, and the three novel elements (attribution, per-step normalization, and decoupled advantage) are not isolated. In the revision we will (i) rerun the main experiments on Qwen3-8B and Qwen3-32B with at least three random seeds, reporting mean accuracy, standard deviation, and paired t-test p-values against RaR; (ii) add an ablation table that cumulatively enables attribution, normalization, and the decoupled estimator while keeping all other factors fixed; and (iii) include the same statistical treatment for the AIME 2025 Faithful Reasoning Rate and self-correction looping metrics. These additions will allow readers to assess which components drive the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces SRaR as an RLVR extension that attributes rubric items via an LLM judge, normalizes per-step scores across rollouts, and combines them with outcome rewards using a decoupled advantage estimator. All reported gains (accuracy lifts, Faithful Reasoning Rate, reduced looping) are framed as empirical outcomes from training and evaluation on external benchmarks. No equations reduce the final performance metric to the judge outputs or normalization by algebraic identity, no fitted parameter is relabeled as a prediction, and no load-bearing premise rests on a self-citation chain. The 18.2 % statistic is an observational motivation rather than a definitional input, and the method remains externally falsifiable on held-out math problems.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the accuracy of an external LLM judge for step attribution and on the assumption that cross-rollout normalization isolates genuine quality variation; both are introduced without independent verification in the abstract.

free parameters (1)

per-step reward weight
The decoupled advantage estimator must combine the normalized per-step signal with the outcome reward; the mixing coefficient is not stated as fixed by prior literature.

axioms (1)

domain assumption An LLM judge can map rubric criteria to specific tokens or sentences in a reasoning rollout with high fidelity.
This premise is required for the per-step reward to be meaningful and is invoked when the abstract describes the attribution step.

invented entities (1)

step-wise rubric attribution via LLM judge no independent evidence
purpose: To convert a single scalar rubric score into per-step learning signals.
This is a new mechanism introduced in the paper; no independent evidence (such as human validation of attributions) is mentioned in the abstract.

pith-pipeline@v0.9.0 · 5905 in / 1511 out tokens · 44541 ms · 2026-05-20T13:26:14.451113+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

uses an LLM judge to attribute each rubric item to a specific reasoning step, (ii) normalizes the per-step rubric scores across rollouts so that only steps whose quality varies produce a learning signal, and (iii) combines the resulting per-step reward with the standard outcome reward through a decoupled advantage estimator

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 8 internal anchors

[1]

Reward and guidance through rubrics: Promoting exploration to improve multi-domain reasoning.arXiv preprint arXiv:2511.12344, 2025

Baolong Bi, Shenghua Liu, Yiwei Wang, Siqian Tong, Lingrui Mei, Yuyao Ge, Yilong Xu, Jiafeng Guo, and Xueqi Cheng. Reward and guidance through rubrics: Promoting exploration to improve multi-domain reasoning.arXiv preprint arXiv:2511.12344, 2025

work page arXiv 2025
[2]

Babyvision: Visual reasoning beyond language

Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Yiping Bao, et al. Babyvision: Visual reasoning beyond language.arXiv preprint arXiv:2601.06521, 2026

work page arXiv 2026
[3]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645: 633–638, 2025

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645: 633–638, 2025

work page 2025
[4]

Interleaved latent visual reasoning with selective perceptual modeling.arXiv preprint arXiv:2512.05665, 2025

Shuai Dong, Siyuan Wang, Xingyu Liu, Chenglin Li, Haowen Hou, and Zhongyu Wei. Interleaved latent visual reasoning with selective perceptual modeling.arXiv preprint arXiv:2512.05665, 2025

work page arXiv 2025
[5]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

LLM-Rubric: A multidi- mensional, calibrated approach to automated evaluation of natural language texts

Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie. LLM-Rubric: A multidi- mensional, calibrated approach to automated evaluation of natural language texts. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 13806–13834, 2024

work page 2024
[7]

OlympiadBench: A challenging benchmark for promoting AGI with olympiad- level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. OlympiadBench: A challenging benchmark for promoting AGI with olympiad- level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024
[8]

Measuring mathematical problem solving with the MATH dataset.Advancesin Neural Information Processing Systems, 34:7294–7305, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset.Advancesin Neural Information Processing Systems, 34:7294–7305, 2021

work page 2021
[9]

Prometheus 2: An open source language model specialized in evaluating other language models

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024
[10]

Solving quantitative reasoning problems with language models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Am- brose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advancesin Neural Information Processing Systems, 35:3843–3857, 2022

work page 2022
[11]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InInternational Conference on Learning Representations, 2024

work page 2024
[12]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Learning to reason with LLMs.https://openai.com/index/learning-to-reason-with-llms/, 2024

OpenAI. Learning to reason with LLMs.https://openai.com/index/learning-to-reason-with-llms/, 2024

work page 2024
[14]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

From Context to Skills: Can Language Models Learn from Context Skillfully?

Shuzheng Si, Haozhe Zhao, Yu Lei, Qingyi Wang, Dingwei Chen, Zhitong Wang, Zhenhailong Wang, Kangyang Luo, Zheng Wang, Gang Chen, Fanchao Qi, Minjia Zhang, and Maosong Sun. From context to skills: Can language models learn from context skillfully?, 2026. URLhttps://arxiv.org/abs/2604.27660

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, Haozhe Zhao, Hong Liu, Hui Su, Jiaqi Zhang, Jiawei Wang, Jing Li, Kefeng Zhang, Manyuan Zhang, Minhao Jing, Peng Pei, Quan Chen, Taofeng Xue, Tongxin Pan, Xiaotong Li, Xiaoyang Li, Xiaoyu Zhao, Xing Hu, Xinyang Lin, Xunliang Cai, Yan ...

work page arXiv 2026
[17]

GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

Jingyi Wang, Lei Zhu, Tengjin Weng, Song-Li Wu, Haochen Tan, Jierun Chen, Chaofan Tao, Haoli Bai, Lu Hou, Lifeng Shang, and Xiao-Ping Zhang. GRPO-VPS: Enhancing group relative policy optimization with verifiable process supervision for effective reasoning.arXiv preprint arXiv:2604.20659, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math- shepherd: Verify and reinforce LLMs step-by-step without human annotations.arXiv preprint arXiv:2312.08935, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Self-consistency improves chain of thought reasoning in language models

XuezhiWang, JasonWei, DaleSchuurmans, QuocLe, EdChi, SharanNarang, AakankshaChowdhery, andDenny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023

work page 2023
[20]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022

work page 2022
[21]

Simple statistical gradient-following algorithms for connectionist reinforcement learning

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4):229–256, 1992

work page 1992
[22]

Grouter: Decoupling routing from representation for accelerated moe training.arXiv preprint arXiv:2603.06626, 2026

Yuqi Xu, Rizhen Hu, Zihan Liu, Mou Sun, and Kun Yuan. Grouter: Decoupling routing from representation for accelerated moe training.arXiv preprint arXiv:2603.06626, 2026

work page arXiv 2026
[23]

Qwen3 Technical Report

An Yang, Anfeng Yang, Baosong Yang, Beichen Bi, Binyuan Hui, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, et al. DAPO: An open-source LLM reinforce- ment learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Mentor: Efficient multimodal-conditioned tuning for autoregressive vision generation models

Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Minjia Zhang, and Junjie Hu. Mentor: Efficient multimodal-conditioned tuning for autoregressive vision generation models. arXiv preprint arXiv:2507.09574, 2025

work page arXiv 2025
[26]

### Step N:

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, et al. Judging LLM-as-a-judge with MT-Bench and chatbot arena.Advances in Neural Information Processing Systems, 36, 2023. 12 Appendix A Limitations We identify the following limitations of our work: 1.Dependence on the LLM ...

work page 2023
[27]

Are the mathematical operations correct?

work page
[28]

Is the logic/deduction valid?

work page
[29]

CORRECT” if the step is logically and mathematically correct •“INCORRECT

Are there any errors in the step? Respond with EXACTLY one of: •“CORRECT” if the step is logically and mathematically correct •“INCORRECT” if the step contains any error C Training Hyperparameters Table 4Training Hyperparameters. Hyperparameter Value Max Prompt Length 2048 Max Response Length 8192 Train Batch Size 128 Learning Rate1×10 −6 Rollout Group Si...

work page 2048
[30]

Is the content mathematically correct?

work page
[31]

Is it relevant to this specific problem?

work page
[32]

valid": true or false,

Is the type label (SUGGEST/PITFALL/BONUS/ANSWER) appropriate? Respond with ONLY a JSON object (no markdown, no extra text): {"valid": true or false, "reason": "brief explanation"} 0 50 100 150 200 250 Training Step 0.35 0.40 0.45 0.50 0.55Suggest Score (Step) (a) Suggest Score 0 50 100 150 200 250 Training Step 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09Bonus...

work page arXiv 2066
[33]

The outcome component:∑n i=1 Abase,i=0(Proposition O.1)

work page
[34]

Proof.Follows directly from Propositions O.1 and O.2

The rubric component: for each stepk,∑i∈Gk ¯dk,i =0(Proposition O.2). Proof.Follows directly from Propositions O.1 and O.2. The outcome advantage is a scalar broadcast to all tokens, so it sums to zero over rollouts at every token position. The rubric reward at any tokentbelonging to stepkin rolloutiequals ¯dk,i, which sums to zero overG k by Proposition ...

work page

[1] [1]

Reward and guidance through rubrics: Promoting exploration to improve multi-domain reasoning.arXiv preprint arXiv:2511.12344, 2025

Baolong Bi, Shenghua Liu, Yiwei Wang, Siqian Tong, Lingrui Mei, Yuyao Ge, Yilong Xu, Jiafeng Guo, and Xueqi Cheng. Reward and guidance through rubrics: Promoting exploration to improve multi-domain reasoning.arXiv preprint arXiv:2511.12344, 2025

work page arXiv 2025

[2] [2]

Babyvision: Visual reasoning beyond language

Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Yiping Bao, et al. Babyvision: Visual reasoning beyond language.arXiv preprint arXiv:2601.06521, 2026

work page arXiv 2026

[3] [3]

DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645: 633–638, 2025

DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.Nature, 645: 633–638, 2025

work page 2025

[4] [4]

Interleaved latent visual reasoning with selective perceptual modeling.arXiv preprint arXiv:2512.05665, 2025

Shuai Dong, Siyuan Wang, Xingyu Liu, Chenglin Li, Haowen Hou, and Zhongyu Wei. Interleaved latent visual reasoning with selective perceptual modeling.arXiv preprint arXiv:2512.05665, 2025

work page arXiv 2025

[5] [5]

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

LLM-Rubric: A multidi- mensional, calibrated approach to automated evaluation of natural language texts

Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie. LLM-Rubric: A multidi- mensional, calibrated approach to automated evaluation of natural language texts. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), pages 13806–13834, 2024

work page 2024

[7] [7]

OlympiadBench: A challenging benchmark for promoting AGI with olympiad- level bilingual multimodal scientific problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. OlympiadBench: A challenging benchmark for promoting AGI with olympiad- level bilingual multimodal scientific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024

work page 2024

[8] [8]

Measuring mathematical problem solving with the MATH dataset.Advancesin Neural Information Processing Systems, 34:7294–7305, 2021

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset.Advancesin Neural Information Processing Systems, 34:7294–7305, 2021

work page 2021

[9] [9]

Prometheus 2: An open source language model specialized in evaluating other language models

Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

work page 2024

[10] [10]

Solving quantitative reasoning problems with language models

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Am- brose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, et al. Solving quantitative reasoning problems with language models. Advancesin Neural Information Processing Systems, 35:3843–3857, 2022

work page 2022

[11] [11]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InInternational Conference on Learning Representations, 2024

work page 2024

[12] [12]

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Meiqi Guo, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. Improve mathematical reasoning in language models by automated process supervision.arXiv preprint arXiv:2406.06592, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Learning to reason with LLMs.https://openai.com/index/learning-to-reason-with-llms/, 2024

OpenAI. Learning to reason with LLMs.https://openai.com/index/learning-to-reason-with-llms/, 2024

work page 2024

[14] [14]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

From Context to Skills: Can Language Models Learn from Context Skillfully?

Shuzheng Si, Haozhe Zhao, Yu Lei, Qingyi Wang, Dingwei Chen, Zhitong Wang, Zhenhailong Wang, Kangyang Luo, Zheng Wang, Gang Chen, Fanchao Qi, Minjia Zhang, and Maosong Sun. From context to skills: Can language models learn from context skillfully?, 2026. URLhttps://arxiv.org/abs/2604.27660

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng, Hang Yu, Hao Yang, Haonan Yan, Haoze Sun, Haozhe Zhao, Hong Liu, Hui Su, Jiaqi Zhang, Jiawei Wang, Jing Li, Kefeng Zhang, Manyuan Zhang, Minhao Jing, Peng Pei, Quan Chen, Taofeng Xue, Tongxin Pan, Xiaotong Li, Xiaoyang Li, Xiaoyu Zhao, Xing Hu, Xinyang Lin, Xunliang Cai, Yan ...

work page arXiv 2026

[17] [17]

GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning

Jingyi Wang, Lei Zhu, Tengjin Weng, Song-Li Wu, Haochen Tan, Jierun Chen, Chaofan Tao, Haoli Bai, Lu Hou, Lifeng Shang, and Xiao-Ping Zhang. GRPO-VPS: Enhancing group relative policy optimization with verifiable process supervision for effective reasoning.arXiv preprint arXiv:2604.20659, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math- shepherd: Verify and reinforce LLMs step-by-step without human annotations.arXiv preprint arXiv:2312.08935, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Self-consistency improves chain of thought reasoning in language models

XuezhiWang, JasonWei, DaleSchuurmans, QuocLe, EdChi, SharanNarang, AakankshaChowdhery, andDenny Zhou. Self-consistency improves chain of thought reasoning in language models. InInternational Conference on Learning Representations, 2023

work page 2023

[20] [20]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022

work page 2022

[21] [21]

Simple statistical gradient-following algorithms for connectionist reinforcement learning

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3–4):229–256, 1992

work page 1992

[22] [22]

Grouter: Decoupling routing from representation for accelerated moe training.arXiv preprint arXiv:2603.06626, 2026

Yuqi Xu, Rizhen Hu, Zihan Liu, Mou Sun, and Kun Yuan. Grouter: Decoupling routing from representation for accelerated moe training.arXiv preprint arXiv:2603.06626, 2026

work page arXiv 2026

[23] [23]

Qwen3 Technical Report

An Yang, Anfeng Yang, Baosong Yang, Beichen Bi, Binyuan Hui, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, et al. DAPO: An open-source LLM reinforce- ment learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

Mentor: Efficient multimodal-conditioned tuning for autoregressive vision generation models

Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Minjia Zhang, and Junjie Hu. Mentor: Efficient multimodal-conditioned tuning for autoregressive vision generation models. arXiv preprint arXiv:2507.09574, 2025

work page arXiv 2025

[26] [26]

### Step N:

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, et al. Judging LLM-as-a-judge with MT-Bench and chatbot arena.Advances in Neural Information Processing Systems, 36, 2023. 12 Appendix A Limitations We identify the following limitations of our work: 1.Dependence on the LLM ...

work page 2023

[27] [27]

Are the mathematical operations correct?

work page

[28] [28]

Is the logic/deduction valid?

work page

[29] [29]

CORRECT” if the step is logically and mathematically correct •“INCORRECT

Are there any errors in the step? Respond with EXACTLY one of: •“CORRECT” if the step is logically and mathematically correct •“INCORRECT” if the step contains any error C Training Hyperparameters Table 4Training Hyperparameters. Hyperparameter Value Max Prompt Length 2048 Max Response Length 8192 Train Batch Size 128 Learning Rate1×10 −6 Rollout Group Si...

work page 2048

[30] [30]

Is the content mathematically correct?

work page

[31] [31]

Is it relevant to this specific problem?

work page

[32] [32]

valid": true or false,

Is the type label (SUGGEST/PITFALL/BONUS/ANSWER) appropriate? Respond with ONLY a JSON object (no markdown, no extra text): {"valid": true or false, "reason": "brief explanation"} 0 50 100 150 200 250 Training Step 0.35 0.40 0.45 0.50 0.55Suggest Score (Step) (a) Suggest Score 0 50 100 150 200 250 Training Step 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09Bonus...

work page arXiv 2066

[33] [33]

The outcome component:∑n i=1 Abase,i=0(Proposition O.1)

work page

[34] [34]

Proof.Follows directly from Propositions O.1 and O.2

The rubric component: for each stepk,∑i∈Gk ¯dk,i =0(Proposition O.2). Proof.Follows directly from Propositions O.1 and O.2. The outcome advantage is a scalar broadcast to all tokens, so it sums to zero over rollouts at every token position. The rubric reward at any tokentbelonging to stepkin rolloutiequals ¯dk,i, which sums to zero overG k by Proposition ...

work page