arxiv: 2603.26535 · v3 · submitted 2026-03-27 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

PAPO: Stabilizing Rubric Integration Training via Decoupled Advantage Normalization

Zelin Tan , Zhouliang Yu , Bohan Lin , Zijie Geng , Hejia Geng , Yudong Zhang , Mulei Zhang , Yang Chen

show 4 more authors

Shuyue Hu Zhenfei Yin Chen Zhang Lei Bai

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:52 UTC · model grok-4.3

classification 💻 cs.AI

keywords PAPOdecoupled advantage normalizationprocess reward modeloutcome reward modelGRPOreasoning optimizationreinforcement learningrubric evaluation

0 comments

The pith

Decoupled normalization of outcome and process advantages integrates rubric rewards into policy optimization without reward hacking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PAPO to fix two problems in training reasoning models: outcome rewards stop providing signal once most answers are correct, while direct use of process rewards leads models to add unnecessary words to inflate scores and lose accuracy. It solves this by splitting the advantage into an outcome part normalized over every response and a process part normalized only among correct responses. This keeps correctness as the anchor while letting process signals improve step quality. A reader would care because the approach yields higher accuracy on hard math benchmarks and sustains gains after standard methods plateau.

Core claim

PAPO composes the advantage from an outcome component Aout, derived from ORM and normalized over all responses, and a process component Aproc, derived from a rubric-based PRM and normalized exclusively among correct responses, so that Aout anchors training on correctness while Aproc differentiates reasoning quality without distorting the outcome signal.

What carries the argument

Decoupled advantage normalization separating global outcome advantage from correctness-restricted process advantage.

Load-bearing premise

The rubric-based process reward model measures genuine reasoning quality and does not reward superficial features such as length among responses that are already correct.

What would settle it

If PAPO training causes models to produce longer responses while accuracy on the benchmarks falls, the claim that the decoupled normalization prevents reward hacking would be falsified.

read the original abstract

We propose Process-Aware Policy Optimization (PAPO), a method that integrates process-level evaluation into Group Relative Policy Optimization (GRPO) through decoupled advantage normalization, to address two limitations of existing reward designs. Outcome reward models (ORM) evaluate only final-answer correctness, treating all correct responses identically regardless of reasoning quality, and gradually lose the advantage signal as groups become uniformly correct. Process reward models (PRM) offer richer supervision, but directly using PRM scores causes reward hacking, where models exploit verbosity to inflate scores while accuracy collapses. PAPO resolves both by composing the advantage from an outcome component Aout, derived from ORM and normalized over all responses, and a process component Aproc, derived from a rubric-based PRM and normalized exclusively among correct responses. This decoupled design ensures that Aout anchors training on correctness while Aproc differentiates reasoning quality without distorting the outcome signal. Experiments across multiple model scales and six benchmarks demonstrate that PAPO consistently outperforms ORM, reaching 51.3% vs.\ 46.3% on OlympiadBench while continuing to improve as ORM plateaus and declines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PAPO's decoupled normalization splits global outcome advantages from correct-only process ones to add rubric PRM signals without immediate hacking, and the reported benchmark lifts are concrete but rest on an unverified ORM partition.

read the letter

The main point is that PAPO normalizes the outcome advantage Aout over the full group from the ORM while restricting the process advantage Aproc to only the ORM-labeled correct responses from the rubric PRM. This split is the new design move on top of GRPO, and it is meant to keep correctness anchored while still differentiating reasoning quality inside the good answers. The experiments report steady gains over plain ORM across model sizes and six benchmarks, with the clearest lift on OlympiadBench at 51.3 percent versus 46.3 percent, and the curve keeps rising where the ORM baseline flattens. That pattern matches the stated goal of avoiding the accuracy collapse that comes from direct PRM use. The approach is simple enough that it could be tried quickly by anyone already running GRPO with an ORM and a rubric scorer. The soft spot is exactly the one the stress-test flags: the Aproc group depends on the ORM being accurate enough to separate correct from incorrect trajectories. On Olympiad-level problems the ORM is known to make mistakes, and any false positives would let rubric scores from flawed reasoning leak into the normalized process term and then into the gradient. The abstract gives no ORM accuracy numbers on the training distribution and no ablation that recomputes Aproc with oracle labels, so the claim that the outcome signal stays undistorted is still an assumption. There are also no error bars or statistical tests shown. This paper is for groups already working on process-supervised RL for math reasoning who want a lightweight way to add rubric signals. It is worth sending to a serious referee because the core construction is clean, the results are specific, and the open question about label noise is straightforward to test with the code and data they presumably have.

Referee Report

2 major / 1 minor

Summary. The paper proposes Process-Aware Policy Optimization (PAPO) to integrate rubric-based process rewards into Group Relative Policy Optimization (GRPO) via decoupled advantage normalization. It decomposes the advantage into an outcome component Aout (derived from ORM scores and normalized over the full group) and a process component Aproc (derived from a rubric-based PRM and normalized only over ORM-labeled correct responses). This design aims to preserve the correctness signal while allowing differentiation of reasoning quality, avoiding both signal collapse in uniform-correct groups and reward hacking from verbosity. Experiments across model scales and six benchmarks report consistent gains over ORM baselines, including 51.3% vs. 46.3% on OlympiadBench, with PAPO continuing to improve after ORM plateaus.

Significance. If the empirical gains prove robust, the decoupled normalization offers a lightweight, parameter-free way to incorporate process supervision without distorting outcome-driven training. This could meaningfully advance reasoning-model training on math and olympiad-style tasks where standard ORM rewards saturate.

major comments (2)

[Abstract/Experiments] Abstract and Experiments section: The reported benchmark gains (e.g., 51.3% vs. 46.3% on OlympiadBench) are presented without error bars, number of random seeds, or statistical significance tests, leaving the central claim of consistent outperformance only moderately supported.
[Method] Method (decoupled advantage definition): Aproc is normalized exclusively over responses labeled correct by the ORM. No ORM error-rate measurement on the training distribution or oracle-correctness ablation is provided, so the claim that this partition cleanly isolates reasoning quality without leaking superficial features rests on an unverified assumption.

minor comments (1)

Notation for Aout and Aproc is introduced without an explicit equation block showing the exact normalization formulas; adding this would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on empirical robustness and methodological assumptions. We address each major comment below and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract/Experiments] Abstract and Experiments section: The reported benchmark gains (e.g., 51.3% vs. 46.3% on OlympiadBench) are presented without error bars, number of random seeds, or statistical significance tests, leaving the central claim of consistent outperformance only moderately supported.

Authors: We agree that error bars and seed information would strengthen the presentation. In the revised manuscript we will report results from three random seeds for the primary benchmarks, include standard deviation error bars on the main tables and figures, and update the abstract to note that gains were consistent across seeds. Full statistical significance testing across all six benchmarks is computationally intensive and was not performed in the original experiments; we will add a limitations note acknowledging this while emphasizing the uniform improvements observed across model scales. revision: yes
Referee: [Method] Method (decoupled advantage definition): Aproc is normalized exclusively over responses labeled correct by the ORM. No ORM error-rate measurement on the training distribution or oracle-correctness ablation is provided, so the claim that this partition cleanly isolates reasoning quality without leaking superficial features rests on an unverified assumption.

Authors: The design intentionally relies on the ORM only for binary correctness labeling, which is its standard and well-validated use case in reasoning RL; Aproc is then computed solely within that subset to differentiate reasoning quality. We will expand the method section with additional discussion of this choice, citing prior work on ORM reliability for final-answer verification in math domains, and clarify that the normalization scope prevents process signals from incorrect trajectories. An explicit ORM error-rate measurement on the training distribution or oracle ablation is not present in the current experiments; we view the absence of reward hacking in the reported results as supporting evidence for clean isolation, but acknowledge this remains an assumption. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The PAPO construction composes Aout (ORM-derived, globally normalized) and Aproc (rubric-PRM-derived, normalized only over ORM-labeled correct responses) via standard advantage formulas with no fitted parameters, self-referential equations, or load-bearing self-citations. The abstract and description present an explicit compositional design whose central claims rest on external benchmarks rather than reducing to the inputs by definition. No steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The method rests on the standard policy-gradient assumptions of GRPO plus the assumption that rubric PRM scores are meaningful when conditioned on correctness. No new free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Policy gradient methods remain stable when advantages are additively composed from separately normalized components
Invoked when defining the total advantage as Aout + Aproc

invented entities (1)

Decoupled advantage components Aout and Aproc no independent evidence
purpose: Separate global correctness signal from within-correct reasoning differentiation
New compositional device introduced to avoid reward hacking and signal collapse

pith-pipeline@v0.9.0 · 5528 in / 1151 out tokens · 30276 ms · 2026-05-14T22:52:03.500132+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PAPO constructs the advantage from two independently normalized components: Aout ... normalized over all responses, and Aproc ... normalized exclusively among correct responses.
IndisputableMonolith/Foundation/AbsoluteFloorClosure absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This decoupled design ensures that Aout anchors training on correctness while Aproc differentiates reasoning quality without distorting the outcome signal.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.