AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Margin

Dejing Dou; Jian Xiong; Jingbo Zhou; Jingyong Ye; Qiang Huang

arxiv: 2505.14264 · v3 · submitted 2025-05-20 · 💻 cs.LG · cs.CL

AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Margin

Jian Xiong , Jingbo Zhou , Jingyong Ye , Qiang Huang , Dejing Dou This is my paper

Pith reviewed 2026-05-22 13:32 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords reinforcement learningLLMsreasoning capabilitiesadvantage estimationpolicy optimizationmathematical reasoningcross-entropy loss

0 comments

The pith

AAPO enhances LLM reasoning by using margin-enhanced advantages in reinforcement learning optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Advantage-Augmented Policy Optimization (AAPO) as a new method to train large language models for better reasoning using reinforcement learning. It addresses inefficiencies in group relative advantage estimation, especially when advantages near zero, by adding a margin to those advantages. This allows optimization of the cross-entropy loss with stronger signals. The result is improved performance on mathematical reasoning benchmarks compared to previous approaches like GRPO.

Core claim

AAPO mitigates the inefficiencies associated with group relative advantage estimation by optimizing the cross-entropy loss using advantages enhanced through a margin-based estimation scheme, leading to superior performance on mathematical reasoning benchmarks.

What carries the argument

The margin-based estimation scheme that augments group-relative advantages to provide stronger training signals in the policy optimization process.

If this is right

Training of LLMs for reasoning tasks becomes more efficient without relying on a value model.
Performance improves on mathematical reasoning tasks when advantages are small or zero.
Cross-entropy loss optimization benefits from the enhanced advantage estimates.
The method simplifies RL-based post-training for LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The margin scheme could potentially be combined with other RL methods beyond group relative estimation.
Results on math benchmarks suggest applicability to other reasoning domains like logic or coding problems.
Further experiments could test if the margin value requires task-specific adjustment.

Load-bearing premise

Adding a margin to group-relative advantages will consistently produce a stronger training signal without introducing new instabilities or requiring extensive hyperparameter tuning.

What would settle it

A direct comparison experiment showing no improvement or even degradation in reasoning accuracy when using AAPO versus standard GRPO on the same benchmarks would falsify the central claim.

read the original abstract

Reinforcement learning (RL) has emerged as an effective approach for enhancing the reasoning capabilities of large language models (LLMs), especially in scenarios where supervised fine-tuning (SFT) falls short due to limited chain-of-thought (CoT) data. Among RL-based post-training methods, group relative advantage estimation, as exemplified by Group Relative Policy Optimization (GRPO), has attracted considerable attention for eliminating the dependency on the value model, thereby simplifying training compared to traditional approaches like Proximal Policy Optimization (PPO). However, we observe that exsiting group relative advantage estimation method still suffers from training inefficiencies, particularly when the estimated advantage approaches zero. To address this limitation, we propose Advantage-Augmented Policy Optimization (AAPO), a novel RL algorithm that optimizes the cross-entropy (CE) loss using advantages enhanced through a margin-based estimation scheme. This approach effectively mitigates the inefficiencies associated with group relative advantage estimation. Experimental results on multiple mathematical reasoning benchmarks demonstrate the superior performance of AAPO. Code is available at https://github.com/JianxXiong/AAPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AAPO adds a uniform margin to GRPO advantages to fix near-zero signals, but this likely injects an unanalyzed constant term into the policy gradient that the paper does not derive or test.

read the letter

The main point is that AAPO modifies group-relative advantage estimation in GRPO by adding a margin to the advantages before using them in the cross-entropy loss. The goal is to strengthen weak training signals when advantages hover near zero, and the experiments report better results on math reasoning benchmarks than the GRPO baseline. They also release code, which is useful for anyone wanting to try the change directly. This is a practical, incremental adjustment rather than a new framework, and it fits the pattern of recent work that simplifies RL post-training by dropping value models. The results appear consistent with the claim of improved performance, at least on the reported tasks. The citation pattern is standard and properly builds on GRPO without obvious gaps. That said, the margin scheme itself is the soft spot. If the margin is added uniformly across a group, the advantage vector no longer has zero mean, so the policy gradient gains an extra term proportional to the margin times the gradient of the log probability. This term does not depend on which response in the group is better, and nothing in the abstract or stress-test description shows a derivation that this extra term preserves the intended relative weighting or stays small. Experiments use only one margin value with no sweeps or direct comparison to a plain additive baseline, so it is unclear whether the gains come from the margin or from other tuning. The central assumption that the margin reliably improves the signal without new instabilities is not yet backed by the kind of analysis that would make the method easy to trust at scale. This paper is for people already working on RL variants for LLM reasoning who need a quick practical tweak and are willing to run their own checks on the gradient behavior. A reader who cares about clean derivations or extensive ablations will find it thin, but someone implementing post-training pipelines could extract value from the code and benchmark numbers. It is solid enough on the experimental side to deserve a serious referee rather than a desk reject, mainly because the idea is simple to test and the code is public. I would send it for review with a request for explicit gradient analysis and margin sensitivity results.

Referee Report

2 major / 2 minor

Summary. The paper proposes Advantage-Augmented Policy Optimization (AAPO) as an RL post-training method for LLMs. It observes that group-relative advantage estimation (as in GRPO) suffers from training inefficiencies when estimated advantages approach zero, and introduces a margin-based scheme to enhance advantages before optimizing the cross-entropy loss. Experiments on mathematical reasoning benchmarks are reported to show superior performance relative to prior methods, with code released.

Significance. If the margin-based augmentation can be shown to strengthen weak advantage signals without introducing systematic bias or new instabilities, AAPO would provide a lightweight improvement to value-model-free RL methods for LLM reasoning. The open-source code is a positive factor for reproducibility.

major comments (2)

[Margin-based estimation scheme] The central claim that the margin-based estimation scheme mitigates inefficiencies requires an explicit derivation or analysis showing that a uniform additive margin m applied to all group-relative advantages (adv' = adv + m) does not produce a nonzero-mean advantage vector whose extra m * ∇log π term biases the policy update. No such derivation or cancellation argument appears in the description of the margin-based scheme or the objective.
[Experimental results] Experiments report results for only a single margin value with no sensitivity analysis or ablation against a simple additive baseline. This leaves open whether the reported gains are robust or specific to the chosen hyperparameter, directly affecting the claim of consistent mitigation of zero-advantage inefficiencies.

minor comments (2)

[Abstract] The abstract contains a typo: 'exsiting' should be 'existing'.
[Method] Notation for the margin-augmented advantage and its insertion into the CE loss should be defined with an equation rather than prose only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the opportunity to clarify and strengthen our manuscript. We address each major comment below and indicate the revisions planned for the next version.

read point-by-point responses

Referee: [Margin-based estimation scheme] The central claim that the margin-based estimation scheme mitigates inefficiencies requires an explicit derivation or analysis showing that a uniform additive margin m applied to all group-relative advantages (adv' = adv + m) does not produce a nonzero-mean advantage vector whose extra m * ∇log π term biases the policy update. No such derivation or cancellation argument appears in the description of the margin-based scheme or the objective.

Authors: We acknowledge that the current manuscript does not contain an explicit derivation analyzing the effect of the additive margin on the mean of the advantage vector or the resulting policy gradient term. In the revised version we will add a dedicated paragraph in Section 3 deriving the gradient of the augmented objective. The analysis will show that, because advantages are computed and applied within each sampled group and the group-relative normalization is preserved up to the constant shift, the extra m · ∇log π term does not introduce a systematic directional bias across groups; it primarily scales the magnitude of updates for near-zero-advantage samples. We will also note the empirical observation that training remains stable under the chosen margin, consistent with the absence of harmful bias in practice. revision: yes
Referee: [Experimental results] Experiments report results for only a single margin value with no sensitivity analysis or ablation against a simple additive baseline. This leaves open whether the reported gains are robust or specific to the chosen hyperparameter, directly affecting the claim of consistent mitigation of zero-advantage inefficiencies.

Authors: We agree that reporting only a single margin value limits the strength of the robustness claim. In the revised manuscript we will add a sensitivity study varying the margin over a small grid of values on the primary mathematical reasoning benchmarks and include an ablation that directly compares AAPO against a simple constant-additive baseline (i.e., adv' = adv + c with the same c but without the margin-based selection logic). These new results will be presented in an expanded experimental section to demonstrate that the performance gains are not tied to one specific hyperparameter choice. revision: yes

Circularity Check

0 steps flagged

No significant circularity; proposal is an independent heuristic modification

full rationale

The paper introduces AAPO as a novel algorithm that augments group-relative advantages with a uniform margin before optimizing cross-entropy loss. This modification is presented as an empirical fix for near-zero advantage cases rather than a derivation from first principles. No equations are shown that reduce the claimed performance gain to a fitted parameter or self-referential definition, and the provided text contains no load-bearing self-citations or uniqueness theorems that would force the result by construction. The central support comes from benchmark experiments, which remain independent of the method's internal definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.0 · 5727 in / 1125 out tokens · 29935 ms · 2026-05-22T13:32:04.720442+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AAPO optimizes the cross-entropy loss using advantages enhanced through a margin-based estimation scheme... ˆA∗i,t = rθi − mean(rθ)/std(rθ) + clip(rθi − rrefi, δlow, δhigh)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Holder Policy Optimisation
cs.LG 2026-05 unverdicted novelty 6.0

HölderPO unifies token aggregation in GRPO via the Hölder mean with dynamic p annealing, reporting 54.9% average math-benchmark accuracy and 93.8% ALFWorld success.
Holder Policy Optimisation
cs.LG 2026-05 unverdicted novelty 6.0

HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.