AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

Kai Ming Ting; Kun Yao; Ming Pang; Yang Xu; Yiming Deng; Zheng Fang

arxiv: 2605.05826 · v1 · submitted 2026-05-07 · 💻 cs.AI

AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

Yang Xu , Kun Yao , Yiming Deng , Zheng Fang , Kai Ming Ting , Ming Pang This is my paper

Pith reviewed 2026-05-08 11:20 UTC · model grok-4.3

classification 💻 cs.AI

keywords Asymmetric Group Policy OptimizationRLVRLLM reasoningboundary shrinkagemathematical benchmarkssearch ads relevancegroup advantageverifiable rewards

0 comments

The pith

Asymmetric Group Policy Optimization counters reasoning boundary shrinkage in RLVR-trained models while raising accuracy and pass@k coverage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning with verifiable rewards improves how efficiently large language models sample correct reasoning paths, yet it often narrows the range of patterns the model can still discover compared with its untrained base version. The paper introduces Asymmetric Group Policy Optimization to reverse this narrowing by applying stronger negative updates to incorrect paths and scaling positive updates according to the variance inside each group of sampled responses. This design keeps exploration capacity alive while still reinforcing rare correct answers and suppressing trivial ones. On five mathematical benchmarks the approach reaches state-of-the-art accuracy and raises pass@k scores even at large sample sizes. In an industrial search-ads relevance setting it also raises the quality of human annotations, which then produces measurable gains when the labels are used to train smaller student models.

Core claim

The paper claims that AGPO prevents the capability-boundary shrinkage seen in standard RLVR by combining a negative-dominant reinforcement strategy that suppresses wrong reasoning paths with a group-advantage mechanism that scales positive updates by intra-group variance, thereby preserving the base model's ability to surface fundamentally new correct patterns; the resulting models achieve higher accuracy and better large-k coverage on mathematical benchmarks and improve downstream performance in search-ads relevance through higher-quality annotations.

What carries the argument

Asymmetric Group Policy Optimization (AGPO), which pairs negative-dominant reinforcement to penalize incorrect paths with a variance-scaled group advantage for positive updates that emphasizes rare correct responses.

If this is right

The trained models reach state-of-the-art accuracy on five standard mathematical reasoning benchmarks.
Pass@k performance improves consistently as the number of samples grows, unlike prior RLVR methods.
Data-annotation quality rises in a large-scale search-ads relevance task.
Downstream student models trained on the improved annotations show substantial performance gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same asymmetry could be tested on other verifiable domains such as code generation or theorem proving to check whether boundary preservation generalizes.
If the variance-scaling term proves robust, it might be combined with existing exploration bonuses to widen boundaries further without extra negative pressure.
Industrial pipelines that already collect group-level responses could adopt the method with minimal extra labeling cost.

Load-bearing premise

That applying stronger negative updates to wrong answers and scaling positive updates by intra-group variance will suppress errors without eliminating the base model's capacity to discover entirely new correct reasoning patterns.

What would settle it

A controlled experiment in which, after AGPO training, the pass@k curve at large k (hundreds of samples) falls below the base model's curve or the set of distinct correct reasoning traces shrinks rather than stays at least as broad.

Figures

Figures reproduced from arXiv: 2605.05826 by Kai Ming Ting, Kun Yao, Ming Pang, Yang Xu, Yiming Deng, Zheng Fang.

**Figure 1.** Figure 1: Advantage magnitude comparison: AGPO vs. baselines. AGPO employs asymmetric dynamic advantage estimation, with advantages in both PSR and NSR showing nearly linear trends. critical for PSR in RLVR, as it enables the effective improvement of the probability of latent rare correct paths within the base model, thereby enhancing the trained model’s greedy reasoning capability view at source ↗

**Figure 2.** Figure 2: Pass@k performance scaling on MATH, Olympiad and AIME-2024 with Qwen3-4B and Llama-3.1-8B-Instruct view at source ↗

**Figure 3.** Figure 3: Training dynamics of Qwen2.5-Math-7B on MATH under different RL methods across training steps. (a) Correct responses ratio per batch on the training set. (b) Greedy decoding accuracy (Pass@1) on the test set. (c) The model’s entropy on the test set. This is because PPO and GRPO reinforce correct paths with greater magnitude, allowing them to rapidly exploit a subset of high-likelihood paths. However, these… view at source ↗

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated notable success in enhancing the reasoning performance of large language models (LLMs). However, recent studies reveal that while current RLVR methods improve sampling efficiency towards correct paths, they do not elicit fundamentally new reasoning patterns. Instead, the reasoning capability boundary of trained models often narrows compared to their base models, with base models achieving higher coverage at large sample sizes. In this work, we propose Asymmetric Group Policy Optimization (AGPO) to counteract this boundary shrinkage. AGPO adopts a negative-dominant reinforcement strategy to suppress incorrect reasoning paths, maintaining the base model's exploration capacity. For positive reinforcement, AGPO adopts a group advantage mechanism, which scales positive updates based on intra-group variance, allowing the model to focus on rare correct paths while suppressing updates from trivial paths. Our experiments on five mathematical benchmarks demonstrate that AGPO achieves state-of-the-art accuracy while consistently improving pass@$k$ performance at scale. In a large-scale industrial application for search ads relevance optimization, AGPO effectively enhances the quality of the data annotation, leading to substantial performance gains in downstream student models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AGPO tries to stop RLVR from shrinking the reasoning boundary via negative-dominant updates and variance-scaled positives, but the abstract gives no numbers or ablations to show it works.

read the letter

The core idea is a negative-dominant policy update that pushes down wrong paths hard, paired with a group advantage that scales positive updates by intra-group variance so rare correct trajectories get more weight. This is meant to keep the base model's coverage at large k while still improving accuracy. The abstract positions this as a direct response to the documented RLVR problem where accuracy rises but pass@k falls relative to the base model at scale, and it reports SOTA accuracy plus better pass@k on five math benchmarks plus gains in an industrial search-ads annotation pipeline that flows to student models. That combination of academic benchmarks and a real production use case is the part worth noting; the mechanism itself is a specific tweak on group-based RL rather than a wholly new algorithm family. The main weakness is that none of the claims come with numbers, ablation tables, or statistical tests in the abstract, so it is impossible to judge whether the asymmetry actually preserves low-probability correct paths or whether the variance scaling ends up suppressing them when they appear infrequently in groups. The stress-test concern about gradients on rare correct answers is not addressed in the provided text, and without that derivation or empirical check the central claim stays unverified. This paper is aimed at people training reasoning models with verifiable rewards and at teams doing RL for ranking or annotation pipelines. It is coherent enough on its own terms to deserve a serious referee, mainly because the limitation it targets is real and the proposed fix is concrete enough to test. I would send it out for review with a request for the missing quantitative results and a direct check on coverage of infrequent reasoning patterns.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Asymmetric Group Policy Optimization (AGPO) for Reinforcement Learning with Verifiable Rewards (RLVR) in LLMs. It employs a negative-dominant strategy to suppress incorrect reasoning trajectories while using a group-based advantage scaled by intra-group variance for positive updates, with the goal of focusing on rare correct paths and counteracting the observed narrowing of reasoning boundaries relative to base models. Experiments claim state-of-the-art accuracy and improved pass@k at scale across five mathematical benchmarks, plus gains in an industrial search ads relevance task through better data annotation quality for downstream models.

Significance. If the central mechanism holds, the work addresses a practically important limitation in current RLVR approaches: improved sampling efficiency at the cost of reduced coverage of reasoning patterns at large k. The asymmetric negative-dominant design combined with variance scaling represents a targeted attempt to preserve exploration, and the large-scale industrial application in search ads relevance provides evidence of real-world utility beyond academic benchmarks. Credit is due for grounding the method in verifiable rewards and for reporting both benchmark and production outcomes.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): The central claims of SOTA accuracy and consistent pass@k gains are asserted without any quantitative metrics, baseline comparisons, ablation results, or statistical tests visible in the high-level description. The results must include concrete tables (e.g., pass@1/pass@8/pass@64 values on the five benchmarks) and controls showing that the variance scaling specifically improves coverage of rare correct paths rather than simply reweighting existing ones.
[§3] §3 (Method): The group advantage formulation with intra-group variance scaling is described qualitatively as allowing focus on rare correct paths, but no derivation or gradient analysis is supplied for the case where correct trajectories appear infrequently within sampled groups. This is load-bearing for the claim that the asymmetry counteracts boundary shrinkage; without it, the skeptic concern that variance scaling may still down-weight updates for low-probability positives remains unaddressed.
[§3.2 and §5] §3.2 and §5: The negative-dominant reinforcement strategy is presented as maintaining base-model exploration capacity, yet the manuscript supplies no analysis or empirical check (e.g., entropy or coverage metrics at large k) demonstrating that the combined update rule does not reduce the probability mass on fundamentally new reasoning patterns that appear only in small fractions of groups.

minor comments (2)

[§3] Notation in §3: Define the exact functional form of the asymmetric advantage (positive vs. negative components) with an equation label so readers can trace how variance scaling interacts with the negative-dominant term.
[§4] Reproducibility: Report the group size, number of groups per update, and variance computation details (e.g., whether it is normalized across the batch) to allow independent verification of the industrial and benchmark results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We have revised the manuscript to address all major points raised, including adding quantitative tables, formal derivations, and additional empirical analyses. Our point-by-point responses are as follows.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claims of SOTA accuracy and consistent pass@k gains are asserted without any quantitative metrics, baseline comparisons, ablation results, or statistical tests visible in the high-level description. The results must include concrete tables (e.g., pass@1/pass@8/pass@64 values on the five benchmarks) and controls showing that the variance scaling specifically improves coverage of rare correct paths rather than simply reweighting existing ones.

Authors: We agree with the need for more explicit presentation of results. The revised manuscript includes an updated abstract with key performance numbers and a new table in §4 detailing pass@1, pass@8, and pass@64 accuracies across the five mathematical benchmarks, with direct comparisons to prior SOTA methods. We have also included ablation experiments that control for the variance scaling component, using metrics such as the number of unique correct reasoning trajectories discovered at high k to demonstrate that it promotes coverage of rare paths rather than just reweighting frequent ones. Statistical significance is assessed via multiple runs with reported standard deviations and p-values. revision: yes
Referee: [§3] §3 (Method): The group advantage formulation with intra-group variance scaling is described qualitatively as allowing focus on rare correct paths, but no derivation or gradient analysis is supplied for the case where correct trajectories appear infrequently within sampled groups. This is load-bearing for the claim that the asymmetry counteracts boundary shrinkage; without it, the skeptic concern that variance scaling may still down-weight updates for low-probability positives remains unaddressed.

Authors: We have added a mathematical derivation and gradient analysis to §3 in the revision. Specifically, we derive the expected gradient for positive samples under low frequency in groups, showing that the division by intra-group standard deviation increases the effective learning rate for rare high-reward trajectories. This counters the potential down-weighting issue and provides the formal support for how AGPO mitigates reasoning boundary shrinkage. revision: yes
Referee: [§3.2 and §5] §3.2 and §5: The negative-dominant reinforcement strategy is presented as maintaining base-model exploration capacity, yet the manuscript supplies no analysis or empirical check (e.g., entropy or coverage metrics at large k) demonstrating that the combined update rule does not reduce the probability mass on fundamentally new reasoning patterns that appear only in small fractions of groups.

Authors: In response, we have incorporated new empirical evaluations in the revised §5. These include measurements of output entropy over large sample sets (k=64) and coverage of distinct reasoning patterns (quantified by clustering of solution embeddings or unique answer paths). The data shows that AGPO maintains or increases these metrics relative to the base model and standard RLVR, indicating preservation of exploration for novel patterns. This analysis is now presented with figures to substantiate the claim. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on proposed method and experiments without self-referential reduction

full rationale

The provided abstract and context describe AGPO as a novel combination of negative-dominant reinforcement and intra-group variance scaling to address observed boundary shrinkage in RLVR. No equations, fitted parameters, or derivations are shown that reduce the claimed improvement to a self-defined quantity or prior self-citation. The central premise is presented as an empirical proposal supported by benchmark results and industrial application, with no load-bearing step that collapses by construction to its inputs. This is the expected non-finding for a methods paper whose contribution is algorithmic rather than a closed-form derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations, parameters, or explicit assumptions; therefore the ledger is empty pending full text.

pith-pipeline@v0.9.0 · 5507 in / 1102 out tokens · 31860 ms · 2026-05-08T11:20:49.163182+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

This means that each subset Ai must contain pairs of numbers that sum up to all integers starting from 15

Understanding the Problem:For each subset Ai, we need to ensure that for any n≥15 , there exist two distinct elements a, b∈A i such that a+b=n . This means that each subset Ai must contain pairs of numbers that sum up to all integers starting from 15

work page
[2]

One way to do this is to consider the parity (odd or even) of the numbers

Constructing Subsets:To achieve this, we need to carefully construct the subsets. One way to do this is to consider the parity (odd or even) of the numbers. For example, we can partition the positive integers into two subsets: •A 1 contains all odd numbers. •A 2 contains all even numbers. However, this simple partition doesn’t work because forn= 15 , we n...

work page
[3]

One possible way is to partition the positive integers into three subsets: •A 1 contains numbers of the form3k+ 1

Further Partitioning:We need to further partition the subsets to ensure that for each n≥15 , there exist two distinct elements in the same subset that sum to n. One possible way is to partition the positive integers into three subsets: •A 1 contains numbers of the form3k+ 1. •A 2 contains numbers of the form3k+ 2. •A 3 contains numbers of the form3k. Let’...

work page
[4]

Generalizing:We can generalize this idea to show that we can partition the positive integers into k= 3 subsets and satisfy the condition. If we try to partition into more than 3 subsets, we would need to ensure that for each n≥15 , there exist two distinct elements in the same subset that sum to n, which becomes increasingly complex and may not be possibl...

work page
[5]

Understand the Problem:We need to partition the set of positive integers into k subsets, such that each subset Ai contains pairs of distinct elements whose sum is any integern≥15

work page
[6]

Clearly, this is possible since we can always find two distinct elements ofA 1 whose sum isnforn≥15

Consider Small Values ofk: • If k= 1 , then we need to partition all positive integers into one subset A1. Clearly, this is possible since we can always find two distinct elements ofA 1 whose sum isnforn≥15. • If k= 2 , we can partition the positive integers into odd and even numbers. This works since for any n≥15 , we can find two odd numbers or two even...

work page
[7]

For k≥5 , we need to check if it’s possible to ensure that for alln≥15 , there exist two distinct elements in each subset whose sum is n

Generalize for k:We can generalize this to k subsets, where each subset contains numbers that are congruent to the same residue modulo k. For k≥5 , we need to check if it’s possible to ensure that for alln≥15 , there exist two distinct elements in each subset whose sum is n. However, for k= 5 , we can’t ensure that for alln≥15 , there exist two distinct e...

work page
[8]

•fevaluates to zero if and only if the points form the vertices of a regular2n-gon in some order or are all equal

Understanding the Polynomial: • The polynomial f is a 4n-variable polynomial, meaning it takes 2n x-coordinates and 2n y-coordinates as inputs. •fevaluates to zero if and only if the points form the vertices of a regular2n-gon in some order or are all equal

work page
[9]

• The polynomial must be able to distinguish between the vertices of a regular 2n-gon and other configurations of points

Degree of the Polynomial: • The polynomialfmust be symmetric under rotations and reflections of the regular2n-gon. • The polynomial must be able to distinguish between the vertices of a regular 2n-gon and other configurations of points

work page
[10]

苹果手机"，sku对应产品为

Smallest Possible Degree: • The smallest degree of f can be determined by considering the properties of a regular 2n-gon. The polynomial must be able to capture the geometric constraints of the regular2n-gon. • The degree of the polynomial is influenced by the fact that it must be able to distinguish between the vertices of a regular2n-gon and other confi...

work page
[11]

First, the core function of RLVR is not the creation of fundamentally new reasoning capabilities, but rather the efficient filtering of reasoning paths

Insights into the RLVR Mechanism.Our research, together with existing research, reveals several important insights into RLVR (Yue et al., 2025; Zhu et al., 2025). First, the core function of RLVR is not the creation of fundamentally new reasoning capabilities, but rather the efficient filtering of reasoning paths. Second, prioritizing Negative Sample Rein...

work page 2025
[12]

The Constraint of Initial Capability (Cold-Start Problem).A significant limitation of RLVR methods is their heavy reliance on the base model’s initial exploration capacity. Our experiments on theAIME-2025 benchmark with the Llama-3.1-8B-Instruct model demonstrate that when the initial Pass@1 accuracy is near zero, the algorithm struggles to encounter suff...

work page 2025
[13]

Instability in Long-Term Training.Although AGPO demonstrates superior entropy maintenance and short-term optimization efficiency, we observe that it faces stability challenges during extended training. Specifically, when the KL divergence penalty is minimized or removed to maximize performance, extensive training over hundreds of gradient steps can lead t...

work page
[14]

Potential Synergy with Scalable RL Framework.An interesting avenue for future research involves exploring the interplay between AGPO and emerging system-level optimizations, such as DAPO (Yu et al., 2025). While DAPO primarily focuses on enhancing training throughput and update dynamics through dynamic sampling and decoupled clipping, AGPO introduces a di...

work page 2025

[1] [1]

This means that each subset Ai must contain pairs of numbers that sum up to all integers starting from 15

Understanding the Problem:For each subset Ai, we need to ensure that for any n≥15 , there exist two distinct elements a, b∈A i such that a+b=n . This means that each subset Ai must contain pairs of numbers that sum up to all integers starting from 15

work page

[2] [2]

One way to do this is to consider the parity (odd or even) of the numbers

Constructing Subsets:To achieve this, we need to carefully construct the subsets. One way to do this is to consider the parity (odd or even) of the numbers. For example, we can partition the positive integers into two subsets: •A 1 contains all odd numbers. •A 2 contains all even numbers. However, this simple partition doesn’t work because forn= 15 , we n...

work page

[3] [3]

One possible way is to partition the positive integers into three subsets: •A 1 contains numbers of the form3k+ 1

Further Partitioning:We need to further partition the subsets to ensure that for each n≥15 , there exist two distinct elements in the same subset that sum to n. One possible way is to partition the positive integers into three subsets: •A 1 contains numbers of the form3k+ 1. •A 2 contains numbers of the form3k+ 2. •A 3 contains numbers of the form3k. Let’...

work page

[4] [4]

Generalizing:We can generalize this idea to show that we can partition the positive integers into k= 3 subsets and satisfy the condition. If we try to partition into more than 3 subsets, we would need to ensure that for each n≥15 , there exist two distinct elements in the same subset that sum to n, which becomes increasingly complex and may not be possibl...

work page

[5] [5]

Understand the Problem:We need to partition the set of positive integers into k subsets, such that each subset Ai contains pairs of distinct elements whose sum is any integern≥15

work page

[6] [6]

Clearly, this is possible since we can always find two distinct elements ofA 1 whose sum isnforn≥15

Consider Small Values ofk: • If k= 1 , then we need to partition all positive integers into one subset A1. Clearly, this is possible since we can always find two distinct elements ofA 1 whose sum isnforn≥15. • If k= 2 , we can partition the positive integers into odd and even numbers. This works since for any n≥15 , we can find two odd numbers or two even...

work page

[7] [7]

For k≥5 , we need to check if it’s possible to ensure that for alln≥15 , there exist two distinct elements in each subset whose sum is n

Generalize for k:We can generalize this to k subsets, where each subset contains numbers that are congruent to the same residue modulo k. For k≥5 , we need to check if it’s possible to ensure that for alln≥15 , there exist two distinct elements in each subset whose sum is n. However, for k= 5 , we can’t ensure that for alln≥15 , there exist two distinct e...

work page

[8] [8]

•fevaluates to zero if and only if the points form the vertices of a regular2n-gon in some order or are all equal

Understanding the Polynomial: • The polynomial f is a 4n-variable polynomial, meaning it takes 2n x-coordinates and 2n y-coordinates as inputs. •fevaluates to zero if and only if the points form the vertices of a regular2n-gon in some order or are all equal

work page

[9] [9]

• The polynomial must be able to distinguish between the vertices of a regular 2n-gon and other configurations of points

Degree of the Polynomial: • The polynomialfmust be symmetric under rotations and reflections of the regular2n-gon. • The polynomial must be able to distinguish between the vertices of a regular 2n-gon and other configurations of points

work page

[10] [10]

苹果手机"，sku对应产品为

Smallest Possible Degree: • The smallest degree of f can be determined by considering the properties of a regular 2n-gon. The polynomial must be able to capture the geometric constraints of the regular2n-gon. • The degree of the polynomial is influenced by the fact that it must be able to distinguish between the vertices of a regular2n-gon and other confi...

work page

[11] [11]

First, the core function of RLVR is not the creation of fundamentally new reasoning capabilities, but rather the efficient filtering of reasoning paths

Insights into the RLVR Mechanism.Our research, together with existing research, reveals several important insights into RLVR (Yue et al., 2025; Zhu et al., 2025). First, the core function of RLVR is not the creation of fundamentally new reasoning capabilities, but rather the efficient filtering of reasoning paths. Second, prioritizing Negative Sample Rein...

work page 2025

[12] [12]

The Constraint of Initial Capability (Cold-Start Problem).A significant limitation of RLVR methods is their heavy reliance on the base model’s initial exploration capacity. Our experiments on theAIME-2025 benchmark with the Llama-3.1-8B-Instruct model demonstrate that when the initial Pass@1 accuracy is near zero, the algorithm struggles to encounter suff...

work page 2025

[13] [13]

Instability in Long-Term Training.Although AGPO demonstrates superior entropy maintenance and short-term optimization efficiency, we observe that it faces stability challenges during extended training. Specifically, when the KL divergence penalty is minimized or removed to maximize performance, extensive training over hundreds of gradient steps can lead t...

work page

[14] [14]

Potential Synergy with Scalable RL Framework.An interesting avenue for future research involves exploring the interplay between AGPO and emerging system-level optimizations, such as DAPO (Yu et al., 2025). While DAPO primarily focuses on enhancing training throughput and update dynamics through dynamic sampling and decoupled clipping, AGPO introduces a di...

work page 2025