AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD
Pith reviewed 2026-05-08 11:20 UTC · model grok-4.3
The pith
Asymmetric Group Policy Optimization counters reasoning boundary shrinkage in RLVR-trained models while raising accuracy and pass@k coverage.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that AGPO prevents the capability-boundary shrinkage seen in standard RLVR by combining a negative-dominant reinforcement strategy that suppresses wrong reasoning paths with a group-advantage mechanism that scales positive updates by intra-group variance, thereby preserving the base model's ability to surface fundamentally new correct patterns; the resulting models achieve higher accuracy and better large-k coverage on mathematical benchmarks and improve downstream performance in search-ads relevance through higher-quality annotations.
What carries the argument
Asymmetric Group Policy Optimization (AGPO), which pairs negative-dominant reinforcement to penalize incorrect paths with a variance-scaled group advantage for positive updates that emphasizes rare correct responses.
If this is right
- The trained models reach state-of-the-art accuracy on five standard mathematical reasoning benchmarks.
- Pass@k performance improves consistently as the number of samples grows, unlike prior RLVR methods.
- Data-annotation quality rises in a large-scale search-ads relevance task.
- Downstream student models trained on the improved annotations show substantial performance gains.
Where Pith is reading between the lines
- The same asymmetry could be tested on other verifiable domains such as code generation or theorem proving to check whether boundary preservation generalizes.
- If the variance-scaling term proves robust, it might be combined with existing exploration bonuses to widen boundaries further without extra negative pressure.
- Industrial pipelines that already collect group-level responses could adopt the method with minimal extra labeling cost.
Load-bearing premise
That applying stronger negative updates to wrong answers and scaling positive updates by intra-group variance will suppress errors without eliminating the base model's capacity to discover entirely new correct reasoning patterns.
What would settle it
A controlled experiment in which, after AGPO training, the pass@k curve at large k (hundreds of samples) falls below the base model's curve or the set of distinct correct reasoning traces shrinks rather than stays at least as broad.
Figures
read the original abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has demonstrated notable success in enhancing the reasoning performance of large language models (LLMs). However, recent studies reveal that while current RLVR methods improve sampling efficiency towards correct paths, they do not elicit fundamentally new reasoning patterns. Instead, the reasoning capability boundary of trained models often narrows compared to their base models, with base models achieving higher coverage at large sample sizes. In this work, we propose Asymmetric Group Policy Optimization (AGPO) to counteract this boundary shrinkage. AGPO adopts a negative-dominant reinforcement strategy to suppress incorrect reasoning paths, maintaining the base model's exploration capacity. For positive reinforcement, AGPO adopts a group advantage mechanism, which scales positive updates based on intra-group variance, allowing the model to focus on rare correct paths while suppressing updates from trivial paths. Our experiments on five mathematical benchmarks demonstrate that AGPO achieves state-of-the-art accuracy while consistently improving pass@$k$ performance at scale. In a large-scale industrial application for search ads relevance optimization, AGPO effectively enhances the quality of the data annotation, leading to substantial performance gains in downstream student models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Asymmetric Group Policy Optimization (AGPO) for Reinforcement Learning with Verifiable Rewards (RLVR) in LLMs. It employs a negative-dominant strategy to suppress incorrect reasoning trajectories while using a group-based advantage scaled by intra-group variance for positive updates, with the goal of focusing on rare correct paths and counteracting the observed narrowing of reasoning boundaries relative to base models. Experiments claim state-of-the-art accuracy and improved pass@k at scale across five mathematical benchmarks, plus gains in an industrial search ads relevance task through better data annotation quality for downstream models.
Significance. If the central mechanism holds, the work addresses a practically important limitation in current RLVR approaches: improved sampling efficiency at the cost of reduced coverage of reasoning patterns at large k. The asymmetric negative-dominant design combined with variance scaling represents a targeted attempt to preserve exploration, and the large-scale industrial application in search ads relevance provides evidence of real-world utility beyond academic benchmarks. Credit is due for grounding the method in verifiable rewards and for reporting both benchmark and production outcomes.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): The central claims of SOTA accuracy and consistent pass@k gains are asserted without any quantitative metrics, baseline comparisons, ablation results, or statistical tests visible in the high-level description. The results must include concrete tables (e.g., pass@1/pass@8/pass@64 values on the five benchmarks) and controls showing that the variance scaling specifically improves coverage of rare correct paths rather than simply reweighting existing ones.
- [§3] §3 (Method): The group advantage formulation with intra-group variance scaling is described qualitatively as allowing focus on rare correct paths, but no derivation or gradient analysis is supplied for the case where correct trajectories appear infrequently within sampled groups. This is load-bearing for the claim that the asymmetry counteracts boundary shrinkage; without it, the skeptic concern that variance scaling may still down-weight updates for low-probability positives remains unaddressed.
- [§3.2 and §5] §3.2 and §5: The negative-dominant reinforcement strategy is presented as maintaining base-model exploration capacity, yet the manuscript supplies no analysis or empirical check (e.g., entropy or coverage metrics at large k) demonstrating that the combined update rule does not reduce the probability mass on fundamentally new reasoning patterns that appear only in small fractions of groups.
minor comments (2)
- [§3] Notation in §3: Define the exact functional form of the asymmetric advantage (positive vs. negative components) with an equation label so readers can trace how variance scaling interacts with the negative-dominant term.
- [§4] Reproducibility: Report the group size, number of groups per update, and variance computation details (e.g., whether it is normalized across the batch) to allow independent verification of the industrial and benchmark results.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We have revised the manuscript to address all major points raised, including adding quantitative tables, formal derivations, and additional empirical analyses. Our point-by-point responses are as follows.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claims of SOTA accuracy and consistent pass@k gains are asserted without any quantitative metrics, baseline comparisons, ablation results, or statistical tests visible in the high-level description. The results must include concrete tables (e.g., pass@1/pass@8/pass@64 values on the five benchmarks) and controls showing that the variance scaling specifically improves coverage of rare correct paths rather than simply reweighting existing ones.
Authors: We agree with the need for more explicit presentation of results. The revised manuscript includes an updated abstract with key performance numbers and a new table in §4 detailing pass@1, pass@8, and pass@64 accuracies across the five mathematical benchmarks, with direct comparisons to prior SOTA methods. We have also included ablation experiments that control for the variance scaling component, using metrics such as the number of unique correct reasoning trajectories discovered at high k to demonstrate that it promotes coverage of rare paths rather than just reweighting frequent ones. Statistical significance is assessed via multiple runs with reported standard deviations and p-values. revision: yes
-
Referee: [§3] §3 (Method): The group advantage formulation with intra-group variance scaling is described qualitatively as allowing focus on rare correct paths, but no derivation or gradient analysis is supplied for the case where correct trajectories appear infrequently within sampled groups. This is load-bearing for the claim that the asymmetry counteracts boundary shrinkage; without it, the skeptic concern that variance scaling may still down-weight updates for low-probability positives remains unaddressed.
Authors: We have added a mathematical derivation and gradient analysis to §3 in the revision. Specifically, we derive the expected gradient for positive samples under low frequency in groups, showing that the division by intra-group standard deviation increases the effective learning rate for rare high-reward trajectories. This counters the potential down-weighting issue and provides the formal support for how AGPO mitigates reasoning boundary shrinkage. revision: yes
-
Referee: [§3.2 and §5] §3.2 and §5: The negative-dominant reinforcement strategy is presented as maintaining base-model exploration capacity, yet the manuscript supplies no analysis or empirical check (e.g., entropy or coverage metrics at large k) demonstrating that the combined update rule does not reduce the probability mass on fundamentally new reasoning patterns that appear only in small fractions of groups.
Authors: In response, we have incorporated new empirical evaluations in the revised §5. These include measurements of output entropy over large sample sets (k=64) and coverage of distinct reasoning patterns (quantified by clustering of solution embeddings or unique answer paths). The data shows that AGPO maintains or increases these metrics relative to the base model and standard RLVR, indicating preservation of exploration for novel patterns. This analysis is now presented with figures to substantiate the claim. revision: yes
Circularity Check
No circularity detected; claims rest on proposed method and experiments without self-referential reduction
full rationale
The provided abstract and context describe AGPO as a novel combination of negative-dominant reinforcement and intra-group variance scaling to address observed boundary shrinkage in RLVR. No equations, fitted parameters, or derivations are shown that reduce the claimed improvement to a self-defined quantity or prior self-citation. The central premise is presented as an empirical proposal supported by benchmark results and industrial application, with no load-bearing step that collapses by construction to its inputs. This is the expected non-finding for a methods paper whose contribution is algorithmic rather than a closed-form derivation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Understanding the Problem:For each subset Ai, we need to ensure that for any n≥15 , there exist two distinct elements a, b∈A i such that a+b=n . This means that each subset Ai must contain pairs of numbers that sum up to all integers starting from 15
-
[2]
One way to do this is to consider the parity (odd or even) of the numbers
Constructing Subsets:To achieve this, we need to carefully construct the subsets. One way to do this is to consider the parity (odd or even) of the numbers. For example, we can partition the positive integers into two subsets: •A 1 contains all odd numbers. •A 2 contains all even numbers. However, this simple partition doesn’t work because forn= 15 , we n...
-
[3]
Further Partitioning:We need to further partition the subsets to ensure that for each n≥15 , there exist two distinct elements in the same subset that sum to n. One possible way is to partition the positive integers into three subsets: •A 1 contains numbers of the form3k+ 1. •A 2 contains numbers of the form3k+ 2. •A 3 contains numbers of the form3k. Let’...
-
[4]
Generalizing:We can generalize this idea to show that we can partition the positive integers into k= 3 subsets and satisfy the condition. If we try to partition into more than 3 subsets, we would need to ensure that for each n≥15 , there exist two distinct elements in the same subset that sum to n, which becomes increasingly complex and may not be possibl...
-
[5]
Understand the Problem:We need to partition the set of positive integers into k subsets, such that each subset Ai contains pairs of distinct elements whose sum is any integern≥15
-
[6]
Clearly, this is possible since we can always find two distinct elements ofA 1 whose sum isnforn≥15
Consider Small Values ofk: • If k= 1 , then we need to partition all positive integers into one subset A1. Clearly, this is possible since we can always find two distinct elements ofA 1 whose sum isnforn≥15. • If k= 2 , we can partition the positive integers into odd and even numbers. This works since for any n≥15 , we can find two odd numbers or two even...
-
[7]
Generalize for k:We can generalize this to k subsets, where each subset contains numbers that are congruent to the same residue modulo k. For k≥5 , we need to check if it’s possible to ensure that for alln≥15 , there exist two distinct elements in each subset whose sum is n. However, for k= 5 , we can’t ensure that for alln≥15 , there exist two distinct e...
-
[8]
Understanding the Polynomial: • The polynomial f is a 4n-variable polynomial, meaning it takes 2n x-coordinates and 2n y-coordinates as inputs. •fevaluates to zero if and only if the points form the vertices of a regular2n-gon in some order or are all equal
-
[9]
Degree of the Polynomial: • The polynomialfmust be symmetric under rotations and reflections of the regular2n-gon. • The polynomial must be able to distinguish between the vertices of a regular 2n-gon and other configurations of points
-
[10]
Smallest Possible Degree: • The smallest degree of f can be determined by considering the properties of a regular 2n-gon. The polynomial must be able to capture the geometric constraints of the regular2n-gon. • The degree of the polynomial is influenced by the fact that it must be able to distinguish between the vertices of a regular2n-gon and other confi...
-
[11]
Insights into the RLVR Mechanism.Our research, together with existing research, reveals several important insights into RLVR (Yue et al., 2025; Zhu et al., 2025). First, the core function of RLVR is not the creation of fundamentally new reasoning capabilities, but rather the efficient filtering of reasoning paths. Second, prioritizing Negative Sample Rein...
work page 2025
-
[12]
The Constraint of Initial Capability (Cold-Start Problem).A significant limitation of RLVR methods is their heavy reliance on the base model’s initial exploration capacity. Our experiments on theAIME-2025 benchmark with the Llama-3.1-8B-Instruct model demonstrate that when the initial Pass@1 accuracy is near zero, the algorithm struggles to encounter suff...
work page 2025
-
[13]
Instability in Long-Term Training.Although AGPO demonstrates superior entropy maintenance and short-term optimization efficiency, we observe that it faces stability challenges during extended training. Specifically, when the KL divergence penalty is minimized or removed to maximize performance, extensive training over hundreds of gradient steps can lead t...
-
[14]
Potential Synergy with Scalable RL Framework.An interesting avenue for future research involves exploring the interplay between AGPO and emerging system-level optimizations, such as DAPO (Yu et al., 2025). While DAPO primarily focuses on enhancing training throughput and update dynamics through dynamic sampling and decoupled clipping, AGPO introduces a di...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.