arxiv: 2603.14389 · v2 · submitted 2026-03-15 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

From log π to π: Taming Divergence in Soft Clipping via Bilateral Decoupled Decay of Probability Gradient Weight

Xiaoliang Fu , Jiaye Lin , Yangyi Fang , Chaowen Hu , Cong Qin , Zekai Shao , Binbin Zheng , Lu Pan

show 1 more author

Ke Zeng

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Reinforcement LearningPolicy OptimizationSoft ClippingGradient StabilityImportance SamplingLarge Language ModelsRLVR

0 comments

The pith

Using raw probability gradients with asymmetric importance-ratio decay stabilizes soft clipping and sustains exploration in LLM reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that soft-clipping methods in RLVR for LLMs become unstable because log-probability gradients diverge when token probabilities approach zero. It replaces those gradients with direct probability gradients and adds a decoupled decay rule that applies continuous, asymmetric shrinkage only to boundary tokens according to importance sampling ratios. This combination keeps gradients bounded while still allowing tokens outside the nominal trust region to contribute, removing the exploration penalty imposed by hard clipping. Experiments on DeepSeek-R1-Distill-Qwen models from 1.5 B to 14 B parameters show consistent gains on mathematical reasoning benchmarks.

Core claim

DGPO establishes the probability gradient as the superior optimization primitive and employs a decoupled decay mechanism based on importance sampling ratios. By applying asymmetric, continuous decay to boundary tokens, DGPO resolves the conflict between stability and sustained exploration.

What carries the argument

The decoupled decay mechanism that applies asymmetric continuous shrinkage to boundary tokens according to importance sampling ratios, applied to raw probability gradients rather than log-probability gradients.

If this is right

Tokens outside the trust region can still influence the update without causing weight explosion.
Exploration remains active throughout training instead of being progressively suppressed by hard clipping.
The same decay schedule works across model sizes from 1.5 B to 14 B without per-model retuning.
RLVR training becomes more robust to the low-probability tokens that arise during reasoning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same probability-gradient-plus-decay pattern could be tested in non-LLM policy optimization domains where vanishing probabilities also appear.
Removing hard clipping entirely might allow larger effective trust regions or different ratio thresholds in future variants.
The importance-sampling-ratio signal could be reused for adaptive learning-rate scaling rather than only for decay.

Load-bearing premise

Switching to raw probability gradients plus asymmetric decay on boundary tokens will reliably prevent divergence without introducing new instabilities or requiring extensive per-model retuning of the decay schedule.

What would settle it

A training run on one of the DeepSeek-R1-Distill-Qwen models in which DGPO produces higher divergence rates or lower benchmark scores than a hard-clipped GRPO baseline.

read the original abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed a leap in Large Language Model (LLM) reasoning, yet its optimization dynamics remain fragile. Standard algorithms like GRPO enforce stability via "hard clipping", which inadvertently stifles exploration by discarding gradients of tokens outside the trust region. While recent "soft clipping" methods attempt to recover these gradients, they suffer from a critical challenge: relying on log-probability gradient ($\nabla_\theta\log \pi_\theta$) yields divergent weights as probabilities vanish, destabilizing LLM training. We rethink this convention by establishing probability gradient ($\nabla_\theta \pi_\theta$) as the superior optimization primitive. Accordingly, we propose Decoupled Gradient Policy Optimization (DGPO), which employs a decoupled decay mechanism based on importance sampling ratios. By applying asymmetric, continuous decay to boundary tokens, DGPO resolves the conflict between stability and sustained exploration. Extensive experiments across DeepSeek-R1-Distill-Qwen series models (1.5B/7B/14B) demonstrate that DGPO consistently outperforms strong baselines on various mathematical benchmarks, offering a robust and scalable solution for RLVR. Our code and implementation are available at: https://github.com/FlyTune/DGPO-RL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DGPO swaps to raw probability gradients plus asymmetric importance decay to fix soft-clip divergence in RLVR, but skips the re-derivation needed to confirm the estimator stays unbiased.

read the letter

The main point is that this paper replaces the standard log-probability gradient with a raw probability gradient inside soft clipping for RLVR, then adds a bilateral decoupled decay driven by importance sampling ratios to keep boundary tokens from blowing up. They test it on DeepSeek-R1-Distill-Qwen models from 1.5B to 14B and report better math benchmark scores than the usual GRPO-style baselines. That combination of the gradient swap and the asymmetric decay looks like the actual new piece, and the multi-size experiments give at least some practical signal that the method runs without immediate collapse. The code release is also a plus for anyone who wants to check the implementation directly. The central weakness is that the switch from ∇logπ to ∇π changes the score-function identity that normally guarantees an unbiased policy gradient. The abstract and visible claims do not show a fresh derivation that the new estimator, after the decay is applied, still ascends the original expected-reward objective. Without that step, the reported stability could come from optimizing a different surrogate rather than the intended one. The paper also gives no numbers on gradient norms, divergence frequency, or ablations that isolate the decay schedule, so the claim that the method reliably balances stability and exploration rests on the benchmark wins alone. The decay itself appears to carry at least one tunable schedule parameter. This work is aimed at groups already running RLVR on reasoning LLMs and hitting clipping-related instability. A reader who cares about the same engineering problem will find the idea and the numbers worth looking at, but anyone who needs the math to be tight first will want the missing derivation before treating the results as settled. I would send it to peer review because the practical issue is real and the empirical pattern is worth checking, even though the theoretical gap is load-bearing.

Referee Report

2 major / 2 minor

Summary. The paper proposes Decoupled Gradient Policy Optimization (DGPO) to stabilize soft clipping in RLVR for LLMs. It replaces the log-probability gradient ∇logπ with the probability gradient ∇π as the optimization primitive and applies a bilateral decoupled decay mechanism based on importance sampling ratios to boundary tokens. This is claimed to resolve the stability-exploration conflict. Experiments on DeepSeek-R1-Distill-Qwen models (1.5B/7B/14B) report consistent outperformance on mathematical benchmarks, with code released.

Significance. If the substitution preserves an unbiased ascent of the expected-reward objective and the decay schedule is shown to be robust, the method could meaningfully improve training stability for RLVR without sacrificing exploration. The multi-scale experiments and open code are strengths. However, the absence of a re-derivation for unbiasedness and limited quantitative stability metrics leave the central claims under-supported.

major comments (2)

[Method] The substitution of ∇_θ π_θ for ∇_θ log π_θ in the policy gradient estimator lacks any re-derivation showing that the resulting estimator (after bilateral decay) remains unbiased for the original expected-reward objective. The standard score-function identity E[∇logπ · r] = ∇E[r] is altered without an alternative proof or bias analysis.
[Experiments] The abstract and experimental section report outperformance across three model sizes but provide no quantitative details on gradient norms, divergence incidents, or ablation results isolating the decoupled decay components. This leaves the stability claim load-bearing yet unsupported by visible evidence.

minor comments (2)

Specify the exact mathematical form of the importance sampling ratios and decay schedule (including any free parameters) to allow reproduction and assessment of whether the schedule is parameter-free.
[Abstract] List the specific mathematical benchmarks used in the experiments rather than referring to them generically as 'various'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important points regarding theoretical rigor and experimental support for DGPO. We respond to each major comment below and will incorporate revisions to address them.

read point-by-point responses

Referee: [Method] The substitution of ∇_θ π_θ for ∇_θ log π_θ in the policy gradient estimator lacks any re-derivation showing that the resulting estimator (after bilateral decay) remains unbiased for the original expected-reward objective. The standard score-function identity E[∇logπ · r] = ∇E[r] is altered without an alternative proof or bias analysis.

Authors: We acknowledge that an explicit re-derivation would strengthen the presentation. In the revised manuscript we will add a formal derivation in the appendix showing that the probability-gradient estimator, when combined with the bilateral decoupled decay applied via importance-sampling ratios, remains an unbiased estimator of the expected-reward gradient. The key step is that ∇π_θ = π_θ ∇logπ_θ and the asymmetric continuous decay is constructed to preserve the sign and relative magnitude of the original score-function term for tokens inside and outside the trust region, thereby recovering the unbiasedness property under the modified weighting. revision: yes
Referee: [Experiments] The abstract and experimental section report outperformance across three model sizes but provide no quantitative details on gradient norms, divergence incidents, or ablation results isolating the decoupled decay components. This leaves the stability claim load-bearing yet unsupported by visible evidence.

Authors: We agree that additional quantitative evidence is needed to substantiate the stability claims. In the revision we will include (i) training curves of gradient norms for both the standard log-probability and the proposed probability-gradient estimators, (ii) counts of divergence incidents (NaN or exploding gradients) across runs, and (iii) ablation tables that isolate the contribution of the bilateral decoupled decay schedule on both stability metrics and final benchmark performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The provided abstract and context present DGPO as a rethinking that substitutes probability gradient for log-probability gradient and applies asymmetric decoupled decay via standard importance sampling ratios. No equations, self-citations, or fitted parameters are quoted that reduce the central claim (resolution of divergence while preserving exploration) to an input by construction, such as a renamed fit or a self-referential definition. Importance sampling ratios are a conventional RL primitive, and the decay mechanism is described as a proposed solution rather than a statistically forced prediction or ansatz smuggled from prior author work. The derivation chain therefore remains self-contained and independent of the target result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that probability gradients are well-behaved under the proposed decay and that importance sampling ratios provide an unbiased signal for boundary tokens. No explicit free parameters are named in the abstract, but the decay mechanism implicitly introduces at least one schedule parameter. No new physical entities are postulated.

free parameters (1)

decay schedule parameter
The asymmetric continuous decay applied to boundary tokens is controlled by at least one tunable schedule whose exact form is not specified in the abstract.

axioms (1)

domain assumption Importance sampling ratios remain valid estimators when probabilities approach zero under the new gradient definition.
Invoked when the paper claims the switch to probability gradients resolves divergence while preserving the correctness of the policy update.

pith-pipeline@v0.9.0 · 5556 in / 1335 out tokens · 30633 ms · 2026-05-15T11:23:31.425292+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

probability exhibits superior geometric symmetry within its value range... (0,1). This boundedness facilitates... symmetric gradient mechanisms. In contrast, log-probabilities span the asymmetric and unbounded interval (−∞,0)
IndisputableMonolith/Foundation/BranchSelection.lean RCLCombiner_isCoupling_iff echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Left Boundary: A positive integer power function of πθ... Right Boundary: A reciprocal radical power function of πθ... Cleft = (1−εlow)−n π−(n+1)θold ... Cright = (1+εhigh)1/m π1/m−1θold

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Semantic Relevance: Counterfactual Risk Minimization for Robust Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 6.0

CoRM-RAG uses a cognitive perturbation protocol to simulate biases and trains an Evidence Critic to retrieve documents that support correct decisions even under adversarial query changes.