Rethinking the Divergence Regularization in LLM RL

Jiarui Yao; Liefeng Bo; Penghui Qi; Tianyu Pang; Wee Sun Lee; Xiangxin Zhou

arxiv: 2606.09821 · v1 · pith:XQW5AIMXnew · submitted 2026-06-08 · 💻 cs.LG

Rethinking the Divergence Regularization in LLM RL

Jiarui Yao , Xiangxin Zhou , Penghui Qi , Wee Sun Lee , Liefeng Bo , Tianyu Pang This is my paper

Pith reviewed 2026-06-27 17:12 UTC · model grok-4.3

classification 💻 cs.LG

keywords divergence regularizationpolicy optimizationtrust regionLLM RLoff-policy learninggradient weighting

0 comments

The pith

DRPO replaces DPPO's hard divergence mask with a smooth advantage-weighted quadratic regularizer while preserving the same trust-region geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current divergence-based trust regions in LLM RL discard gradients abruptly once a token's absolute probability shift crosses a boundary. DRPO replaces this hard mask with an advantage-weighted quadratic regularizer on policy shift. The change keeps the trust region identical yet produces bounded continuous gradient weights. These weights attenuate diverging updates and supply corrective signals for harmful moves that would otherwise be ignored. Experiments across scales and settings show improved stability and efficiency without added tuning parameters.

Core claim

DRPO replaces the hard mask in DPPO with a smooth advantage-weighted quadratic regularizer on policy shift. It preserves the same trust-region geometry defined by the sampled token's absolute probability shift while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary.

What carries the argument

Advantage-weighted quadratic regularizer on absolute probability shift

If this is right

DRPO preserves the exact trust-region geometry of DPPO.
Gradient weights remain bounded and continuous rather than abruptly zeroed.
Corrective gradient signals are supplied for updates that cross the boundary in harmful directions.
Training stability and efficiency improve across model scales, architectures, and precision settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regularizer form might substitute for ratio clipping in non-LLM off-policy methods that face long-tailed action spaces.
Empirical measurement of how often the regularizer activates on real training trajectories would quantify its corrective contribution.

Load-bearing premise

Replacing the hard mask with an advantage-weighted quadratic regularizer on absolute probability shift will maintain the intended trust region without introducing new instabilities or requiring additional tuning parameters that dominate performance.

What would settle it

A controlled comparison on the same LLM, tasks, and hyperparameters where DRPO training shows higher variance or lower final reward than DPPO.

Figures

Figures reproduced from arXiv: 2606.09821 by Jiarui Yao, Liefeng Bo, Penghui Qi, Tianyu Pang, Wee Sun Lee, Xiangxin Zhou.

**Figure 2.** Figure 2: Histogram, cumulative distribution, and absolute probability shift [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Average accuracy across all main experiment settings on AIME24 and AIME25. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation on |Aˆ t |. Removing this term degrades performance and destabilizes training. To further evaluate the effectiveness of our proposed method, we conduct a series of ablation studies on the design considerations of the regularizer. Advantage weight. In both SPO and DRPO, the regularization term is weighted by the absolute advantage |Aˆ t |. This weighting ensures that the per-token optimum lies on a… view at source ↗

**Figure 5.** Figure 5: Ablation on alternative divergence metrics. DRPO achieves the best performance. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Training dynamics for DRPO and directly applying a KL penalty term without introducing [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison among experiments applying a KL penalty term or a TV penalty term, with [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: The hyperparameter tuning for KL with advantage weight [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Training dynamics for different parameters under DPPO, compared to DRPO. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Hyperparameter tuning of the coefficient on DRPO. [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Ablation on applying the DRPO regularizer only outside DPPO’s trust region. [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

read the original abstract

Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded rather than corrected. To address this, we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DRPO swaps DPPO's hard mask for an advantage-weighted quadratic regularizer on probability shift, but the claim of identical trust-region geometry does not obviously follow from the description.

read the letter

The paper's main move is to replace DPPO's hard mask with a smooth quadratic penalty on absolute probability shift, weighted by advantage. This is meant to keep gradients from being discarded entirely once a token crosses the boundary and instead supply a continuous corrective term.

The change is a direct, narrow extension of the cited DPPO work. It targets the practical issue that hard masking wastes gradient information in long-tailed vocabularies during off-policy LLM training. That limitation is real, and turning the cutoff into a bounded regularizer is a reasonable way to address it without switching back to ratio clipping.

The soft spot is the geometry claim. The abstract states that DRPO preserves the same trust-region geometry as DPPO while also providing corrective signals beyond the boundary. A quadratic regularizer does not enforce a strict cutoff the way a mask does; if the advantage term favors continued movement, the penalty can allow shifts that the original hard mask would block. Nothing in the given description shows an explicit equivalence or limiting argument that keeps the feasible set identical. The stress-test note on this point holds up against the abstract.

Experiments are described only at the level of "across scales and settings show improvement." No numbers, ablations, or details on how the regularizer interacts with other hyperparameters appear here, so the strength of the empirical support is not yet visible.

This is for people already working on divergence-based or mask-based variants of PPO for LLM post-training. It is a targeted implementation tweak rather than a new framework.

I would send it for peer review. The targeted fix addresses a documented drawback of DPPO, and referees could check the geometry math and the actual results. It is not broad enough to change the field, but the subfield is active enough that a careful review makes sense.

Referee Report

2 major / 0 minor

Summary. The paper proposes Divergence Regularized Policy Optimization (DRPO) to improve upon DPPO for off-policy LLM RL. It replaces DPPO's hard divergence-based mask (on absolute probability shift) with an advantage-weighted quadratic regularizer on policy shift. The central claim is that DRPO preserves exactly the same trust-region geometry as DPPO while inducing bounded continuous gradient weights that attenuate diverging updates and supply corrective signals beyond the boundary, yielding improved stability and efficiency; this is asserted to hold across model scales, architectures, and precision settings based on experiments.

Significance. If the geometry-preservation claim and the absence of new instabilities can be rigorously established, DRPO would offer a principled softening of hard trust-region constraints that retains the distributional-shift focus of DPPO while avoiding abrupt gradient discarding. This could meaningfully advance stable off-policy optimization for long-tailed LLM vocabularies beyond ratio-clipping methods. The manuscript's current presentation, however, provides no quantitative results, ablations, or derivations, leaving the practical impact and correctness of the central claim difficult to evaluate.

major comments (2)

[Abstract] Abstract: the claim that DRPO 'preserves the same trust-region geometry as DPPO' while simultaneously 'provide corrective signals beyond the boundary' is load-bearing for the contribution yet appears inconsistent on the given description. DPPO's hard mask sets the gradient contribution to zero once the absolute probability shift crosses the boundary in a harmful direction; an advantage-weighted quadratic regularizer is a soft penalty whose gradient remains non-zero (and can be positive in the diverging direction when the advantage term favors it). No limiting argument, equivalent constrained formulation, or explicit proof that the effective feasible set remains identical is supplied, so the geometry-preservation assertion rests on an unshown equivalence.
[Abstract] Abstract: the soundness of the empirical claim ('experiments across model scales... show that DRPO improves the stability and efficiency') cannot be assessed because the abstract supplies neither quantitative metrics, baseline comparisons, ablation results on the quadratic coefficient, nor details on how the trust-region boundary is enforced or measured. Without these, the central claim that the new regularizer improves upon DPPO without introducing instabilities cannot be verified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript to strengthen the presentation of the geometry claim and to include quantitative support in the abstract.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that DRPO 'preserves the same trust-region geometry as DPPO' while simultaneously 'provide corrective signals beyond the boundary' is load-bearing for the contribution yet appears inconsistent on the given description. DPPO's hard mask sets the gradient contribution to zero once the absolute probability shift crosses the boundary in a harmful direction; an advantage-weighted quadratic regularizer is a soft penalty whose gradient remains non-zero (and can be positive in the diverging direction when the advantage term favors it). No limiting argument, equivalent constrained formulation, or explicit proof that the effective feasible set remains identical is supplied, so the geometry-preservation assertion rests on an unshown equivalence.

Authors: We thank the referee for this observation. The DRPO formulation uses an advantage-weighted quadratic penalty that is identically zero inside the DPPO boundary (defined by absolute probability shift) and grows outside it, with the weighting ensuring that corrective gradients oppose harmful divergence. While the manuscript states the geometry is preserved, it does not supply the requested limiting argument or constrained equivalence. We will add a short derivation in the revised version (new subsection in Section 3) showing that, in the limit of increasing regularization strength, the soft penalty recovers the hard-mask behavior and identical feasible set. revision: yes
Referee: [Abstract] Abstract: the soundness of the empirical claim ('experiments across model scales... show that DRPO improves the stability and efficiency') cannot be assessed because the abstract supplies neither quantitative metrics, baseline comparisons, ablation results on the quadratic coefficient, nor details on how the trust-region boundary is enforced or measured. Without these, the central claim that the new regularizer improves upon DPPO without introducing instabilities cannot be verified.

Authors: We agree that the abstract would be more informative with supporting numbers. The full manuscript reports experiments across scales with stability and efficiency metrics versus DPPO, plus ablations on the quadratic coefficient, and defines the boundary via absolute probability shift in Section 3. We will revise the abstract to include concise quantitative results (e.g., stability gains and convergence speedups) and a brief reference to the boundary definition and ablations. revision: yes

Circularity Check

0 steps flagged

No circularity: DRPO is a novel regularizer proposal whose geometry claim is a design assertion, not a reduction by construction.

full rationale

The paper defines DRPO by substituting DPPO's hard mask with an advantage-weighted quadratic penalty on the same absolute probability shift divergence. This substitution is presented as an explicit design choice that yields bounded continuous weights and corrective signals. No equation reduces the new objective to a prior fitted quantity or renames an input as output. The trust-region preservation statement is a claim about the regularizer's effect rather than an algebraic identity forced by self-definition or self-citation. The derivation therefore stands as an independent proposal whose validity rests on the stated equations and experiments, not on circular reuse of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proposal rests on the assumption that the quadratic regularizer induces the same trust region as the hard mask without new side effects; no free parameters are explicitly named in the abstract, but the regularization strength is implicitly required. No new entities are postulated.

axioms (1)

domain assumption The divergence measure used in DPPO correctly captures distributional shift for long-tailed vocabularies.
Abstract states that importance ratio is a poor proxy and DPPO addresses this; DRPO inherits the same divergence definition.

pith-pipeline@v0.9.1-grok · 5755 in / 1267 out tokens · 16831 ms · 2026-06-27T17:12:26.851192+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation
cs.LG 2026-06 unverdicted novelty 7.0

TAPO constructs learnable micro-reflective trajectories from contrastive model rollouts during RL training to provide explicit error diagnoses and corrections, reporting consistent gains over GRPO on AIME and HMMT mat...

Reference graph

Works this paper leans on

25 extracted references · 13 linked inside Pith · cited by 1 Pith paper

[1]

Troll: Trust regions improve reinforcement learning for large language models.arXiv preprint arXiv:2510.03817,

Philipp Becker, Niklas Freymuth, Serge Thilges, Fabian Otto, and Gerhard Neumann. Troll: Trust regions improve reinforcement learning for large language models.arXiv preprint arXiv:2510.03817,

arXiv
[2]

Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585,

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585,

Pith/arXiv arXiv
[3]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

Pith/arXiv arXiv
[4]

Deepseek-v3

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025a. Jiacai Liu, Yingru Li, Yuqian Fu, Jiawei Wang, Qian Liu, and Yu Shen. When speed kills stability: Demystifying rl collaps...

Pith/arXiv arXiv
[5]

Defeating the training-inference mismatch via fp16.arXiv preprint arXiv:2510.26788,

Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Defeating the training-inference mismatch via fp16.arXiv preprint arXiv:2510.26788,

arXiv
[6]

Rethinking the trust region in llm reinforcement learning.arXiv preprint arXiv:2602.04879,

Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, and Wee Sun Lee. Rethinking the trust region in llm reinforcement learning.arXiv preprint arXiv:2602.04879,

Pith/arXiv arXiv
[7]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

Pith/arXiv arXiv
[8]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

Pith/arXiv arXiv
[9]

Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

Pith/arXiv arXiv
[10]

Megatron-lm: Training multi-billion parameter language models using model parallelism

13 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053,

Pith/arXiv arXiv 1909
[11]

Kimi k1.5: Scaling reinforcement learning with llms

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025a. Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: ...

Pith/arXiv arXiv
[12]

Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.arXiv preprint arXiv:2510.18855, 2025b

Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, et al. Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.arXiv preprint arXiv:2510.18855, 2025b. Manan Tomar, Lior Shani, Yonathan Efroni, and Mohammad Ghavamzadeh. Mirror descent policy optimization. InInternati...

arXiv
[13]

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al

URL https://openreview.net/forum?id=aBO5SvgSt1. Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939,

Pith/arXiv arXiv
[14]

Simple policy optimization

Zhengpeng Xie, Qiang Zhang, Fan Yang, Marco Hutter, and Renjing Xu. Simple policy optimization. arXiv preprint arXiv:2401.16025,

arXiv
[15]

Qwen3 technical report.arXiv preprint arXiv:2505.09388,

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

Pith/arXiv arXiv
[16]

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al

https://fengyao.notion.site/off- policy-rl. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

Pith/arXiv arXiv
[17]

Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,

14 Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,

Pith/arXiv arXiv
[18]

Stabilizing reinforcement learning with llms: Formulation and practices.arXiv preprint arXiv:2512.01374, 2025a

Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, et al. Stabilizing reinforcement learning with llms: Formulation and practices.arXiv preprint arXiv:2512.01374, 2025a. Haizhong Zheng, Jiawei Zhao, and Beidi Chen. Prosperity before collapse: How far can off-policy rl reach with stale data ...

arXiv
[19]

Despite its success, the clipping mechanism neither strictly bounds the likelihood ratio nor enforces a well-defined divergence constraint (Wang et al., 2020)

replaces the explicit KL constraint with a ratio-clipping heuristic, enabling first-order optimization. Despite its success, the clipping mechanism neither strictly bounds the likelihood ratio nor enforces a well-defined divergence constraint (Wang et al., 2020). Truly PPO (Wang et al.,

2020
[20]

Most relevant to our work, SPO (Xie et al.,

connects trust-region policy optimization with mirror descent (Beck and Teboulle, 2003), approximately solving the trust-region subproblem via multiple gradient steps on a Bregman divergence objective rather than enforcing a hard constraint. Most relevant to our work, SPO (Xie et al.,

2003
[21]

The per-token optimum of the resulting concave quadratic exactly matches PPO’s clipping boundary, while providing non-zero corrective gradients outside the trust region

replaces PPO’s hard clipping with a smooth quadratic regularizer on the importance ratio. The per-token optimum of the resulting concave quadratic exactly matches PPO’s clipping boundary, while providing non-zero corrective gradients outside the trust region. Our method adopts SPO’s smooth regularization principle but changes the trust-region geometry fro...

2025
[22]

ThedominantapproachusesPPO-stylehardclippingtoimposeratio-basedtrustregions

and mini-batch policy staleness (Liu et al., 2025a), making trust-region optimization essential for stable training. ThedominantapproachusesPPO-stylehardclippingtoimposeratio-basedtrustregions. GRPO(Shao et al., 2024; Liu et al., 2025c) retains this objective while replacing critic-based advantages with group-relative advantages (Liu et al., 2025c; Zeng e...

2024
[23]

removes clipping through truncated importance sampling, and M2PO (Zheng et al., 2025b) constrains the second moment of importance weights. To reduce variance under off-policy data, prior work has also proposed truncated (Yao et al., 2025; Zheng et al., 2025a) and masked (Liu et al., 2025b; Team et al., 2025b) importance sampling. Another line of work uses...

2025
[24]

as the 20 Table 2: Hyperparameters. Hyperparameters Qwen3-4B-Base Qwen3-30B-A3B-Base Qwen3.5-35B-A3B-Base R1D Learning Rate 1e-6 1e-6 1e-6 1e-6 PPO Epochs 1 1 1 1 Max Prompt Length 2048 2048 2048 2048 Max Response Length 8192 8192 8192 8192 Train Batch Size 64 256 256 64 PPO Mini Batch Size 32 32 32 16 Rollout Temperature 1.0 1.0 1.0 1.0 Group Size 8 16 1...

2048
[25]

as the training backend, and at the time we conducted experiments, it did not have sufficient support for efficiently training Qwen3.5, we chose to train fewer steps compared to Qwen3-30B-A3B-Base. D.1 Comparing with KL Regularization 0 20 40 60 80 100 Training Step 0.50 0.55 0.60 0.65 0.70Accuracy Qwen3.5-35B-A3B-Base 0 50 100 150 200 250 300 Training St...

2000

[1] [1]

Troll: Trust regions improve reinforcement learning for large language models.arXiv preprint arXiv:2510.03817,

Philipp Becker, Niklas Freymuth, Serge Thilges, Fabian Otto, and Gerhard Neumann. Troll: Trust regions improve reinforcement learning for large language models.arXiv preprint arXiv:2510.03817,

arXiv

[2] [2]

Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585,

Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585,

Pith/arXiv arXiv

[3] [3]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

Pith/arXiv arXiv

[4] [4]

Deepseek-v3

Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025a. Jiacai Liu, Yingru Li, Yuqian Fu, Jiawei Wang, Qian Liu, and Yu Shen. When speed kills stability: Demystifying rl collaps...

Pith/arXiv arXiv

[5] [5]

Defeating the training-inference mismatch via fp16.arXiv preprint arXiv:2510.26788,

Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Defeating the training-inference mismatch via fp16.arXiv preprint arXiv:2510.26788,

arXiv

[6] [6]

Rethinking the trust region in llm reinforcement learning.arXiv preprint arXiv:2602.04879,

Penghui Qi, Xiangxin Zhou, Zichen Liu, Tianyu Pang, Chao Du, Min Lin, and Wee Sun Lee. Rethinking the trust region in llm reinforcement learning.arXiv preprint arXiv:2602.04879,

Pith/arXiv arXiv

[7] [7]

Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

Pith/arXiv arXiv

[8] [8]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

Pith/arXiv arXiv

[9] [9]

Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

Pith/arXiv arXiv

[10] [10]

Megatron-lm: Training multi-billion parameter language models using model parallelism

13 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053,

Pith/arXiv arXiv 1909

[11] [11]

Kimi k1.5: Scaling reinforcement learning with llms

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1.5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025a. Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: ...

Pith/arXiv arXiv

[12] [12]

Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.arXiv preprint arXiv:2510.18855, 2025b

Ling Team, Anqi Shen, Baihui Li, Bin Hu, Bin Jing, Cai Chen, Chao Huang, Chao Zhang, Chaokun Yang, Cheng Lin, et al. Every step evolves: Scaling reinforcement learning for trillion-scale thinking model.arXiv preprint arXiv:2510.18855, 2025b. Manan Tomar, Lior Shani, Yonathan Efroni, and Mohammad Ghavamzadeh. Mirror descent policy optimization. InInternati...

arXiv

[13] [13]

Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al

URL https://openreview.net/forum?id=aBO5SvgSt1. Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning.arXiv preprint arXiv:2506.01939,

Pith/arXiv arXiv

[14] [14]

Simple policy optimization

Zhengpeng Xie, Qiang Zhang, Fan Yang, Marco Hutter, and Renjing Xu. Simple policy optimization. arXiv preprint arXiv:2401.16025,

arXiv

[15] [15]

Qwen3 technical report.arXiv preprint arXiv:2505.09388,

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

Pith/arXiv arXiv

[16] [16]

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al

https://fengyao.notion.site/off- policy-rl. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

Pith/arXiv arXiv

[17] [17]

Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,

14 Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. Simplerl-zoo: Investigating and taming zero reinforcement learning for open base models in the wild.arXiv preprint arXiv:2503.18892,

Pith/arXiv arXiv

[18] [18]

Stabilizing reinforcement learning with llms: Formulation and practices.arXiv preprint arXiv:2512.01374, 2025a

Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, Hao Lin, Chencan Wu, Feng Hu, et al. Stabilizing reinforcement learning with llms: Formulation and practices.arXiv preprint arXiv:2512.01374, 2025a. Haizhong Zheng, Jiawei Zhao, and Beidi Chen. Prosperity before collapse: How far can off-policy rl reach with stale data ...

arXiv

[19] [19]

Despite its success, the clipping mechanism neither strictly bounds the likelihood ratio nor enforces a well-defined divergence constraint (Wang et al., 2020)

replaces the explicit KL constraint with a ratio-clipping heuristic, enabling first-order optimization. Despite its success, the clipping mechanism neither strictly bounds the likelihood ratio nor enforces a well-defined divergence constraint (Wang et al., 2020). Truly PPO (Wang et al.,

2020

[20] [20]

Most relevant to our work, SPO (Xie et al.,

connects trust-region policy optimization with mirror descent (Beck and Teboulle, 2003), approximately solving the trust-region subproblem via multiple gradient steps on a Bregman divergence objective rather than enforcing a hard constraint. Most relevant to our work, SPO (Xie et al.,

2003

[21] [21]

The per-token optimum of the resulting concave quadratic exactly matches PPO’s clipping boundary, while providing non-zero corrective gradients outside the trust region

replaces PPO’s hard clipping with a smooth quadratic regularizer on the importance ratio. The per-token optimum of the resulting concave quadratic exactly matches PPO’s clipping boundary, while providing non-zero corrective gradients outside the trust region. Our method adopts SPO’s smooth regularization principle but changes the trust-region geometry fro...

2025

[22] [22]

ThedominantapproachusesPPO-stylehardclippingtoimposeratio-basedtrustregions

and mini-batch policy staleness (Liu et al., 2025a), making trust-region optimization essential for stable training. ThedominantapproachusesPPO-stylehardclippingtoimposeratio-basedtrustregions. GRPO(Shao et al., 2024; Liu et al., 2025c) retains this objective while replacing critic-based advantages with group-relative advantages (Liu et al., 2025c; Zeng e...

2024

[23] [23]

removes clipping through truncated importance sampling, and M2PO (Zheng et al., 2025b) constrains the second moment of importance weights. To reduce variance under off-policy data, prior work has also proposed truncated (Yao et al., 2025; Zheng et al., 2025a) and masked (Liu et al., 2025b; Team et al., 2025b) importance sampling. Another line of work uses...

2025

[24] [24]

as the 20 Table 2: Hyperparameters. Hyperparameters Qwen3-4B-Base Qwen3-30B-A3B-Base Qwen3.5-35B-A3B-Base R1D Learning Rate 1e-6 1e-6 1e-6 1e-6 PPO Epochs 1 1 1 1 Max Prompt Length 2048 2048 2048 2048 Max Response Length 8192 8192 8192 8192 Train Batch Size 64 256 256 64 PPO Mini Batch Size 32 32 32 16 Rollout Temperature 1.0 1.0 1.0 1.0 Group Size 8 16 1...

2048

[25] [25]

as the training backend, and at the time we conducted experiments, it did not have sufficient support for efficiently training Qwen3.5, we chose to train fewer steps compared to Qwen3-30B-A3B-Base. D.1 Comparing with KL Regularization 0 20 40 60 80 100 Training Step 0.50 0.55 0.60 0.65 0.70Accuracy Qwen3.5-35B-A3B-Base 0 50 100 150 200 250 300 Training St...

2000