pith. sign in

arxiv: 2605.26784 · v1 · pith:PSYHHRSUnew · submitted 2026-05-26 · 💻 cs.LG · cs.AI

Ratio-Variance Regularized Policy Optimization

Pith reviewed 2026-06-29 19:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords policy optimizationtrust regionvariance regularizationprimal-dual optimizationreinforcement learningLLM reasoningrobotic controlsample efficiency
0
0 comments X

The pith

Constraining policy ratio variance approximates trust-region constraints without hard clipping.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard clipping in on-policy reinforcement learning cuts off promising but divergent updates indiscriminately. Explicitly penalizing the variance of policy ratios instead supplies a soft, distributional brake that locally approximates trust regions. This preserves gradient information from high-return discoveries while allowing reuse of older data. The resulting R²VPO algorithm is realized through a primal-dual solver and is tested on language-model reasoning and robotic control. The approach aims to deliver more stable and sample-efficient policy optimization than clipping-based baselines.

Core claim

Explicitly constraining the policy ratio variance provides a principled local approximation to trust-region constraints, eliminating the need for binary hard clipping. Implemented via the R²VPO method in a primal-dual optimization framework, the variance penalty acts as a distributional soft brake that preserves critical gradient signals from novel updates while down-weighting stale off-policy data.

What carries the argument

Ratio-variance regularization term enforced by a primal-dual optimization framework that serves as a soft distributional brake on policy updates.

Load-bearing premise

Penalizing variance of the policy ratio will preserve high-return updates while maintaining stability without new instabilities arising from the primal-dual solver or the chosen variance target.

What would settle it

If the method produces training instability or lower returns than clipped PPO specifically in settings with frequent high-divergence high-return updates, the claim that variance constraint reliably approximates trust regions would be falsified.

Figures

Figures reproduced from arXiv: 2605.26784 by Dong Li, Fuchun Sun, Huaping Liu, Jianye Hao, Lei Lv, Shuo Han, Yihan Hu, Yu Luo.

Figure 1
Figure 1. Figure 1: Consistent Average Gains. Results are aggregated over 5 mathematical reasoning benchmarks across 7 LLM scales (span￾ning both Fast and Slow thinking paradigms) and 10 continuous robotic control tasks. R2VPO consistently achieves the highest average scores, demonstrating its robustness and superiority in both discrete (LLM) and continuous (Robotics) action spaces. 1. Introduction On-policy reinforcement lea… view at source ↗
Figure 2
Figure 2. Figure 2: Ratio-variance as a unified proxy for f-divergence trust regions. Theoretical quadratic approximations (dashed lines) align tightly with exact numerical values (solid lines) across Reverse KL, Forward KL, and JS metrics. Boxplots and shading represent 80% confidence intervals from sampled Gaussian policies, confirming that ratio variance provides a stable, computationally tractable alternative to complex d… view at source ↗
Figure 3
Figure 3. Figure 3: Training Curves on Continuous Control Tasks (DeepMind Control Suite). We compare R2VPO-ON (blue) against PPO (orange) across locomotion and manipulation tasks, including those with sparse rewards. Solid lines denote the mean performance over 5 independent training runs with different random seeds, and shaded regions denote standard deviation. R2VPO demonstrates superior exploration in sparse settings (e.g.… view at source ↗
Figure 4
Figure 4. Figure 4: Mechanism Analysis. (a) Hard clipping indiscriminately truncates high-value exploration. (b) R 2VPO exhibits superior robustness to data staleness compared to GRPO. (c) Bounded ratio distributions empirically validate the reliability of the second-order approximation. (d) The adaptive dual-update strategy (λadaptive) outperforms fixed constraints. 5.2. Continuous Robotic Control To demonstrate generality b… view at source ↗
Figure 5
Figure 5. Figure 5: Empirical Analysis of Variance as a Proxy for Divergence. We visualize the relationship between the variance of policy ratios, Var(ρθ), and six common divergence metrics across sampled policy updates. The dashed black line represents the theoretical second-order approximation. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison with Exact KL Penalty. We compare R2VPO with a GRPO-KL variant on DeepSeek-Distill-Qwen2.5-1.5B. R 2VPO exhibits similar reward and AIME-score dynamics to the exact KL-penalty baseline, supporting the effectiveness of ratio variance as a lightweight local trust-region surrogate. Reward Convergence. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training Reward Dynamics across Seven Model Scales. We compare the average episode reward curves of R2VPO (orange/brown) against baselines throughout the training process. R2VPO-OFF consistently demonstrates faster reward convergence and higher asymptotic performance, particularly on smaller models and distilled models, validating the efficiency of variance-regularized off-policy learning. 17 [PITH_FULL_I… view at source ↗
Figure 8
Figure 8. Figure 8: Evolution of Response Length during Training. The dynamics exhibit two distinct patterns: (1) For Fast Thinking models (top row), R2VPO drives a substantial increase in response length, unlocking extended reasoning capabilities. (2) For Slow Thinking models (bottom row), while all methods reduce initial redundancy, R2VPO maintains significantly longer reasoning chains than GRPO (grey line), which often suf… view at source ↗
Figure 9
Figure 9. Figure 9: Evolution of Policy Ratio Distributions. The grid visualizes the density of ρt(θ) across training iterations (columns) and data staleness levels (rows). The consistent concentration around the center (red line at ρ = 1) confirms that R2VPO effectively constrains divergence and prevents distributional collapse, even when reusing stale data from a large replay buffer (rb = 8). 19 [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 10
Figure 10. Figure 10: Visualizations of the continuous control benchmarks. We evaluate our method on 10 diverse tasks from the DeepMind Control Suite, ranging from simple balancing tasks (e.g., Cartpole) to complex locomotion tasks (e.g., Humanoid, Cheetah). The full performance comparisons on these robotic tasks are reported in [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Learning curves on DeepMind Control Suite tasks. The x-axis denotes the number of environment steps, and the y-axis represents the average episode reward. Solid lines denote the mean performance over 5 independent training runs with different random seeds, and shaded regions denote standard deviation. R2VPO (ours) consistently achieves higher returns and faster convergence compared to PPO. High-Dimensiona… view at source ↗
Figure 12
Figure 12. Figure 12: High-Dimensional Control with Stabilized Training. We compare R2VPO and PPO on Dog-Run and Humanoid-Run under a WPO-inspired stabilization setup. R2VPO achieves stronger performance than PPO on both tasks, while the later-stage decrease on Dog-Run suggests that this environment is sensitive to task-specific stabilization and regularization. E. Computational Infrastructure All LLM fine-tuning experiments w… view at source ↗
read the original abstract

Standard on-policy reinforcement learning relies on heuristic clipping to enforce trust regions, but this mechanism imposes a severe cost by indiscriminately truncating high-return yet high-divergence updates. We demonstrate that explicitly constraining the policy ratio variance provides a principled local approximation to trust-region constraints, eliminating the need for binary hard clipping. By acting as a distributional ``soft brake'', this approach preserves critical gradient signals from novel discoveries while naturally down-weighting and enabling the reuse of stale, off-policy data. We introduce ${\bf R}^2{\bf VPO}$ (Ratio-Variance Regularized Policy Optimization), which implements this constraint via a primal-dual optimization framework. Extensive evaluations across $7$ LLM scales, spanning both fast and slow reasoning paradigms, and $10$ robotic control tasks demonstrate the generality of the proposed approach. R$^2$VPO achieves substantial performance gains on mathematical reasoning benchmarks, with particularly pronounced improvements on smaller models, while significantly improving sample efficiency. Furthermore, it consistently outperforms PPO baselines in continuous control domains, particularly in sparse-reward and dynamic environments. Together, these findings establish ratio-variance regularization as a principled foundation for stable and data-efficient policy optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes R²VPO, a policy optimization algorithm that replaces PPO-style hard clipping with an explicit variance penalty on policy ratios (Var(π_new/π_old)), enforced through a primal-dual optimization framework. This is presented as a soft, distributional approximation to trust-region constraints that preserves high-return gradients while enabling reuse of stale data. The method is evaluated on mathematical reasoning tasks across 7 LLM scales and 10 robotic control tasks, reporting gains in performance and sample efficiency over PPO baselines, especially in sparse-reward settings.

Significance. If the variance penalty can be shown to bound policy divergence comparably to KL or total-variation trust regions without introducing primal-dual instabilities, the approach would supply a more flexible, gradient-preserving alternative to clipping. This could improve data efficiency in on-policy RL for both language and continuous-control domains, particularly where binary clipping discards useful updates.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'explicitly constraining the policy ratio variance provides a principled local approximation to trust-region constraints' is load-bearing yet unsupported by any derivation, inequality, or limiting argument linking Var(ratio) to KL divergence, total variation, or other standard trust metrics; without this link the elimination of hard clipping cannot be justified as principled.
  2. [Abstract] Abstract: the primal-dual solver that enforces a fixed variance target is presented without analysis of convergence, oscillation risk, or sensitivity to the chosen target value; the weakest assumption (that this solver will not create new instabilities in sparse-reward or LLM reasoning regimes) therefore remains unexamined.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review of our manuscript. We address each major comment below and indicate planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'explicitly constraining the policy ratio variance provides a principled local approximation to trust-region constraints' is load-bearing yet unsupported by any derivation, inequality, or limiting argument linking Var(ratio) to KL divergence, total variation, or other standard trust metrics; without this link the elimination of hard clipping cannot be justified as principled.

    Authors: The manuscript presents the variance penalty primarily through its empirical behavior as a distributional soft constraint that avoids indiscriminate truncation of high-return updates. We acknowledge that no explicit inequality or limiting argument connecting Var(ratio) to KL or total variation is derived in the current text. In revision we will add a short theoretical subsection providing a local approximation argument under the assumption of small policy steps, relating the second moment of the ratio to a first-order bound on expected divergence. revision: yes

  2. Referee: [Abstract] Abstract: the primal-dual solver that enforces a fixed variance target is presented without analysis of convergence, oscillation risk, or sensitivity to the chosen target value; the weakest assumption (that this solver will not create new instabilities in sparse-reward or LLM reasoning regimes) therefore remains unexamined.

    Authors: The paper focuses on the practical performance of the primal-dual formulation rather than its theoretical convergence properties. We agree that sensitivity and stability analysis would strengthen the presentation. The revised version will include a dedicated paragraph discussing the observed behavior of the dual variable across the reported domains, together with a brief note on target-value selection and any empirical indications of oscillation. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper presents R²VPO as a primal-dual method that enforces a variance constraint on policy ratios as a soft surrogate for trust regions. The abstract and description frame this as a demonstrated approximation with empirical validation across LLM and robotics tasks. No equations or steps are shown that reduce the core claim (variance penalty as local trust-region proxy) to a fitted parameter renamed as prediction, a self-citation chain, or a definitional equivalence. The variance target and dual variables are part of the proposed algorithm rather than retrofitted to performance metrics. External evaluations on held-out benchmarks provide independent content, satisfying the criteria for a non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. The method implicitly relies on standard RL assumptions about policy gradients and the existence of a variance that can be constrained without breaking optimization.

pith-pipeline@v0.9.1-grok · 5747 in / 1097 out tokens · 30698 ms · 2026-06-29T19:41:07.472547+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 23 canonical work pages · 10 internal anchors

  1. [1]

    Constitutional AI: Harmlessness from AI Feedback

    Bai, Y ., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKin- non, C., et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073,

  2. [2]

    Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition,

    Chen, H., Wang, Y ., Han, K., Li, D., Li, L., Bi, Z., Li, J., Wang, H., Mi, F., Zhu, M., et al. Pangu embedded: An efficient dual-system llm reasoner with metacognition. arXiv preprint arXiv:2505.22375,

  3. [3]

    Aime problems and solutions

    Committees, M. Aime problems and solutions. https://artofproblemsolving.com/wiki/ index.php/AIME_Problems_and_Solutions, 2024,2025. 9 Ratio-Variance Regularized Policy Optimization Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., and Madry, A. Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv...

  4. [4]

    Emergence of Locomotion Behaviours in Rich Environments

    Heess, N., Tb, D., Sriram, S., Lemmon, J., Merel, J., Wayne, G., Tassa, Y ., Erez, T., Wang, Z., Eslami, S., et al. Emer- gence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286,

  5. [5]

    SB-TRPO: Towards Safe Reinforcement Learning with Hard Constraints

    Kanwar, A., Wagner, D., and Ong, L. Safety-biased policy optimisation: Towards hard-constrained rein- forcement learning via trust regions.arXiv preprint arXiv:2512.23770,

  6. [6]

    What can rl bring to vla generalization? an empirical study.arXiv preprint arXiv:2505.19789,

    Liu, J., Gao, F., Wei, B., Chen, X., Liao, Q., Wu, Y ., Yu, C., and Wang, Y . What can rl bring to vla generalization? an empirical study.arXiv preprint arXiv:2505.19789,

  7. [7]

    VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning

    Lu, G., Guo, W., Zhang, C., Zhou, Y ., Jiang, H., Gao, Z., Tang, Y ., and Wang, Z. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719,

  8. [8]

    A., Ziesche, H

    Otto, F., Becker, P., Vien, N. A., Ziesche, H. C., and Neu- mann, G. Differentiable trust region layers for deep re- inforcement learning.arXiv preprint arXiv:2101.09207,

  9. [9]

    Borsa, Jo˜ ao Guilherme Madeira Ara´ ujo, Brendan Daniel Tracey, and Hado van Hasselt

    Pfau, D., Davies, I., Borsa, D., Araujo, J. G., Tracey, B., and Van Hasselt, H. Wasserstein policy optimization.arXiv preprint arXiv:2505.00663,

  10. [10]

    Ta- pered off-policy reinforce: Stable and efficient reinforcement learning for llms.arXiv preprint arXiv:2503.14286,

    Roux, N. L., Bellemare, M. G., Lebensold, J., Bergeron, A., Greaves, J., Fr ´echette, A., Pelletier, C., Thibodeau- Laufer, E., Toth, S., and Work, S. Tapered off-policy reinforce: Stable and efficient reinforcement learning for llms.arXiv preprint arXiv:2503.14286,

  11. [11]

    Proximal Policy Optimization Algorithms

    10 Ratio-Variance Regularized Policy Optimization Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  12. [12]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al. Deepseekmath: Push- ing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  13. [13]

    F., Abdolmaleki, A., Springenberg, J

    Song, H. F., Abdolmaleki, A., Springenberg, J. T., Clark, A., Soyer, H., Rae, J. W., Noury, S., Ahuja, A., Liu, S., Tirumala, D., et al. V-mpo: On-policy maximum a posteriori policy optimization for discrete and continuous control.arXiv preprint arXiv:1909.12238,

  14. [14]

    Provably conver- gent policy optimization via metric-aware trust region methods.arXiv preprint arXiv:2306.14133,

    Song, J., He, N., Ding, L., and Zhao, C. Provably conver- gent policy optimization via metric-aware trust region methods.arXiv preprint arXiv:2306.14133,

  15. [15]

    Klear-reasoner: Advanc- ing reasoning capability via gradient-preserving clipping policy optimization.arXiv preprint arXiv:2508.07629,

    Su, Z., Pan, L., Bai, X., Liu, D., Dong, G., Huang, J., Hu, W., Zhang, F., Gai, K., and Zhou, G. Klear-reasoner: Advanc- ing reasoning capability via gradient-preserving clipping policy optimization.arXiv preprint arXiv:2508.07629,

  16. [16]

    Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models

    Sun, H., Min, Y ., Chen, Z., Zhao, W. X., Fang, L., Liu, Z., Wang, Z., and Wen, J.-R. Challenging the boundaries of reasoning: An olympiad-level math benchmark for large language models.arXiv preprint arXiv:2503.21380,

  17. [17]

    Tassa, Y ., Doron, Y ., Muldal, A., Erez, T., Li, Y ., Casas, D. d. L., Budden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., et al. Deepmind control suite.arXiv preprint arXiv:1801.00690,

  18. [18]

    Simple policy optimization.arXiv preprint arXiv:2401.16025,

    Xie, Z., Zhang, Q., Yang, F., Hutter, M., and Xu, R. Simple policy optimization.arXiv preprint arXiv:2401.16025,

  19. [19]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  20. [20]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  21. [21]

    Zentner, K., Puri, U., Huang, Z., and Sukhatme, G. S. Guar- anteed trust region optimization via two-phase kl penal- ization.arXiv preprint arXiv:2312.05405,

  22. [22]

    A stochastic trust-region framework for policy optimization.arXiv preprint arXiv:1911.11640,

    Zhao, M., Li, Y ., and Wen, Z. A stochastic trust-region framework for policy optimization.arXiv preprint arXiv:1911.11640,

  23. [23]

    R2VPO utilizes adaptive dual updates for LLMs and fixed dual factors for robotics tasks

    Table 3.Unified Hyperparameters.Detailed configuration for LLM mathematical reasoning tasks (Part I) and MuJoCo Playground continuous control tasks (Part II). R2VPO utilizes adaptive dual updates for LLMs and fixed dual factors for robotics tasks. Parameter Value Part I: LLM Mathematical Reasoning Common Training Configuration Optimizer AdamW Learning Rat...

  24. [24]

    Table 4.Comprehensive Benchmarking Results.We compare R 2VPO against multiple strong baselines (GRPO, GRPO-CH, GPPO, TOPR) across seven model scales.Avgreports the average accuracy, with the relative improvement overBaseshown in parentheses. Method AIME 24 AIME 25 AMC 23 HMMT OlymMath Avg(Gain vs Base) openPangu-Embedded-1B Base 20.83 21.67 60.00 9.59 4.0...

  25. [25]

    Crucially, the distributions remain tightly concentrated around ρt = 1 (indicated by the red vertical line) even at maximum staleness, providing strong empirical evidence that the optimization trajectory stays within the valid regime of our second-order variance approximation throughout the training process. 16 Ratio-Variance Regularized Policy Optimizati...

  26. [26]

    These results demonstrate that R2VPO significantly outperforms PPO across the majority of tasks. In particular, R2VPO exhibits superior sample efficiency and asymptotic performance in complex environments such as WalkerRun and CheetahRun, while maintaining robust learning in sparse-reward settings likeCartpoleSwingupSparse. 0.0 2.5 5.0 7.5 step 1e7 0 500r...