pith. sign in

arxiv: 2605.25604 · v1 · pith:EMBLF33Jnew · submitted 2026-05-25 · 💻 cs.CL · cs.LG

DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

Pith reviewed 2026-06-29 21:27 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords multi-reward reinforcement learningadvantage optimizationlarge language modelsmathematical reasoningtool usevariance adaptive weightinggroup relative policy optimizationpolicy optimization
0
0 comments X

The pith

DVAO adjusts multi-reward advantage weights by empirical variance per rollout group to bound magnitudes and add cross-objective regularization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that standard ways of combining multiple rewards in RL for LLMs create either oversized advantages or static weights that ignore signal strength. DVAO replaces these with weights derived from the observed variance of each reward inside the current group of rollouts. This up-weights objectives that deliver consistent learning signals and down-weights noisy ones. The method is shown to keep advantage sizes bounded and to produce a self-adjusting regularization effect across objectives. On mathematical reasoning and tool-use tasks the resulting policies reach better trade-offs than the baselines while remaining stable.

Core claim

DVAO dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. The approach is proven to maintain bounded advantage magnitudes for stable training and to introduce a self-adaptive cross-objective regularization mechanism. Experiments with Qwen3 and Qwen2.5 models on mathematical reasoning and tool-use benchmarks show that DVAO outperforms baseline scalarization methods and achieves a superior multi-objective Pareto frontier.

What carries the argument

Dynamic Variance-adaptive Advantage Optimization (DVAO), which recomputes per-objective weights from empirical reward variances observed inside each rollout group.

If this is right

  • Advantage magnitudes remain bounded regardless of the number or scale of reward objectives.
  • A self-adaptive regularization effect emerges automatically across objectives without extra hyperparameters.
  • The policy reaches a better multi-objective Pareto frontier than either Reward Combination or Advantage Combination.
  • Training stability is preserved on mathematical reasoning and tool-use tasks with current open models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same variance-driven weighting could be tested in non-LLM multi-reward RL domains where rollout groups are already collected.
  • If variance estimates become unreliable for very small groups, the method may need an explicit smoothing term not derived in the paper.
  • The bounded-magnitude guarantee may allow larger learning rates or longer training runs than static scalarization permits.

Load-bearing premise

The empirical reward variance measured inside each rollout group is a reliable, unbiased proxy for the true strength of the learning signal carried by that objective.

What would settle it

A controlled run on the same benchmarks in which variance-derived weights produce either larger advantage magnitudes or lower final performance than the static Advantage Combination baseline.

Figures

Figures reproduced from arXiv: 2605.25604 by Chuzhan Hao, Guochao Jiang, Guofeng Quan, Guohua Liu, Jingyi Song, Yuewei Zhang.

Figure 1
Figure 1. Figure 1: Training dynamics on Qwen3-4B-Base. Left: accuracy reward (top=mean, bottom=std). Middle: length reward (top=mean, bottom=std). Right: average response length. Accuracy reward. Across both model scales, DVAO consistently achieves the highest accuracy reward while suppressing its variance most effectively. All methods start from a similar low baseline and rise steadily throughout training. DVAO’s accuracy r… view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics on Qwen3-8B-Base. Left: accuracy reward (top=mean, bottom=std). Middle: length reward (top=mean, bottom=std). Right: average response length. 36 38 40 42 44 46 Acc. 90 92 94 96 98 100 Len. Qwen3-4B-Base RC AC GDPO DVAO (a) Mathematical Reasoning Task (Qwen3-4B-Base) 50 52 54 56 58 60 Acc. 60.0 62.5 65.0 67.5 70.0 72.5 75.0 77.5 80.0 Format. Qwen2.5-3B-Instruct RC AC GDPO DVAO (b) Tool-Use… view at source ↗
Figure 3
Figure 3. Figure 3: Pareto frontier of accuracy vs. length/format compliance across methods. DVAO consistently [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes Dynamic Variance-adaptive Advantage Optimization (DVAO) as an improvement over standard scalarization methods (Reward Combination and Advantage Combination) in Group Relative Policy Optimization for multi-reward RL alignment of LLMs. DVAO dynamically sets combination weights from the empirical per-objective reward variance computed inside each rollout group, claims a mathematical proof that this keeps advantage magnitudes bounded, introduces a self-adaptive cross-objective regularization term, and reports superior performance on mathematical-reasoning and tool-use benchmarks with Qwen3/Qwen2.5 models.

Significance. If the bounded-magnitude proof is correct and the variance-based weighting is shown to be unbiased, the method would address a practical instability problem in multi-objective RL for LLMs and could improve Pareto efficiency over static scalarization baselines.

major comments (3)
  1. [Abstract] Abstract: the claim of a 'mathematical proof' that DVAO maintains bounded advantage magnitudes is unsupported by any equation, definition of the dynamic weights, or proof sketch. Without these, it is impossible to verify whether the variance-based re-weighting actually produces the claimed bound or whether the weights reduce, by construction, to quantities estimated from the same rollout data used for the advantages.
  2. [Abstract] Abstract / Method description: the core assumption that empirical reward variance inside a rollout group is an unbiased, low-noise proxy for learning-signal strength is load-bearing for both the stability guarantee and the reported gains, yet no analysis of finite-sample bias, group-size effects, objective correlations, or zero-variance edge cases is supplied.
  3. [Abstract] Abstract: the experimental claims of 'significant outperformance' and 'superior multi-objective Pareto frontier' are presented without any dataset sizes, number of runs, statistical tests, or baseline implementation details, rendering the empirical support unverifiable.
minor comments (1)
  1. [Abstract] The phrase 'self-adaptive cross-objective regularization mechanism' is introduced without a definition or equation showing how it differs from standard regularization or how it emerges from the variance weighting.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful reading and specific comments on the abstract and method. We respond to each major comment below and will revise the manuscript accordingly where indicated.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of a 'mathematical proof' that DVAO maintains bounded advantage magnitudes is unsupported by any equation, definition of the dynamic weights, or proof sketch. Without these, it is impossible to verify whether the variance-based re-weighting actually produces the claimed bound or whether the weights reduce, by construction, to quantities estimated from the same rollout data used for the advantages.

    Authors: We agree the abstract does not contain the supporting details. Section 3.2 of the manuscript defines the dynamic weights explicitly as w_k = 1 / (σ_k² + ε) where σ_k² is the empirical variance of objective k within the rollout group, and proves that the resulting combined advantage vector satisfies ||A||₂ ≤ C for a constant C independent of the reward scales. We will revise the abstract to include a one-sentence reference to this weight definition and the bounded-norm result. revision: yes

  2. Referee: [Abstract] Abstract / Method description: the core assumption that empirical reward variance inside a rollout group is an unbiased, low-noise proxy for learning-signal strength is load-bearing for both the stability guarantee and the reported gains, yet no analysis of finite-sample bias, group-size effects, objective correlations, or zero-variance edge cases is supplied.

    Authors: The referee correctly identifies that the paper relies on this proxy without accompanying analysis. We will add a dedicated paragraph in Section 3.3 and a short appendix subsection that (i) derives the finite-sample bias of the variance estimator under Gaussian reward noise, (ii) reports empirical sensitivity to group size (G=4,8,16), (iii) measures objective correlations on the training rollouts, and (iv) specifies the ε-floor used for zero-variance objectives. revision: yes

  3. Referee: [Abstract] Abstract: the experimental claims of 'significant outperformance' and 'superior multi-objective Pareto frontier' are presented without any dataset sizes, number of runs, statistical tests, or baseline implementation details, rendering the empirical support unverifiable.

    Authors: The full experimental section (Section 4) already specifies the datasets (MATH, GSM8K, ToolBench), five random seeds, paired t-tests with p<0.05, and exact baseline re-implementations. To make the abstract self-contained we will append a concise clause: 'across three benchmarks with five seeds, yielding statistically significant gains (p<0.05) over static scalarization baselines.' revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation self-contained based on provided text

full rationale

The abstract proposes DVAO using empirical per-objective reward variance within rollout groups to set dynamic weights, claims a mathematical proof of bounded advantage magnitudes, and mentions self-adaptive regularization, but supplies no equations, derivations, or self-citations. No load-bearing step can be quoted that reduces by construction to fitted inputs or prior self-work. The variance-based weighting is presented as a novel mechanism with an independent stability proof, making the central claim self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities. The dynamic weights are described as derived from empirical variance, but whether this introduces fitted constants or relies on unstated domain assumptions cannot be determined from the given text.

pith-pipeline@v0.9.1-grok · 5740 in / 1166 out tokens · 33651 ms · 2026-06-29T21:27:20.481755+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 22 canonical work pages · 10 internal anchors

  1. [1]

    L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

    Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning.arXiv preprint arXiv:2503.04697,

  2. [2]

    Le, Sergey Levine, and Yi Ma

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V . Le, Sergey Levine, and Yi Ma. SFT memorizes, RL generalizes: A comparative study of foundation model post-training. InForty-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13-19,

  3. [3]

    net/forum?id=dYur3yabMj

    URLhttps://openreview. net/forum?id=dYur3yabMj. Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Efficient reasoning models: A survey.Trans. Mach. Learn. Res., 2025, 2025a. URL https://openreview.net/forum?id= sySqlxj8EB. Wenfeng Feng, Chuzhan Hao, Yuewei Zhang, Guochao Jiang, and Jingyi Song. Airrag: Au- tonomous strategic planning and reasoning ...

  4. [4]

    2025.00098

    doi: 10.1109/ICSME64153. 2025.00098. URLhttps://doi.org/10.1109/ICSME64153.2025.00098. Ruofan Gao, Amjed Tahir, Peng Liang, Teo Susnjak, and Foutse Khomh. A survey of bugs in ai-generated code.arXiv preprint arXiv:2512.05239,

  5. [5]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

    doi: 10.1038/S41586-025-09422-Z. URLhttps://doi.org/10.1038/s41586-025-09422-z. 10 Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiad- bench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal sc...

  6. [6]

    URL https://doi.org/10.18653/v1/2024.acl-long.211

    doi: 10.18653/V1/2024.ACL-LONG.211. URL https://doi.org/10.18653/v1/2024.acl-long.211. Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Trans. Inf. S...

  7. [7]

    2025 , publisher =

    doi: 10.1145/3703155. URLhttps://doi.org/10.1145/3703155. Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, and Prithviraj Ammanabrolu. Personalized soups: Personalized large language model alignment via post-hoc parameter merging.arXiv preprint arXiv:2310.11564,

  8. [8]

    In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 31122–31130

    Guochao Jiang, Wenfeng Feng, Guofeng Quan, Chuzhan Hao, Yuewei Zhang, Guohua Liu, and Hao Wang. Vcrl: Variance-based curriculum reinforcement learning for large language models.arXiv preprint arXiv:2509.19803, 2025a. Guochao Jiang, Guofeng Quan, Zepeng Ding, Ziqin Luo, Dixuan Wang, and Zheng Hu. Flashthink: An early exit method for efficient reasoning.arX...

  9. [9]

    Alarm: Align language models via hierarchical rewards modeling

    Yuhang Lai, Siyuan Wang, Shujun Liu, Xuanjing Huang, and Zhongyu Wei. Alarm: Align language models via hierarchical rewards modeling. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, Findings of ACL, pages 7817–7831. Ass...

  10. [10]

    URL https: //doi.org/10.18653/v1/2024.findings-acl.465

    doi: 10.18653/V1/2024.FINDINGS-ACL.465. URL https: //doi.org/10.18653/v1/2024.findings-acl.465. Xuefeng Li, Haoyang Zou, and Pengfei Liu. Torl: Scaling tool-integrated rl.arXiv preprint arXiv:2503.23383,

  11. [11]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11,

  12. [12]

    Qiqiang Lin, Muning Wen, Qiuying Peng, Guanyu Nie, Junwei Liao, Jun Wang, Xiaoyun Mo, Jiamu Zhou, Cheng Cheng, Yin Zhao, et al

    URLhttps://openreview.net/forum?id=v8L0pN6EOi. Qiqiang Lin, Muning Wen, Qiuying Peng, Guanyu Nie, Junwei Liao, Jun Wang, Xiaoyun Mo, Jiamu Zhou, Cheng Cheng, Yin Zhao, et al. Hammer: Robust function-calling for on-device language models via function masking.arXiv preprint arXiv:2410.04587,

  13. [13]

    Dler: Doing length penalty right-incentivizing more intelligence per token via reinforcement learning.arXiv preprint arXiv:2510.15110, 2025a

    Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, et al. Dler: Doing length penalty right-incentivizing more intelligence per token via reinforcement learning.arXiv preprint arXiv:2510.15110, 2025a. Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Ming...

  14. [14]

    Learn to reason efficiently with adaptive length-based reward shaping.arXiv preprint arXiv:2505.15612, 2025b

    Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, and Junxian He. Learn to reason efficiently with adaptive length-based reward shaping.arXiv preprint arXiv:2505.15612, 2025b. 11 Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yux...

  15. [15]

    URL https: //openreview.net/forum?id=8EB8k6DdCU

    OpenReview.net, 2025c. URL https: //openreview.net/forum?id=8EB8k6DdCU. Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu, Ling Pan, Xinyu Hu, Shaopan Xiong, Ju Huang, Jian Hu, et al. Part i: Tricks or traps? a deep dive into rl for llm reasoning.arXiv preprint arXiv:2508.08221, 2025d. Ilya Loshchilov and Frank Hutter. Decoupled weight decay re...

  16. [16]

    O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning.CoRR, abs/2501.12570, 2025

    URLhttps://openreview.net/forum?id=Bkg6RiCqY7. Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, and Dacheng Tao. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning. arXiv preprint arXiv:2501.12570,

  17. [17]

    Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E

    Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): from tool use to agentic evaluation of large language models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste- Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, e...

  18. [18]

    ToolRL: Reward is All Tool Learning Needs

    doi: 10.1613/JAIR.1.18675. URLhttps://doi.org/10.1613/jair.1.18675. Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs.arXiv preprint arXiv:2504.13958,

  19. [19]

    A comprehensive survey of hallucination in large language, image, video and audio foundation models

    Pranab Sahoo, Prabhash Meharia, Akash Ghosh, Sriparna Saha, Vinija Jain, and Aman Chadha. A comprehensive survey of hallucination in large language, image, video and audio foundation models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-...

  20. [20]

    Proximal Policy Optimization Algorithms

    doi: 10.18653/V1/2024.FINDINGS-EMNLP.685. URL https: //doi.org/10.18653/v1/2024.findings-emnlp.685. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  21. [21]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathemat- ical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  22. [22]

    Hybridflow: A flexible and efficient RLHF framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pages 1279–1297. ACM,

  23. [23]

    doi: 10.1145/3689031. 3696075. URLhttps://doi.org/10.1145/3689031.3696075. Vaishnavi Shrivastava, Ahmed Awadallah, Vidhisha Balachandran, Shivam Garg, Harkirat Behl, and Dimitris Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning.arXiv preprint arXiv:2508.09726,

  24. [24]

    Stop overthinking: A survey on efficient reasoning for large language models.Trans

    Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, Hanjie Chen, and Xia Hu. Stop overthinking: A survey on efficient reasoning for large language models.Trans. Mach. Learn. Res., 2025,

  25. [25]

    Kimi K2.5: Visual Agentic Intelligence

    doi: 10.1007/S10664-025-10614-4. URL https://doi.org/10. 1007/s10664-025-10614-4. Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

  26. [26]

    Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

    Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2502.14768,

  27. [27]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,

  28. [28]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  29. [29]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476,

  30. [30]

    A Survey of Reinforcement Learning for Large Reasoning Models

    Jianguo Zhang, Tian Lan, Ming Zhu, Zuxin Liu, Thai Hoang, Shirley Kokane, Weiran Yao, Juntao Tan, Akshara Prabhakar, Haolin Chen, Zhiwei Liu, Yihao Feng, Tulika Manoj Awalgaonkar, Rithesh R. N., Zeyuan Chen, Ran Xu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, and Caiming Xiong. xlam: A family of large action models to empower AI agen...

  31. [31]

    For the advantage combination: 1 G GX j=1 A(i,j) 2 = 1 G GX j=1 X k wkA(i,j) k !2 = X k w2 k · 1 G GX j=1 A(i,j) k 2 + 2 X k<l wkwl · 1 G GX j=1 A(i,j) k A(i,j) l = X k w2 k + 2 X k<l wkwl ˆρi kl = X k wk !2 −2 X k<l wkwl 1−ˆρi kl = 1−2 X k<l wkwl 1−ˆρi kl ≤1, which completes the proof. B Proof of Proposition 2 Proposition 2.For a fixed query xi and rollo...