pith. machine review for the scientific record. sign in

arxiv: 2604.19485 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords explained variancepolicy optimizationadaptive criticPPOGRPOLLM post-trainingvariance reductionreinforcement learning
0
0 comments X

The pith

EVPO monitors batch explained variance to switch between critic and batch-mean baselines, ensuring no higher variance than the better of the two at every step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in sparse-reward LLM post-training a learned critic can add more noise than it removes, raising advantage variance instead of lowering it. It unifies PPO and GRPO as opposite extremes of a Kalman filter and shows that explained variance calculated from one training batch marks exactly when the critic helps versus hurts. EVPO uses this per-step EV reading to pick the lower-variance estimate automatically. A reader should care because the method removes any need to choose between the two baselines in advance and delivers better results on control, agent, and math tasks whether one fixed baseline was originally stronger.

Core claim

By framing baseline selection as a Kalman filtering problem, PPO and GRPO become the two limiting cases of Kalman gain. Explained variance, which can be computed directly from a single batch, identifies the precise boundary: positive EV means the critic reduces variance while zero or negative EV means it inflates it. EVPO therefore tracks batch-level EV and switches between critic-based and batch-mean advantage estimates, provably achieving variance no greater than the minimum of the two fixed methods at every training step. Experiments across classical control, agentic interaction, and mathematical reasoning confirm consistent outperformance over both PPO and GRPO, with the EV signal also跟踪

What carries the argument

Batch-level explained variance as the real-time indicator that decides whether to use the learned critic or the batch mean for advantage estimation inside a Kalman-unified policy update.

If this is right

  • EVPO provably never exceeds the variance of the better fixed baseline at any training step.
  • The adaptive switch outperforms both PPO and GRPO on tasks where either one was originally stronger.
  • Batch EV tracks critic maturation and signals when the critic becomes reliable during training.
  • The zero-EV threshold derived from the Kalman model is empirically optimal for the switch decision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar per-batch variance monitoring could decide when to trust other learned value functions in sparse-reward RL beyond the PPO-GRPO pair.
  • The Kalman unification offers a way to analyze and combine additional baseline estimators without manual selection.
  • The approach may reduce hyperparameter search effort when scaling RL post-training to new tasks with unknown reward sparsity.

Load-bearing premise

That explained variance measured on a single training batch reliably signals whether the critic reduces or increases advantage variance in sparse-reward LLM post-training.

What would settle it

An experiment on any of the four task types in which EVPO's measured advantage variance exceeds the variance of both PPO and GRPO at one or more steps, or in which a non-zero EV threshold produces lower final variance than the zero threshold.

Figures

Figures reproduced from arXiv: 2604.19485 by Binghai Wang, Chengjun Pan, Dingwei Zhu, Jiahang Lin, Jiazheng Zhang, Rui Zheng, Shichun Liu, Shihan Dou, Songyang Gao, Tao Gui, Xuanjing Huang, Yansong Feng, Zhenhua Han.

Figure 1
Figure 1. Figure 1: Overview of EVPO. The EV gating module selects between the critic baseline (EV [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training dynamics of PPO on Sokoban. (a) Batch-level EV over training steps. (b) Policy gradient norms. (c) Training success rate. The shaded region in (a–c) marks the early phase where EV is low. (d) Effect of injecting Gaussian noise into a converged critic on FrozenLake. 3 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reducing critic variance helps the actor. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training and validation curves across four tasks. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Gating trigger probability (evb B > 0, i.e., selection of critic over batch mean) across training. Finding 3: In most tasks, gating frequency decreases monotonically over training, reflecting critic maturation from PA > PB to PA < PB [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Threshold sensitivity analysis. Top row: training metrics (smoothed); bottom row: validation metrics (unsmoothed). (a, c, e, g) Best success rate as a function of the EV switching threshold; stars mark τ=0. (b, d, f, h) Full curves under each threshold setting. Performance peaks at τ=0 on both tasks. Finding 4: Performance peaks at ev = 0 and degrades in either direction, with positive and negative deviati… view at source ↗
read the original abstract

Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have gained widespread adoption due to their simplicity and competitive performance. We show that in sparse-reward settings, a learned critic can inject estimation noise that exceeds the state signal it captures, increasing rather than reducing advantage variance. By casting baseline selection as a Kalman filtering problem, we unify PPO and GRPO as two extremes of the Kalman gain and prove that explained variance (EV), computable from a single training batch, identifies the exact boundary: positive EV indicates the critic reduces variance, while zero or negative EV signals that it inflates variance. Building on this insight, we propose Explained Variance Policy Optimization (EVPO), which monitors batch-level EV at each training step and adaptively switches between critic-based and batch-mean advantage estimation, provably achieving no greater variance than the better of the two at every step. Across four tasks spanning classical control, agentic interaction, and mathematical reasoning, EVPO consistently outperforms both PPO and GRPO regardless of which fixed baseline is stronger on a given task. Further analysis confirms that the adaptive gating tracks critic maturation over training and that the theoretically derived zero threshold is empirically optimal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that in sparse-reward LLM post-training, a learned critic can increase rather than reduce advantage variance due to estimation noise. Casting baseline selection as a Kalman filtering problem unifies PPO (critic-based) and GRPO (batch-mean) as extremes of the Kalman gain. It proves that batch-level explained variance (EV) identifies the exact boundary: positive EV means the critic reduces variance while zero or negative EV means it inflates variance. EVPO adaptively switches between the two estimators at each step based on this threshold, provably achieving variance no greater than the better of the two at every step. Experiments across four tasks (classical control, agentic interaction, mathematical reasoning) show EVPO outperforming both fixed baselines regardless of which is stronger on a given task, with further analysis confirming the adaptive rule tracks critic maturation.

Significance. If the theoretical guarantee holds under the paper's modeling assumptions, this provides a principled, parameter-free adaptive mechanism for critic utilization in RL post-training of LLMs, addressing a practical tension between classical variance-reduction theory and the adoption of critic-free methods like GRPO. The Kalman unification is elegant, the derivation of the zero-EV threshold from first principles is a strength, and the consistent empirical outperformance across diverse tasks (without task-specific tuning) adds practical value. Credit is given for the explicit analysis of how the gating evolves with critic quality over training.

major comments (2)
  1. [Section 3 (Kalman model and variance proof)] Section 3 (Kalman model and variance proof): The claim that EV computed from a single batch 'identifies the exact boundary' and yields a provable variance guarantee assumes additive independent noise in the critic observation model. In the sparse-reward LLM setting, the same trajectories are used both to train the critic and to compute batch EV/advantages, inducing correlation between critic errors and returns. This correlation can reverse the sign of the true variance difference, breaking the guarantee that the chosen estimator has variance no larger than the better of the two. The paper must either relax the independence assumption in the derivation or provide a separate robustness argument/empirical check that the sign of EV still correctly identifies the variance-minimizing choice under realistic correlation.
  2. [Section 5 (empirical evaluation)] Section 5 (empirical evaluation): The central empirical claim is that EVPO 'consistently outperforms both PPO and GRPO regardless of which fixed baseline is stronger on a given task.' This requires showing that the adaptive switches actually select the lower-variance estimator in practice. The results should report per-task variance across random seeds, the fraction of steps where EVPO matches the better fixed method, and whether the zero threshold was fixed a priori or validated post-hoc; without these, it is unclear whether the gains stem from the theoretical rule or from incidental benefits of switching.
minor comments (2)
  1. [Section 2 or 3] The exact formula for batch explained variance (EV) should be stated explicitly (including the precise definition of 'explained' versus total variance) in the main text rather than deferred to an appendix, as it is load-bearing for reproducibility of the adaptive rule.
  2. [Figures in Section 5] Figure captions for the EV-over-training plots should annotate the points at which the method switches baselines to make the adaptive behavior visually verifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment below, clarifying our theoretical assumptions and committing to expanded empirical reporting. Revisions will be incorporated in the next version of the paper.

read point-by-point responses
  1. Referee: The claim that EV computed from a single batch 'identifies the exact boundary' and yields a provable variance guarantee assumes additive independent noise in the critic observation model. In the sparse-reward LLM setting, the same trajectories are used both to train the critic and to compute batch EV/advantages, inducing correlation between critic errors and returns. This correlation can reverse the sign of the true variance difference, breaking the guarantee that the chosen estimator has variance no larger than the better of the two. The paper must either relax the independence assumption in the derivation or provide a separate robustness argument/empirical check that the sign of EV still correctly identifies the variance-minimizing choice under realistic correlation.

    Authors: We agree that the exact zero-EV threshold in the variance proof relies on the additive independent noise assumption in the Kalman observation model. The correlation induced by reusing trajectories for both critic training and EV/advantage computation is a valid concern that could affect the guarantee in principle. To address this, we will revise Section 3 to explicitly state the independence assumption and add a dedicated robustness subsection. This subsection will (i) derive approximate bounds showing that the sign of batch EV remains indicative of variance reduction under moderate positive correlations typical in RL batches, and (ii) include an empirical verification on held-out trajectory subsets confirming that positive EV still selects the lower-variance estimator in our experimental regimes. These additions preserve the core Kalman unification while directly responding to the correlation issue. revision: partial

  2. Referee: The central empirical claim is that EVPO 'consistently outperforms both PPO and GRPO regardless of which fixed baseline is stronger on a given task.' This requires showing that the adaptive switches actually select the lower-variance estimator in practice. The results should report per-task variance across random seeds, the fraction of steps where EVPO matches the better fixed method, and whether the zero threshold was fixed a priori or validated post-hoc; without these, it is unclear whether the gains stem from the theoretical rule or from incidental benefits of switching.

    Authors: We concur that the current empirical section would benefit from these additional diagnostics to more directly link performance gains to the adaptive rule. In the revised manuscript we will augment Section 5 with: (1) per-task means and standard deviations of key metrics (e.g., return variance and final performance) across five random seeds for EVPO, PPO, and GRPO; (2) the fraction of training steps at which EVPO selects the critic-based estimator versus the batch-mean estimator, together with the percentage of steps where this choice coincides with the empirically lower-variance fixed method on that task; and (3) explicit documentation that the zero-EV threshold was fixed a priori from the theoretical derivation, accompanied by a sensitivity plot varying the threshold over [-0.1, 0.1] to demonstrate that zero remains near-optimal. These additions will substantiate that the observed improvements arise from the principled switching mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is model-based and self-contained

full rationale

The paper casts baseline selection as a Kalman filtering problem and derives the EV-based switching rule mathematically from the properties of the Kalman gain (positive EV selects critic-based advantages; non-positive selects batch-mean). Explained variance is computed directly as a batch statistic from the same trajectories, without any fitted parameters being renamed as predictions. The unification of PPO and GRPO is an application of standard filtering theory rather than an ansatz or result imported via self-citation. No step reduces the claimed guarantee to a tautology or self-referential quantity by construction; the proof holds inside the stated model assumptions even if those assumptions are later questioned on empirical grounds.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on modeling advantage estimation as a Kalman filtering problem and treating batch-level explained variance as a reliable indicator of critic utility. No free parameters or new entities are introduced in the abstract description.

axioms (1)
  • domain assumption Advantage estimation in sparse-reward settings can be modeled as a Kalman filtering problem in which the learned critic acts as a noisy observation of the true state value.
    This modeling choice is used to unify PPO and GRPO as extremes of the Kalman gain and to derive the role of explained variance.

pith-pipeline@v0.9.0 · 5589 in / 1582 out tokens · 48324 ms · 2026-05-10T02:56:04.270939+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 27 canonical work pages · 14 internal anchors

  1. [1]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms, February 2024. URL http://arxiv.org/abs/ 2402.14740

  2. [2]

    Averaged-dqn: Variance reduction and stabi- lization for deep reinforcement learning

    Oron Anschel, Nir Baram, and Nahum Shimkin. Averaged-dqn: Variance reduction and stabi- lization for deep reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, pages 176–185. PMLR, July 2017. URL https://proceedings.mlr. press/v70/anschel17a.html

  3. [3]

    Gui-shepherd: Reliable process reward and verification for long-sequence gui tasks, September 2025

    Cong Chen, Kaixiang Ji, Hao Zhong, Muzhi Zhu, Anzhou Li, Guo Gan, Ziyuan Huang, Cheng Zou, Jiajia Liu, Jingdong Chen, Hao Chen, and Chunhua Shen. Gui-shepherd: Reliable process reward and verification for long-sequence gui tasks, September 2025. URL http: //arxiv.org/abs/2509.23738

  4. [4]

    Bryan L. M. de Oliveira, Felipe V . Frujeri, Marcos P. C. M. Queiroz, Luana G. B. Martins, Telma W. de L. Soares, and Luckeciano C. Melo. Learning without critics? revisiting grpo in classical reinforcement learning environments, November 2025. URL http://arxiv.org/ abs/2511.03527

  5. [5]

    The frozen lake problem

    Paolo Dell’Aversana. The frozen lake problem. an example of optimization policy, December

  6. [6]

    URL https://www.researchgate.net/publication/357420847_The_Frozen_ Lake_Problem_An_example_of_optimization_policy

  7. [7]

    Octobench: Benchmarking scaffold-aware instruction following in repository-grounded agentic coding,

    Deming Ding, Shichun Liu, Enhui Yang, Jiahang Lin, Ziying Chen, Shihan Dou, Honglin Guo, Weiyu Cheng, Pengyu Zhao, Chengjun Xiao, Qunhong Zeng, Qi Zhang, Xuanjing Huang, Qidi Xu, and Tao Gui. Octobench: Benchmarking scaffold-aware instruction following in repository- grounded agentic coding, January 2026. URLhttp://arxiv.org/abs/2601.10343

  8. [8]

    Agentic reinforced policy optimization

    Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic reinforced policy optimization. InThe Fourteenth International Conference on Learning Representations, October 2025. URL https://openreview.net/ forum?id=TX4k7BF6aO

  9. [9]

    Cl-bench: A benchmark for context learning

    Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Di Wang, and Shunyu Yao. Cl-bench: A benchmar...

  10. [10]

    Only relevant information matters: Filtering out noisy samples to boost rl, April 2019

    Yannis Flet-Berliac and Philippe Preux. Only relevant information matters: Filtering out noisy samples to boost rl, April 2019. URLhttps://arxiv.org/abs/1904.04025v5. 10

  11. [11]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  12. [12]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization, November 2025. URL http://arxiv. org/abs/2501.03262

  13. [13]

    Towards understanding the optimization landscape of grpo and its variants

    Samyak Jain, Ayush Agrawal, and Navin Goyal. Towards understanding the optimization landscape of grpo and its variants. InFirst Workshop on Foundations of Reasoning in Language Models, December 2025. URLhttps://openreview.net/forum?id=DSX0b1Ds0j

  14. [14]

    Sokoban: Enhancing general single-agent search methods using domain knowledge.Artificial Intelligence, 129(1):219–251, June 2001

    Andreas Junghanns and Jonathan Schaeffer. Sokoban: Enhancing general single-agent search methods using domain knowledge.Artificial Intelligence, 129(1):219–251, June 2001. ISSN 0004-3702. doi: 10.1016/S0004-3702(01)00109-6. URL https://www.sciencedirect. com/science/article/pii/S0004370201001096

  15. [15]

    R. E. Kalman. A new approach to linear filtering and prediction problems.Journal of Basic Engineering, 82(1):35–45, March 1960. ISSN 0021-9223. doi: 10.1115/1.3662552. URL https://doi.org/10.1115/1.3662552

  16. [16]

    Unifying ppo, dpo, and grpo: A theoretical and empirical study on llm post-training, November 2025

    Lijun Li, Xiaoyu Wang, Jianan Liu, Haoran Chen, Minghao Zhang, Zixuan Yang, and Haoran Zhang. Unifying ppo, dpo, and grpo: A theoretical and empirical study on llm post-training, November 2025. URL https://www.researchgate.net/publication/ 397211705_Unifying_PPO_DPO_and_GRPO_A_Theoretical_and_Empirical_Study_ on_LLM_Post-Training

  17. [17]

    Mm-doc-r1: Training agents for long document visual question answering through multi-turn reinforcement learning, April 2026

    Jiahang Lin, Kai Hu, Binghai Wang, Yuhao Zhou, Zhiheng Xi, Honglin Guo, Shichun Liu, Junzhe Wang, Shihan Dou, Enyu Zhou, Hang Yan, Zhenhua Han, Tao Gui, Qi Zhang, and Xuanjing Huang. Mm-doc-r1: Training agents for long document visual question answering through multi-turn reinforcement learning, April 2026. URL http://arxiv.org/abs/2604. 13579. 11

  18. [18]

    Proximal policy optimization with adaptive generalized advantage estimate: Critic- aware refinements, 2024

    Naemeh Mohammadpour, Meysam Fozi, Mohammad Mehdi Ebadzadeh, Ali Azimi, and Ali Kamali. Proximal policy optimization with adaptive generalized advantage estimate: Critic- aware refinements, 2024. URLhttps://github.com/naempr/PPO-with-adaptive-GAE

  19. [19]

    OpenAI, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich,...

  20. [20]

    Fussell, and Keshav Pingali

    Yan Pei, Swarnendu Biswas, Donald S. Fussell, and Keshav Pingali. An elementary introduction to kalman filtering, June 2019. URLhttp://arxiv.org/abs/1710.04055

  21. [21]

    Qwen2.5 Technical Report

    Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Ti...

  22. [22]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

  23. [23]

    The nuts and bolts of deep rl research, December 2016

    John Schulman. The nuts and bolts of deep rl research, December 2016. URL http://joschu. net/docs/nuts-and-bolts.pdf

  24. [24]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, August 2017. URLhttp://arxiv.org/abs/1707.06347

  25. [25]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation, October 2018. URL http://arxiv.org/abs/1506.02438

  26. [26]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, April 2024. URL http://arxiv.org/ abs/2402.03300

  27. [27]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning, March 2021. URLhttp://arxiv.org/abs/2010.03768

  28. [28]

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, A. J. Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker- Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer,...

  29. [29]

    URLhttp://arxiv.org/abs/2601.03267

  30. [30]

    Policy gradient meth- ods for reinforcement learning with function approximation

    Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation. InAdvances in Neural Information Processing Systems, volume 12. MIT Press, 1999. URL https://papers.nips.cc/paper_ files/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html

  31. [31]

    Reft: Reasoning with reinforced fine-tuning

    Luong Trung, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning. In Lun-Wei Ku, Andre Martins, and Vivek Sriku- mar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 7601–7614, Bangkok, Thailand, August

  32. [32]

    doi: 10.18653/v1/2024.acl-long.410

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.410. URL https://aclanthology.org/2024.acl-long.410/

  33. [33]

    Enhancing llm-based search agents via contribution weighted group relative policy optimization, April

    Junzhe Wang, Zhiheng Xi, yajie yang, Hao Luo, Shihan Dou, Tao Gui, and Qi Zhang. Enhancing llm-based search agents via contribution weighted group relative policy optimization, April

  34. [34]

    URLhttp://arxiv.org/abs/2604.14267

  35. [35]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen: Understanding 14 self-evolution in llm agents via multi-turn reinforcement learning, May 2025. URL http: //arxiv.o...

  36. [36]

    Williams

    Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine Learning, 8(3):229–256, May 1992. ISSN 1573-0565. doi: 10.1007/ BF00992696. URLhttps://link.springer.com/article/10.1007/BF00992696

  37. [37]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  38. [38]

    Narasimhan

    Shunyu Yao, Howard Chen, John Yang, and Karthik R. Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems, October 2022. URL https://openreview.net/forum? id=R9KnuFlvnU

  39. [39]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  40. [40]

    Secrets of RLHF in large language models part I: PPO.CoRR, abs/2307.04964, 2023

    Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Yuhao Zhou, Limao Xiong, Lu Chen, Zhiheng Xi, Nuo Xu, Wenbin Lai, Minghao Zhu, Cheng Chang, Zhangyue Yin, Rongxiang Weng, Wensen Cheng, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, and Xuanjing Huang. Secrets of rlhf in large langu...

  41. [41]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, November 2023. URLhttp://arxiv.org/abs/2311.07911. A Extended Proofs and Derivations Detailed proofs and derivations are provided in this appendix. A.1 Detailed Kalman Filtering Analogy We p...