arxiv: 2604.19485 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.AI· cs.CL

Recognition: unknown

EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

Chengjun Pan , Shichun Liu , Jiahang Lin , Dingwei Zhu , Jiazheng Zhang , Shihan Dou , Songyang Gao , Zhenhua Han

show 5 more authors

Binghai Wang Rui Zheng Xuanjing Huang Tao Gui Yansong Feng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords explained variancepolicy optimizationadaptive criticPPOGRPOLLM post-trainingvariance reductionreinforcement learning

0 comments

The pith

EVPO monitors batch explained variance to switch between critic and batch-mean baselines, ensuring no higher variance than the better of the two at every step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in sparse-reward LLM post-training a learned critic can add more noise than it removes, raising advantage variance instead of lowering it. It unifies PPO and GRPO as opposite extremes of a Kalman filter and shows that explained variance calculated from one training batch marks exactly when the critic helps versus hurts. EVPO uses this per-step EV reading to pick the lower-variance estimate automatically. A reader should care because the method removes any need to choose between the two baselines in advance and delivers better results on control, agent, and math tasks whether one fixed baseline was originally stronger.

Core claim

By framing baseline selection as a Kalman filtering problem, PPO and GRPO become the two limiting cases of Kalman gain. Explained variance, which can be computed directly from a single batch, identifies the precise boundary: positive EV means the critic reduces variance while zero or negative EV means it inflates it. EVPO therefore tracks batch-level EV and switches between critic-based and batch-mean advantage estimates, provably achieving variance no greater than the minimum of the two fixed methods at every training step. Experiments across classical control, agentic interaction, and mathematical reasoning confirm consistent outperformance over both PPO and GRPO, with the EV signal also跟踪

What carries the argument

Batch-level explained variance as the real-time indicator that decides whether to use the learned critic or the batch mean for advantage estimation inside a Kalman-unified policy update.

If this is right

EVPO provably never exceeds the variance of the better fixed baseline at any training step.
The adaptive switch outperforms both PPO and GRPO on tasks where either one was originally stronger.
Batch EV tracks critic maturation and signals when the critic becomes reliable during training.
The zero-EV threshold derived from the Kalman model is empirically optimal for the switch decision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar per-batch variance monitoring could decide when to trust other learned value functions in sparse-reward RL beyond the PPO-GRPO pair.
The Kalman unification offers a way to analyze and combine additional baseline estimators without manual selection.
The approach may reduce hyperparameter search effort when scaling RL post-training to new tasks with unknown reward sparsity.

Load-bearing premise

That explained variance measured on a single training batch reliably signals whether the critic reduces or increases advantage variance in sparse-reward LLM post-training.

What would settle it

An experiment on any of the four task types in which EVPO's measured advantage variance exceeds the variance of both PPO and GRPO at one or more steps, or in which a non-zero EV threshold produces lower final variance than the zero threshold.

Figures

Figures reproduced from arXiv: 2604.19485 by Binghai Wang, Chengjun Pan, Dingwei Zhu, Jiahang Lin, Jiazheng Zhang, Rui Zheng, Shichun Liu, Shihan Dou, Songyang Gao, Tao Gui, Xuanjing Huang, Yansong Feng, Zhenhua Han.

**Figure 2.** Figure 2: Training dynamics of PPO on Sokoban. (a) Batch-level EV over training steps. (b) Policy gradient norms. (c) Training success rate. The shaded region in (a–c) marks the early phase where EV is low. (d) Effect of injecting Gaussian noise into a converged critic on FrozenLake. 3 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Reducing critic variance helps the actor. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Training and validation curves across four tasks. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Gating trigger probability (evb B > 0, i.e., selection of critic over batch mean) across training. Finding 3: In most tasks, gating frequency decreases monotonically over training, reflecting critic maturation from PA > PB to PA < PB [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Threshold sensitivity analysis. Top row: training metrics (smoothed); bottom row: validation metrics (unsmoothed). (a, c, e, g) Best success rate as a function of the EV switching threshold; stars mark τ=0. (b, d, f, h) Full curves under each threshold setting. Performance peaks at τ=0 on both tasks. Finding 4: Performance peaks at ev = 0 and degrades in either direction, with positive and negative deviati… view at source ↗

read the original abstract

Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have gained widespread adoption due to their simplicity and competitive performance. We show that in sparse-reward settings, a learned critic can inject estimation noise that exceeds the state signal it captures, increasing rather than reducing advantage variance. By casting baseline selection as a Kalman filtering problem, we unify PPO and GRPO as two extremes of the Kalman gain and prove that explained variance (EV), computable from a single training batch, identifies the exact boundary: positive EV indicates the critic reduces variance, while zero or negative EV signals that it inflates variance. Building on this insight, we propose Explained Variance Policy Optimization (EVPO), which monitors batch-level EV at each training step and adaptively switches between critic-based and batch-mean advantage estimation, provably achieving no greater variance than the better of the two at every step. Across four tasks spanning classical control, agentic interaction, and mathematical reasoning, EVPO consistently outperforms both PPO and GRPO regardless of which fixed baseline is stronger on a given task. Further analysis confirms that the adaptive gating tracks critic maturation over training and that the theoretically derived zero threshold is empirically optimal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EVPO uses per-batch explained variance to switch between critic and mean advantages in LLM RL, with a Kalman unification and claimed variance guarantee, but the guarantee likely breaks under the data correlation the stress test flags.

read the letter

The paper's core move is to treat baseline choice as a Kalman filtering problem, so PPO and GRPO become the two ends of the gain spectrum. It then shows that explained variance computed on the current batch tells you which end to pick: positive EV means use the critic, zero or negative means fall back to the batch mean. The claim is that this switch keeps variance no higher than the better of the two fixed choices at every step, and the experiments back that up across control, agentic, and math tasks.

Referee Report

2 major / 2 minor

Summary. The paper claims that in sparse-reward LLM post-training, a learned critic can increase rather than reduce advantage variance due to estimation noise. Casting baseline selection as a Kalman filtering problem unifies PPO (critic-based) and GRPO (batch-mean) as extremes of the Kalman gain. It proves that batch-level explained variance (EV) identifies the exact boundary: positive EV means the critic reduces variance while zero or negative EV means it inflates variance. EVPO adaptively switches between the two estimators at each step based on this threshold, provably achieving variance no greater than the better of the two at every step. Experiments across four tasks (classical control, agentic interaction, mathematical reasoning) show EVPO outperforming both fixed baselines regardless of which is stronger on a given task, with further analysis confirming the adaptive rule tracks critic maturation.

Significance. If the theoretical guarantee holds under the paper's modeling assumptions, this provides a principled, parameter-free adaptive mechanism for critic utilization in RL post-training of LLMs, addressing a practical tension between classical variance-reduction theory and the adoption of critic-free methods like GRPO. The Kalman unification is elegant, the derivation of the zero-EV threshold from first principles is a strength, and the consistent empirical outperformance across diverse tasks (without task-specific tuning) adds practical value. Credit is given for the explicit analysis of how the gating evolves with critic quality over training.

major comments (2)

[Section 3 (Kalman model and variance proof)] Section 3 (Kalman model and variance proof): The claim that EV computed from a single batch 'identifies the exact boundary' and yields a provable variance guarantee assumes additive independent noise in the critic observation model. In the sparse-reward LLM setting, the same trajectories are used both to train the critic and to compute batch EV/advantages, inducing correlation between critic errors and returns. This correlation can reverse the sign of the true variance difference, breaking the guarantee that the chosen estimator has variance no larger than the better of the two. The paper must either relax the independence assumption in the derivation or provide a separate robustness argument/empirical check that the sign of EV still correctly identifies the variance-minimizing choice under realistic correlation.
[Section 5 (empirical evaluation)] Section 5 (empirical evaluation): The central empirical claim is that EVPO 'consistently outperforms both PPO and GRPO regardless of which fixed baseline is stronger on a given task.' This requires showing that the adaptive switches actually select the lower-variance estimator in practice. The results should report per-task variance across random seeds, the fraction of steps where EVPO matches the better fixed method, and whether the zero threshold was fixed a priori or validated post-hoc; without these, it is unclear whether the gains stem from the theoretical rule or from incidental benefits of switching.

minor comments (2)

[Section 2 or 3] The exact formula for batch explained variance (EV) should be stated explicitly (including the precise definition of 'explained' versus total variance) in the main text rather than deferred to an appendix, as it is load-bearing for reproducibility of the adaptive rule.
[Figures in Section 5] Figure captions for the EV-over-training plots should annotate the points at which the method switches baselines to make the adaptive behavior visually verifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment below, clarifying our theoretical assumptions and committing to expanded empirical reporting. Revisions will be incorporated in the next version of the paper.

read point-by-point responses

Referee: The claim that EV computed from a single batch 'identifies the exact boundary' and yields a provable variance guarantee assumes additive independent noise in the critic observation model. In the sparse-reward LLM setting, the same trajectories are used both to train the critic and to compute batch EV/advantages, inducing correlation between critic errors and returns. This correlation can reverse the sign of the true variance difference, breaking the guarantee that the chosen estimator has variance no larger than the better of the two. The paper must either relax the independence assumption in the derivation or provide a separate robustness argument/empirical check that the sign of EV still correctly identifies the variance-minimizing choice under realistic correlation.

Authors: We agree that the exact zero-EV threshold in the variance proof relies on the additive independent noise assumption in the Kalman observation model. The correlation induced by reusing trajectories for both critic training and EV/advantage computation is a valid concern that could affect the guarantee in principle. To address this, we will revise Section 3 to explicitly state the independence assumption and add a dedicated robustness subsection. This subsection will (i) derive approximate bounds showing that the sign of batch EV remains indicative of variance reduction under moderate positive correlations typical in RL batches, and (ii) include an empirical verification on held-out trajectory subsets confirming that positive EV still selects the lower-variance estimator in our experimental regimes. These additions preserve the core Kalman unification while directly responding to the correlation issue. revision: partial
Referee: The central empirical claim is that EVPO 'consistently outperforms both PPO and GRPO regardless of which fixed baseline is stronger on a given task.' This requires showing that the adaptive switches actually select the lower-variance estimator in practice. The results should report per-task variance across random seeds, the fraction of steps where EVPO matches the better fixed method, and whether the zero threshold was fixed a priori or validated post-hoc; without these, it is unclear whether the gains stem from the theoretical rule or from incidental benefits of switching.

Authors: We concur that the current empirical section would benefit from these additional diagnostics to more directly link performance gains to the adaptive rule. In the revised manuscript we will augment Section 5 with: (1) per-task means and standard deviations of key metrics (e.g., return variance and final performance) across five random seeds for EVPO, PPO, and GRPO; (2) the fraction of training steps at which EVPO selects the critic-based estimator versus the batch-mean estimator, together with the percentage of steps where this choice coincides with the empirically lower-variance fixed method on that task; and (3) explicit documentation that the zero-EV threshold was fixed a priori from the theoretical derivation, accompanied by a sensitivity plot varying the threshold over [-0.1, 0.1] to demonstrate that zero remains near-optimal. These additions will substantiate that the observed improvements arise from the principled switching mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is model-based and self-contained

full rationale

The paper casts baseline selection as a Kalman filtering problem and derives the EV-based switching rule mathematically from the properties of the Kalman gain (positive EV selects critic-based advantages; non-positive selects batch-mean). Explained variance is computed directly as a batch statistic from the same trajectories, without any fitted parameters being renamed as predictions. The unification of PPO and GRPO is an application of standard filtering theory rather than an ansatz or result imported via self-citation. No step reduces the claimed guarantee to a tautology or self-referential quantity by construction; the proof holds inside the stated model assumptions even if those assumptions are later questioned on empirical grounds.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on modeling advantage estimation as a Kalman filtering problem and treating batch-level explained variance as a reliable indicator of critic utility. No free parameters or new entities are introduced in the abstract description.

axioms (1)

domain assumption Advantage estimation in sparse-reward settings can be modeled as a Kalman filtering problem in which the learned critic acts as a noisy observation of the true state value.
This modeling choice is used to unify PPO and GRPO as extremes of the Kalman gain and to derive the role of explained variance.

pith-pipeline@v0.9.0 · 5589 in / 1582 out tokens · 48324 ms · 2026-05-10T02:56:04.270939+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 27 canonical work pages · 14 internal anchors

[1]

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms, February 2024. URL http://arxiv.org/abs/ 2402.14740

work page internal anchor Pith review arXiv 2024
[2]

Averaged-dqn: Variance reduction and stabi- lization for deep reinforcement learning

Oron Anschel, Nir Baram, and Nahum Shimkin. Averaged-dqn: Variance reduction and stabi- lization for deep reinforcement learning. InProceedings of the 34th International Conference on Machine Learning, pages 176–185. PMLR, July 2017. URL https://proceedings.mlr. press/v70/anschel17a.html

2017
[3]

Gui-shepherd: Reliable process reward and verification for long-sequence gui tasks, September 2025

Cong Chen, Kaixiang Ji, Hao Zhong, Muzhi Zhu, Anzhou Li, Guo Gan, Ziyuan Huang, Cheng Zou, Jiajia Liu, Jingdong Chen, Hao Chen, and Chunhua Shen. Gui-shepherd: Reliable process reward and verification for long-sequence gui tasks, September 2025. URL http: //arxiv.org/abs/2509.23738

work page arXiv 2025
[4]

Bryan L. M. de Oliveira, Felipe V . Frujeri, Marcos P. C. M. Queiroz, Luana G. B. Martins, Telma W. de L. Soares, and Luckeciano C. Melo. Learning without critics? revisiting grpo in classical reinforcement learning environments, November 2025. URL http://arxiv.org/ abs/2511.03527

work page arXiv 2025
[5]

The frozen lake problem

Paolo Dell’Aversana. The frozen lake problem. an example of optimization policy, December
[6]

URL https://www.researchgate.net/publication/357420847_The_Frozen_ Lake_Problem_An_example_of_optimization_policy

work page arXiv
[7]

Octobench: Benchmarking scaffold-aware instruction following in repository-grounded agentic coding,

Deming Ding, Shichun Liu, Enhui Yang, Jiahang Lin, Ziying Chen, Shihan Dou, Honglin Guo, Weiyu Cheng, Pengyu Zhao, Chengjun Xiao, Qunhong Zeng, Qi Zhang, Xuanjing Huang, Qidi Xu, and Tao Gui. Octobench: Benchmarking scaffold-aware instruction following in repository- grounded agentic coding, January 2026. URLhttp://arxiv.org/abs/2601.10343

work page arXiv 2026
[8]

Agentic reinforced policy optimization

Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, and Zhicheng Dou. Agentic reinforced policy optimization. InThe Fourteenth International Conference on Learning Representations, October 2025. URL https://openreview.net/ forum?id=TX4k7BF6aO

2025
[9]

Cl-bench: A benchmark for context learning

Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Di Wang, and Shunyu Yao. Cl-bench: A benchmar...

work page arXiv 2026
[10]

Only relevant information matters: Filtering out noisy samples to boost rl, April 2019

Yannis Flet-Berliac and Philippe Preux. Only relevant information matters: Filtering out noisy samples to boost rl, April 2019. URLhttps://arxiv.org/abs/1904.04025v5. 10

work page arXiv 2019
[11]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025
[12]

REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization, November 2025. URL http://arxiv. org/abs/2501.03262

work page internal anchor Pith review arXiv 2025
[13]

Towards understanding the optimization landscape of grpo and its variants

Samyak Jain, Ayush Agrawal, and Navin Goyal. Towards understanding the optimization landscape of grpo and its variants. InFirst Workshop on Foundations of Reasoning in Language Models, December 2025. URLhttps://openreview.net/forum?id=DSX0b1Ds0j

2025
[14]

Sokoban: Enhancing general single-agent search methods using domain knowledge.Artificial Intelligence, 129(1):219–251, June 2001

Andreas Junghanns and Jonathan Schaeffer. Sokoban: Enhancing general single-agent search methods using domain knowledge.Artificial Intelligence, 129(1):219–251, June 2001. ISSN 0004-3702. doi: 10.1016/S0004-3702(01)00109-6. URL https://www.sciencedirect. com/science/article/pii/S0004370201001096

work page doi:10.1016/s0004-3702(01)00109-6 2001
[15]

R. E. Kalman. A new approach to linear filtering and prediction problems.Journal of Basic Engineering, 82(1):35–45, March 1960. ISSN 0021-9223. doi: 10.1115/1.3662552. URL https://doi.org/10.1115/1.3662552

work page doi:10.1115/1.3662552 1960
[16]

Unifying ppo, dpo, and grpo: A theoretical and empirical study on llm post-training, November 2025

Lijun Li, Xiaoyu Wang, Jianan Liu, Haoran Chen, Minghao Zhang, Zixuan Yang, and Haoran Zhang. Unifying ppo, dpo, and grpo: A theoretical and empirical study on llm post-training, November 2025. URL https://www.researchgate.net/publication/ 397211705_Unifying_PPO_DPO_and_GRPO_A_Theoretical_and_Empirical_Study_ on_LLM_Post-Training

2025
[17]

Mm-doc-r1: Training agents for long document visual question answering through multi-turn reinforcement learning, April 2026

Jiahang Lin, Kai Hu, Binghai Wang, Yuhao Zhou, Zhiheng Xi, Honglin Guo, Shichun Liu, Junzhe Wang, Shihan Dou, Enyu Zhou, Hang Yan, Zhenhua Han, Tao Gui, Qi Zhang, and Xuanjing Huang. Mm-doc-r1: Training agents for long document visual question answering through multi-turn reinforcement learning, April 2026. URL http://arxiv.org/abs/2604. 13579. 11

2026
[18]

Proximal policy optimization with adaptive generalized advantage estimate: Critic- aware refinements, 2024

Naemeh Mohammadpour, Meysam Fozi, Mohammad Mehdi Ebadzadeh, Ali Azimi, and Ali Kamali. Proximal policy optimization with adaptive generalized advantage estimate: Critic- aware refinements, 2024. URLhttps://github.com/naempr/PPO-with-adaptive-GAE

2024
[19]

OpenAI, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kondrich,...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Fussell, and Keshav Pingali

Yan Pei, Swarnendu Biswas, Donald S. Fussell, and Keshav Pingali. An elementary introduction to kalman filtering, June 2019. URLhttp://arxiv.org/abs/1710.04055

work page arXiv 2019
[21]

Qwen2.5 Technical Report

Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Ti...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Qwen3.5: Towards native multimodal agents, February 2026

Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https: //qwen.ai/blog?id=qwen3.5

2026
[23]

The nuts and bolts of deep rl research, December 2016

John Schulman. The nuts and bolts of deep rl research, December 2016. URL http://joschu. net/docs/nuts-and-bolts.pdf

2016
[24]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, August 2017. URLhttp://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation, October 2018. URL http://arxiv.org/abs/1506.02438

work page internal anchor Pith review arXiv 2018
[26]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, April 2024. URL http://arxiv.org/ abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning, March 2021. URLhttp://arxiv.org/abs/2010.03768

work page internal anchor Pith review arXiv 2021
[28]

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, A. J. Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker- Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer,...
[29]

URLhttp://arxiv.org/abs/2601.03267

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Policy gradient meth- ods for reinforcement learning with function approximation

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient meth- ods for reinforcement learning with function approximation. InAdvances in Neural Information Processing Systems, volume 12. MIT Press, 1999. URL https://papers.nips.cc/paper_ files/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html

1999
[31]

Reft: Reasoning with reinforced fine-tuning

Luong Trung, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning. In Lun-Wei Ku, Andre Martins, and Vivek Sriku- mar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers), pages 7601–7614, Bangkok, Thailand, August
[32]

doi: 10.18653/v1/2024.acl-long.410

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.410. URL https://aclanthology.org/2024.acl-long.410/

work page doi:10.18653/v1/2024.acl-long.410 2024
[33]

Enhancing llm-based search agents via contribution weighted group relative policy optimization, April

Junzhe Wang, Zhiheng Xi, yajie yang, Hao Luo, Shihan Dou, Tao Gui, and Qi Zhang. Enhancing llm-based search agents via contribution weighted group relative policy optimization, April
[34]

URLhttp://arxiv.org/abs/2604.14267

work page internal anchor Pith review Pith/arXiv arXiv
[35]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen: Understanding 14 self-evolution in llm agents via multi-turn reinforcement learning, May 2025. URL http: //arxiv.o...

work page internal anchor Pith review arXiv 2025
[36]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforce- ment learning.Machine Learning, 8(3):229–256, May 1992. ISSN 1573-0565. doi: 10.1007/ BF00992696. URLhttps://link.springer.com/article/10.1007/BF00992696

work page doi:10.1007/bf00992696 1992
[37]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Narasimhan

Shunyu Yao, Howard Chen, John Yang, and Karthik R. Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. InAdvances in Neural Information Processing Systems, October 2022. URL https://openreview.net/forum? id=R9KnuFlvnU

2022
[39]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Secrets of RLHF in large language models part I: PPO.CoRR, abs/2307.04964, 2023

Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Yuhao Zhou, Limao Xiong, Lu Chen, Zhiheng Xi, Nuo Xu, Wenbin Lai, Minghao Zhu, Cheng Chang, Zhangyue Yin, Rongxiang Weng, Wensen Cheng, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, and Xuanjing Huang. Secrets of rlhf in large langu...

work page arXiv 2023
[41]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, November 2023. URLhttp://arxiv.org/abs/2311.07911. A Extended Proofs and Derivations Detailed proofs and derivations are provided in this appendix. A.1 Detailed Kalman Filtering Analogy We p...

work page internal anchor Pith review Pith/arXiv arXiv 2023