pith. machine review for the scientific record. sign in

arxiv: 2509.25424 · v6 · submitted 2025-09-29 · 💻 cs.LG · cs.AI

Polychromic Objectives for Reinforcement Learning

Pith reviewed 2026-05-18 11:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningpolicy gradientdiversityexplorationfine-tuningPPOvine sampling
0
0 comments X

The pith

A polychromic objective for policy gradients prevents pretrained RL policies from losing behavioral diversity during fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an objective that explicitly rewards refinement across many distinct generations rather than convergence to a few high-reward outputs. Standard RL fine-tuning often collapses diversity, which blocks further exploration and wastes the potential of test-time compute. The method adapts PPO by collecting on-policy data through vine sampling and redefining the advantage to reflect gains under the new objective. Experiments across BabyAI, Minigrid, and algorithmic tasks show the resulting policies solve more environment configurations, generalize under large changes, and cover more strategies when allowed multiple attempts.

Core claim

Optimizing a polychromic objective, which requires the policy to explore and refine a diverse set of generations, allows proximal policy optimization to avoid collapse into narrow behaviors and instead produce agents that reliably cover a larger fraction of solvable tasks while preserving a broad repertoire of strategies.

What carries the argument

The polychromic objective, realized by vine sampling to gather on-policy rollouts and a modified advantage function inside PPO that scores actions according to their contribution to diversity-preserving improvement.

If this is right

  • Policies continue to explore new behaviors instead of converging to a handful of repeatable outputs.
  • Higher success rates emerge because the agent solves a wider range of environment configurations.
  • Generalization improves under large perturbations because multiple distinct strategies remain available.
  • Pass@k performance rises because the policy retains and can deploy a broader set of successful trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same objective could be applied to fine-tuning large language models where output diversity also collapses.
  • Test-time compute scaling may yield larger gains when the base policy already maintains many distinct solution paths.
  • Combining the polychromic term with other explicit exploration bonuses could further enlarge the set of reachable behaviors.

Load-bearing premise

Vine sampling plus the modified advantage produces on-policy data whose diversity statistics stay stable and representative without adding bias to the policy gradient estimate.

What would settle it

Run standard PPO and the polychromic variant side-by-side on the same BabyAI or Minigrid suite; if the polychromic version shows no increase in the number of solved configurations or in pass@k coverage after the same number of updates, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2509.25424 by Chelsea Finn, Dorsa Sadigh, Ellen Xu, Ifdita Hasan Orney, Jubayer Ibn Hamid.

Figure 1
Figure 1. Figure 1: The set value of a state (circled) is the expected discounted return of the subtree (highlighted) rooted in this state. We use the notation V ♯ and Q♯ to distinguish it from the value function and Q-function in standard RL. Here, we assume that the discount factor γ ∈ (0, 1 n ) to ensure values remain bounded - the range is smaller since we add the expected sum of rewards from n actions stemming out of eac… view at source ↗
Figure 2
Figure 2. Figure 2: Results on Algorithmic Creativity. Bars show normalized val￾ues for each metric, with raw values above each bar. We compare polychromic PPO (Poly-PPO) with REINFORCE with baseline [39] and standard PPO [29]. Furthermore, we compare with a UCB-style regularization [1] where we add λUCB · min{1, N(s, a) − 1 2 } to every advantage Aˆ(s, a). Here, N(s, a) is the number of times that action a was sampled from s… view at source ↗
Figure 3
Figure 3. Figure 3: Pass@k on BabyAI tasks. Top: methods without UCB. Bottom: methods with UCB. Columns show Goto, Pickup, Synthseq, and Bosslevel. Each curve is pass rate vs. number of attempts. valid triangles that were not seen in the pretraining data [22]. We also report validity (number of valid triangles constructed), and diversity (unique number of valid triangles). PPO substantially increase validity compared to the p… view at source ↗
Figure 4
Figure 4. Figure 4: Pass@k results on Algorithmic Creativity. For validity pass@k and creativity pass@k, the agent gets a pass if at least one of the k attempts was a valid and creative triangle, respectively. In diff@k evaluation, we evaluate the number of generations that were unique given k attempts. comparison, Poly-PPO achieves substantially higher pass rate than all baselines. It also achieves equal or higher pass rate … view at source ↗
Figure 5
Figure 5. Figure 5: Example BabyAI environments and their missions. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Generalization under state perturbations in BabyAI BossLevel environment. The mission [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
read the original abstract

Reinforcement learning fine-tuning (RLFT) is a dominant paradigm for improving pretrained policies for downstream tasks. These pretrained policies, trained on large datasets, produce generations with a broad range of promising but unrefined behaviors. Often, a critical failure mode of RLFT arises when policies lose this diversity and collapse into a handful of easily exploitable outputs. This convergence hinders exploration, which is essential for expanding the capabilities of the pretrained policy and for amplifying the benefits of test-time compute scaling. To address this, we introduce an objective for policy gradient methods that explicitly enforces the exploration and refinement of diverse generations, which we call a polychromic objective. We then show how proximal policy optimization (PPO) can be adapted to optimize this objective. Our method (1) employs vine sampling to collect on-policy rollouts and (2) modifies the advantage function to reflect the advantage under our new objective. Experiments on BabyAI, Minigrid, and Algorithmic Creativity show that our method improves success rates by reliably solving a larger set of environment configurations and generalizes better under large perturbations. Moreover, when given multiple attempts in pass@$k$ experiments, the policy achieves substantially higher coverage, demonstrating its ability to maintain and exploit a diverse repertoire of strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes polychromic objectives for policy-gradient methods in reinforcement learning. These objectives explicitly encourage exploration and refinement of diverse generations during RL fine-tuning of pretrained policies, countering the common collapse into a small set of exploitable behaviors. The authors adapt PPO via vine sampling for on-policy rollouts and a modified advantage function that reflects the new objective. Experiments on BabyAI, Minigrid, and Algorithmic Creativity report higher success rates across more environment configurations, improved generalization under large perturbations, and substantially higher coverage in pass@k evaluations.

Significance. If the PPO adaptation is shown to optimize the polychromic objective without bias and the empirical gains are substantiated, the work would provide a practical mechanism for preserving behavioral diversity in RL fine-tuning. This directly targets a key limitation that reduces exploration and diminishes returns from test-time compute scaling. The approach could be relevant for domains where maintaining a repertoire of strategies is beneficial.

major comments (2)
  1. [§3] §3 (PPO adaptation): the manuscript does not verify that vine sampling combined with the modified advantage function yields an unbiased policy-gradient estimator for the polychromic objective. If the advantage modification does not exactly correspond to the gradient of the stated objective, or if vine sampling fails to keep the data distribution on-policy across iterations, then observed improvements cannot be attributed to the polychromic objective itself.
  2. [§4] §4 (Experiments): success-rate and pass@k results are presented without reported variance, statistical significance tests, ablation of the two proposed components (vine sampling and advantage modification), or comparison against strong diversity-preserving baselines. This leaves the central empirical claim without the quantitative support needed to assess reliability or generality.
minor comments (1)
  1. [Abstract] Notation for pass@k is written inconsistently as pass@$k$ in the abstract; adopt a uniform mathematical notation throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline the revisions we plan to make in the next version of the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (PPO adaptation): the manuscript does not verify that vine sampling combined with the modified advantage function yields an unbiased policy-gradient estimator for the polychromic objective. If the advantage modification does not exactly correspond to the gradient of the stated objective, or if vine sampling fails to keep the data distribution on-policy across iterations, then observed improvements cannot be attributed to the polychromic objective itself.

    Authors: We appreciate the referee pointing out this gap. The current manuscript describes the PPO adaptation using vine sampling and the modified advantage but does not include an explicit derivation or proof of unbiasedness. Vine sampling collects multiple on-policy trajectories from the same starting state under the current policy to estimate the polychromic objective, and the advantage is redefined to reflect the expected improvement under that objective rather than the scalar reward. To address the concern directly, we will add a dedicated subsection to §3 that derives the policy gradient for the polychromic objective and shows that the combination of vine sampling and the modified advantage produces an unbiased estimator (under standard on-policy assumptions). This addition will make clear that the reported gains can be attributed to optimization of the proposed objective. revision: yes

  2. Referee: [§4] §4 (Experiments): success-rate and pass@k results are presented without reported variance, statistical significance tests, ablation of the two proposed components (vine sampling and advantage modification), or comparison against strong diversity-preserving baselines. This leaves the central empirical claim without the quantitative support needed to assess reliability or generality.

    Authors: We agree that the experimental presentation would be strengthened by additional statistical detail and controls. In the revised manuscript we will report means and standard deviations across multiple random seeds for all success-rate and pass@k metrics, along with appropriate statistical significance tests. We will also add ablation studies that isolate the contribution of vine sampling versus the modified advantage function. Finally, we will include comparisons against established diversity-preserving baselines such as entropy-regularized PPO and mutual-information-based exploration methods. These additions will be placed in an expanded §4 and will provide stronger quantitative support for the reliability and generality of the results. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained; no reduction to inputs by construction

full rationale

The paper introduces a new polychromic objective explicitly defined to enforce diversity across generations, distinct from the base reward signal, then describes a standard PPO adaptation via vine sampling for on-policy rollouts and an altered advantage function. These modifications are presented as technical steps to optimize the stated objective rather than as a re-derivation or fit that collapses back to the inputs. Experimental results on BabyAI, Minigrid, and Algorithmic Creativity are offered as external validation of improved success rates, generalization, and pass@k coverage, with diversity statistics measured separately. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided derivation; the central claims therefore retain independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the unstated premise that diversity of generations can be measured and optimized independently of the task reward without creating optimization conflicts that degrade final performance.

axioms (1)
  • domain assumption Vine sampling produces unbiased on-policy estimates under the modified advantage function.
    Invoked when adapting PPO to the polychromic objective.

pith-pipeline@v0.9.0 · 5756 in / 1233 out tokens · 27646 ms · 2026-05-18T11:51:27.097052+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 18 internal anchors

  1. [1]

    Minimax Regret Bounds for Reinforcement Learning

    Mohammad Gheshlaghi Azar, Ian Osband, and R ´emi Munos. Minimax regret bounds for reinforcement learning, 2017. URLhttps://arxiv.org/abs/1703.05449

  2. [2]

    Sutton, Mohammad Ghavamzadeh, and Mark Lee

    Shalabh Bhatnagar, Richard S. Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor–critic algorithms.Automatica, 45(11):2471–2482, 2009. ISSN 0005-1098. doi: https:// doi.org/10.1016/j.automatica.2009.07.008. URLhttps://www.sciencedirect.com/ science/article/pii/S0005109809003549

  3. [3]

    Babyai: A platform to study the sample ef- ficiency of grounded language learning, 2019

    Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Sa- haria, Thien Huu Nguyen, and Yoshua Bengio. Babyai: A platform to study the sample ef- ficiency of grounded language learning, 2019. URLhttps://arxiv.org/abs/1810. 08272

  4. [4]

    Minigrid and miniworld: Modular and customizable reinforcement learning environments for goal-oriented tasks, 2023

    Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo de Lazcano, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and Jordan Terry. Minigrid and miniworld: Modular and customizable reinforcement learning environments for goal-oriented tasks, 2023. URLhttps://arxiv.org/abs/2306.13831

  5. [5]

    Inference-aware fine- 11 tuning for best-of-n sampling in large language models, 2024

    Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Sridhar Thiagarajan, Craig Boutilier, Rishabh Agarwal, Aviral Kumar, and Aleksandra Faust. Inference-aware fine- 11 tuning for best-of-n sampling in large language models, 2024. URLhttps://arxiv.org/ abs/2412.15287

  6. [6]

    The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

    Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models, 2025. URLhttps://arxiv.org/abs/2505.22617

  7. [7]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  8. [8]

    Thomas Degris, Martha White, and Richard S. Sutton. Off-policy actor-critic, 2013. URL https://arxiv.org/abs/1205.4839

  9. [9]

    The vendi score: A diversity evaluation metric for machine learning, 2023

    Dan Friedman and Adji Bousso Dieng. The vendi score: A diversity evaluation metric for machine learning, 2023. URLhttps://arxiv.org/abs/2210.02410

  10. [10]

    Reinforcement Learning with Deep Energy-Based Policies

    Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies, 2017. URLhttps://arxiv.org/abs/1702.08165

  11. [11]

    Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor, 2018. URL https://arxiv.org/abs/1801.01290

  12. [12]

    Rewarding the unlikely: Lifting grpo beyond distribution sharpening, 2025

    Andre He, Daniel Fried, and Sean Welleck. Rewarding the unlikely: Lifting grpo beyond distribution sharpening, 2025. URLhttps://arxiv.org/abs/2506.02355

  13. [13]

    Marginalized state distribution entropy regularization in policy optimization, 2019

    Riashat Islam, Zafarali Ahmed, and Doina Precup. Marginalized state distribution entropy regularization in policy optimization, 2019. URLhttps://arxiv.org/abs/1912. 05128

  14. [14]

    A natural policy gradient

    Sham M Kakade. A natural policy gradient. In T. Dietterich, S. Becker, and Z. Ghahra- mani (eds.),Advances in Neural Information Processing Systems, volume 14. MIT Press, 12

  15. [15]

    URLhttps://proceedings.neurips.cc/paper_files/paper/2001/ file/4b86abe48d358ecf194c56c69108433e-Paper.pdf

  16. [16]

    Kakade and John Langford

    Sham M. Kakade and John Langford. Approximately optimal approximate reinforcement learning. InInternational Conference on Machine Learning, 2002. URLhttps://api. semanticscholar.org/CorpusID:31442909

  17. [17]

    Vineppo: Refining credit assignment in rl training of llms, 2025

    Amirhossein Kazemnejad, Milad Aghajohari, Eva Portelance, Alessandro Sordoni, Siva Reddy, Aaron Courville, and Nicolas Le Roux. Vineppo: Refining credit assignment in rl training of llms, 2025. URLhttps://arxiv.org/abs/2410.01679

  18. [18]

    One solution is not all you need: Few-shot extrapolation via structured maxent rl, 2020

    Saurabh Kumar, Aviral Kumar, Sergey Levine, and Chelsea Finn. One solution is not all you need: Few-shot extrapolation via structured maxent rl, 2020. URLhttps://arxiv.org/ abs/2010.14484

  19. [19]

    Diverse Preference Optimization

    Jack Lanchantin, Angelica Chen, Shehzaad Dhuliawala, Ping Yu, Jason Weston, Sainba- yar Sukhbaatar, and Ilia Kulikov. Diverse preference optimization, 2025. URLhttps: //arxiv.org/abs/2501.18101

  20. [20]

    Jointly reinforcing diversity and quality in language model generations

    Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, and Tianlu Wang. Jointly reinforcing diversity and quality in language model generations, 2025. URLhttps://arxiv.org/abs/2509.02534

  21. [21]

    Lillicrap, Jonathan J

    Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In Yoshua Bengio and Yann LeCun (eds.),4th International Conference on Learning Repre- sentations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings,

  22. [22]

    URLhttp://arxiv.org/abs/1509.02971

  23. [23]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. URLhttps: //arxiv.org/abs/2503.20783

  24. [24]

    Roll the dice and look before you leap: Going beyond the creative limits of next-token prediction, 2025

    Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, and Aditi Raghunathan. Roll the dice and look before you leap: Going beyond the creative limits of next-token prediction, 2025. URLhttps://arxiv.org/abs/2504.15266

  25. [25]

    OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Alli- son Tam, Ally Bennett, Ananya Kumar, Andre Saraiva, Andrea Vallone, Andrew Duberstein, Andrew Kond...

  26. [26]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback,...

  27. [27]

    Efros, and Trevor Darrell

    Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven ex- ploration by self-supervised prediction. In Doina Precup and Yee Whye Teh (eds.),Pro- ceedings of the 34th International Conference on Machine Learning, volume 70 ofPro- ceedings of Machine Learning Research, pp. 2778–2787. PMLR, 06–11 Aug 2017. URL https://proceeding...

  28. [28]

    Qwq-32b: Embracing the power of reinforcement learning, March 2025

    Qwen. Qwq-32b: Embracing the power of reinforcement learning, March 2025. URLhttps: //qwenlm.github.io/blog/qwq-32b/

  29. [29]

    Bellemare, Jonathan Lebensold, Arnaud Bergeron, Joshua Greaves, Alex Fr´echette, Carolyne Pelletier, Eric Thibodeau-Laufer, S ´andor Toth, and Sam Work

    Nicolas Le Roux, Marc G. Bellemare, Jonathan Lebensold, Arnaud Bergeron, Joshua Greaves, Alex Fr´echette, Carolyne Pelletier, Eric Thibodeau-Laufer, S ´andor Toth, and Sam Work. Ta- pered off-policy reinforce: Stable and efficient reinforcement learning for llms, 2025. URL https://arxiv.org/abs/2503.14286

  30. [30]

    Trust Region Policy Optimization

    John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization, 2017. URLhttps://arxiv.org/abs/1502.05477

  31. [31]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

  32. [32]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation, 2018. URLhttps: //arxiv.org/abs/1506.02438

  33. [33]

    State entropy maximization with random encoders for efficient exploration, 2021

    Younggyo Seo, Lili Chen, Jinwoo Shin, Honglak Lee, Pieter Abbeel, and Kimin Lee. State entropy maximization with random encoders for efficient exploration, 2021. URLhttps: //arxiv.org/abs/2102.09430

  34. [34]

    Deterministic policy gradient algorithms

    David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Ried- miller. Deterministic policy gradient algorithms. In Eric P. Xing and Tony Jebara (eds.),Pro- ceedings of the 31st International Conference on Machine Learning, volume 32 ofProceedings of Machine Learning Research, pp. 387–395, Bejing, China, 22–24 Jun 2014. PMLR. URL h...

  35. [35]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time com- pute optimally can be more effective than scaling model parameters, 2024. URLhttps: //arxiv.org/abs/2408.03314

  36. [36]

    Outcome-based exploration for llm reasoning,

    Yuda Song, Julia Kempe, and Remi Munos. Outcome-based exploration for llm reasoning,

  37. [37]

    URLhttps://arxiv.org/abs/2509.06941

  38. [38]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA, 2018. ISBN 0262039249

  39. [39]

    Sutton, David McAllester, Satinder Singh, and Yishay Mansour

    Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gra- dient methods for reinforcement learning with function approximation. InProceedings of the 13th International Conference on Neural Information Processing Systems, NIPS’99, pp. 1057–1063, Cambridge, MA, USA, 1999. MIT Press

  40. [40]

    Optimizing language models for inference time objectives using reinforcement learning, 2025

    Yunhao Tang, Kunhao Zheng, Gabriel Synnaeve, and R ´emi Munos. Optimizing language models for inference time objectives using reinforcement learning, 2025. URLhttps:// arxiv.org/abs/2503.19595

  41. [41]

    Sample Efficient Actor-Critic with Experience Replay

    Ziyu Wang, Victor Bapst, Nicolas Heess, V olodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with experience replay, 2017. URL https://arxiv.org/abs/1611.01224

  42. [42]

    doi: 10.1007/BF00992696

    Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist re- inforcement learning.Mach. Learn., 8(3–4):229–256, May 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. URLhttps://doi.org/10.1007/BF00992696

  43. [43]

    The invisible leash: Why RLVR may not escape its origin.arXiv preprint arXiv:2507.14843,

    Fang Wu, Weihao Xuan, Ximing Lu, Zaid Harchaoui, and Yejin Choi. The invisible leash: Why rlvr may not escape its origin, 2025. URLhttps://arxiv.org/abs/2507.14843

  44. [44]

    Younis, Rodrigo Perez-Vicente, John U

    Omar G. Younis, Rodrigo Perez-Vicente, John U. Balis, Will Dudley, Alex Davey, and Jordan K Terry. Minari, September 2024. URLhttps://github.com/ Farama-Foundation/Minari

  45. [45]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

  46. [46]

    Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, 2025. URLhttps://arxiv.org/abs/2504.13837

  47. [47]

    Noveltybench: Evaluating language models for humanlike diver- sity, 2025

    Yiming Zhang, Harshita Diddee, Susan Holm, Hanchen Liu, Xinyue Liu, Vinay Samuel, Barry Wang, and Daphne Ippolito. Noveltybench: Evaluating language models for humanlike diver- sity, 2025. URLhttps://arxiv.org/abs/2504.05228

  48. [48]

    Echo chamber: Rl post-training amplifies behaviors learned in pretraining, 2025

    Rosie Zhao, Alexandru Meterez, Sham Kakade, Cengiz Pehlevan, Samy Jelassi, and Eran Malach. Echo chamber: Rl post-training amplifies behaviors learned in pretraining, 2025. URLhttps://arxiv.org/abs/2504.07912

  49. [49]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization, 2025. URLhttps://arxiv.org/abs/2507.18071. 15 A IMPLEMENTATIONDETAILS BabyAI and MiniGrid.For BabyAI tasks, the policy conditions on the grid image, the agent’s direction emb...