pith. sign in

arxiv: 2006.05990 · v1 · pith:ZYXQIT4Pnew · submitted 2020-06-10 · 💻 cs.LG · stat.ML

What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study

classification 💻 cs.LG stat.ML
keywords on-policyagentsalgorithmschoicescontinuouscontroldifferentempirical
0
0 comments X
read the original abstract

In recent years, on-policy reinforcement learning (RL) has been successfully applied to many different continuous control tasks. While RL algorithms are often conceptually simple, their state-of-the-art implementations take numerous low- and high-level design decisions that strongly affect the performance of the resulting agents. Those choices are usually not extensively discussed in the literature, leading to discrepancy between published descriptions of algorithms and their implementations. This makes it hard to attribute progress in RL and slows down overall progress [Engstrom'20]. As a step towards filling that gap, we implement >50 such ``choices'' in a unified on-policy RL framework, allowing us to investigate their impact in a large-scale empirical study. We train over 250'000 agents in five continuous control environments of different complexity and provide insights and practical recommendations for on-policy training of RL agents.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Learning to Trigger: Reinforcement Learning at the Large Hadron Collider

    cs.LG 2026-06 conditional novelty 8.0

    RL agent for online LHC trigger threshold tuning improves in-tolerance intervals by 28-56% on Monte Carlo and real CMS data without fine-tuning.

  2. Learning to Trigger: Reinforcement Learning at the Large Hadron Collider

    cs.LG 2026-06 unverdicted novelty 7.0

    Reinforcement learning improves online trigger threshold tuning at the LHC, boosting in-tolerance performance by 28-56% on simulated and real data.

  3. PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

    cs.LG 2026-06 conditional novelty 7.0

    PowerOPD applies the Box-Cox power transformation to create natively bounded, sign-consistent rewards for on-policy distillation, delivering up to +6.37 Avg@8 gains over vanilla OPD on math reasoning benchmarks while ...

  4. When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?

    cs.LG 2026-05 unverdicted novelty 7.0

    PromptPO shows LLMs can act as black-box policy optimizers for sequential RL when leveraging prior knowledge, matching baselines in exploration and robotics but underperforming in MuJoCo.

  5. When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?

    cs.LG 2026-05 unverdicted novelty 7.0

    PromptPO shows LLMs can act as black-box policy optimizers for sequential RL, matching or exceeding standard baselines with fewer interactions in exploration and robotics tasks when leveraging prior knowledge, but und...

  6. Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation

    cs.LG 2026-05 unverdicted novelty 7.0

    RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.

  7. TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency

    quant-ph 2026-05 unverdicted novelty 7.0

    TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.

  8. Mastering Diverse Domains through World Models

    cs.AI 2023-01 unverdicted novelty 7.0

    DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.

  9. Hint Tuning: Less Data Makes Better Reasoners

    cs.CL 2026-05 unverdicted novelty 6.0

    Hint Tuning reduces token usage 24-66% (31.5% avg) in reasoning models via 1K self-annotated samples aligned to an instruct model's capabilities while keeping benchmark accuracy.

  10. Hint Tuning: Less Data Makes Better Reasoners

    cs.CL 2026-05 unverdicted novelty 6.0

    Hint Tuning uses an instruct model as a difficulty probe to create 1K multi-level hint examples that train reasoning models to calibrate chain-of-thought length, cutting tokens by 31.5% on average across 4B-32B models...

  11. Application of Deep Reinforcement Learning to Event-Triggered Control for Networked Artificial Pancreas Systems

    eess.SY 2026-04 unverdicted novelty 6.0

    A DRL-based event-triggered controller for networked artificial pancreas systems uses blood glucose change rules to formulate control as a semi-Markov decision process, improving communication efficiency.

  12. From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

    cs.LG 2026-04 unverdicted novelty 6.0

    PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baseli...

  13. The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

    cs.LG 2022-01 conditional novelty 6.0

    More capable RL agents exploit reward misspecifications more often, with phase transitions in behavior, and anomaly detectors can identify misaligned policies.

  14. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

    cs.RO 2021-08 accept novelty 6.0

    A comprehensive benchmark study of offline imitation learning methods on multi-stage robot manipulation tasks identifies key sensitivities to algorithm design, data quality, and stopping criteria while releasing all d...

  15. TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

    cs.AI 2026-05 unverdicted novelty 5.0

    TOPPO reformulates PPO with critic balancing to address gradient ill-conditioning in multi-task RL and reports stronger mean and tail performance than SAC baselines on Meta-World+ using fewer parameters and steps.

  16. Application of Deep Reinforcement Learning to Event-Triggered Control for Networked Artificial Pancreas Systems

    eess.SY 2026-04 unverdicted novelty 5.0

    A DRL-based event-triggered controller for artificial pancreas systems uses blood glucose change rules to reduce communication frequency while maintaining control performance via an SMDP formulation.

  17. Low-Rank Adaptation for Critic Learning in Off-Policy Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    LoRA applied to critics in SAC and FastTD3 reduces critic loss and yields best or competitive policy performance on most evaluated tasks.