arxiv: 2407.17032 · v4 · submitted 2024-07-24 · 💻 cs.LG · cs.DL

Recognition: 2 theorem links

· Lean Theorem

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Mark Towers , Ariel Kwiatkowski , Jordan Terry , John U. Balis , Gianluca De Cola , Tristan Deleu , Manuel Goul\~ao , Andreas Kallinteris

show 8 more authors

Markus Krimmel Arjun KG Rodrigo Perez-Vicente Andrea Pierr\'e Sander Schulhoff Jun Jet Tai Hannah Tan Omar G. Younis

Authors on Pith no claims yet

Pith reviewed 2026-05-11 17:24 UTC · model grok-4.3

classification 💻 cs.LG cs.DL

keywords reinforcement learningenvironment interfacestandard APIinteroperabilityreproducibilityopen source libraryRL environments

0 comments

The pith

Gymnasium supplies a standard API for reinforcement learning environments to enable interoperability across implementations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Gymnasium as an open-source library that defines a common interface for reinforcement learning environments. It targets the fragmentation where incompatible environment and algorithm code makes it hard to compare results or extend prior work. The library supplies abstractions for compatibility, ready environments, customization tools, and reproducibility features. If the approach works, researchers spend less time on setup and more on testing new ideas. A reader would care because this kind of standardization has sped progress in other fields by letting people combine components without rewriting interfaces.

Core claim

Gymnasium is an open-source library that provides a standard API for RL environments. Its main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research.

What carries the argument

The standard API consisting of abstractions that support interoperability between any compatible environment and any training algorithm.

If this is right

Different research groups can run each other's environments and algorithms without rewriting code.
Reproducibility checks become easier because the same environment definition works across labs.
New algorithms can be tested on a wider set of environments with minimal additional effort.
Customization tools allow rapid creation of variants while keeping the core interface intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use could turn the current collection of environments into a shared benchmark suite that new methods must beat.
The same abstraction pattern might be copied for simulation interfaces in robotics or game AI beyond reinforcement learning.
If adoption grows, older non-standard code bases may be wrapped once and then reused indefinitely under the new interface.

Load-bearing premise

That the reinforcement learning community will adopt this API broadly enough for the interoperability and reproducibility benefits to appear.

What would settle it

A count of recent RL papers that shows most new experiments still use custom or non-Gymnasium environment wrappers instead of the standard interface.

read the original abstract

Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. However, despite its promise, RL research is often hindered by the lack of standardization in environment and algorithm implementations. This makes it difficult for researchers to compare and build upon each other's work, slowing down progress in the field. Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research. Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on innovation and less on implementation details. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at https://github.com/Farama-Foundation/Gymnasium

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces Gymnasium, an open-source library that provides a standard API for reinforcement learning environments. Its core features are abstractions enabling interoperability between environments and training algorithms, a collection of environments, tools for environment customization, and tools to support reproducibility and robustness in RL research. The library aims to reduce implementation overhead and allow researchers to focus on innovation.

Significance. If the described API and tools see adoption, this is a significant practical contribution to the RL community by extending the standardization efforts of its predecessor (OpenAI Gym) with updated abstractions and reproducibility features. The open-source release with accompanying code and tools for robustness represents a concrete strength that can facilitate more comparable and reliable experiments across different environments and algorithms.

minor comments (2)

The abstract repeats the theme of streamlining research and reducing implementation details in consecutive sentences; condensing this would improve conciseness and readability.
The GitHub link is provided at the end of the abstract, but the manuscript would benefit from a brief dedicated paragraph or subsection early in the text describing installation, basic usage, and where to find documentation or example scripts.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our manuscript and their recommendation to accept. We appreciate the recognition that Gymnasium extends the standardization efforts of OpenAI Gym with updated abstractions and reproducibility features, providing a practical contribution to the RL community.

Circularity Check

0 steps flagged

No circularity: purely descriptive software release with no derivations or predictions

full rationale

The manuscript is a software library announcement describing Gymnasium's API, environments, and reproducibility tools. It contains no equations, no fitted parameters, no predictions, and no derivation chain. The central claim is that the provided abstractions enable interoperability; this is presented as a design choice and implementation fact, not derived from prior results or self-citations. No load-bearing steps reduce to inputs by construction, and the paper does not invoke uniqueness theorems or ansatzes from prior work. The contribution is self-contained as documentation of an open-source interface.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software library announcement paper. It introduces no free parameters, mathematical axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5566 in / 1006 out tokens · 48714 ms · 2026-05-11T17:24:24.223074+00:00 · methodology

discussion (0)

Forward citations

Cited by 45 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Training Non-Differentiable Networks via Optimal Transport
cs.LG 2026-05 unverdicted novelty 8.0

PolyStep optimizes non-differentiable networks via forward-only polytope evaluations and optimal-transport barycentric updates, reaching 93.4% accuracy on hard-LIF spiking networks while outperforming gradient-free baselines.
Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps
cs.CR 2026-04 conditional novelty 8.0

A new benchmark shows frontier LLMs achieve only 3.8% average recall identifying malicious events from raw logs and fail to meet 50% recall thresholds on most tactics.
TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency
quant-ph 2026-05 unverdicted novelty 7.0

TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.
Learning When to Act: Communication-Efficient Reinforcement Learning via Run-Time Assurance
cs.LG 2026-05 unverdicted novelty 7.0

Learned policies with runtime Lyapunov shields achieve substantially higher communication intervals than baselines while maintaining stability on inverted pendulum, cart-pole, and quadrotor systems.
Learning When to Stop: Selective Imitation Learning Under Arbitrary Dynamics Shift
cs.LG 2026-05 unverdicted novelty 7.0

SeqRejectron builds a stopping rule from a small set of validator policies to achieve horizon-free sample-complexity guarantees for selective imitation learning under arbitrary train-test dynamics shifts.
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
cs.AI 2026-05 unverdicted novelty 7.0

Agentick is a new benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human approaches across 37 tasks and finds no single method dominates.
Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $\tau$-Mixing
stat.ML 2026-05 unverdicted novelty 7.0

Finite-sample risk bounds for DQN with ReLU networks are extended to τ-mixing data, showing an extra dimensionality penalty in the convergence rate due to dependence.
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data
cs.LG 2026-05 unverdicted novelty 7.0

SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
cs.LG 2026-05 unverdicted novelty 7.0

Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
Stable GFlowNets with Probabilistic Guarantees
cs.LG 2026-05 unverdicted novelty 7.0

Derives loss-to-TV bounds providing probabilistic guarantees for GFlowNets and introduces Stable GFlowNets algorithm for improved training stability and distributional fidelity.
PACE: Parameter Change for Unsupervised Environment Design
cs.LG 2026-05 unverdicted novelty 7.0

PACE uses the squared L2 norm of policy parameter changes from a first-order approximation as an efficient proxy for environment value in UED, outperforming baselines with higher IQM and lower optimality gap on MiniGr...
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis
cs.CL 2026-05 unverdicted novelty 7.0

Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
Your Loss is My Gain: Low Stake Attacks on Liquid Staking Pools
cs.GT 2026-05 unverdicted novelty 7.0

A low-stake adversary can degrade a liquid staking pool's performance via consensus manipulation and profit from the resulting drop in its LST value through application-layer financial positions.
KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning
cs.RO 2026-04 unverdicted novelty 7.0

KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.
SpecRLBench: A Benchmark for Generalization in Specification-Guided Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

SpecRLBench is a new benchmark evaluating generalization of LTL-guided RL methods across navigation and manipulation domains with static/dynamic environments and varied robot dynamics.
Replay-buffer engineering for noise-robust quantum circuit optimization
quant-ph 2026-04 unverdicted novelty 7.0

Treating the replay buffer as a central lever in RL for quantum circuit optimization yields 4-32x sample efficiency gains, up to 67.5% faster episodes, and 85-90% fewer steps to accuracy on noisy molecular and compila...
Beyond Bellman: High-Order Generator Regression for Continuous-Time Policy Evaluation
stat.ML 2026-04 unverdicted novelty 7.0

High-order generator regression from multi-step trajectories yields a second-order accurate estimator for finite-horizon continuous-time policy evaluation that outperforms the Bellman baseline in calibration studies a...
GPU-Accelerated Continuous-Time Successive Convexification for Contact-Implicit Legged Locomotion
cs.RO 2026-04 unverdicted novelty 7.0

ci-SCvx introduces integral cross-complementarity constraints to continuous-time SCP for contact-implicit legged locomotion, achieving faster solves and lower energy in MuJoCo validation than MPC baselines.
SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.
Augmented Lagrangian Method for Last-Iterate Convergence for Constrained MDPs
cs.LG 2026-05 unverdicted novelty 6.0

An inexact augmented Lagrangian method with projected Q-ascent yields global last-iterate convergence guarantees for constrained MDP policy optimization, extending from tabular to log-linear and non-linear policies.
Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias
cs.LG 2026-05 unverdicted novelty 6.0

Reflective Prompted Policy Optimization uses a Critic-LLM to inspect full trajectories and propose grounded revisions, yielding higher mean best rewards, faster near-optimal performance, and greater stability than sca...
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
cs.AI 2026-05 unverdicted novelty 6.0

Agentick is a new unified benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human agents across 37 tasks, showing no single approach dominates.
Extending Differential Temporal Difference Methods for Episodic Problems
cs.LG 2026-05 unverdicted novelty 6.0

A generalization of differential TD extends it to episodic settings while preserving policy ordering, inheriting linear TD guarantees, and improving sample efficiency.
ANO: A Principled Approach to Robust Policy Optimization
cs.AI 2026-05 unverdicted novelty 6.0

ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF e...
Bridging the Gap Between Average and Discounted TD Learning
cs.LG 2026-05 unverdicted novelty 6.0

A new two-trajectory sampling algorithm for average-reward TD learning guarantees convergence with quadratic sample complexity and no explicit dimension dependence in both tabular and linear approximation settings.
Perturb and Correct: Post-Hoc Ensembles using Affine Redundancy
cs.LG 2026-05 unverdicted novelty 6.0

Perturb-and-Correct generates epistemically diverse predictors from a single pretrained network via hidden-layer perturbations followed by affine least-squares corrections that enforce agreement on calibration data.
Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs
cs.LG 2026-05 unverdicted novelty 6.0

An actor-critic RL algorithm for low-rank MDPs achieves improved sample efficiency using solely a policy evaluation oracle.
Beyond Bellman: High-Order Generator Regression for Continuous-Time Policy Evaluation
stat.ML 2026-04 unverdicted novelty 6.0

High-order moment-matching estimation of the time-dependent generator improves continuous-time policy evaluation accuracy over first-order Bellman recursion by canceling lower-order truncation terms, with supporting e...
Efficient Federated RLHF via Zeroth-Order Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

Par-S²ZPO matches centralized RLHF sample complexity while converging faster in policy updates and outperforming FedAvg on MuJoCo tasks.
Gym-Anything: Turn any Software into an Agent Environment
cs.LG 2026-04 unverdicted novelty 6.0

Gym-Anything turns arbitrary software into agent environments via multi-agent setup and auditing, creating CUA-World with 10K+ long-horizon tasks and showing that trajectory distillation plus test-time auditing improv...
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
cs.LG 2026-04 unverdicted novelty 6.0

FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of...
Temporal Logic Control of Nonlinear Stochastic Systems with Online Performance Optimization
eess.SY 2026-04 unverdicted novelty 6.0

A new interval MDP abstraction method generates a set of verified policies for temporal logic control of stochastic systems, allowing online performance optimization without losing probabilistic guarantees.
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
cs.LG 2026-03 unverdicted novelty 6.0

LeWM is the first end-to-end trainable JEPA from pixels that uses only two loss terms for stable training and fast planning on 2D/3D control tasks.
Robust Probabilistic Shielding for Safe Offline Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 5.0

Shielding the policy improvement process in offline RL yields policies that are safe with high probability while outperforming unshielded baselines in both average and worst-case performance, especially under limited data.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 5.0

The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 5.0

AdaGamma stabilizes state-dependent discounting in deep actor-critic RL by adding a return-consistency regularizer, delivering gains on continuous-control benchmarks and a real-world logistics A/B test.
Causal Reinforcement Learning for Complex Card Games: A Magic The Gathering Benchmark
cs.LG 2026-05 unverdicted novelty 5.0

MTG-Causal-RL is a new benchmark for causal RL using Magic: The Gathering with an explicit SCM, five archetypes, and CGFA-PPO agent showing competitive win rates plus diagnostic metrics.
Zero-Shot, Safe and Time-Efficient UAV Navigation via Potential-Based Reward Shaping, Control Lyapunov and Barrier Functions
eess.SY 2026-05 unverdicted novelty 5.0

PBRS-augmented RL trained in simple settings transfers zero-shot to complex UAV environments when wrapped with a CLF-CBF-QP safety filter, yielding shorter missions and formal safety guarantees.
A High-Throughput Compute-Efficient POMDP Hide-And-Seek-Engine (HASE) for Multi-Agent Operations
cs.MA 2026-04 unverdicted novelty 5.0

A C++ Dec-POMDP simulator using data-oriented design and zero-copy PyTorch integration achieves up to 33 million steps per second on a 16-core CPU, enabling multi-agent policy training in minutes with PPO, DQN, and SAC.
RL-ABC: Reinforcement Learning for Accelerator Beamline Control
cs.LG 2026-04 unverdicted novelty 5.0

RL-ABC is a framework that formulates accelerator beamline tuning as a Markov decision process with a 57-dimensional state and configurable reward, enabling a DDPG agent to reach 70.3% particle transmission on a VEPP-...
RAMP: Hybrid DRL for Online Learning of Numeric Action Models
cs.AI 2026-04 unverdicted novelty 5.0

RAMP learns numeric action models online via a DRL-planning feedback loop and outperforms PPO on IPC numeric domains in solvability and plan quality.
Enhancing RL Generalizability in Robotics through SHAP Analysis of Algorithms and Hyperparameters
cs.LG 2026-05 unverdicted novelty 4.0

SHAP analysis of RL algorithms and hyperparameters reveals consistent impact patterns that enable guided configuration selection for improved generalization in robotic environments.
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
cs.CV 2026-05 unverdicted novelty 3.0

This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.
Benefits of Low-Cost Bio-Inspiration in the Age of Overparametrization
cs.RO 2026-04 unverdicted novelty 3.0

Shallow MLPs and dense CPGs outperform deeper MLPs and Actor-Critic RL in bounded robot control tasks with limited proprioception, with a Parameter Impact metric indicating extra RL parameters yield no performance gai...
[COMP25] The Automated Negotiating Agents Competition (ANAC) 2025 Challenges and Results
cs.MA 2026-04 unverdicted novelty 3.0

ANAC 2025 evaluated negotiating agents in multi-deal and supply chain scenarios, analyzed performance, and suggested future research directions.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · cited by 42 Pith papers · 5 internal anchors

[1]

Hindsight Experience Replay

URL https://arxiv.org/abs/1707.01495. Charles Beattie, Thomas Köppe, Edgar A. Duéñez-Guzmán, and Joel Z. Leibo. Deepmind lab2d,

work page Pith review arXiv
[2]

URL https://arxiv.org/abs/2011.07027. M. G. Bellemare, Y . Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, jun

work page arXiv 2011
[3]

Dota 2 with Large Scale Deep Reinforcement Learning

Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680,

work page internal anchor Pith review arXiv 1912
[4]

Jumanji: a diverse suite of scalable reinforcement learning environments in jax.arXiv preprint arXiv:2306.09884,

URL https://arxiv.org/abs/2306.09884. Felix Book, Arne Traue, Maximilian Schenke, Barnabas Haucke-Korber, and Oliver Wallscheid. Gym-electric-motor (gem) control: An automated open-source controller design suite for drives. In 2023 IEEE International Electric Machines & Drives Conference (IEMDC), pages 1–7,

work page arXiv 2023
[5]

10 Yann Bouteiller, Edouard GEZE, GobeX, Stefan Kuhn, and pius

doi: 10.1109/IEMDC55163.2023.10239044. 10 Yann Bouteiller, Edouard GEZE, GobeX, Stefan Kuhn, and pius. trackmania-rl/tmrl: Release 0.7.1, May

work page doi:10.1109/iemdc55163.2023.10239044 2023
[6]

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang

URL https://doi.org/10.5281/zenodo.15344778. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs,

work page doi:10.5281/zenodo.15344778
[7]

OpenAI Gym

URL http://github.com/google/jax. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540,

work page internal anchor Pith review arXiv
[8]

Dopamine: A Research Framework for Deep Reinforcement Learning

URL http: //arxiv.org/abs/1812.06110. Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and Jordan Terry. Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. In Advances in Neural Information Processing Systems 36, New...

work page Pith review arXiv
[9]

arXiv preprint arXiv:2406.05646 , year=

URL https://arxiv.org/ abs/2406.05646. Steven Dalton, Iuri Frosio, and Michael Garland. Accelerating reinforcement learning through gpu atari emulation,

work page arXiv
[10]

Rodrigo de Lazcano, Kallinteris Andreas, Jun Jet Tai, Seungjae Ryan Lee, and Jordan Terry

URL https://arxiv.org/abs/1907.08467. Rodrigo de Lazcano, Kallinteris Andreas, Jun Jet Tai, Seungjae Ryan Lee, and Jordan Terry. Gymnasium robotics,

work page arXiv 1907
[11]

Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov

URL https: //arxiv.org/abs/2306.08649. Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https: //github.com/openai/baselines,

work page arXiv
[12]

Alegre, Ann Nowé, Ana L

Florian Felten, Lucas N. Alegre, Ann Nowé, Ana L. C. Bazzan, El Ghazali Talbi, Grégoire Danoy, and Bruno C. da Silva. A toolkit for reliable benchmarking and research in multi-objective reinforcement learning. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023),

work page 2023
[14]

Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor

URL http://arxiv.org/abs/1801.01290. arXiv: 1801.01290 version:

work page internal anchor Pith review arXiv
[16]

better" quality operators, two

URL https://arxiv.org/abs/2006.00979. Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G.M. Araújo. Cleanrl: High-quality single-file implementations of deep rein- forcement learning algorithms. Journal of Machine Learning Research, 23(274):1–18,

work page arXiv 2006
[18]

Eric Jang, Shixiang Gu, and Ben Poole

URL https://arxiv.org/abs/2402.03046. Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1):99–134, May

work page arXiv
[19]

Littman, and Anthony R

ISSN 0004-3702. doi: 10.1016/S0004-3702(98)00023-X. Saman Kazemkhani, Aarav Pandya, Daphne Cornelisse, Brennan Shacklett, and Eugene Vinitsky. Gpudrive: Data-driven, multi-agent driving simulation at 1 million fps. In Proceedings of the International Conference on Learning Representations (ICLR) ,

work page doi:10.1016/s0004-3702(98)00023-x
[20]

GPUDrive: Data-driven, multi-agent driving simulation at 1 million FPS, February 2025

URL https://arxiv. org/abs/2408.01584. Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´skowski. ViZ- Doom: A Doom-based AI research platform for visual reinforcement learning. InIEEE Conference on Computational Intelligence and Games, pages 341–348, Santorini, Greece, Sep

work page arXiv
[21]

Vizdoom: A doom-based ai research platform for visual reinforcement learning

IEEE. URL http://arxiv.org/abs/1605.02097. The best paper award. Sotetsu Koyamada, Shinri Okano, Soichiro Nishimori, Yu Murata, Keigo Habara, Haruka Kita, and Shin Ishii. Pgx: Hardware-accelerated parallel game simulators for reinforcement learning. In Advances in Neural Information Processing Systems,

work page arXiv
[22]

Reinforcement learning on web interfaces using workflow-guided exploration.arXiv preprint arXiv:1802.08802, 2018

URL https://arxiv.org/abs/1802.08802. Mikel Malagón, Josu Ceberio, and Jose A. Lozano. Craftium: An extensible framework for creating reinforcement learning environments. arXiv preprint arXiv:2407.03969,

work page arXiv
[23]

Playing Atari with Deep Reinforcement Learning

V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Nature 518(7540):529–533

ISSN 1476-4687. doi: 10.1038/nature14236. URL https://www.nature.com/articles/nature14236. Alistair Muldal, Yotam Doron, John Aslanides, Tim Harley, Tom Ward, and Siqi Liu. dm_env: A python interface for reinforcement learning environments,

work page doi:10.1038/nature14236
[25]

Behaviour suite for reinforce- ment learning.arXiv preprint arXiv:1908.03568,

URL https://arxiv.org/abs/1908.03568. Mathieu Poliquin. Stable retro, a maintained fork of openai’s gym-retro. https://github.com/ Farama-Foundation/stable-retro,

work page arXiv 1908
[26]

Proximal Policy Optimization Algorithms

URL http://jmlr.org/papers/v22/20-1364.html. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs], August

work page internal anchor Pith review Pith/arXiv arXiv
[27]

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al

URL https://arxiv.org/ abs/2203.16777. David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359,

work page arXiv
[28]

Pufferlib: Making reinforcement learning libraries and environments play nice.arXiv preprint arXiv:2406.12905, 2024

URL https://arxiv.org/abs/2406.12905. Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, pages 1057–1063, Cambridge, MA, USA, November

work page arXiv
[29]

Jun Jet Tai, Jim Wong, Mauro Innocente, Nadjim Horri, James Brusey, and Swee King Phang

MIT Press. Jun Jet Tai, Jim Wong, Mauro Innocente, Nadjim Horri, James Brusey, and Swee King Phang. Pyflyt– uav simulation environments for reinforcement learning research.arXiv preprint arXiv:2304.01305,

work page arXiv
[30]

Pettingzoo: Gym for multi-agent reinforcement learning

J Terry, Benjamin Black, Nathaniel Grammel, Mario Jayakumar, Ananth Hari, Ryan Sullivan, Luis S Santos, Clemens Dieffendahl, Caroline Horsch, Rodrigo Perez-Vicente, et al. Pettingzoo: Gym for multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 34: 15032–15043, 2021a. J. K. Terry, Benjamin Black, and Luis Santos. Multipla...

work page arXiv 2009
[31]

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly

doi: 10.1109/IROS.2012.6386109. Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm_control: Software and tasks for continuous control. Software Impacts, 6:100022,

work page doi:10.1109/iros.2012.6386109 2012
[32]

doi: https://doi.org/ 10.1016/j.simpa.2020.100022

ISSN 2665-9638. doi: https://doi.org/ 10.1016/j.simpa.2020.100022. URL https://www.sciencedirect.com/science/article/ pii/S2665963820300099. 13 Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi...

work page doi:10.1016/j.simpa.2020.100022 2020
[34]

arXiv: 2107.14171

URL http://arxiv.org/abs/2107.14171. arXiv: 2107.14171. Jiayi Weng, Min Lin, Shengyi Huang, Bo Liu, Denys Makoviichuk, Viktor Makoviychuk, Zichen Liu, Yufan Song, Ting Luo, Yukun Jiang, Zhongwen Xu, and Shuicheng Yan. En- vPool: A highly parallel reinforcement learning environment execution engine. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho...

work page arXiv
[35]

Kenny Young and Tian Tian

URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 8caaf08e49ddbad6694fae067442ee21-Paper-Datasets_and_Benchmarks.pdf. Kenny Young and Tian Tian. Minatar: An atari-inspired testbed for thorough and reproducible reinforcement learning experiments,

work page 2022
[36]

URL https://arxiv.org/abs/1903.03176. Omar G. Younis, Rodrigo Perez-Vicente, John U. Balis, Will Dudley, Alex Davey, and Jordan K Terry. Minari, September

work page arXiv 1903
[37]

Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine

URL https://doi.org/10.5281/zenodo.13767625. Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning (CoRL),

work page doi:10.5281/zenodo.13767625
[38]

URL https://arxiv.org/abs/1910.10897. 14

work page arXiv 1910