Recognition: 2 theorem links
· Lean TheoremGymnasium: A Standard Interface for Reinforcement Learning Environments
Pith reviewed 2026-05-11 17:24 UTC · model grok-4.3
The pith
Gymnasium supplies a standard API for reinforcement learning environments to enable interoperability across implementations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gymnasium is an open-source library that provides a standard API for RL environments. Its main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research.
What carries the argument
The standard API consisting of abstractions that support interoperability between any compatible environment and any training algorithm.
If this is right
- Different research groups can run each other's environments and algorithms without rewriting code.
- Reproducibility checks become easier because the same environment definition works across labs.
- New algorithms can be tested on a wider set of environments with minimal additional effort.
- Customization tools allow rapid creation of variants while keeping the core interface intact.
Where Pith is reading between the lines
- Widespread use could turn the current collection of environments into a shared benchmark suite that new methods must beat.
- The same abstraction pattern might be copied for simulation interfaces in robotics or game AI beyond reinforcement learning.
- If adoption grows, older non-standard code bases may be wrapped once and then reused indefinitely under the new interface.
Load-bearing premise
That the reinforcement learning community will adopt this API broadly enough for the interoperability and reproducibility benefits to appear.
What would settle it
A count of recent RL papers that shows most new experiments still use custom or non-Gymnasium environment wrappers instead of the standard interface.
read the original abstract
Reinforcement Learning (RL) is a continuously growing field that has the potential to revolutionize many areas of artificial intelligence. However, despite its promise, RL research is often hindered by the lack of standardization in environment and algorithm implementations. This makes it difficult for researchers to compare and build upon each other's work, slowing down progress in the field. Gymnasium is an open-source library that provides a standard API for RL environments, aiming to tackle this issue. Gymnasium's main feature is a set of abstractions that allow for wide interoperability between environments and training algorithms, making it easier for researchers to develop and test RL algorithms. In addition, Gymnasium provides a collection of easy-to-use environments, tools for easily customizing environments, and tools to ensure the reproducibility and robustness of RL research. Through this unified framework, Gymnasium significantly streamlines the process of developing and testing RL algorithms, enabling researchers to focus more on innovation and less on implementation details. By providing a standardized platform for RL research, Gymnasium helps to drive forward the field of reinforcement learning and unlock its full potential. Gymnasium is available online at https://github.com/Farama-Foundation/Gymnasium
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Gymnasium, an open-source library that provides a standard API for reinforcement learning environments. Its core features are abstractions enabling interoperability between environments and training algorithms, a collection of environments, tools for environment customization, and tools to support reproducibility and robustness in RL research. The library aims to reduce implementation overhead and allow researchers to focus on innovation.
Significance. If the described API and tools see adoption, this is a significant practical contribution to the RL community by extending the standardization efforts of its predecessor (OpenAI Gym) with updated abstractions and reproducibility features. The open-source release with accompanying code and tools for robustness represents a concrete strength that can facilitate more comparable and reliable experiments across different environments and algorithms.
minor comments (2)
- The abstract repeats the theme of streamlining research and reducing implementation details in consecutive sentences; condensing this would improve conciseness and readability.
- The GitHub link is provided at the end of the abstract, but the manuscript would benefit from a brief dedicated paragraph or subsection early in the text describing installation, basic usage, and where to find documentation or example scripts.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of our manuscript and their recommendation to accept. We appreciate the recognition that Gymnasium extends the standardization efforts of OpenAI Gym with updated abstractions and reproducibility features, providing a practical contribution to the RL community.
Circularity Check
No circularity: purely descriptive software release with no derivations or predictions
full rationale
The manuscript is a software library announcement describing Gymnasium's API, environments, and reproducibility tools. It contains no equations, no fitted parameters, no predictions, and no derivation chain. The central claim is that the provided abstractions enable interoperability; this is presented as a design choice and implementation fact, not derived from prior results or self-citations. No load-bearing steps reduce to inputs by construction, and the paper does not invoke uniqueness theorems or ansatzes from prior work. The contribution is self-contained as documentation of an open-source interface.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 45 Pith papers
-
Training Non-Differentiable Networks via Optimal Transport
PolyStep optimizes non-differentiable networks via forward-only polytope evaluations and optimal-transport barycentric updates, reaching 93.4% accuracy on hard-LIF spiking networks while outperforming gradient-free baselines.
-
Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps
A new benchmark shows frontier LLMs achieve only 3.8% average recall identifying malicious events from raw logs and fail to meet 50% recall thresholds on most tactics.
-
TuniQ: Autotuning Compilation Passes for Quantum Workloads at Scale for Effectiveness and Efficiency
TuniQ uses RL with a dual-encoder, shaped rewards, and action masking to autotune quantum compilation passes, improving fidelity and speed over Qiskit while generalizing across backends and scaling to large circuits.
-
Learning When to Act: Communication-Efficient Reinforcement Learning via Run-Time Assurance
Learned policies with runtime Lyapunov shields achieve substantially higher communication intervals than baselines while maintaining stability on inverted pendulum, cart-pole, and quadrotor systems.
-
Learning When to Stop: Selective Imitation Learning Under Arbitrary Dynamics Shift
SeqRejectron builds a stopping rule from a small set of validator policies to achieve horizon-free sample-complexity guarantees for selective imitation learning under arbitrary train-test dynamics shifts.
-
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
Agentick is a new benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human approaches across 37 tasks and finds no single method dominates.
-
Beyond the Independence Assumption: Finite-Sample Guarantees for Deep Q-Learning under $\tau$-Mixing
Finite-sample risk bounds for DQN with ReLU networks are extended to τ-mixing data, showing an extra dimensionality penalty in the convergence rate due to dependence.
-
SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data
SOPE uses an actor-aligned OPE signal on a held-out validation split to dynamically stop offline stabilization phases in online RL, improving performance up to 45.6% and cutting TFLOPs up to 22x on 25 Minari tasks.
-
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
-
Stable GFlowNets with Probabilistic Guarantees
Derives loss-to-TV bounds providing probabilistic guarantees for GFlowNets and introduces Stable GFlowNets algorithm for improved training stability and distributional fidelity.
-
PACE: Parameter Change for Unsupervised Environment Design
PACE uses the squared L2 norm of policy parameter changes from a first-order approximation as an efficient proxy for environment value in UED, outperforming baselines with higher IQM and lower optimality gap on MiniGr...
-
A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis
Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.
-
Your Loss is My Gain: Low Stake Attacks on Liquid Staking Pools
A low-stake adversary can degrade a liquid staking pool's performance via consensus manipulation and profit from the resulting drop in its LST value through application-layer financial positions.
-
KinDER: A Physical Reasoning Benchmark for Robot Learning and Planning
KinDER is a new open-source benchmark that demonstrates substantial gaps in current robot learning and planning methods for handling physical constraints.
-
SpecRLBench: A Benchmark for Generalization in Specification-Guided Reinforcement Learning
SpecRLBench is a new benchmark evaluating generalization of LTL-guided RL methods across navigation and manipulation domains with static/dynamic environments and varied robot dynamics.
-
Replay-buffer engineering for noise-robust quantum circuit optimization
Treating the replay buffer as a central lever in RL for quantum circuit optimization yields 4-32x sample efficiency gains, up to 67.5% faster episodes, and 85-90% fewer steps to accuracy on noisy molecular and compila...
-
Beyond Bellman: High-Order Generator Regression for Continuous-Time Policy Evaluation
High-order generator regression from multi-step trajectories yields a second-order accurate estimator for finite-horizon continuous-time policy evaluation that outperforms the Bellman baseline in calibration studies a...
-
GPU-Accelerated Continuous-Time Successive Convexification for Contact-Implicit Legged Locomotion
ci-SCvx introduces integral cross-complementarity constraints to continuous-time SCP for contact-implicit legged locomotion, achieving faster solves and lower energy in MuJoCo validation than MPC baselines.
-
SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning
SafeAdapt certifies a Rashomon set of safe policies from demonstration data and projects updates from arbitrary RL algorithms onto it to guarantee preservation of safety on source tasks.
-
Augmented Lagrangian Method for Last-Iterate Convergence for Constrained MDPs
An inexact augmented Lagrangian method with projected Q-ascent yields global last-iterate convergence guarantees for constrained MDP policy optimization, extending from tabular to log-linear and non-linear policies.
-
Reflective Prompted Policy Optimization: Trajectory-Grounded Revision and Salience Bias
Reflective Prompted Policy Optimization uses a Critic-LLM to inspect full trajectories and propose grounded revisions, yielding higher mean best rewards, faster near-optimal performance, and greater stability than sca...
-
Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
Agentick is a new unified benchmark for sequential decision-making agents that evaluates RL, LLM, VLM, hybrid, and human agents across 37 tasks, showing no single approach dominates.
-
Extending Differential Temporal Difference Methods for Episodic Problems
A generalization of differential TD extends it to episodic settings while preserving policy ordering, inheriting linear TD guarantees, and improving sample efficiency.
-
ANO: A Principled Approach to Robust Policy Optimization
ANO derives a robust policy optimizer from geometric principles that replaces clipping with a smooth redescending gradient, showing better performance and stability than PPO, SPO, and GRPO in MuJoCo, Atari, and RLHF e...
-
Bridging the Gap Between Average and Discounted TD Learning
A new two-trajectory sampling algorithm for average-reward TD learning guarantees convergence with quadratic sample complexity and no explicit dimension dependence in both tabular and linear approximation settings.
-
Perturb and Correct: Post-Hoc Ensembles using Affine Redundancy
Perturb-and-Correct generates epistemically diverse predictors from a single pretrained network via hidden-layer perturbations followed by affine least-squares corrections that enforce agreement on calibration data.
-
Breaking the Computational Barrier: Provably Efficient Actor-Critic for Low-Rank MDPs
An actor-critic RL algorithm for low-rank MDPs achieves improved sample efficiency using solely a policy evaluation oracle.
-
Beyond Bellman: High-Order Generator Regression for Continuous-Time Policy Evaluation
High-order moment-matching estimation of the time-dependent generator improves continuous-time policy evaluation accuracy over first-order Bellman recursion by canceling lower-order truncation terms, with supporting e...
-
Efficient Federated RLHF via Zeroth-Order Policy Optimization
Par-S²ZPO matches centralized RLHF sample complexity while converging faster in policy updates and outperforming FedAvg on MuJoCo tasks.
-
Gym-Anything: Turn any Software into an Agent Environment
Gym-Anything turns arbitrary software into agent environments via multi-agent setup and auditing, creating CUA-World with 10K+ long-horizon tasks and showing that trajectory distillation plus test-time auditing improv...
-
FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control
FlashSAC scales up Soft Actor-Critic with fewer updates, larger models, higher data throughput, and norm bounds to deliver faster, more stable training than PPO on high-dimensional robot control tasks across dozens of...
-
Temporal Logic Control of Nonlinear Stochastic Systems with Online Performance Optimization
A new interval MDP abstraction method generates a set of verified policies for temporal logic control of stochastic systems, allowing online performance optimization without losing probabilistic guarantees.
-
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
LeWM is the first end-to-end trainable JEPA from pixels that uses only two loss terms for stable training and fast planning on 2D/3D control tasks.
-
Robust Probabilistic Shielding for Safe Offline Reinforcement Learning
Shielding the policy improvement process in offline RL yields policies that are safe with high probability while outperforming unshielded baselines in both average and worst-case performance, especially under limited data.
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
The paper organizes research on generalist game AI into Dataset, Model, Harness, and Benchmark pillars and charts a five-level progression from single-game mastery to agents that create and live inside game multiverses.
-
AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning
AdaGamma stabilizes state-dependent discounting in deep actor-critic RL by adding a return-consistency regularizer, delivering gains on continuous-control benchmarks and a real-world logistics A/B test.
-
Causal Reinforcement Learning for Complex Card Games: A Magic The Gathering Benchmark
MTG-Causal-RL is a new benchmark for causal RL using Magic: The Gathering with an explicit SCM, five archetypes, and CGFA-PPO agent showing competitive win rates plus diagnostic metrics.
-
Zero-Shot, Safe and Time-Efficient UAV Navigation via Potential-Based Reward Shaping, Control Lyapunov and Barrier Functions
PBRS-augmented RL trained in simple settings transfers zero-shot to complex UAV environments when wrapped with a CLF-CBF-QP safety filter, yielding shorter missions and formal safety guarantees.
-
A High-Throughput Compute-Efficient POMDP Hide-And-Seek-Engine (HASE) for Multi-Agent Operations
A C++ Dec-POMDP simulator using data-oriented design and zero-copy PyTorch integration achieves up to 33 million steps per second on a 16-core CPU, enabling multi-agent policy training in minutes with PPO, DQN, and SAC.
-
RL-ABC: Reinforcement Learning for Accelerator Beamline Control
RL-ABC is a framework that formulates accelerator beamline tuning as a Markov decision process with a 57-dimensional state and configurable reward, enabling a DDPG agent to reach 70.3% particle transmission on a VEPP-...
-
RAMP: Hybrid DRL for Online Learning of Numeric Action Models
RAMP learns numeric action models online via a DRL-planning feedback loop and outperforms PPO on IPC numeric domains in solvability and plan quality.
-
Enhancing RL Generalizability in Robotics through SHAP Analysis of Algorithms and Hyperparameters
SHAP analysis of RL algorithms and hyperparameters reveals consistent impact patterns that enable guided configuration selection for improved generalization in robotic environments.
-
Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse
This work traces four eras of generalist game players across dataset, model, harness, and benchmark pillars and charts a five-level roadmap ending in agents that create and evolve within game multiverses.
-
Benefits of Low-Cost Bio-Inspiration in the Age of Overparametrization
Shallow MLPs and dense CPGs outperform deeper MLPs and Actor-Critic RL in bounded robot control tasks with limited proprioception, with a Parameter Impact metric indicating extra RL parameters yield no performance gai...
-
[COMP25] The Automated Negotiating Agents Competition (ANAC) 2025 Challenges and Results
ANAC 2025 evaluated negotiating agents in multi-deal and supply chain scenarios, analyzed performance, and suggested future research directions.
Reference graph
Works this paper leans on
-
[1]
URL https://arxiv.org/abs/1707.01495. Charles Beattie, Thomas Köppe, Edgar A. Duéñez-Guzmán, and Joel Z. Leibo. Deepmind lab2d,
- [2]
-
[3]
Dota 2 with Large Scale Deep Reinforcement Learning
Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Debiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680,
work page internal anchor Pith review arXiv 1912
-
[4]
URL https://arxiv.org/abs/2306.09884. Felix Book, Arne Traue, Maximilian Schenke, Barnabas Haucke-Korber, and Oliver Wallscheid. Gym-electric-motor (gem) control: An automated open-source controller design suite for drives. In 2023 IEEE International Electric Machines & Drives Conference (IEMDC), pages 1–7,
-
[5]
10 Yann Bouteiller, Edouard GEZE, GobeX, Stefan Kuhn, and pius
doi: 10.1109/IEMDC55163.2023.10239044. 10 Yann Bouteiller, Edouard GEZE, GobeX, Stefan Kuhn, and pius. trackmania-rl/tmrl: Release 0.7.1, May
-
[6]
URL https://doi.org/10.5281/zenodo.15344778. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs,
-
[7]
URL http://github.com/google/jax. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540,
work page internal anchor Pith review arXiv
-
[8]
Dopamine: A Research Framework for Deep Reinforcement Learning
URL http: //arxiv.org/abs/1812.06110. Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and Jordan Terry. Minigrid & miniworld: Modular & customizable reinforcement learning environments for goal-oriented tasks. In Advances in Neural Information Processing Systems 36, New...
-
[9]
arXiv preprint arXiv:2406.05646 , year=
URL https://arxiv.org/ abs/2406.05646. Steven Dalton, Iuri Frosio, and Michael Garland. Accelerating reinforcement learning through gpu atari emulation,
-
[10]
Rodrigo de Lazcano, Kallinteris Andreas, Jun Jet Tai, Seungjae Ryan Lee, and Jordan Terry
URL https://arxiv.org/abs/1907.08467. Rodrigo de Lazcano, Kallinteris Andreas, Jun Jet Tai, Seungjae Ryan Lee, and Jordan Terry. Gymnasium robotics,
-
[11]
URL https: //arxiv.org/abs/2306.08649. Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines. https: //github.com/openai/baselines,
-
[12]
Florian Felten, Lucas N. Alegre, Ann Nowé, Ana L. C. Bazzan, El Ghazali Talbi, Grégoire Danoy, and Bruno C. da Silva. A toolkit for reliable benchmarking and research in multi-objective reinforcement learning. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023),
work page 2023
-
[14]
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
URL http://arxiv.org/abs/1801.01290. arXiv: 1801.01290 version:
work page internal anchor Pith review arXiv
-
[16]
better" quality operators, two
URL https://arxiv.org/abs/2006.00979. Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G.M. Araújo. Cleanrl: High-quality single-file implementations of deep rein- forcement learning algorithms. Journal of Machine Learning Research, 23(274):1–18,
-
[18]
Eric Jang, Shixiang Gu, and Ben Poole
URL https://arxiv.org/abs/2402.03046. Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1):99–134, May
-
[19]
ISSN 0004-3702. doi: 10.1016/S0004-3702(98)00023-X. Saman Kazemkhani, Aarav Pandya, Daphne Cornelisse, Brennan Shacklett, and Eugene Vinitsky. Gpudrive: Data-driven, multi-agent driving simulation at 1 million fps. In Proceedings of the International Conference on Learning Representations (ICLR) ,
-
[20]
GPUDrive: Data-driven, multi-agent driving simulation at 1 million FPS, February 2025
URL https://arxiv. org/abs/2408.01584. Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´skowski. ViZ- Doom: A Doom-based AI research platform for visual reinforcement learning. InIEEE Conference on Computational Intelligence and Games, pages 341–348, Santorini, Greece, Sep
-
[21]
Vizdoom: A doom-based ai research platform for visual reinforcement learning
IEEE. URL http://arxiv.org/abs/1605.02097. The best paper award. Sotetsu Koyamada, Shinri Okano, Soichiro Nishimori, Yu Murata, Keigo Habara, Haruka Kita, and Shin Ishii. Pgx: Hardware-accelerated parallel game simulators for reinforcement learning. In Advances in Neural Information Processing Systems,
-
[22]
URL https://arxiv.org/abs/1802.08802. Mikel Malagón, Josu Ceberio, and Jose A. Lozano. Craftium: An extensible framework for creating reinforcement learning environments. arXiv preprint arXiv:2407.03969,
-
[23]
Playing Atari with Deep Reinforcement Learning
V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
ISSN 1476-4687. doi: 10.1038/nature14236. URL https://www.nature.com/articles/nature14236. Alistair Muldal, Yotam Doron, John Aslanides, Tim Harley, Tom Ward, and Siqi Liu. dm_env: A python interface for reinforcement learning environments,
-
[25]
Behaviour suite for reinforce- ment learning.arXiv preprint arXiv:1908.03568,
URL https://arxiv.org/abs/1908.03568. Mathieu Poliquin. Stable retro, a maintained fork of openai’s gym-retro. https://github.com/ Farama-Foundation/stable-retro,
-
[26]
Proximal Policy Optimization Algorithms
URL http://jmlr.org/papers/v22/20-1364.html. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. arXiv:1707.06347 [cs], August
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
URL https://arxiv.org/ abs/2203.16777. David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359,
-
[28]
URL https://arxiv.org/abs/2406.12905. Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, pages 1057–1063, Cambridge, MA, USA, November
-
[29]
Jun Jet Tai, Jim Wong, Mauro Innocente, Nadjim Horri, James Brusey, and Swee King Phang
MIT Press. Jun Jet Tai, Jim Wong, Mauro Innocente, Nadjim Horri, James Brusey, and Swee King Phang. Pyflyt– uav simulation environments for reinforcement learning research.arXiv preprint arXiv:2304.01305,
-
[30]
Pettingzoo: Gym for multi-agent reinforcement learning
J Terry, Benjamin Black, Nathaniel Grammel, Mario Jayakumar, Ananth Hari, Ryan Sullivan, Luis S Santos, Clemens Dieffendahl, Caroline Horsch, Rodrigo Perez-Vicente, et al. Pettingzoo: Gym for multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 34: 15032–15043, 2021a. J. K. Terry, Benjamin Black, and Luis Santos. Multipla...
-
[31]
doi: 10.1109/IROS.2012.6386109. Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm_control: Software and tasks for continuous control. Software Impacts, 6:100022,
-
[32]
doi: https://doi.org/ 10.1016/j.simpa.2020.100022
ISSN 2665-9638. doi: https://doi.org/ 10.1016/j.simpa.2020.100022. URL https://www.sciencedirect.com/science/article/ pii/S2665963820300099. 13 Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi...
-
[34]
URL http://arxiv.org/abs/2107.14171. arXiv: 2107.14171. Jiayi Weng, Min Lin, Shengyi Huang, Bo Liu, Denys Makoviichuk, Viktor Makoviychuk, Zichen Liu, Yufan Song, Ting Luo, Yukun Jiang, Zhongwen Xu, and Shuicheng Yan. En- vPool: A highly parallel reinforcement learning environment execution engine. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho...
-
[35]
URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 8caaf08e49ddbad6694fae067442ee21-Paper-Datasets_and_Benchmarks.pdf. Kenny Young and Tian Tian. Minatar: An atari-inspired testbed for thorough and reproducible reinforcement learning experiments,
work page 2022
- [36]
-
[37]
Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine
URL https://doi.org/10.5281/zenodo.13767625. Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian, Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on Robot Learning (CoRL),
- [38]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.