OpenAI Gym

Greg Brockman; Jie Tang; John Schulman; Jonas Schneider; Ludwig Pettersson; Vicki Cheung; Wojciech Zaremba

arxiv: 1606.01540 · v1 · submitted 2016-06-05 · 💻 cs.LG · cs.AI

OpenAI Gym

Greg Brockman , Vicki Cheung , Ludwig Pettersson , Jonas Schneider , John Schulman , Jie Tang , Wojciech Zaremba This is my paper

Pith reviewed 2026-05-11 19:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningbenchmark environmentscommon interfacealgorithm comparisontoolkitsimulation benchmarksresult sharingresearch infrastructure

0 comments

The pith

A toolkit supplies benchmark problems for reinforcement learning through a shared interface along with a website for comparing algorithm results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a toolkit that contains a collection of benchmark problems, each exposing the same interface so that reinforcement learning agents can interact with environments in a consistent way. It pairs this with a website where researchers can share results and directly compare how different algorithms perform on those benchmarks. The work explains the toolkit's components and the design choices made during its creation. A reader would care because this structure could let researchers avoid repeatedly building custom test setups and instead focus on improving methods while seeing clear progress across the field.

Core claim

The central claim is that the toolkit, consisting of a growing collection of benchmark problems that expose a common interface and a website for sharing results, supports reinforcement learning research by enabling standardized testing and performance comparisons. The paper details the toolkit's components and the design decisions that shaped the software.

What carries the argument

The common interface that lets any reinforcement learning algorithm interact uniformly with the benchmark environments.

Load-bearing premise

That providing a common interface for environments plus a platform for sharing results will be sufficient to drive progress and fair comparisons in reinforcement learning.

What would settle it

Track whether new reinforcement learning papers begin using the toolkit's environments for evaluation and posting comparable results on the shared website; sustained low adoption would indicate the standardization has not taken hold.

read the original abstract

OpenAI Gym is a toolkit for reinforcement learning research. It includes a growing collection of benchmark problems that expose a common interface, and a website where people can share their results and compare the performance of algorithms. This whitepaper discusses the components of OpenAI Gym and the design decisions that went into the software.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript is a whitepaper introducing OpenAI Gym as a toolkit for reinforcement learning research. It describes a growing collection of benchmark environments that share a common interface, a website for sharing results to enable comparison of algorithms, and the software components along with the design decisions that shaped the implementation.

Significance. If the described components are delivered as stated, the work provides a standardized, open-source platform that lowers barriers for RL experimentation and supports reproducible benchmarking across the community. The emphasis on a common interface and public result sharing directly addresses fragmentation in RL evaluation practices.

minor comments (2)

The description of the environment interface in the components section would benefit from an explicit listing of the core methods (e.g., reset, step, render) with their signatures to aid immediate implementation by readers.
A brief note on the versioning or release process for the benchmark collection would clarify how new environments are added while maintaining backward compatibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the OpenAI Gym whitepaper and the recommendation to accept. The referee's summary accurately captures the toolkit's purpose, the common interface for environments, the results-sharing website, and the discussion of design decisions.

Circularity Check

0 steps flagged

No circularity: purely descriptive whitepaper with no derivation chain

full rationale

The manuscript is a software whitepaper that describes the OpenAI Gym toolkit, its environments, common interface, and result-sharing website. It contains no equations, no fitted parameters, no predictions, no formal derivations, and no load-bearing claims that reduce to self-referential inputs. The central content is expository documentation of design choices and released code; the reader's noted assumption about real-world representativeness is not used as a premise for any quantitative or derivational result. No self-citations or ansatzes are invoked in a manner that could create circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software toolkit description paper containing no mathematical derivations, fitted parameters, background axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5338 in / 994 out tokens · 65104 ms · 2026-05-11T19:44:10.330352+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering
cs.SE 2025-07 conditional novelty 8.0

AIDev is a new open dataset of 456k AI-agent pull requests showing agents submit code faster than humans but with lower acceptance rates and simpler changes.
BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation
cs.RO 2024-03 accept novelty 8.0

BEHAVIOR-1K introduces a benchmark of 1,000 human everyday activities in realistic simulated scenes together with the OMNIGIBSON physics simulator to evaluate embodied AI.
Decision Transformer: Reinforcement Learning via Sequence Modeling
cs.LG 2021-06 accept novelty 8.0

Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
Stochastic MeanFlow Policies: One-Step Generative Control with Entropic Mirror Descent
cs.LG 2026-05 unverdicted novelty 7.0

SMFP introduces a one-step generative policy class using MeanFlow to map noise to actions, providing a tractable entropy surrogate for unified off-policy mirror descent training that outperforms Gaussian and generativ...
Proximal State Nudging: Reducing Skill Atrophy from AI Assistance
cs.RO 2026-05 unverdicted novelty 7.0

Proximal State Nudging (PSN) jointly optimizes skill development and task performance in shared autonomy, outperforming baselines in LunarLander simulation and yielding up to 7x larger unassisted skill gains with 50% ...
ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders
cs.RO 2026-05 unverdicted novelty 7.0

ARC-RL provides four new MuJoCo continuous-control environments with hexapod and quadruped morphologies inspired by ARC Raiders, a unified multi-component reward without motion capture, CPG expert demonstrators, and e...
Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation
cs.LG 2026-05 unverdicted novelty 7.0

RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
NeuroTrain: Surveying Local Learning Rules for Spiking Neural Networks with an Open Benchmarking Framework
cs.NE 2026-05 unverdicted novelty 7.0

A taxonomy of SNN training algorithms is presented with the release of NeuroTrain, an open benchmarking framework for reproducible comparisons across datasets and architectures.
Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling
cs.LG 2026-05 unverdicted novelty 7.0

DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.
Matrix-Space Reinforcement Learning for Reusing Local Transition Geometry
cs.LG 2026-05 unverdicted novelty 7.0

MSRL represents trajectory segments as PSD matrices to prove additive composition properties and bootstrap value functions for better transfer, reaching 0.73 AUC versus 0.57-0.65 baselines.
IGT-OMD: Implicit Gradient Transport for Decision-Focused Learning under Delayed Feedback
cs.LG 2026-05 unverdicted novelty 7.0

IGT-OMD reduces gradient transport error from quadratic to linear in delay length for delayed bilevel optimization and achieves sublinear regret with adaptive steps.
gym-invmgmt: An Open Benchmarking Framework for Inventory Management Methods
cs.LG 2026-05 unverdicted novelty 7.0

gym-invmgmt is a new benchmarking framework that evaluates inventory policies across optimization and learning methods, finding stochastic programming strongest among non-oracle approaches and PPO-Transformer best amo...
Revisiting Mixture Policies in Entropy-Regularized Actor-Critic
cs.LG 2026-05 unverdicted novelty 7.0

A new marginalized reparameterization estimator allows low-variance training of mixture policies in entropy-regularized actor-critic algorithms, matching or exceeding Gaussian policy performance in several continuous ...
Operator-Guided Invariance Learning for Continuous Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

VPSD-RL discovers exact and approximate value-preserving Lie-group operators in continuous RL to stabilize learning via transition augmentation and consistency regularization.
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning
cs.LG 2026-05 unverdicted novelty 7.0

FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connect...
EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents
cs.AI 2026-05 unverdicted novelty 7.0

EO-Gym supplies an executable multimodal environment and 9k-trajectory benchmark that turns Earth Observation into a tool-using, multi-step reasoning task, revealing that current VLMs struggle on temporal and cross-se...
Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks
cs.AI 2026-04 unverdicted novelty 7.0

COSPLAY co-evolves an LLM decision agent with a skill bank agent to improve long-horizon game performance, reporting over 25.1% average reward gains versus frontier LLM baselines on single-player benchmarks.
Hierarchical Active Inference using Successor Representations
cs.LG 2026-04 unverdicted novelty 7.0

A hierarchical active inference framework using successor representations learns abstract states and actions to enable efficient planning on navigation and reinforcement learning tasks.
Flow Gym: A framework for the development, benchmarking, training, and deployment of flow-field quantification methods
physics.flu-dyn 2025-12 accept novelty 7.0

Flow Gym supplies a JAX-based framework with standardized interfaces, modular components, and utilities to develop, benchmark, train, and deploy flow-field quantification methods such as PIV on both synthetic and expe...
Adaptive Ensemble Aggregation for Actor-Critics
cs.LG 2025-07 unverdicted novelty 7.0

AEA dynamically aggregates ensembles in off-policy actor-critics from training dynamics, with proofs of convergence to an error-minimizing equilibrium, bias shrinkage with ensemble size, and monotonic policy improvement.
Steering Your Diffusion Policy with Latent Space Reinforcement Learning
cs.RO 2025-06 unverdicted novelty 7.0

DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
Group-in-Group Policy Optimization for LLM Agent Training
cs.LG 2025-05 unverdicted novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
Mitigating Value Hallucination in Dyna Planning via Multistep Predecessor Models
cs.LG 2020-06 unverdicted novelty 7.0

Introduces multistep predecessor models for Dyna planning to mitigate value hallucination by avoiding real-state updates from simulated values.
Dota 2 with Large Scale Deep Reinforcement Learning
cs.LG 2019-12 accept novelty 7.0

OpenAI Five achieved superhuman performance in Dota 2 by defeating the world champions using scaled self-play reinforcement learning.
Benchmarking Model-Based Reinforcement Learning
cs.LG 2019-07 accept novelty 7.0

Introduces a benchmark suite of over 18 MBRL environments, evaluates multiple algorithms under consistent settings, and identifies three core challenges: dynamics bottleneck, planning horizon dilemma, and early-termin...
Learning the Arrow of Time
cs.LG 2019-07 unverdicted novelty 7.0

Introduces a learned arrow of time in MDPs that aligns with the Jordan-Kinderlehrer-Otto notion for stochastic processes and enables practical RL utilities like reachability and side-effect detection.
Exploring Model-based Planning with Policy Networks
cs.LG 2019-06 unverdicted novelty 7.0

POPLIN combines policy networks with model-predictive planning by optimizing either action sequences or policy parameters, yielding 3x better sample efficiency than PETS, TD3 and SAC on MuJoCo locomotion tasks.
Soft Actor-Critic Algorithms and Applications
cs.LG 2018-12 unverdicted novelty 7.0

SAC extends maximum-entropy RL into a stable off-policy actor-critic method with constrained temperature tuning, outperforming prior algorithms in sample efficiency and consistency on locomotion and manipulation tasks.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
cs.LG 2018-01 accept novelty 7.0

Soft Actor-Critic is an off-policy maximum-entropy actor-critic algorithm that achieves state-of-the-art performance and high stability on continuous control benchmarks.
Proximal Policy Optimization Algorithms
cs.LG 2017-07 accept novelty 7.0

A clipped surrogate objective L^CLIP = E[min(r_t A_t, clip(r_t, 1-ε, 1+ε) A_t)] enables multi-epoch minibatch policy updates with TRPO-like stability but first-order optimization.
Deep reinforcement learning from human preferences
stat.ML 2017-06 accept novelty 7.0

Reinforcement learning agents solve complex tasks without access to the reward function by training a reward predictor from human comparisons of trajectory segments, requiring feedback on less than 1% of interactions.
Reflex: Reinforcement Learning with Reflection Symmetry Exploitation in State-Based Continuous Control
cs.LG 2026-05 unverdicted novelty 6.0

Reflex formalizes axial and bilateral reflection symmetries and adds symmetry regularization to PPO and SAC, reporting better performance and sample efficiency on Gym and DMC benchmarks.
ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders
cs.RO 2026-05 accept novelty 6.0

ARC-RL is a new suite of four MuJoCo continuous-control environments featuring game-inspired hexapod and quadruped morphologies, a single closed-form multi-component reward function, CPG demonstrators, and empirical c...
DiPRL: Learning Discrete Programmatic Policies via Architecture Entropy Regularization
cs.LG 2026-05 unverdicted novelty 6.0

DiPRL trains nearly discrete programmatic policies in RL by adding architecture entropy regularization to gradient-based optimization, avoiding performance collapse from post-hoc discretization.
Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making
cs.LG 2026-05 unverdicted novelty 6.0

Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.
R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning
cs.LG 2026-05 unverdicted novelty 6.0

R2R2 introduces a non-centered regularization objective for SPL that addresses conflicts with spectral properties, leading to better performance on continuous control tasks at high UTD ratios.
CA2: Code-Aware Agent for Automated Game Testing
cs.SE 2026-05 unverdicted novelty 6.0

CA2 integrates call stack information into RL agents for game testing and shows consistent gains over baselines that ignore code signals.
Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Introduces RAPCs and a contraction Bellman operator that jointly enforce probabilistic reach-avoid constraints while minimizing expected costs in stochastic RL, with almost-sure convergence to local optima.
Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Introduces RAPCs and a contraction Bellman operator for cost-optimal policies that satisfy probabilistic reach-avoid specifications in stochastic MDPs, with almost-sure convergence to local optima.
Debiased Model-based Representations for Sample-efficient Continuous Control
cs.LG 2026-05 unverdicted novelty 6.0

DR.Q debiases model-based representations for Q-learning by maximizing mutual information between state-action and next-state representations and applying faded prioritized experience replay, achieving competitive or ...
Policy Gradient Methods for Non-Markovian Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Introduces the Agent State-Markov Policy Gradient (ASMPG) algorithm and a policy gradient theorem for non-Markovian decision processes by jointly optimizing agent state dynamics and control policy.
Actor-Critic Algorithm for Dynamic Expectile and CVaR
cs.LG 2026-05 unverdicted novelty 6.0

A model-free off-policy actor-critic algorithm is constructed for dynamic expectile and CVaR using a surrogate policy gradient without transition perturbation and elicitability-based value learning, with empirical out...
BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

BehaviorGuard detects backdoor behaviors in DRL policies via behavioral drift in action distributions and suppresses suspicious actions at runtime, claimed as the first online defense for both single- and multi-agent ...
Learning to Theorize the World from Observation
cs.LG 2026-05 unverdicted novelty 6.0

NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...
Towards Real-time Control of a CartPole System on a Quantum Computer
quant-ph 2026-05 unverdicted novelty 6.0

A single-qubit quantum reinforcement learning agent solves CartPole faster than classical networks and quantifies shot-count versus control-frequency requirements for real-time closed-loop control on NISQ hardware, in...
Distributional Reinforcement Learning via the Cram\'er Distance
cs.LG 2026-04 unverdicted novelty 6.0

C-DSAC applies the Cramér distance to distributional value learning inside SAC and outperforms standard SAC on robotic benchmarks, with larger gains in complex environments due to confidence-driven conservative updates.
Scalable Neighborhood-Based Multi-Agent Actor-Critic
cs.LG 2026-04 unverdicted novelty 6.0

MADDPG-K scales centralized critics in multi-agent RL by limiting each critic to k-nearest neighbors under Euclidean distance, yielding constant input size and competitive performance.
Distributional Off-Policy Evaluation with Deep Quantile Process Regression
stat.ML 2026-04 unverdicted novelty 6.0

DQPOPE estimates the entire return distribution in off-policy evaluation via deep quantile process regression, providing statistical advantages over standard single-value methods with equivalent sample sizes.
Policy-Invisible Violations in LLM-Based Agents
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents commit policy-invisible violations when policy facts are hidden from their context; a graph-simulation enforcer reaches 93% accuracy vs 68.8% for content-only baselines on a new 600-trace benchmark.
Infernux: A Python-Native Game Engine with JIT-Accelerated Scripting
cs.GR 2026-04 unverdicted novelty 6.0

Infernux is a game engine that uses batch data bridging and Numba JIT to make Python scripting performant within a Vulkan C++ core.
Bridging Natural Language and Microgrid Dynamics: A Context-Aware Simulator and Dataset
eess.SY 2026-04 unverdicted novelty 6.0

OpenCEM is the first open-source digital twin that integrates unstructured contextual information with quantitative microgrid dynamics to enable context-aware energy management.
HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents
cs.AI 2026-03 unverdicted novelty 6.0

HiMAC decomposes LLM agent tasks into macro planning and micro execution using critic-free hierarchical RL and iterative co-evolution, outperforming baselines on ALFWorld, WebShop, and Sokoban.
Chasing Ghosts: A Simulation-to-Real Olfactory Navigation Stack with Optional Vision Augmentation
cs.RO 2026-02 unverdicted novelty 6.0

A simulation-to-real navigation policy enables a quadrotor to locate an odor source using only basic olfaction sensors and optional vision, validated in indoor real-world flights.
The Role of Learning in Attacking ML-based Network Intrusion Detection
cs.CR 2026-02 unverdicted novelty 6.0

Reinforcement learning agents achieve up to 58.1% attack success on ML network intrusion detectors at 0.31 ms per attack, delivering over 1000X higher throughput than gradient-based methods while working directly on n...
Learning to Plan, Planning to Learn: Adaptive Hierarchical RL-MPC for Sample-Efficient Decision Making
cs.LG 2025-12 unverdicted novelty 6.0

An adaptive RL-MPC framework uses RL to inform MPPI sampling and aggregates MPPI samples for value estimation, delivering up to 72% higher success rates and 2.1x faster convergence on tasks like race driving and Lunar...
Reinforcement Learning-based Control via Y-wise Affine Neural Networks (YANNs)
eess.SY 2025-08 unverdicted novelty 6.0

YANN-RL initializes RL actor and critic networks with explicit multi-parametric linear MPC solutions via YANNs to start from linear optimal control performance and then learn nonlinear policies through online interaction.
Synthetic POMDPs to Challenge Memory-Augmented RL: Memory Demand Structure Modeling
cs.AI 2025-08 unverdicted novelty 6.0

Presents MDS framework, linear-dynamics construction method, and tunable synthetic POMDP suite for controlled testing of memory-augmented reinforcement learning.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 115 Pith papers

[1]

Dynamic programming and optimal control

Dimitri P Bertsekas, Dimitri P Bertsekas, Dimitri P Bertsekas, and Dimitri P Bertsekas. Dynamic programming and optimal control. Athena Scientiﬁc Belmont, MA, 1995

work page 1995
[2]

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, Sadik Beattie, C., Antonoglou A., H. I., King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015

work page 2015
[3]

Schulman, S

J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization. In ICML, pages 1889–1897, 2015

work page 2015
[4]

Asynchronous Methods for Deep Reinforcement Learning

V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. arXiv preprint arXiv:1602.01783, 2016

work page Pith review arXiv 2016
[5]

M. G. Bellemare, Y . Naddaf, J. Veness, and M. Bowling. The Arcade Learning Environment: An evaluation platform for general agents. J. Artif. Intell. Res., 47:253–279, 2013

work page 2013
[6]

Benchmarking deep reinforcement learning for continuous control

Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. arXiv preprint arXiv:1604.06778, 2016

work page arXiv 2016
[7]

Geramifard, C

A. Geramifard, C. Dann, R. H. Klein, W. Dabney, and J. P. How. RLPy: A value-function-based reinforcement learning framework for education and research. J. Mach. Learn. Res., 16:1573–1578, 2015

work page 2015
[8]

Tanner and A

B. Tanner and A. White. RL-Glue: Language-independent software for reinforcement-learning experiments. J. Mach. Learn. Res., 10:2133–2136, 2009

work page 2009
[9]

Schaul, J

T. Schaul, J. Bayer, D. Wierstra, Y . Sun, M. Felder, F. Sehnke, T. R¨uckstieß, and J. Schmidhuber. PyBrain. J. Mach. Learn. Res., 11:743–746, 2010

work page 2010
[10]

Abeyruwan

S. Abeyruwan. RLLib: Lightweight standard and on/off policy reinforcement learning library (C++). http://web.cs.miami.edu/home/saminda/rilib.html, 2013

work page 2013
[11]

The reinforcement learning competition 2014

Christos Dimitrakakis, Guangliang Li, and Nikoalos Tziortziotis. The reinforcement learning competition 2014. AI Magazine, 35(3):61–65, 2014

work page 2014
[12]

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction . MIT Press, 1998

work page 1998
[13]

Pachi: State of the art open source go program

Petr Baudi ˇs and Jean-loup Gailly. Pachi: State of the art open source go program. In Advances in Computer Games, pages 24–38. Springer, 2011

work page 2011
[14]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on , pages 5026–5033. IEEE, 2012

work page 2012
[15]

ViZDoom: A Doom-based AI Research Platform for Visual Reinforcement Learning

Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´skowski. Vizdoom: A doom-based ai research platform for visual reinforcement learning. arXiv preprint arXiv:1605.02097, 2016. 4

work page Pith review arXiv 2016

[1] [1]

Dynamic programming and optimal control

Dimitri P Bertsekas, Dimitri P Bertsekas, Dimitri P Bertsekas, and Dimitri P Bertsekas. Dynamic programming and optimal control. Athena Scientiﬁc Belmont, MA, 1995

work page 1995

[2] [2]

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, Sadik Beattie, C., Antonoglou A., H. I., King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015

work page 2015

[3] [3]

Schulman, S

J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization. In ICML, pages 1889–1897, 2015

work page 2015

[4] [4]

Asynchronous Methods for Deep Reinforcement Learning

V olodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. arXiv preprint arXiv:1602.01783, 2016

work page Pith review arXiv 2016

[5] [5]

M. G. Bellemare, Y . Naddaf, J. Veness, and M. Bowling. The Arcade Learning Environment: An evaluation platform for general agents. J. Artif. Intell. Res., 47:253–279, 2013

work page 2013

[6] [6]

Benchmarking deep reinforcement learning for continuous control

Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. arXiv preprint arXiv:1604.06778, 2016

work page arXiv 2016

[7] [7]

Geramifard, C

A. Geramifard, C. Dann, R. H. Klein, W. Dabney, and J. P. How. RLPy: A value-function-based reinforcement learning framework for education and research. J. Mach. Learn. Res., 16:1573–1578, 2015

work page 2015

[8] [8]

Tanner and A

B. Tanner and A. White. RL-Glue: Language-independent software for reinforcement-learning experiments. J. Mach. Learn. Res., 10:2133–2136, 2009

work page 2009

[9] [9]

Schaul, J

T. Schaul, J. Bayer, D. Wierstra, Y . Sun, M. Felder, F. Sehnke, T. R¨uckstieß, and J. Schmidhuber. PyBrain. J. Mach. Learn. Res., 11:743–746, 2010

work page 2010

[10] [10]

Abeyruwan

S. Abeyruwan. RLLib: Lightweight standard and on/off policy reinforcement learning library (C++). http://web.cs.miami.edu/home/saminda/rilib.html, 2013

work page 2013

[11] [11]

The reinforcement learning competition 2014

Christos Dimitrakakis, Guangliang Li, and Nikoalos Tziortziotis. The reinforcement learning competition 2014. AI Magazine, 35(3):61–65, 2014

work page 2014

[12] [12]

R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction . MIT Press, 1998

work page 1998

[13] [13]

Pachi: State of the art open source go program

Petr Baudi ˇs and Jean-loup Gailly. Pachi: State of the art open source go program. In Advances in Computer Games, pages 24–38. Springer, 2011

work page 2011

[14] [14]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on , pages 5026–5033. IEEE, 2012

work page 2012

[15] [15]

ViZDoom: A Doom-based AI Research Platform for Visual Reinforcement Learning

Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Ja´skowski. Vizdoom: A doom-based ai research platform for visual reinforcement learning. arXiv preprint arXiv:1605.02097, 2016. 4

work page Pith review arXiv 2016