Physical Atari: A Robust and Accessible Platform for Real-time Reinforcement Learning on Robots

Gloria Kennickell; John Carmack; Joseph Modayil; Khurram Javed; Richard S. Sutton

arxiv: 2606.19357 · v1 · pith:CGEOL6Z3new · submitted 2026-05-29 · 💻 cs.RO · cs.AI

Physical Atari: A Robust and Accessible Platform for Real-time Reinforcement Learning on Robots

Khurram Javed , Joseph Modayil , Gloria Kennickell , Richard S. Sutton , John Carmack This is my paper

Pith reviewed 2026-06-28 21:56 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords reinforcement learningphysical roboticsAtaridistribution shifton-device adaptationrobot controllerreal-time learninghardware platform

0 comments

The pith

Reinforcement learning algorithms can learn directly on physical robots using an affordable Atari controller platform, but even small distribution shifts between learning and deployment significantly degrade policy performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Physical Atari, a system built around the Robotroller that presses buttons on a real Atari controller and the Atari Devbox that displays game frames and reward signals. The full setup uses an off-the-shelf camera and computer to run reinforcement learning experiments entirely in the physical world. Experiments with this platform confirm that policies can be trained on-robot yet show clear drops in performance from even minor differences between the training conditions and later use. This matters because it indicates that adaptation while the robot is operating is needed to maintain good results instead of relying on separate training environments.

Core claim

The Physical Atari system, consisting of the Robotroller and Atari Devbox together with a camera and desktop computer, forms a robust and accessible platform for studying reinforcement learning in the physical world. The authors used it to validate that reinforcement learning algorithms can learn directly on robots. They also show that even small distribution shifts between learning and deployment can significantly degrade the performance of policies, which underscores the importance of on-device adaptation for strong performance on robots.

What carries the argument

The Robotroller, a bearing-mounted servo actuator for the Atari CX40+ controller with high-frequency state monitoring software to limit stress, combined with the Atari Devbox that renders game frames and rewards.

If this is right

Reinforcement learning algorithms can train policies directly on physical robots using this platform.
Small distribution shifts between learning and deployment cause significant degradation in policy performance.
On-device adaptation is required for strong performance on robots.
The platform supports continuous operation for weeks without mechanical failures.
The full system can be built for under $1000 using off-the-shelf and 3D-printed parts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same low-cost physical interface approach could support direct learning on other button-based or controller-driven tasks.
Isolating hardware variability from distribution shift would require additional controlled comparisons not reported here.
Extending the platform to multiple robots could test whether adaptation needs scale with hardware differences.

Load-bearing premise

The performance drop is caused by distribution shift between learning and deployment rather than by hardware variability, camera noise, or servo inconsistencies in the robot itself.

What would settle it

Running the same learning and deployment phases with every measurable hardware state, lighting, and camera condition held identical and checking whether policy performance still degrades.

Figures

Figures reproduced from arXiv: 2606.19357 by Gloria Kennickell, John Carmack, Joseph Modayil, Khurram Javed, Richard S. Sutton.

**Figure 2.** Figure 2: The components of the Robotroller that are 3D printed. On a consumer Bambu Lab P1S printer, all these parts can be printed in around 12 hours [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The components of the Robotroller that have to be purchased. These include screws, bearings, servos, electronics, and threaded inserts. The total cost of the parts is around $400. The Robotroller consists of two types of parts. The first type is custom parts designed in CAD software (Autodesk Fusion) that have to be manufactured. These are shown in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: The Robotroller is a robot that reliably actuates an unmodified CX40+ controller using [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: The key to making the Robotroller reliable is the design that uses bearings for all move [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: The Atari Devbox is a device that renders Atari 2600 games at 60 FPS, listens to the [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: We measured the end-to-end response time of the Physical Atari platform. The response [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: A photo of three Physical Atari setups. This design allows each convolutional filter to condition on the same recent-action summary at every spatial location. In this respect, the architecture differs from DQN (Mnih et al., 2015) and its variants, where the network outputs a value for each action instead of receiving the action history as part of its input. As a result, our value estimator is explicitly co… view at source ↗

**Figure 9.** Figure 9: Average reward over time during real-time reinforcement learning on six Atari games. At [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗

**Figure 10.** Figure 10: We evaluated policies that learned with 6 hours of experience twice, once on the robot [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: In another experiment, we switched the body of the robot after 6 hours of learning. The [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

read the original abstract

We built a robot called the Robotroller that actuates an Atari CX40+ controller and a device called the Atari Devbox that renders the game frame and the reward signal from the Arcade Learning Environment on a screen. The Robotroller and the Atari Devbox, together with an off-the-shelf camera and a desktop computer, constitute a system that can be used to study reinforcement learning algorithms in the physical world. We call the full system Physical Atari. In this paper, we detail the key decisions that make Physical Atari a robust and accessible platform. To make the system robust, we designed the Robotroller so that all movement is done through bearings, which reduces wear. Additionally, we wrote software that monitors the state of the servos at a high frequency and intervenes to limit stress. To make the system accessible, we used affordable off-the-shelf components and parts that can be manufactured using consumer 3D printers. Physical Atari can be built for under $1,000 and has been used for weeks of non-stop reinforcement learning experiments without any mechanical failures. We used it to validate that reinforcement learning algorithms can learn directly on robots and show that even small distribution shifts between learning and deployment can significantly degrade the performance of policies. Our results underscore the importance of on-device adaptation for strong performance on robots.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Physical Atari is a replicable low-cost robot platform for running RL on real Atari hardware, but the distribution-shift results need tighter controls to separate from hardware noise.

read the letter

The main takeaway is that this paper gives a concrete, buildable system for physical reinforcement learning on Atari games using a 3D-printed robot actuator and a custom rendering box.

The Robotroller moves the CX40 controller through bearings to reduce wear, and the Atari Devbox pulls frames and rewards from the Arcade Learning Environment. They add a standard camera for observations and run everything on a desktop computer. The design keeps costs under $1000 with off-the-shelf parts and consumer printers. Software watches servo states at high frequency and steps in to prevent damage. They report the setup has run weeks of continuous experiments without mechanical failures.

The paper does a clear job spelling out why these choices matter for robustness and accessibility. It shows RL agents can learn directly on the physical hardware and that small changes between training and test conditions hurt policy performance, which supports the case for on-device adaptation.

The softer spot is the experimental support. The abstract states the learning and distribution-shift results but gives no success rates, trial counts, variance numbers, or exclusion rules. The stress-test concern holds weight here: servo position drift, bearing friction shifts, camera angle or lighting changes, or reward noise from the Devbox could produce the same performance drops without any distribution shift being the cause. If the full paper lacks ablations that hold hardware state fixed while varying only the shift, that part of the argument stays unconvincing.

This is for RL and robotics researchers who want a simple real-world testbed for classic control tasks. Anyone testing adaptation methods or sim-to-real ideas could use the platform directly.

It deserves peer review. The hardware description and build instructions are a usable contribution even if the results section needs more statistical detail.

Referee Report

2 major / 1 minor

Summary. The manuscript describes Physical Atari, a physical platform for real-time RL consisting of the Robotroller (a bearing-based servo actuator for the Atari CX40+ controller) and Atari Devbox (which renders ALE game frames and reward signals on-screen). Combined with an off-the-shelf camera and desktop computer, the system is presented as robust (bearings reduce wear; high-frequency servo monitoring limits stress) and accessible (under $1000 using consumer 3D-printed and off-the-shelf parts). The paper claims the platform has supported weeks of continuous RL experiments without mechanical failure and reports experiments showing that RL algorithms can learn directly on the physical robot while small distribution shifts between training and deployment significantly degrade policy performance, underscoring the value of on-device adaptation.

Significance. If the reported experiments are supported by quantitative data and controls, the platform would provide an affordable, reproducible testbed for studying physical RL issues such as distribution shift, potentially enabling broader empirical work on on-robot learning that is currently hindered by hardware barriers.

major comments (2)

[Abstract] Abstract: the claim that experiments 'validated learning on robots and distribution-shift effects' is unsupported by any quantitative results, error bars, trial counts, or data-exclusion criteria. This absence prevents verification that the data actually support the central empirical claim.
[Experiments section] Experiments section (and design claims): the attribution of performance degradation to distribution shift between learning and deployment is not isolated from potential Robotroller hardware confounds. No trial-to-trial statistics on servo state, bearing friction, camera variability, or reward noise are reported, nor are ablations or statistical tests separating these factors from the intended shift variable.

minor comments (1)

[Figures] Figure captions and system diagrams would benefit from explicit labels for all components (e.g., servo mounting, camera placement) to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive review of our manuscript on Physical Atari. We address the two major comments point by point below, indicating where revisions are planned.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that experiments 'validated learning on robots and distribution-shift effects' is unsupported by any quantitative results, error bars, trial counts, or data-exclusion criteria. This absence prevents verification that the data actually support the central empirical claim.

Authors: We agree that the abstract claim would be more verifiable with explicit quantitative support. The experiments section describes successful on-robot learning and the impact of distribution shifts on policy performance, but we will revise the abstract to reference specific results (including trial counts, performance metrics, and variability measures) and add the requested details to the experiments section in the revised manuscript. revision: yes
Referee: [Experiments section] Experiments section (and design claims): the attribution of performance degradation to distribution shift between learning and deployment is not isolated from potential Robotroller hardware confounds. No trial-to-trial statistics on servo state, bearing friction, camera variability, or reward noise are reported, nor are ablations or statistical tests separating these factors from the intended shift variable.

Authors: The Robotroller design uses bearings for all motion and high-frequency servo monitoring to limit stress, choices intended to reduce mechanical variability; the system completed weeks of continuous operation without failure. These features provide supporting evidence for robustness, but we did not record or analyze the specific trial-to-trial statistics on servo state, friction, camera variability, or reward noise, nor conduct the suggested ablations or statistical tests, because the experimental focus was on RL policy behavior rather than hardware metrology. revision: no

standing simulated objections not resolved

Trial-to-trial statistics on servo state, bearing friction, camera variability, and reward noise, along with ablations or statistical tests isolating distribution shift from hardware factors, as these data were not collected during the reported experiments.

Circularity Check

0 steps flagged

No circularity: hardware platform and empirical results have no derivation chain

full rationale

The paper is a description of a physical robot platform (Robotroller + Atari Devbox) for running RL experiments, with claims resting on mechanical construction details, cost, runtime reliability, and reported performance under distribution shifts. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the provided text or abstract. The central empirical observation (performance degradation under shifts) is presented as a direct experimental outcome rather than a mathematical reduction; any concerns about unmeasured hardware confounds are issues of experimental isolation, not circularity in a derivation. The work is self-contained as a platform report with no load-bearing steps that reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the platform rests on engineering assumptions about mechanical durability and servo monitoring rather than mathematical axioms or new physical entities.

axioms (1)

domain assumption Bearings and high-frequency servo monitoring software are sufficient to prevent mechanical wear and stress during weeks of continuous operation.
Stated as the basis for robustness in the abstract.

pith-pipeline@v0.9.1-grok · 5777 in / 1287 out tokens · 25481 ms · 2026-06-28T21:56:45.008486+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Abbeel, P., Coates, A., Quigley, M., & Ng, A. Y . (2006). An application of reinforcement learning to aerobatic helicopter flight.Advances in Neural Information Processing Systems,

2006
[2]

Solving Rubik's Cube with a Robot Hand

OpenAI, Akkaya, I., Andrychowicz, M., Chociej, M., Chiek, M., Boby, A., Baker, B., ... & Zaremba, W. (2019). Solving Rubik’s cube with a robot hand.arXiv Preprint arXiv:1910.07113. Benbrahim, H., Doleac, J., Franklin, J., & Selfridge, O. (1992, June). Real-time learning: A ball on a beam. InInternational Joint Conference on Neural Networks. Brockman, G., ...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[3]

& Silver, D

Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., ... & Silver, D. (2020). Mastering Atari, Go, Chess and Shogi by planning with a learned model.Nature. Schwarzer, M., Obando-Ceron, J., Courville, A., Bellemare, M. G., Agarwal, R., & Castro, P. S. (2023). Bigger, better, faster: Human-level Atari with human-level efficie...

2020
[4]

A Deeper Look at Experience Replay

Wu, P., Escontrela, A., Hafner, D., Abbeel, P., & Goldberg, K. (2023, March). Daydreamer: World models for physical robot learning. InConference on Robot Learning. PMLR. 12 Zhao, T. Z., Kumar, V ., Levine, S., & Finn, C. (2023, July). Learning fine-grained bimanual manip- ulation with low-cost hardware. InProceedings of Robotics: Science and Systems. Zhan...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Finally, exploration is managed via anϵ-greedy strategy with a fixedϵof2 −6 (Exploration), and new actions are selected everyPolicy Skip(2) frames

frames. Finally, exploration is managed via anϵ-greedy strategy with a fixedϵof2 −6 (Exploration), and new actions are selected everyPolicy Skip(2) frames. E Robotroller and Camera Hyperparameters This section details the specific configuration used for the Robotroller and the camera in our experi- ments. E.1 Camera Configuration The camera is configured ...

2048

[1] [1]

Abbeel, P., Coates, A., Quigley, M., & Ng, A. Y . (2006). An application of reinforcement learning to aerobatic helicopter flight.Advances in Neural Information Processing Systems,

2006

[2] [2]

Solving Rubik's Cube with a Robot Hand

OpenAI, Akkaya, I., Andrychowicz, M., Chociej, M., Chiek, M., Boby, A., Baker, B., ... & Zaremba, W. (2019). Solving Rubik’s cube with a robot hand.arXiv Preprint arXiv:1910.07113. Benbrahim, H., Doleac, J., Franklin, J., & Selfridge, O. (1992, June). Real-time learning: A ball on a beam. InInternational Joint Conference on Neural Networks. Brockman, G., ...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[3] [3]

& Silver, D

Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., ... & Silver, D. (2020). Mastering Atari, Go, Chess and Shogi by planning with a learned model.Nature. Schwarzer, M., Obando-Ceron, J., Courville, A., Bellemare, M. G., Agarwal, R., & Castro, P. S. (2023). Bigger, better, faster: Human-level Atari with human-level efficie...

2020

[4] [4]

A Deeper Look at Experience Replay

Wu, P., Escontrela, A., Hafner, D., Abbeel, P., & Goldberg, K. (2023, March). Daydreamer: World models for physical robot learning. InConference on Robot Learning. PMLR. 12 Zhao, T. Z., Kumar, V ., Levine, S., & Finn, C. (2023, July). Learning fine-grained bimanual manip- ulation with low-cost hardware. InProceedings of Robotics: Science and Systems. Zhan...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Finally, exploration is managed via anϵ-greedy strategy with a fixedϵof2 −6 (Exploration), and new actions are selected everyPolicy Skip(2) frames

frames. Finally, exploration is managed via anϵ-greedy strategy with a fixedϵof2 −6 (Exploration), and new actions are selected everyPolicy Skip(2) frames. E Robotroller and Camera Hyperparameters This section details the specific configuration used for the Robotroller and the camera in our experi- ments. E.1 Camera Configuration The camera is configured ...

2048