Learning to Race in Minutes: Infoprop Dyna on the Mini Wheelbot

Devdutt Subhasish; Henrik Hose; Sebastian Trimpe

arxiv: 2605.01096 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.RO

Learning to Race in Minutes: Infoprop Dyna on the Mini Wheelbot

Devdutt Subhasish , Henrik Hose , Sebastian Trimpe This is my paper

Pith reviewed 2026-05-09 19:00 UTC · model grok-4.3

classification 💻 cs.LG cs.RO

keywords model-based reinforcement learningreal-world roboticsunicycle robotuncertainty-aware RLrobot racingunderactuated control

0 comments

The pith

The Mini Wheelbot learns to race around a track within 11 minutes using an uncertainty-aware model-based RL method directly in the real world.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper aims to establish that an unstable unicycle robot can acquire a racing skill solely through real-world interactions. It applies Infoprop Dyna to reach competent lap times in 11 minutes of physical experience, removing the usual reliance on physics simulators and domain randomization. If correct, the result indicates that uncertainty-aware model-based reinforcement learning can manage data collection and policy improvement on hardware without added safety engineering. A reader would care because it shortens the path from robot design to functional control for systems where building faithful simulations is difficult or slow.

Core claim

Infoprop Dyna, a state-of-the-art uncertainty-aware model-based reinforcement learning framework, enables the Mini Wheelbot, an underactuated unicycle robot, to learn racing around a track within 11 minutes of real-world experience without any simulator.

What carries the argument

Infoprop Dyna: the uncertainty-aware model-based RL framework that supports safe data collection and policy learning from direct physical interactions on unstable systems.

If this is right

High-performance control tasks become reachable on physical hardware in minutes rather than after extensive simulator tuning.
Underactuated unstable robots can learn complex behaviors without custom safety layers or pre-training.
Model-based RL with uncertainty estimates can scale to real-world robotic systems where accurate digital models are unavailable.
Wall-clock time for learning reduces dramatically when data collection occurs directly on the target platform.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same framework could shorten development cycles for other mobile robots with similar instability.
Evaluating the learned policy on altered track geometries would test how well the method transfers beyond the original setup.
Adding onboard sensing could allow the approach to handle tasks with greater environmental variability.

Load-bearing premise

The uncertainty-aware model-based RL framework can safely collect sufficient real-world data on an unstable robot without prior simulation or hand-crafted safety constraints.

What would settle it

Repeated trials in which the robot fails to complete laps or requires frequent manual resets within the same real-world time budget would show the method does not deliver the claimed learning speed and safety.

Figures

Figures reproduced from arXiv: 2605.01096 by Devdutt Subhasish, Henrik Hose, Sebastian Trimpe.

**Figure 2.** Figure 2: Distributed training schematic. A learned dynamics model generates [PITH_FULL_IMAGE:figures/full_fig_p001_2.png] view at source ↗

**Figure 3.** Figure 3: Trajectory from the AMPC (left) and final racing agent (right). [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

read the original abstract

Reinforcement Learning (RL) has the potential to enable robots with fast, nonlinear, and unstable dynamics to reach the limits of their performance. However, most recent advances rely on carefully designed physics-based simulators and domain randomization to achieve successful sim-to-real transfer within reasonable wall-clock time. In this work, we bypass the need for such simulators and demonstrate that Infoprop Dyna, a state-of-the-art uncertainty-aware model-based reinforcement learning (MBRL) framework, can enable robots to learn directly from real-world interactions. Using Infoprop Dyna, the Mini Wheelbot, an underactuated unicycle robot, learns to race around a track within 11 minutes of real-world experience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that Infoprop Dyna, an uncertainty-aware model-based RL framework, enables the Mini Wheelbot (an underactuated unicycle) to learn track racing in 11 minutes of real-world experience, bypassing both physics simulators and hand-crafted safety constraints.

Significance. If the empirical results and supporting analysis hold, the work would advance MBRL by showing direct real-world learning is feasible for fast, unstable robotic systems, reducing dependence on sim-to-real pipelines.

major comments (2)

Abstract: the headline result (11 min real-world track racing with no simulator) is stated without any quantitative metrics, success rates, lap times, variance, or baseline comparisons, preventing assessment of the claimed outcome.
The central claim that uncertainty estimates alone enable safe data collection on the underactuated unicycle from the first rollout is load-bearing yet unexamined; the manuscript provides no analysis, failure-mode statistics, or description of how model uncertainty prevents falls given the narrow stable manifold of the hardware.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the presentation of results and the supporting analysis. We address each point below and have revised the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: Abstract: the headline result (11 min real-world track racing with no simulator) is stated without any quantitative metrics, success rates, lap times, variance, or baseline comparisons, preventing assessment of the claimed outcome.

Authors: We agree that the abstract would be more informative with quantitative details. The revised abstract now includes success rates across trials, average lap times, associated variance, and comparisons to relevant baselines. revision: yes
Referee: The central claim that uncertainty estimates alone enable safe data collection on the underactuated unicycle from the first rollout is load-bearing yet unexamined; the manuscript provides no analysis, failure-mode statistics, or description of how model uncertainty prevents falls given the narrow stable manifold of the hardware.

Authors: This observation is fair. While the original manuscript describes the uncertainty-aware planning in Infoprop Dyna, it does not include a dedicated examination of failure modes or the precise interaction with the unicycle's stable manifold. We have added a new subsection with failure-mode statistics from the real-world rollouts and an explanation of how the model's uncertainty estimates guide safe exploration from the first trial without hand-crafted constraints. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical hardware demonstration without self-referential derivation

full rationale

The paper reports an experimental result in which Infoprop Dyna enables an underactuated unicycle robot to learn track racing after 11 minutes of real-world interaction, bypassing simulators. No mathematical derivation chain, equations, or parameter-fitting procedure is described that reduces a claimed prediction or uniqueness result to its own inputs by construction. The central claim rests on measured wall-clock time and task success on physical hardware, which is externally falsifiable and independent of any self-citation or ansatz. This is a standard empirical validation in robotics RL and remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the provided abstract; the central claim is an empirical demonstration whose supporting details are absent.

pith-pipeline@v0.9.0 · 5415 in / 1027 out tokens · 28434 ms · 2026-05-09T19:00:29.714771+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Champion-level drone racing using deep reinforce- ment learning,

E. Kaufmann, L. Bauersfeld, A. Loquercio, M. M ¨uller, V . Koltun, and D. Scaramuzza, “Champion-level drone racing using deep reinforce- ment learning,”Nature, 2023

work page 2023
[2]

Solving Rubik’s Cube with a Robot Hand,

OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, and L. Zhang, “Solving Rubik’s Cube with a Robot Hand,” 2019

work page 2019
[3]

Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning,

N. Rudin, D. Hoeller, P. Reist, and M. Hutter, “Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning,” in Proceedings of the 5th Conference on Robot Learning, 2022

work page 2022
[4]

Demonstrating A Walk in the Park: Learning to Walk in 20 Minutes With Model-Free Reinforcement Learning,

L. M. Smith, I. Kostrikov, and S. Levine, “Demonstrating A Walk in the Park: Learning to Walk in 20 Minutes With Model-Free Reinforcement Learning,”Robotics: Science and Systems XIX, 2023

work page 2023
[5]

PILCO: a model-based and data-efficient approach to policy search,

M. P. Deisenroth and C. E. Rasmussen, “PILCO: a model-based and data-efficient approach to policy search,” inProceedings of the 28th International Conference on International Conference on Machine Learning, 2011

work page 2011
[6]

Dream to Control: Learning Behaviors by Latent Imagination,

D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to Control: Learning Behaviors by Latent Imagination,” inInternational Confer- ence on Learning Representations, 2020

work page 2020
[7]

Trust the Model Where It Trusts Itself - Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption,

B. Frauenknecht, A. Eisele, D. Subhasish, F. Solowjow, and S. Trimpe, “Trust the Model Where It Trusts Itself - Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption,” inInternational Confer- ence on Machine Learning, 2024

work page 2024
[8]

On Rollouts in Model-Based Reinforcement Learning,

B. Frauenknecht, D. Subhasish, F. Solowjow, and S. Trimpe, “On Rollouts in Model-Based Reinforcement Learning,” inInternational Conference on Learning Representations, 2025

work page 2025
[9]

MuJoCo: A physics engine for model-based control,

E. Todorov, T. Erez, and Y . Tassa, “MuJoCo: A physics engine for model-based control,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012

work page 2012
[10]

The Mini Wheelbot: A Testbed for Learning-based Balancing, Flips, and Articulated Driving,

H. Hose, J. Weisgerber, and S. Trimpe, “The Mini Wheelbot: A Testbed for Learning-based Balancing, Flips, and Articulated Driving,” in2025 IEEE International Conference on Robotics and Automation (ICRA), 2025

work page 2025
[11]

Quickstart: How to think in JAX — JAX documentation

“Quickstart: How to think in JAX — JAX documentation.” [Online]. Available: https://docs.jax.dev/en/latest/notebooks/thinking in jax.html

work page
[12]

google/brax

“google/brax.” [Online]. Available: https://github.com/google/brax

work page

[1] [1]

Champion-level drone racing using deep reinforce- ment learning,

E. Kaufmann, L. Bauersfeld, A. Loquercio, M. M ¨uller, V . Koltun, and D. Scaramuzza, “Champion-level drone racing using deep reinforce- ment learning,”Nature, 2023

work page 2023

[2] [2]

Solving Rubik’s Cube with a Robot Hand,

OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, and L. Zhang, “Solving Rubik’s Cube with a Robot Hand,” 2019

work page 2019

[3] [3]

Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning,

N. Rudin, D. Hoeller, P. Reist, and M. Hutter, “Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning,” in Proceedings of the 5th Conference on Robot Learning, 2022

work page 2022

[4] [4]

Demonstrating A Walk in the Park: Learning to Walk in 20 Minutes With Model-Free Reinforcement Learning,

L. M. Smith, I. Kostrikov, and S. Levine, “Demonstrating A Walk in the Park: Learning to Walk in 20 Minutes With Model-Free Reinforcement Learning,”Robotics: Science and Systems XIX, 2023

work page 2023

[5] [5]

PILCO: a model-based and data-efficient approach to policy search,

M. P. Deisenroth and C. E. Rasmussen, “PILCO: a model-based and data-efficient approach to policy search,” inProceedings of the 28th International Conference on International Conference on Machine Learning, 2011

work page 2011

[6] [6]

Dream to Control: Learning Behaviors by Latent Imagination,

D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to Control: Learning Behaviors by Latent Imagination,” inInternational Confer- ence on Learning Representations, 2020

work page 2020

[7] [7]

Trust the Model Where It Trusts Itself - Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption,

B. Frauenknecht, A. Eisele, D. Subhasish, F. Solowjow, and S. Trimpe, “Trust the Model Where It Trusts Itself - Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption,” inInternational Confer- ence on Machine Learning, 2024

work page 2024

[8] [8]

On Rollouts in Model-Based Reinforcement Learning,

B. Frauenknecht, D. Subhasish, F. Solowjow, and S. Trimpe, “On Rollouts in Model-Based Reinforcement Learning,” inInternational Conference on Learning Representations, 2025

work page 2025

[9] [9]

MuJoCo: A physics engine for model-based control,

E. Todorov, T. Erez, and Y . Tassa, “MuJoCo: A physics engine for model-based control,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012

work page 2012

[10] [10]

The Mini Wheelbot: A Testbed for Learning-based Balancing, Flips, and Articulated Driving,

H. Hose, J. Weisgerber, and S. Trimpe, “The Mini Wheelbot: A Testbed for Learning-based Balancing, Flips, and Articulated Driving,” in2025 IEEE International Conference on Robotics and Automation (ICRA), 2025

work page 2025

[11] [11]

Quickstart: How to think in JAX — JAX documentation

“Quickstart: How to think in JAX — JAX documentation.” [Online]. Available: https://docs.jax.dev/en/latest/notebooks/thinking in jax.html

work page

[12] [12]

google/brax

“google/brax.” [Online]. Available: https://github.com/google/brax

work page