Learning to Race in Minutes: Infoprop Dyna on the Mini Wheelbot
Pith reviewed 2026-05-09 19:00 UTC · model grok-4.3
The pith
The Mini Wheelbot learns to race around a track within 11 minutes using an uncertainty-aware model-based RL method directly in the real world.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Infoprop Dyna, a state-of-the-art uncertainty-aware model-based reinforcement learning framework, enables the Mini Wheelbot, an underactuated unicycle robot, to learn racing around a track within 11 minutes of real-world experience without any simulator.
What carries the argument
Infoprop Dyna: the uncertainty-aware model-based RL framework that supports safe data collection and policy learning from direct physical interactions on unstable systems.
If this is right
- High-performance control tasks become reachable on physical hardware in minutes rather than after extensive simulator tuning.
- Underactuated unstable robots can learn complex behaviors without custom safety layers or pre-training.
- Model-based RL with uncertainty estimates can scale to real-world robotic systems where accurate digital models are unavailable.
- Wall-clock time for learning reduces dramatically when data collection occurs directly on the target platform.
Where Pith is reading between the lines
- The same framework could shorten development cycles for other mobile robots with similar instability.
- Evaluating the learned policy on altered track geometries would test how well the method transfers beyond the original setup.
- Adding onboard sensing could allow the approach to handle tasks with greater environmental variability.
Load-bearing premise
The uncertainty-aware model-based RL framework can safely collect sufficient real-world data on an unstable robot without prior simulation or hand-crafted safety constraints.
What would settle it
Repeated trials in which the robot fails to complete laps or requires frequent manual resets within the same real-world time budget would show the method does not deliver the claimed learning speed and safety.
Figures
read the original abstract
Reinforcement Learning (RL) has the potential to enable robots with fast, nonlinear, and unstable dynamics to reach the limits of their performance. However, most recent advances rely on carefully designed physics-based simulators and domain randomization to achieve successful sim-to-real transfer within reasonable wall-clock time. In this work, we bypass the need for such simulators and demonstrate that Infoprop Dyna, a state-of-the-art uncertainty-aware model-based reinforcement learning (MBRL) framework, can enable robots to learn directly from real-world interactions. Using Infoprop Dyna, the Mini Wheelbot, an underactuated unicycle robot, learns to race around a track within 11 minutes of real-world experience.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that Infoprop Dyna, an uncertainty-aware model-based RL framework, enables the Mini Wheelbot (an underactuated unicycle) to learn track racing in 11 minutes of real-world experience, bypassing both physics simulators and hand-crafted safety constraints.
Significance. If the empirical results and supporting analysis hold, the work would advance MBRL by showing direct real-world learning is feasible for fast, unstable robotic systems, reducing dependence on sim-to-real pipelines.
major comments (2)
- Abstract: the headline result (11 min real-world track racing with no simulator) is stated without any quantitative metrics, success rates, lap times, variance, or baseline comparisons, preventing assessment of the claimed outcome.
- The central claim that uncertainty estimates alone enable safe data collection on the underactuated unicycle from the first rollout is load-bearing yet unexamined; the manuscript provides no analysis, failure-mode statistics, or description of how model uncertainty prevents falls given the narrow stable manifold of the hardware.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the presentation of results and the supporting analysis. We address each point below and have revised the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: Abstract: the headline result (11 min real-world track racing with no simulator) is stated without any quantitative metrics, success rates, lap times, variance, or baseline comparisons, preventing assessment of the claimed outcome.
Authors: We agree that the abstract would be more informative with quantitative details. The revised abstract now includes success rates across trials, average lap times, associated variance, and comparisons to relevant baselines. revision: yes
-
Referee: The central claim that uncertainty estimates alone enable safe data collection on the underactuated unicycle from the first rollout is load-bearing yet unexamined; the manuscript provides no analysis, failure-mode statistics, or description of how model uncertainty prevents falls given the narrow stable manifold of the hardware.
Authors: This observation is fair. While the original manuscript describes the uncertainty-aware planning in Infoprop Dyna, it does not include a dedicated examination of failure modes or the precise interaction with the unicycle's stable manifold. We have added a new subsection with failure-mode statistics from the real-world rollouts and an explanation of how the model's uncertainty estimates guide safe exploration from the first trial without hand-crafted constraints. revision: yes
Circularity Check
No circularity: empirical hardware demonstration without self-referential derivation
full rationale
The paper reports an experimental result in which Infoprop Dyna enables an underactuated unicycle robot to learn track racing after 11 minutes of real-world interaction, bypassing simulators. No mathematical derivation chain, equations, or parameter-fitting procedure is described that reduces a claimed prediction or uniqueness result to its own inputs by construction. The central claim rests on measured wall-clock time and task success on physical hardware, which is externally falsifiable and independent of any self-citation or ansatz. This is a standard empirical validation in robotics RL and remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Champion-level drone racing using deep reinforce- ment learning,
E. Kaufmann, L. Bauersfeld, A. Loquercio, M. M ¨uller, V . Koltun, and D. Scaramuzza, “Champion-level drone racing using deep reinforce- ment learning,”Nature, 2023
work page 2023
-
[2]
Solving Rubik’s Cube with a Robot Hand,
OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, and L. Zhang, “Solving Rubik’s Cube with a Robot Hand,” 2019
work page 2019
-
[3]
Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning,
N. Rudin, D. Hoeller, P. Reist, and M. Hutter, “Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning,” in Proceedings of the 5th Conference on Robot Learning, 2022
work page 2022
-
[4]
L. M. Smith, I. Kostrikov, and S. Levine, “Demonstrating A Walk in the Park: Learning to Walk in 20 Minutes With Model-Free Reinforcement Learning,”Robotics: Science and Systems XIX, 2023
work page 2023
-
[5]
PILCO: a model-based and data-efficient approach to policy search,
M. P. Deisenroth and C. E. Rasmussen, “PILCO: a model-based and data-efficient approach to policy search,” inProceedings of the 28th International Conference on International Conference on Machine Learning, 2011
work page 2011
-
[6]
Dream to Control: Learning Behaviors by Latent Imagination,
D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to Control: Learning Behaviors by Latent Imagination,” inInternational Confer- ence on Learning Representations, 2020
work page 2020
-
[7]
B. Frauenknecht, A. Eisele, D. Subhasish, F. Solowjow, and S. Trimpe, “Trust the Model Where It Trusts Itself - Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption,” inInternational Confer- ence on Machine Learning, 2024
work page 2024
-
[8]
On Rollouts in Model-Based Reinforcement Learning,
B. Frauenknecht, D. Subhasish, F. Solowjow, and S. Trimpe, “On Rollouts in Model-Based Reinforcement Learning,” inInternational Conference on Learning Representations, 2025
work page 2025
-
[9]
MuJoCo: A physics engine for model-based control,
E. Todorov, T. Erez, and Y . Tassa, “MuJoCo: A physics engine for model-based control,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012
work page 2012
-
[10]
The Mini Wheelbot: A Testbed for Learning-based Balancing, Flips, and Articulated Driving,
H. Hose, J. Weisgerber, and S. Trimpe, “The Mini Wheelbot: A Testbed for Learning-based Balancing, Flips, and Articulated Driving,” in2025 IEEE International Conference on Robotics and Automation (ICRA), 2025
work page 2025
-
[11]
Quickstart: How to think in JAX — JAX documentation
“Quickstart: How to think in JAX — JAX documentation.” [Online]. Available: https://docs.jax.dev/en/latest/notebooks/thinking in jax.html
- [12]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.