pith. sign in

arxiv: 2604.07672 · v1 · submitted 2026-04-09 · 💻 cs.RO

Reset-Free Reinforcement Learning for Real-World Agile Driving: An Empirical Study

Pith reviewed 2026-05-10 18:25 UTC · model grok-4.3

classification 💻 cs.RO
keywords reset-free reinforcement learningagile drivingreal-world RLresidual learningMPPI controlTD-MPC2sim-to-real gapautonomous vehicle
0
0 comments X

The pith

Only TD-MPC2 outperforms the MPPI baseline in real-world reset-free RL for agile driving, while simulation favors SAC with residuals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper runs an empirical comparison of reset-free reinforcement learning on a physical 1/10-scale vehicle that trains continuously on a slippery indoor track without manual resets. It employs MPPI control both to recover the vehicle after failures and as the base policy for residual learning, then evaluates PPO, SAC, and TD-MPC2 with and without the residual component. The central finding is a clear simulation-to-reality gap: SAC plus residuals produces the highest returns in simulation, yet only TD-MPC2 reliably beats the MPPI baseline once the policies run on hardware, where residuals often reduce performance. The study matters because it shows that intuitions derived from simulation about which algorithms work for high-speed, near-friction-limit driving do not carry over when unmodeled dynamics, delays, and physical recovery are present.

Core claim

The paper claims that reset-free RL for real-world agile driving exhibits a pronounced sim-to-real discrepancy. SAC with residual learning attains the highest simulated returns, but on the physical platform only TD-MPC2 consistently exceeds the MPPI baseline; residual learning, helpful in simulation, fails to transfer and can degrade real-world results. These outcomes arise because complex vehicle dynamics, actuation delays, and tire-friction limits prevent accurate simulation and direct policy transfer, revealing challenges unique to training in the wild.

What carries the argument

MPPI control used simultaneously as the autonomous reset policy and as the base policy for residual learning, enabling side-by-side testing of PPO, SAC, and TD-MPC2 across simulation and physical slippery-track runs.

If this is right

  • Simulation rankings cannot be trusted to select RL methods for reset-free physical deployment in agile driving.
  • Residual learning on top of a recovery policy needs new mechanisms to remain beneficial once transferred to hardware.
  • TD-MPC2 shows relative robustness for continuous on-robot learning when vehicle dynamics are uncertain.
  • Real-world RL for high-speed tasks must incorporate explicit handling of unmodeled effects that simulation omits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model-based approaches like TD-MPC2 may tolerate the noise and partial observability of physical platforms better than purely model-free methods.
  • Testing the same algorithms on different surfaces or vehicle scales would reveal whether the observed gap is track-specific.
  • Hybrid methods that adapt the base policy online could reduce reliance on a fixed MPPI recovery policy.

Load-bearing premise

That performance differences measured on one specific 1/10-scale indoor slippery track, one MPPI recovery policy, and the chosen three RL algorithms will generalize to other real-world agile driving conditions.

What would settle it

Finding that SAC with residual learning produces higher lap completion rates or returns than TD-MPC2 when both are run reset-free on the same physical vehicle and track would falsify the reported real-world ranking.

Figures

Figures reproduced from arXiv: 2604.07672 by Hirotaka Hosogaya, Kohei Honda.

Figure 1
Figure 1. Figure 1: Overview of the reset-free RL training process for real-world agile driving. The forward policy [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Experimental environments. (a) The real-world circular track [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Episodic reward curve during training in the simulation envi [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Episodic reward curve during training in the real-world environment. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of driving trajectories during inference after 200 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

This paper presents an empirical study of reset-free reinforcement learning (RL) for real-world agile driving, in which a physical 1/10-scale vehicle learns continuously on a slippery indoor track without manual resets. High-speed driving near the limits of tire friction is particularly challenging for learning-based methods because complex vehicle dynamics, actuation delays, and other unmodeled effects hinder both accurate simulation and direct sim-to-real transfer of learned policies. To enable autonomous training on a physical platform, we employ Model Predictive Path Integral control (MPPI) as both the reset policy and the base policy for residual learning, and systematically compare three representative RL algorithms, i.e., PPO, SAC, and TD-MPC2, with and without residual learning in simulation and real-world experiments. Our results reveal a clear gap between simulation and real-world: SAC with residual learning achieves the highest returns in simulation, yet only TD-MPC2 consistently outperforms the MPPI baseline on the physical platform. Moreover, residual learning, while clearly beneficial in simulation, fails to transfer its advantage to the real world and can even degrade performance. These findings reveal that reset-free RL in the real world poses unique challenges absent from simulation, calling for further algorithmic development tailored to training in the wild.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical study of reset-free RL for agile driving on a 1/10-scale physical vehicle on a slippery indoor track. Using MPPI as both reset policy and base policy for residual learning, it systematically compares PPO, SAC, and TD-MPC2 (with and without residuals) in simulation and on hardware. The central claims are that SAC+residual achieves the highest returns in simulation, only TD-MPC2 consistently outperforms the MPPI baseline on the physical platform, and residual learning fails to transfer (sometimes degrading performance), revealing unique real-world challenges absent from simulation.

Significance. If the reported ordering and transfer failures hold under rigorous replication, the work supplies direct hardware evidence of sim-to-real gaps in high-speed vehicle control, a practically important domain. The systematic inclusion of residual learning ablations and the use of a recovery policy to enable continuous training are concrete strengths that could guide future real-world RL development.

major comments (2)
  1. [Results / Experiments] Results section (and abstract): the performance-gap claims (SAC+residual best in sim; only TD-MPC2 beats MPPI on hardware; residual learning fails to transfer) are presented without reported trial counts, error bars, statistical tests, or explicit data-exclusion rules. This leaves the directional findings only partially supported, as noted in the soundness assessment.
  2. [Experiments and Discussion] The central empirical ordering and the conclusion that 'reset-free RL in the real world poses unique challenges' rest on a single 1/10-scale track, fixed MPPI recovery policy, and one set of dynamics (slippery surface, actuation delays). No ablations on track friction, sensor noise profiles, or alternative base policies are shown; if the observed gaps are specific to these interactions, the broader claim about real-world challenges would not generalize without additional testbeds.
minor comments (2)
  1. [Preliminaries] Notation for the three RL algorithms and residual formulation could be introduced earlier with a compact table of hyperparameters to aid reproducibility.
  2. [Figures] Figure captions and axis labels in the simulation vs. hardware comparison plots should explicitly state the number of seeds or runs per curve.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important aspects of statistical rigor and generalizability that we will address in the revision. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [Results / Experiments] Results section (and abstract): the performance-gap claims (SAC+residual best in sim; only TD-MPC2 beats MPPI on hardware; residual learning fails to transfer) are presented without reported trial counts, error bars, statistical tests, or explicit data-exclusion rules. This leaves the directional findings only partially supported, as noted in the soundness assessment.

    Authors: We agree that the current presentation would be strengthened by explicit statistical reporting. In the revised manuscript we will add the number of independent trials per condition (we conducted five runs for each algorithm/configuration), error bars showing standard deviation on all performance plots, and the results of statistical tests (paired t-tests with Bonferroni correction) for the key comparisons between TD-MPC2 and the MPPI baseline as well as between SAC+residual and the other methods. We will also include a clear statement of data-exclusion criteria (trials terminated due to hardware safety triggers or track-boundary violations). These additions will make the directional claims more robustly supported. revision: yes

  2. Referee: [Experiments and Discussion] The central empirical ordering and the conclusion that 'reset-free RL in the real world poses unique challenges' rest on a single 1/10-scale track, fixed MPPI recovery policy, and one set of dynamics (slippery surface, actuation delays). No ablations on track friction, sensor noise profiles, or alternative base policies are shown; if the observed gaps are specific to these interactions, the broader claim about real-world challenges would not generalize without additional testbeds.

    Authors: We acknowledge that all hardware results were obtained on one physical platform with a fixed MPPI recovery policy and a single set of dynamics. This choice was deliberate: the slippery indoor track combined with actuation delays creates a demanding reset-free setting that is representative of the challenges encountered in agile driving. Nevertheless, we agree that the manuscript should not imply universality. In the revision we will (i) explicitly qualify the scope of the claims in the abstract, introduction, and conclusion, (ii) expand the discussion section to enumerate the specific conditions under which the observed sim-to-real gap and residual-learning failure were measured, and (iii) add a dedicated limitations paragraph that recommends future multi-testbed studies varying friction, sensor noise, and base-policy choices. No new hardware experiments will be added, but the textual clarifications will prevent overgeneralization while preserving the value of the reported empirical evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of RL algorithms

full rationale

The paper is an empirical study that runs PPO, SAC, and TD-MPC2 (with and without residual learning) against an MPPI baseline in simulation and on one physical 1/10-scale track. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps exist. All reported orderings (SAC+residual best in sim, TD-MPC2 best on hardware, residual learning failing to transfer) are direct experimental observations against an external baseline. The single-track limitation affects generalizability but does not create circularity by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on experimental outcomes from standard RL algorithms and an existing MPPI controller applied to a physical platform; no new free parameters, axioms beyond standard MDP assumptions, or invented entities are introduced.

axioms (1)
  • domain assumption The driving task can be formulated as a Markov decision process with the chosen state and action spaces.
    Implicit when applying PPO, SAC, and TD-MPC2 to the vehicle control problem.

pith-pipeline@v0.9.0 · 5519 in / 1179 out tokens · 55176 ms · 2026-05-10T18:25:39.965091+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

  1. [1]

    Super- human performance in gran turismo sport using deep reinforcement learning,

    F. Fuchs, Y . Song, E. Kaufmann, D. Scaramuzza, and P. D ¨urr, “Super- human performance in gran turismo sport using deep reinforcement learning,”IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 4257–4264, 2021

  2. [2]

    Outracing champion gran turismo drivers with deep reinforcement learning,

    P. R. Wurman, S. Barrett, K. Kawamoto, J. MacGlashan, K. Subrama- nian, T. J. Walsh, R. Capobianco, A. Devlic, F. Eckert, F. Fuchs,et al., “Outracing champion gran turismo drivers with deep reinforcement learning,”Nature, vol. 602, no. 7896, pp. 223–228, 2022

  3. [3]

    Learning to walk in minutes using massively parallel deep reinforcement learning,

    N. Rudin, D. Hoeller, P. Reist, and M. Hutter, “Learning to walk in minutes using massively parallel deep reinforcement learning,” in Conference on robot learning. PMLR, 2022, pp. 91–100

  4. [4]

    Reaching the limit in autonomous racing: Optimal control versus reinforcement learning,

    Y . Song, A. Romero, M. M ¨uller, V . Koltun, and D. Scaramuzza, “Reaching the limit in autonomous racing: Optimal control versus reinforcement learning,”Science Robotics, vol. 8, no. 82, p. eadg1462, 2023

  5. [5]

    Unifying f1tenth autonomous racing: Survey, methods and benchmarks,

    B. D. Evans, R. Trumpp, M. Caccamo, F. Jahncke, J. Betz, H. W. Jordaan, and H. A. Engelbrecht, “Unifying F1TENTH au- tonomous racing: Survey, methods and benchmarks,”arXiv preprint arXiv:2402.18558, 2024

  6. [6]

    Information-theoretic model predictive control: Theory and applica- tions to autonomous driving,

    G. Williams, P. Drews, B. Goldfain, J. M. Rehg, and E. A. Theodorou, “Information-theoretic model predictive control: Theory and applica- tions to autonomous driving,”IEEE Transactions on Robotics, vol. 34, no. 6, pp. 1603–1622, 2018

  7. [7]

    Leave no trace: Learning to reset for safe and autonomous reinforcement learning,

    B. Eysenbach, S. Gu, J. Ibarz, and S. Levine, “Leave no trace: Learning to reset for safe and autonomous reinforcement learning,” inConference on Learning Representations, 2018

  8. [8]

    Au- tonomous reinforcement learning via subgoal curricula,

    A. Sharma, A. Gupta, S. Levine, K. Hausman, and C. Finn, “Au- tonomous reinforcement learning via subgoal curricula,”Advances in Neural Information Processing Systems, vol. 34, pp. 18 474–18 486, 2021

  9. [9]

    Autonomous reinforcement learning: Formalism and bench- marking,

    A. Sharma, K. Xu, N. Sardana, A. Gupta, K. Hausman, S. Levine, and C. Finn, “Autonomous reinforcement learning: Formalism and bench- marking,” inInternational Conference on Learning Representations, 2022

  10. [10]

    Residual reinforcement learning for robot control,

    T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. A. Ojea, E. Solowjow, and S. Levine, “Residual reinforcement learning for robot control,” inInternational Conference on Robotics and Automation. IEEE, 2019, pp. 6023–6029

  11. [11]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  12. [12]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,

    T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor,” inInternational conference on machine learning. PMLR, 2018, pp. 1861–1870

  13. [13]

    TD-MPC2: Scalable, robust world models for continuous control,

    N. Hansen, H. Su, and X. Wang, “TD-MPC2: Scalable, robust world models for continuous control,” inInternational Conference on Learn- ing Representations, 2024

  14. [14]

    Stein variational guided model predictive path integral control: Proposal and experiments with fast maneuvering vehicles,

    K. Honda, N. Akai, K. Suzuki, M. Aoki, H. Hosogaya, H. Okuda, and T. Suzuki, “Stein variational guided model predictive path integral control: Proposal and experiments with fast maneuvering vehicles,” in International Conference on Robotics and Automation. IEEE, 2024, pp. 7020–7026

  15. [15]

    Learning- based model predictive control for autonomous racing,

    J. Kabzan, L. Hewing, A. Liniger, and M. N. Zeilinger, “Learning- based model predictive control for autonomous racing,”IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3363–3370, 2019

  16. [16]

    F1TENTH: An open-source evaluation environment for continuous control and reinforcement learning,

    M. O’Kelly, H. Zheng, D. Karthik, and R. Mangharam, “F1TENTH: An open-source evaluation environment for continuous control and reinforcement learning,”Proceedings of Machine Learning Research, vol. 123, 2020

  17. [17]

    Residual policy learning facilitates efficient model-free autonomous racing,

    R. Zhang, J. Hou, G. Chen, Z. Li, J. Chen, and A. Knoll, “Residual policy learning facilitates efficient model-free autonomous racing,” IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 11 625– 11 632, 2022

  18. [18]

    TC-Driver: A trajectory- conditioned reinforcement learning approach to zero-shot autonomous racing,

    E. Ghignone, N. Baumann, and M. Magno, “TC-Driver: A trajectory- conditioned reinforcement learning approach to zero-shot autonomous racing,”IEEE Transactions on Field Robotics, vol. 1, pp. 527–536, 2024

  19. [19]

    Safe reinforce- ment learning for high-speed autonomous racing,

    B. D. Evans, H. W. Jordaan, and H. A. Engelbrecht, “Safe reinforce- ment learning for high-speed autonomous racing,”Cognitive Robotics, vol. 3, pp. 107–126, 2023

  20. [20]

    Bypassing the simulation-to-reality gap: Online reinforcement learning using a supervisor,

    B. D. Evans, J. Betz, H. Zheng, H. A. Engelbrecht, R. Mangharam, and H. W. Jordaan, “Bypassing the simulation-to-reality gap: Online reinforcement learning using a supervisor,” inInternational Confer- ence on Advanced Robotics. IEEE, 2023, pp. 325–331

  21. [21]

    High-speed autonomous drifting with deep reinforcement learning,

    P. Cai, X. Mei, L. Tai, Y . Sun, and M. Liu, “High-speed autonomous drifting with deep reinforcement learning,”IEEE Robotics and Au- tomation Letters, vol. 5, no. 2, pp. 1247–1254, 2020

  22. [22]

    Wheeled lab: Modern sim2real for low-cost, open-source wheeled robotics,

    T. Han, P. Shah, S. Rajagopal, Y . Bao, S. Jung, S. Talia, G. Guo, B. Xu, B. Mehta, E. Romig,et al., “Wheeled lab: Modern sim2real for low-cost, open-source wheeled robotics,” inConference on Robot Learning. PMLR, 2025, pp. 906–923

  23. [23]

    Safe reinforcement learning for legged locomotion,

    T.-Y . Yang, T. Zhang, L. Luu, S. Ha, J. Tan, and W. Yu, “Safe reinforcement learning for legged locomotion,” inInternational Con- ference on Intelligent Robots and Systems. IEEE/RSJ, 2022, pp. 2454–2461

  24. [24]

    SERL: A software suite for sample- efficient robotic reinforcement learning,

    J. Luo, Z. Hu, C. Xu, Y . L. Tan, J. Berg, A. Sharma, S. Schaal, C. Finn, A. Gupta, and S. Levine, “SERL: A software suite for sample- efficient robotic reinforcement learning,” inInternational Conference on Robotics and Automation. IEEE, 2024, pp. 16 961–16 969

  25. [25]

    Reset-free reinforcement learning via multi- task learning: Learning dexterous manipulation behaviors without human intervention,

    A. Gupta, J. Yu, T. Z. Zhao, V . Kumar, A. Rovinsky, K. Xu, T. Devlin, and S. Levine, “Reset-free reinforcement learning via multi- task learning: Learning dexterous manipulation behaviors without human intervention,” inInternational Conference on Robotics and Automation. IEEE, 2021, pp. 6664–6671

  26. [26]

    The ingredients of real world robotic reinforcement learning,

    H. Zhu, J. Yu, A. Gupta, D. Shah, K. Hartikainen, A. Singh, V . Kumar, and S. Levine, “The ingredients of real world robotic reinforcement learning,” inInternational Conference on Learning Representations, 2020

  27. [27]

    Fully autonomous real-world reinforcement learning with applications to mobile manipulation,

    C. Sun, J. Orbik, C. M. Devin, B. H. Yang, A. Gupta, G. Berseth, and S. Levine, “Fully autonomous real-world reinforcement learning with applications to mobile manipulation,” inConference on Robot Learning. PMLR, 2022, pp. 308–319

  28. [28]

    Rajamani,V ehicle dynamics and control

    R. Rajamani,V ehicle dynamics and control. Springer, 2006

  29. [29]

    Model Predictive Control via Probabilistic Inference: A Tutorial and Survey

    K. Honda, “Model predictive control via probabilistic inference: A tutorial and survey,”arXiv preprint arXiv:2511.08019, 2025