Self-Paced Curriculum Reinforcement Learning for Autonomous Superbike Racing in Simulation

Carlo D'Eramo; Jacopo Essenziale; Luca Ghisi; Matteo Luperto

arxiv: 2606.09236 · v1 · pith:7VKAWQD4new · submitted 2026-06-08 · 💻 cs.RO · cs.AI

Self-Paced Curriculum Reinforcement Learning for Autonomous Superbike Racing in Simulation

Luca Ghisi , Jacopo Essenziale , Carlo D'Eramo , Matteo Luperto This is my paper

Pith reviewed 2026-06-27 16:21 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords reinforcement learningcurriculum learningautonomous racingmotorcyclesoft actor-criticsimulationbalance control

0 comments

The pith

Integrating self-paced curriculum learning with SAC trains autonomous superbike agents more efficiently than SAC alone in a physics simulator.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that a reinforcement learning agent can be trained to race a superbike by pairing Soft Actor-Critic with self-paced curriculum learning inside a Unity-based simulator. The curriculum automatically raises task difficulty according to the agent's current performance, while the state includes lean-angle history and the reward penalizes falls and instability. A reader would care because motorbike racing adds balance and reactive steering demands absent from prior four-wheeled autonomous racing work, and the experiments supply the first reported baseline comparing curriculum and non-curriculum RL on this task.

Core claim

The authors argue that SPDL combined with SAC produces agents that reach higher training efficiency, lower lap times, and greater driving stability than plain SAC, and that these gains appear consistently across several tracks and motorbike models inside the VRider SBK simulator.

What carries the argument

Self-Paced curriculum Deep reinforcement Learning (SPDL) integrated with Soft Actor-Critic, which automatically generates a sequence of progressively harder racing tasks from the agent's measured performance.

If this is right

The same SPDL-SAC combination works without any hand-designed sequence of tasks.
State features that track lean-angle history plus rewards that penalize instability are sufficient to manage two-wheeled dynamics.
Performance advantages hold when the same method is tested on multiple tracks and multiple motorbike models.
The resulting agents provide an initial quantitative baseline that later work can compare against.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the simulator-to-reality gap can be closed, the learned policies supply candidate control laws for physical superbikes.
The automatic curriculum construction may transfer to other vehicles whose stability depends on continuous lean or tilt control.
Adding visual or tire-force observations to the existing proprioceptive state could further reduce falls during high-speed cornering.

Load-bearing premise

The VRider SBK Unity simulator supplies a physics model accurate enough that policies successful inside it will reflect real superbike balance and lean behavior.

What would settle it

Running identical training runs in a higher-fidelity or real-world superbike platform and finding that SPDL no longer improves lap time or stability over SAC would falsify the central performance claim.

Figures

Figures reproduced from arXiv: 2606.09236 by Carlo D'Eramo, Jacopo Essenziale, Luca Ghisi, Matteo Luperto.

**Figure 2.** Figure 2: The architecture of our system. II. RELATED WORK A. Reinforcement Learning for Autonomous Racing The application of RL to autonomous racing has advanced rapidly. Fuchs et al. [1] achieved super-human lap times in Gran Turismo Sport using SAC, introducing a dense progressbased reward and kinetic-energy wall-contact penalty. Song et al. [8] extended this to overtaking via a manually designed three-stage cur… view at source ↗

**Figure 3.** Figure 3: (top) The trajectories followed by the motorbike, from the [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Two trajectories followed by the motorbike. In the top one, the bike [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

read the original abstract

Autonomous Racing has seen remarkable progress through deep Reinforcement Learning (RL), primarily for four-wheeled vehicles. However, motorbikes introduce substantially greater complexity due to the need to manage balance and lean angle, in addition to more reactive steering and throttle control, and a smaller weight. In this work, we present a framework for training an autonomous agent to race a superbike in VRider SBK, a physics-accurate Unity-based motorbike simulator. Our approach integrates Soft Actor-Critic (SAC) with Self-Paced curriculum Deep reinforcement Learning (SPDL), which dynamically generates progressively more challenging tasks based on the agent's performance, without requiring manual curriculum design. The agent's state space comprises proprioceptive features extended with lean-angle history, along with global track features via course points. The reward signal is shaped to encourage progress along the track while penalizing instability-inducing behaviors specific to two-wheeled dynamics. Preliminary experimental results demonstrate that SPDL outperforms SAC alone in training efficiency, lap time, and driving stability across multiple tracks and motorbike models, establishing a first baseline for RL-based autonomous motorbike racing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Applies SAC plus self-paced curriculum to superbike sim racing and claims a first baseline, but supplies no numbers and leaves simulator fidelity unchecked.

read the letter

This paper takes the established SAC algorithm and layers self-paced curriculum learning on top to train agents inside a Unity-based superbike simulator. The central claim is that the curriculum version improves training speed, lap times, and stability over plain SAC across several tracks and bike models, and that this constitutes the first RL baseline for autonomous motorbike racing.

The setup shows some care in the domain specifics. The state includes lean-angle history, the reward penalizes instability tied to two-wheeled physics, and the curriculum generates harder tasks automatically as performance improves. These choices make sense for the added complexity of balance and reactive steering that cars do not have.

The evidence is thin. The abstract labels the results preliminary and gives no quantitative values, no ablation tables, no run counts, and no error bars, so it is impossible to judge the size or reliability of the reported gains. The larger gap is simulator validation: the work calls the model physics-accurate yet provides no comparison of simulated lean dynamics, steering response, or weight transfer against real telemetry. Without that check, outperformance inside the sim could be an artifact rather than transferable progress.

The paper is aimed at researchers extending RL vehicle control beyond four wheels. It is worth a reading group discussion for the domain shift alone. It deserves peer review because the application area is new and the framing is straightforward, but referees will need to see the missing metrics and at least basic sim-to-real grounding before any stronger claims can stand.

Referee Report

2 major / 0 minor

Summary. The paper presents a framework for autonomous superbike racing in the VRider SBK Unity simulator that combines Soft Actor-Critic (SAC) with Self-Paced curriculum Deep reinforcement Learning (SPDL). The state includes proprioceptive features augmented by lean-angle history plus global track course points; the reward encourages forward progress while penalizing two-wheeled instability. The central claim is that SPDL yields better training efficiency, lap times, and stability than plain SAC across multiple tracks and motorbike models, thereby establishing a first baseline for RL-based motorbike racing.

Significance. If the reported performance gains are reproducible and the simulator dynamics are shown to be faithful to real superbikes, the work would supply a useful initial benchmark in an underexplored domain. The self-paced curriculum mechanism, which generates tasks automatically from agent performance, is a practical contribution that avoids hand-crafted curricula. The explicit inclusion of lean-angle history in the observation is a domain-appropriate design choice.

major comments (2)

[Abstract] Abstract: the assertion that the VRider SBK simulator is 'physics-accurate' is load-bearing for the claim that the SPDL-vs-SAC comparison establishes a meaningful baseline, yet the manuscript supplies no quantitative validation (e.g., lean-angle time-series correlation, steering-response matching, or comparison against real telemetry).
[Abstract] Abstract: the central empirical claim that 'SPDL outperforms SAC alone in training efficiency, lap time, and driving stability' is stated without any numerical metrics, error bars, statistical tests, ablation tables, or learning-curve figures, rendering the strength of the evidence impossible to assess.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each major point below and will revise the manuscript to strengthen the presentation of claims.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the VRider SBK simulator is 'physics-accurate' is load-bearing for the claim that the SPDL-vs-SAC comparison establishes a meaningful baseline, yet the manuscript supplies no quantitative validation (e.g., lean-angle time-series correlation, steering-response matching, or comparison against real telemetry).

Authors: We agree that the unqualified term 'physics-accurate' is not supported by quantitative evidence in the manuscript. The work is intended as a simulation baseline rather than a claim of real-world transfer. In the revised manuscript we will replace the phrase with 'Unity-based motorbike simulator incorporating two-wheeled dynamics' and add a short paragraph in Section 3 describing the simulator's modeling assumptions and known limitations. revision: yes
Referee: [Abstract] Abstract: the central empirical claim that 'SPDL outperforms SAC alone in training efficiency, lap time, and driving stability' is stated without any numerical metrics, error bars, statistical tests, ablation tables, or learning-curve figures, rendering the strength of the evidence impossible to assess.

Authors: The abstract follows the conventional practice of summarizing results at a high level while deferring quantitative details to the body of the paper. However, we accept that the current wording makes the strength of the evidence difficult to judge from the abstract alone. We will revise the abstract to include concise numerical highlights (e.g., relative lap-time reduction and training-step savings) together with explicit references to the learning-curve figures and statistical tables already present in Section 5. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical RL comparison with no derivation chain

full rationale

The paper reports experimental results from training SAC and SPDL agents in the VRider SBK simulator and compares their training efficiency, lap times, and stability. No equations, first-principles derivations, or predictions are presented that could reduce to fitted parameters or self-referential definitions. The core claim rests on direct simulation runs rather than any analytical chain. Self-citations, if present for the SPDL method, are not load-bearing for the reported performance deltas, which are measured independently. This matches the default case of an empirical paper whose results are falsifiable outside any internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Ledger is necessarily incomplete because only the abstract is available; no free parameters, axioms, or invented entities are explicitly quantified in the provided text.

axioms (1)

domain assumption The reward signal shaped to encourage track progress while penalizing instability-inducing behaviors is appropriate for two-wheeled dynamics.
Stated directly in the abstract as part of the method.

pith-pipeline@v0.9.1-grok · 5732 in / 1189 out tokens · 21229 ms · 2026-06-27T16:21:52.938122+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Super- human performance in gran turismo sport using deep reinforcement learning,

F. Fuchs, Y . Song, E. Kaufmann, D. Scaramuzza, and P. D ¨urr, “Super- human performance in gran turismo sport using deep reinforcement learning,”IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 4257–4264, 2021

2021
[2]

Outracing champion gran turismo drivers with deep reinforcement learning,

P. R. Wurman, S. Barrett, K. Kawamoto, J. MacGlashan, K. Subrama- nian, T. J. Walsh, R. Capobianco, A. Devlic, F. Eckert, and F. Fuchs, “Outracing champion gran turismo drivers with deep reinforcement learning,”Nature, vol. 602, no. 7896, pp. 223–228, 2022

2022
[3]

A super-human vision-based reinforcement learning agent for autonomous racing in gran turismo,

M. Vasco, T. Seno, K. Kawamoto, K. Subramanian, P. R. Wurman, and P. Stone, “A super-human vision-based reinforcement learning agent for autonomous racing in gran turismo,” inProceedings of the 2024 Reinforcement Learning Conference (RLC), 2024

2024
[4]

A champion-level vision-based reinforcement learning agent for competitive racing in gran turismo 7,

H. Lee, T. Seno, J. J. Tai, K. Subramanian, K. Kawamoto, P. Stone, and P. R. Wurman, “A champion-level vision-based reinforcement learning agent for competitive racing in gran turismo 7,”IEEE Robotics and Automation Letters (RA-L), 2025

2025
[5]

Self-paced deep reinforcement learning,

P. Klink, C. D’Eramo, J. R. Peters, and J. Pajarinen, “Self-paced deep reinforcement learning,”Advances in Neural Information Processing Systems, vol. 33, pp. 9216–9227, 2020

2020
[6]

A probabilistic interpretation of self-paced learning with applications to reinforcement learning,

P. Klink, H. Abdulsamad, B. Belousov, C. D’Eramo, J. Peters, and J. Pajarinen, “A probabilistic interpretation of self-paced learning with applications to reinforcement learning,”Journal of Machine Learning Research, vol. 22, no. 182, pp. 1–52, 2021

2021
[7]

Curriculum reinforcement learning via constrained optimal transport,

P. Klink, H. Yang, C. D’Eramo, J. Peters, and J. Pajarinen, “Curriculum reinforcement learning via constrained optimal transport,” inProceed- ings of the International Conference on Machine Learning (ICML). PMLR, 2022, pp. 11 341–11 358

2022
[8]

Au- tonomous overtaking in gran turismo sport using curriculum rein- forcement learning,

Y . Song, H. Lin, E. Kaufmann, P. D ¨urr, and D. Scaramuzza, “Au- tonomous overtaking in gran turismo sport using curriculum rein- forcement learning,” inProceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 9403–9409

2021
[9]

Out-of-distribution generalization with a sparc: Racing 100 unseen vehicles with a single policy,

B. Grooten, P. MacAlpine, K. Subramanian, P. Stone, and P. R. Wurman, “Out-of-distribution generalization with a sparc: Racing 100 unseen vehicles with a single policy,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

2026
[10]

Unity: A general platform for intelligent agents,

A. Juliani, V .-P. Berges, E. Teng, A. Cohen, J. Harper, C. Elion, C. Goy, Y . Gao, H. Henry, M. Mattar, and D. Lange, “Unity: A general platform for intelligent agents,”arXiv preprint arXiv:1809.02627, 2020. [Online]. Available: https://arxiv.org/pdf/1809.02627.pdf

work page arXiv 2020
[11]

OpenAI Gym

G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schul- man, J. Tang, and W. Zaremba, “Openai gym,”arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[12]

Stable-baselines3: Reliable reinforcement learning implementa- tions,

A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dor- mann, “Stable-baselines3: Reliable reinforcement learning implementa- tions,”Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021

2021

[1] [1]

Super- human performance in gran turismo sport using deep reinforcement learning,

F. Fuchs, Y . Song, E. Kaufmann, D. Scaramuzza, and P. D ¨urr, “Super- human performance in gran turismo sport using deep reinforcement learning,”IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 4257–4264, 2021

2021

[2] [2]

Outracing champion gran turismo drivers with deep reinforcement learning,

P. R. Wurman, S. Barrett, K. Kawamoto, J. MacGlashan, K. Subrama- nian, T. J. Walsh, R. Capobianco, A. Devlic, F. Eckert, and F. Fuchs, “Outracing champion gran turismo drivers with deep reinforcement learning,”Nature, vol. 602, no. 7896, pp. 223–228, 2022

2022

[3] [3]

A super-human vision-based reinforcement learning agent for autonomous racing in gran turismo,

M. Vasco, T. Seno, K. Kawamoto, K. Subramanian, P. R. Wurman, and P. Stone, “A super-human vision-based reinforcement learning agent for autonomous racing in gran turismo,” inProceedings of the 2024 Reinforcement Learning Conference (RLC), 2024

2024

[4] [4]

A champion-level vision-based reinforcement learning agent for competitive racing in gran turismo 7,

H. Lee, T. Seno, J. J. Tai, K. Subramanian, K. Kawamoto, P. Stone, and P. R. Wurman, “A champion-level vision-based reinforcement learning agent for competitive racing in gran turismo 7,”IEEE Robotics and Automation Letters (RA-L), 2025

2025

[5] [5]

Self-paced deep reinforcement learning,

P. Klink, C. D’Eramo, J. R. Peters, and J. Pajarinen, “Self-paced deep reinforcement learning,”Advances in Neural Information Processing Systems, vol. 33, pp. 9216–9227, 2020

2020

[6] [6]

A probabilistic interpretation of self-paced learning with applications to reinforcement learning,

P. Klink, H. Abdulsamad, B. Belousov, C. D’Eramo, J. Peters, and J. Pajarinen, “A probabilistic interpretation of self-paced learning with applications to reinforcement learning,”Journal of Machine Learning Research, vol. 22, no. 182, pp. 1–52, 2021

2021

[7] [7]

Curriculum reinforcement learning via constrained optimal transport,

P. Klink, H. Yang, C. D’Eramo, J. Peters, and J. Pajarinen, “Curriculum reinforcement learning via constrained optimal transport,” inProceed- ings of the International Conference on Machine Learning (ICML). PMLR, 2022, pp. 11 341–11 358

2022

[8] [8]

Au- tonomous overtaking in gran turismo sport using curriculum rein- forcement learning,

Y . Song, H. Lin, E. Kaufmann, P. D ¨urr, and D. Scaramuzza, “Au- tonomous overtaking in gran turismo sport using curriculum rein- forcement learning,” inProceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021, pp. 9403–9409

2021

[9] [9]

Out-of-distribution generalization with a sparc: Racing 100 unseen vehicles with a single policy,

B. Grooten, P. MacAlpine, K. Subramanian, P. Stone, and P. R. Wurman, “Out-of-distribution generalization with a sparc: Racing 100 unseen vehicles with a single policy,” inProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2026

2026

[10] [10]

Unity: A general platform for intelligent agents,

A. Juliani, V .-P. Berges, E. Teng, A. Cohen, J. Harper, C. Elion, C. Goy, Y . Gao, H. Henry, M. Mattar, and D. Lange, “Unity: A general platform for intelligent agents,”arXiv preprint arXiv:1809.02627, 2020. [Online]. Available: https://arxiv.org/pdf/1809.02627.pdf

work page arXiv 2020

[11] [11]

OpenAI Gym

G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schul- man, J. Tang, and W. Zaremba, “Openai gym,”arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[12] [12]

Stable-baselines3: Reliable reinforcement learning implementa- tions,

A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dor- mann, “Stable-baselines3: Reliable reinforcement learning implementa- tions,”Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021

2021