pith. sign in

arxiv: 2603.15013 · v2 · submitted 2026-03-16 · 💻 cs.RO

CycleRL: Sim-to-Real Deep Reinforcement Learning for Robust Autonomous Bicycle Control

Pith reviewed 2026-05-15 10:41 UTC · model grok-4.3

classification 💻 cs.RO
keywords reinforcement learningsim-to-real transferautonomous bicycledomain randomizationPPOrobot controlunderactuated systemsIsaac Sim
0
0 comments X

The pith

CycleRL trains a PPO policy in simulation that transfers directly to physical bicycle hardware for balance and tracking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops CycleRL as a sim-to-real framework that learns bicycle control through deep reinforcement learning rather than explicit modeling. It uses Proximal Policy Optimization inside a high-fidelity simulator, combined with systematic domain randomization, to create a policy that maps raw perception to steering and velocity actions. A composite reward encourages simultaneous balance, heading accuracy, and speed tracking. If the approach holds, it shows that randomization over simulation parameters can close the reality gap enough for zero-shot deployment on real hardware. A sympathetic reader would care because this could make underactuated vehicles like bicycles practical for autonomous urban tasks where traditional controllers falter on model errors and disturbances.

Core claim

CycleRL establishes a direct perception-to-action policy for autonomous bicycle control by training with PPO in NVIDIA Isaac Sim. Systematic domain randomization reduces dependence on precise dynamics models and enables transfer to hardware. In simulation the policy reaches 99.90 percent balance success, 1.15 degree heading error, and 0.18 m/s velocity error; the same policy succeeds on physical hardware and demonstrates greater adaptability than conventional methods.

What carries the argument

PPO policy with composite reward and systematic domain randomization that learns perception-to-action mapping while covering real-world parameter variations.

If this is right

  • The learned policy can be deployed on hardware with no additional fine-tuning.
  • DRL provides better robustness to model mismatch than traditional controllers for underactuated nonlinear systems.
  • Autonomous bicycles become feasible for urban mobility and logistics applications.
  • The framework validates end-to-end learning for concurrent balance, velocity, and steering objectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same randomization-plus-PPO recipe may apply to other underactuated platforms such as motorcycles or single-wheel robots.
  • Adding perception for obstacle avoidance could turn the current balance controller into a full navigation system.
  • Performance limits would appear in regimes the randomization never sampled, such as very low speeds or steep slopes.

Load-bearing premise

Systematic domain randomization over a limited set of simulation parameters is sufficient to cover all real-world uncertainties and enable zero-shot transfer to physical hardware without further adaptation.

What would settle it

A physical bicycle deployment fails to maintain balance or track headings when exposed to wind, friction, or mass variations outside the randomized ranges used in simulation.

Figures

Figures reproduced from arXiv: 2603.15013 by Gelu Liu, Junliang Wu, Songyuan Li, Teng Wang, Xiangwei Zhu, Zhijie Wu.

Figure 1
Figure 1. Figure 1: Real-world deployment of the proposed autonomous [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the reward function design for balancing [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Bicycle and terrain modeling in Isaac Sim. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training curves and convergence analysis. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity Analysis of Reward Weights. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Construction of hardware platform. The mechanical design emphasizes modularity and reliabil￾ity, utilizing off-the-shelf components to ensure reproducibil￾ity. Computationally, the NVIDIA Jetson board provides suffi￾cient processing power for real-time neural network inference while maintaining power efficiency for extended operation. 2) Validation of Real-World Deployment: Real-world vali￾dation was condu… view at source ↗
read the original abstract

Autonomous bicycles offer a promising agile solution for urban mobility and last-mile logistics. However, conventional control strategies often struggle with underactuated nonlinear dynamics, suffering from sensitivity to model mismatches and limited adaptability to real-world uncertainties. To address this, we develop CycleRL, a comprehensive sim-to-real framework for robust autonomous bicycle control. Our approach establishes a direct perception-to-action mapping within the high-fidelity NVIDIA Isaac Sim environment, leveraging Proximal Policy Optimization (PPO) to optimize the control policy. The framework features a composite reward function tailored for concurrent balance maintenance, velocity tracking, and steering control. Crucially, systematic domain randomization is employed to reduce the reliance on precise system modeling, bridge the simulation-to-reality gap and facilitate direct transfer. In simulation, CycleRL achieves promising performance, including a 99.90% balance success rate, a heading tracking error of 1.15{\deg}, and a velocity tracking error of 0.18 m/s. These quantitative results, coupled with successful hardware deployment, validate DRL as an effective paradigm for autonomous bicycle control, offering superior adaptability over traditional methods. Video demonstrations are available at https://anony6f05.github.io/CycleRL/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces CycleRL, a sim-to-real deep reinforcement learning framework for autonomous bicycle control. It employs Proximal Policy Optimization (PPO) within the NVIDIA Isaac Sim environment to learn a direct perception-to-action policy, using a composite reward function for simultaneous balance maintenance, velocity tracking, and steering control. Systematic domain randomization is applied to mitigate model mismatches and enable zero-shot transfer to physical hardware. Simulation results report a 99.90% balance success rate, 1.15° heading tracking error, and 0.18 m/s velocity tracking error, with the work claiming successful hardware deployment that demonstrates superior adaptability compared to traditional control methods.

Significance. If the domain randomization and zero-shot transfer claims are substantiated with detailed parameter ranges and hardware metrics, the work would provide concrete evidence that DRL can robustly handle the underactuated nonlinear dynamics of bicycles in uncertain real-world conditions. This could advance practical applications in agile robotics and last-mile logistics by offering greater adaptability than model-based controllers. The quantitative simulation metrics and availability of video demonstrations offer a useful benchmark for the field.

major comments (3)
  1. [Abstract] Abstract: The central claim of successful hardware deployment validating superior DRL adaptability rests on zero-shot transfer via domain randomization, yet the abstract (and by extension the experimental reporting) provides no quantitative hardware metrics such as balance success rate or tracking errors on the physical platform. This omission is load-bearing because it prevents direct evaluation of the sim-to-real performance gap.
  2. [Experimental Setup] Experimental Setup (domain randomization description): No explicit list of randomized parameters (mass, friction, disturbances, sensor noise) or their numerical ranges is given, nor is there justification or sensitivity analysis for these choices. This directly undermines assessment of whether the randomization sufficiently covers real-world uncertainties, as required by the weakest assumption in the sim-to-real claim.
  3. [Results] Results section: The simulation performance numbers (99.90% success, 1.15° heading error, 0.18 m/s velocity error) are presented without baselines, ablations on reward weights, or statistical details on training variability. This is load-bearing for the superiority claim over traditional methods, as the reported metrics cannot be contextualized without these comparisons.
minor comments (2)
  1. [Abstract] Abstract: The notation '1.15{°}' uses an escaped degree symbol; ensure consistent rendering of units (e.g., ° or deg) across all sections and figures.
  2. Consider adding a dedicated table or subsection that directly compares simulation versus hardware quantitative results to strengthen the transfer validation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and describe the revisions we will make to strengthen the presentation of the sim-to-real results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim of successful hardware deployment validating superior DRL adaptability rests on zero-shot transfer via domain randomization, yet the abstract (and by extension the experimental reporting) provides no quantitative hardware metrics such as balance success rate or tracking errors on the physical platform. This omission is load-bearing because it prevents direct evaluation of the sim-to-real performance gap.

    Authors: We agree that quantitative hardware metrics would allow readers to directly assess the sim-to-real gap. The current abstract and results emphasize simulation performance while noting successful hardware deployment (supported by the linked video demonstrations). In the revised manuscript we will update the abstract and add a dedicated hardware results subsection that reports the corresponding balance success rate, heading error, and velocity error measured on the physical platform. revision: yes

  2. Referee: [Experimental Setup] Experimental Setup (domain randomization description): No explicit list of randomized parameters (mass, friction, disturbances, sensor noise) or their numerical ranges is given, nor is there justification or sensitivity analysis for these choices. This directly undermines assessment of whether the randomization sufficiently covers real-world uncertainties, as required by the weakest assumption in the sim-to-real claim.

    Authors: We acknowledge that the manuscript describes domain randomization at a high level without the requested parameter details. To address this, the revised version will include a table enumerating all randomized parameters (mass, friction coefficients, sensor noise, external disturbances, etc.) together with their numerical ranges. We will also add a short justification based on our hardware characterization and a sensitivity analysis showing policy robustness across the chosen ranges. revision: yes

  3. Referee: [Results] Results section: The simulation performance numbers (99.90% success, 1.15° heading error, 0.18 m/s velocity error) are presented without baselines, ablations on reward weights, or statistical details on training variability. This is load-bearing for the superiority claim over traditional methods, as the reported metrics cannot be contextualized without these comparisons.

    Authors: We agree that additional context is needed to substantiate the superiority claim. In the revised results section we will add (i) baseline comparisons against tuned PID and LQR controllers on the same simulation tasks, (ii) ablations that vary the relative weights of the balance, velocity, and steering reward terms, and (iii) statistical summaries (mean and standard deviation) of the reported metrics across multiple independent training seeds. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper trains a PPO policy in NVIDIA Isaac Sim using a composite reward for balance, velocity, and steering, then applies systematic domain randomization for sim-to-real transfer. Reported metrics (99.90% balance success, 1.15° heading error, 0.18 m/s velocity error) and hardware deployment are direct empirical outputs of the optimization and physical validation, not quantities defined by or reduced to the same fitted parameters. No equations, self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the abstract or description. The chain is self-contained against external simulator and hardware benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The claim depends on the fidelity of the Isaac Sim bicycle model and on the coverage of the chosen randomization ranges; both are chosen by the authors rather than derived from first principles.

free parameters (2)
  • composite reward weights
    Balance, velocity, and steering terms must be scaled by hand-tuned coefficients to produce the reported behavior.
  • domain randomization ranges
    Bounds on mass, friction, and sensor noise are selected to bridge the reality gap but are not derived from measurements.
axioms (1)
  • domain assumption NVIDIA Isaac Sim supplies sufficiently accurate rigid-body and contact dynamics for the bicycle once parameters are randomized.
    The entire sim-to-real pipeline rests on this modeling assumption.

pith-pipeline@v0.9.0 · 5527 in / 1361 out tokens · 54027 ms · 2026-05-15T10:41:17.457744+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    Robust balancing and trajectory control of a self-driving bicycle,

    T.-J. Yeh, T.-C. Lin, and A. C.-B. Chen, “Robust balancing and trajectory control of a self-driving bicycle,”IEEE Transactions on Control Systems Technology, vol. 32, no. 6, pp. 2410–2417, 2024

  2. [2]

    A survey of deep learning applications to autonomous vehicle control,

    S. Kuutti, R. Bowden, Y . Jin, P. Barber, and S. Fallah, “A survey of deep learning applications to autonomous vehicle control,”IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 2, pp. 712–733, 2020

  3. [3]

    Stability control and path tracking of a self- balancing bicycle with a reaction wheel,

    W.-H. Huang, P. T.-T. Nguyen, D.-D. Nguyen, H.-P. Doan, M.-Y . Chuang, and C.-H. Kuo, “Stability control and path tracking of a self- balancing bicycle with a reaction wheel,”IEEE/ASME Transactions on Mechatronics, 2025

  4. [4]

    Online robust self-learning terminal sliding mode control for balancing control of reaction wheel bicycle robots,

    X. Zhu, W. Xu, Z. Chen, Y . Deng, Q. Zheng, B. Liang, and Y . Liu, “Online robust self-learning terminal sliding mode control for balancing control of reaction wheel bicycle robots,”Robotica, vol. 42, no. 10, pp. 3416–3430, 2024

  5. [5]

    Towards automated bicycles: Achieving self-balance using steering control,

    W. Deng, S. Moore, J. Bush, M. Mabey, and W. Zhang, “Towards automated bicycles: Achieving self-balance using steering control,” in Dynamic Systems and Control Conference, vol. 51906. American Society of Mechanical Engineers, 2018, p. V002T24A012

  6. [6]

    A robust two-stage active disturbance rejection control for the stabilization of a riderless bicycle,

    M. Baquero-Su ´arez, J. Cort ´es-Romero, J. Arcos-Legarda, and H. Coral- Enriquez, “A robust two-stage active disturbance rejection control for the stabilization of a riderless bicycle,”Multibody System Dynamics, vol. 45, no. 1, pp. 7–35, 2019

  7. [7]

    Steering control for autonomously balancing bicycle at low speed,

    Y . Yu and M. Zhao, “Steering control for autonomously balancing bicycle at low speed,” in2018 IEEE International Conference on Robotics and Biomimetics (ROBIO). IEEE, 2018, pp. 33–38

  8. [8]

    Towards artificial general intelligence with hybrid tianjic chip architecture,

    J. Pei, L. Deng, S. Song, M. Zhao, Y . Zhang, S. Wu, G. Wang, Z. Zou, Z. Wu, W. Heet al., “Towards artificial general intelligence with hybrid tianjic chip architecture,”Nature, vol. 572, no. 7767, pp. 106–111, 2019

  9. [9]

    Dynamic vehicle drifting with nonlinear mpc and a fused kinematic-dynamic bicycle model,

    G. Bellegarda and Q. Nguyen, “Dynamic vehicle drifting with nonlinear mpc and a fused kinematic-dynamic bicycle model,”IEEE Control Systems Letters, vol. 6, pp. 1958–1963, 2021

  10. [10]

    Design and im- plementation of an adaptive critic-based neuro-fuzzy controller on an unmanned bicycle,

    A. Shafiekhani, M. J. Mahjoob, and M. Akraminia, “Design and im- plementation of an adaptive critic-based neuro-fuzzy controller on an unmanned bicycle,”Mechatronics, vol. 28, pp. 115–123, 2015

  11. [11]

    Controlling an autonomous vehicle with deep reinforcement learning,

    A. Folkers, M. Rick, and C. B ¨uskens, “Controlling an autonomous vehicle with deep reinforcement learning,” in2019 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2019, pp. 2025–2031

  12. [12]

    Isaac gym: High performance gpu based physics sim- ulation for robot learning,

    V . M. et al., “Isaac gym: High performance gpu based physics sim- ulation for robot learning,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, J. Vanschoren and S. Yeung, Eds., vol. 1, 2021

  13. [13]

    Mujoco: A physics engine for model-based control,

    E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model-based control,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033

  14. [14]

    Reinforcement learning and deep learning based lateral control for autonomous driving,

    D. Li, D. Zhao, Q. Zhang, and Y . Chen, “Reinforcement learning and deep learning based lateral control for autonomous driving,”IEEE Computational Intelligence Magazine, vol. 14, no. 2, pp. 83–98, 2019

  15. [15]

    Fuzzy sliding-mode underac- tuated control for autonomous dynamic balance of an electrical bicycle,

    C.-L. Hwang, H.-M. Wu, and C.-L. Shih, “Fuzzy sliding-mode underac- tuated control for autonomous dynamic balance of an electrical bicycle,” IEEE Transactions on Control Systems Technology, vol. 17, no. 3, pp. 658–670, 2009

  16. [16]

    An empirical study on ego vehicle trajectory prediction for bicycles in urban environments based on conditional imitation learning,

    A. Weißmann and D. G ¨orges, “An empirical study on ego vehicle trajectory prediction for bicycles in urban environments based on conditional imitation learning,” in2021 IEEE International Intelligent Transportation Systems Conference. IEEE, 2021, pp. 1482–1489

  17. [17]

    Reinforcement learning applications in unmanned vehicle control: A comprehensive overview,

    H. Liu, B. Kiumarsi, Y . Kartal, A. Taha Koru, H. Modares, and F. L. Lewis, “Reinforcement learning applications in unmanned vehicle control: A comprehensive overview,”Unmanned Systems, vol. 11, no. 01, pp. 17–26, 2023

  18. [18]

    A deep reinforcement learning algorithm to control a two-wheeled scooter with a humanoid robot,

    J. Baltes, G. Christmann, and S. Saeedvand, “A deep reinforcement learning algorithm to control a two-wheeled scooter with a humanoid robot,”Engineering Applications of Artificial Intelligence, vol. 126, p. 106941, 2023

  19. [19]

    Control of rough terrain vehicles using deep reinforcement learning,

    V . Wiberg, E. Wallin, T. Nordfjell, and M. Servin, “Control of rough terrain vehicles using deep reinforcement learning,”IEEE Robotics and Automation Letters, vol. 7, no. 1, pp. 390–397, 2022

  20. [20]

    Combined control algorithm based on synchronous reinforcement learning for a self- balancing bicycle robot,

    L. Guo, H. Lin, J. Jiang, Y . Song, and D. Gan, “Combined control algorithm based on synchronous reinforcement learning for a self- balancing bicycle robot,”ISA Transactions, vol. 145, pp. 479–492, 2024

  21. [21]

    Sim2real in robotics and automation: Applications and challenges,

    S. H ¨ofer, K. Bekris, A. Handa, J. C. Gamboa, M. Mozifian, F. Golemo, C. Atkeson, D. Fox, K. Goldberg, J. Leonardet al., “Sim2real in robotics and automation: Applications and challenges,”IEEE Transactions on Automation Science and Engineering, vol. 18, no. 2, pp. 398–400, 2021

  22. [22]

    Domain randomization for transferring deep neural networks from sim- ulation to the real world,

    J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from sim- ulation to the real world,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 23–30

  23. [23]

    Robust proximal adversarial reinforcement learning under model mismatch,

    P. Zhai, X. Wei, T. Hou, X. Ji, Z. Dong, J. Yi, and L. Zhang, “Robust proximal adversarial reinforcement learning under model mismatch,” IEEE Robotics and Automation Letters, vol. 9, no. 11, pp. 10 248– 10 255, 2024

  24. [24]

    Bi-directional domain adaptation for sim2real transfer of embodied navigation agents,

    J. Truong, S. Chernova, and D. Batra, “Bi-directional domain adaptation for sim2real transfer of embodied navigation agents,”IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 2634–2641, 2021

  25. [25]

    Linearized dynamics equations for the balance and steer of a bicycle: a benchmark and review,

    J. P. Meijaard, J. M. Papadopoulos, A. Ruina, and A. L. Schwab, “Linearized dynamics equations for the balance and steer of a bicycle: a benchmark and review,”Proceedings of the Royal society A: math- ematical, physical and engineering sciences, vol. 463, no. 2084, pp. 1955–1982, 2007

  26. [26]

    Tra- jectory tracking and stabilisation of a riderless bicycle,

    N. Persson, M. C. Ekstr ¨om, M. Ekstr ¨om, and A. V . Papadopoulos, “Tra- jectory tracking and stabilisation of a riderless bicycle,” in2021 IEEE International Intelligent Transportation Systems Conference (ITSC). IEEE, 2021, pp. 1859–1866

  27. [27]

    Safe and efficient dynamic window approach for differential mobile robots with stochastic dynamics using deterministic sampling,

    S. Yasuda, T. Kumagai, and H. Yoshida, “Safe and efficient dynamic window approach for differential mobile robots with stochastic dynamics using deterministic sampling,”IEEE Robotics and Automation Letters, vol. 8, no. 5, pp. 2614–2621, 2023

  28. [28]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv:1707.06347, 2017

  29. [29]

    Steering control and stability analysis for an autonomous bicycle: part ii experiments and a modified linear control law,

    J. Xiong, R. Yu, and C. Liu, “Steering control and stability analysis for an autonomous bicycle: part ii experiments and a modified linear control law,”Nonlinear Dynamics, vol. 112, no. 5, pp. 3107–3132, 2024

  30. [30]

    Ultra fast structure-aware deep lane detection,

    Z. Qin, H. Wang, and X. Li, “Ultra fast structure-aware deep lane detection,” inEuropean conference on computer vision. Springer, 2020, pp. 276–291