CycleRL: Sim-to-Real Deep Reinforcement Learning for Robust Autonomous Bicycle Control
Pith reviewed 2026-05-15 10:41 UTC · model grok-4.3
The pith
CycleRL trains a PPO policy in simulation that transfers directly to physical bicycle hardware for balance and tracking.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CycleRL establishes a direct perception-to-action policy for autonomous bicycle control by training with PPO in NVIDIA Isaac Sim. Systematic domain randomization reduces dependence on precise dynamics models and enables transfer to hardware. In simulation the policy reaches 99.90 percent balance success, 1.15 degree heading error, and 0.18 m/s velocity error; the same policy succeeds on physical hardware and demonstrates greater adaptability than conventional methods.
What carries the argument
PPO policy with composite reward and systematic domain randomization that learns perception-to-action mapping while covering real-world parameter variations.
If this is right
- The learned policy can be deployed on hardware with no additional fine-tuning.
- DRL provides better robustness to model mismatch than traditional controllers for underactuated nonlinear systems.
- Autonomous bicycles become feasible for urban mobility and logistics applications.
- The framework validates end-to-end learning for concurrent balance, velocity, and steering objectives.
Where Pith is reading between the lines
- The same randomization-plus-PPO recipe may apply to other underactuated platforms such as motorcycles or single-wheel robots.
- Adding perception for obstacle avoidance could turn the current balance controller into a full navigation system.
- Performance limits would appear in regimes the randomization never sampled, such as very low speeds or steep slopes.
Load-bearing premise
Systematic domain randomization over a limited set of simulation parameters is sufficient to cover all real-world uncertainties and enable zero-shot transfer to physical hardware without further adaptation.
What would settle it
A physical bicycle deployment fails to maintain balance or track headings when exposed to wind, friction, or mass variations outside the randomized ranges used in simulation.
Figures
read the original abstract
Autonomous bicycles offer a promising agile solution for urban mobility and last-mile logistics. However, conventional control strategies often struggle with underactuated nonlinear dynamics, suffering from sensitivity to model mismatches and limited adaptability to real-world uncertainties. To address this, we develop CycleRL, a comprehensive sim-to-real framework for robust autonomous bicycle control. Our approach establishes a direct perception-to-action mapping within the high-fidelity NVIDIA Isaac Sim environment, leveraging Proximal Policy Optimization (PPO) to optimize the control policy. The framework features a composite reward function tailored for concurrent balance maintenance, velocity tracking, and steering control. Crucially, systematic domain randomization is employed to reduce the reliance on precise system modeling, bridge the simulation-to-reality gap and facilitate direct transfer. In simulation, CycleRL achieves promising performance, including a 99.90% balance success rate, a heading tracking error of 1.15{\deg}, and a velocity tracking error of 0.18 m/s. These quantitative results, coupled with successful hardware deployment, validate DRL as an effective paradigm for autonomous bicycle control, offering superior adaptability over traditional methods. Video demonstrations are available at https://anony6f05.github.io/CycleRL/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CycleRL, a sim-to-real deep reinforcement learning framework for autonomous bicycle control. It employs Proximal Policy Optimization (PPO) within the NVIDIA Isaac Sim environment to learn a direct perception-to-action policy, using a composite reward function for simultaneous balance maintenance, velocity tracking, and steering control. Systematic domain randomization is applied to mitigate model mismatches and enable zero-shot transfer to physical hardware. Simulation results report a 99.90% balance success rate, 1.15° heading tracking error, and 0.18 m/s velocity tracking error, with the work claiming successful hardware deployment that demonstrates superior adaptability compared to traditional control methods.
Significance. If the domain randomization and zero-shot transfer claims are substantiated with detailed parameter ranges and hardware metrics, the work would provide concrete evidence that DRL can robustly handle the underactuated nonlinear dynamics of bicycles in uncertain real-world conditions. This could advance practical applications in agile robotics and last-mile logistics by offering greater adaptability than model-based controllers. The quantitative simulation metrics and availability of video demonstrations offer a useful benchmark for the field.
major comments (3)
- [Abstract] Abstract: The central claim of successful hardware deployment validating superior DRL adaptability rests on zero-shot transfer via domain randomization, yet the abstract (and by extension the experimental reporting) provides no quantitative hardware metrics such as balance success rate or tracking errors on the physical platform. This omission is load-bearing because it prevents direct evaluation of the sim-to-real performance gap.
- [Experimental Setup] Experimental Setup (domain randomization description): No explicit list of randomized parameters (mass, friction, disturbances, sensor noise) or their numerical ranges is given, nor is there justification or sensitivity analysis for these choices. This directly undermines assessment of whether the randomization sufficiently covers real-world uncertainties, as required by the weakest assumption in the sim-to-real claim.
- [Results] Results section: The simulation performance numbers (99.90% success, 1.15° heading error, 0.18 m/s velocity error) are presented without baselines, ablations on reward weights, or statistical details on training variability. This is load-bearing for the superiority claim over traditional methods, as the reported metrics cannot be contextualized without these comparisons.
minor comments (2)
- [Abstract] Abstract: The notation '1.15{°}' uses an escaped degree symbol; ensure consistent rendering of units (e.g., ° or deg) across all sections and figures.
- Consider adding a dedicated table or subsection that directly compares simulation versus hardware quantitative results to strengthen the transfer validation.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and describe the revisions we will make to strengthen the presentation of the sim-to-real results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim of successful hardware deployment validating superior DRL adaptability rests on zero-shot transfer via domain randomization, yet the abstract (and by extension the experimental reporting) provides no quantitative hardware metrics such as balance success rate or tracking errors on the physical platform. This omission is load-bearing because it prevents direct evaluation of the sim-to-real performance gap.
Authors: We agree that quantitative hardware metrics would allow readers to directly assess the sim-to-real gap. The current abstract and results emphasize simulation performance while noting successful hardware deployment (supported by the linked video demonstrations). In the revised manuscript we will update the abstract and add a dedicated hardware results subsection that reports the corresponding balance success rate, heading error, and velocity error measured on the physical platform. revision: yes
-
Referee: [Experimental Setup] Experimental Setup (domain randomization description): No explicit list of randomized parameters (mass, friction, disturbances, sensor noise) or their numerical ranges is given, nor is there justification or sensitivity analysis for these choices. This directly undermines assessment of whether the randomization sufficiently covers real-world uncertainties, as required by the weakest assumption in the sim-to-real claim.
Authors: We acknowledge that the manuscript describes domain randomization at a high level without the requested parameter details. To address this, the revised version will include a table enumerating all randomized parameters (mass, friction coefficients, sensor noise, external disturbances, etc.) together with their numerical ranges. We will also add a short justification based on our hardware characterization and a sensitivity analysis showing policy robustness across the chosen ranges. revision: yes
-
Referee: [Results] Results section: The simulation performance numbers (99.90% success, 1.15° heading error, 0.18 m/s velocity error) are presented without baselines, ablations on reward weights, or statistical details on training variability. This is load-bearing for the superiority claim over traditional methods, as the reported metrics cannot be contextualized without these comparisons.
Authors: We agree that additional context is needed to substantiate the superiority claim. In the revised results section we will add (i) baseline comparisons against tuned PID and LQR controllers on the same simulation tasks, (ii) ablations that vary the relative weights of the balance, velocity, and steering reward terms, and (iii) statistical summaries (mean and standard deviation) of the reported metrics across multiple independent training seeds. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper trains a PPO policy in NVIDIA Isaac Sim using a composite reward for balance, velocity, and steering, then applies systematic domain randomization for sim-to-real transfer. Reported metrics (99.90% balance success, 1.15° heading error, 0.18 m/s velocity error) and hardware deployment are direct empirical outputs of the optimization and physical validation, not quantities defined by or reduced to the same fitted parameters. No equations, self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the abstract or description. The chain is self-contained against external simulator and hardware benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- composite reward weights
- domain randomization ranges
axioms (1)
- domain assumption NVIDIA Isaac Sim supplies sufficiently accurate rigid-body and contact dynamics for the bicycle once parameters are randomized.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
composite reward function ... Rt = λsurv·rsurv + λvel·rvel + λhead·rhead + λact·ract + λrate·rrate; Proximal Policy Optimization (PPO) ... domain randomization strategy
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
systematic domain randomization ... Dynamics Randomization (Physical Parameters) ... Initial State Randomization ... Task / Command Randomization
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Robust balancing and trajectory control of a self-driving bicycle,
T.-J. Yeh, T.-C. Lin, and A. C.-B. Chen, “Robust balancing and trajectory control of a self-driving bicycle,”IEEE Transactions on Control Systems Technology, vol. 32, no. 6, pp. 2410–2417, 2024
work page 2024
-
[2]
A survey of deep learning applications to autonomous vehicle control,
S. Kuutti, R. Bowden, Y . Jin, P. Barber, and S. Fallah, “A survey of deep learning applications to autonomous vehicle control,”IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 2, pp. 712–733, 2020
work page 2020
-
[3]
Stability control and path tracking of a self- balancing bicycle with a reaction wheel,
W.-H. Huang, P. T.-T. Nguyen, D.-D. Nguyen, H.-P. Doan, M.-Y . Chuang, and C.-H. Kuo, “Stability control and path tracking of a self- balancing bicycle with a reaction wheel,”IEEE/ASME Transactions on Mechatronics, 2025
work page 2025
-
[4]
X. Zhu, W. Xu, Z. Chen, Y . Deng, Q. Zheng, B. Liang, and Y . Liu, “Online robust self-learning terminal sliding mode control for balancing control of reaction wheel bicycle robots,”Robotica, vol. 42, no. 10, pp. 3416–3430, 2024
work page 2024
-
[5]
Towards automated bicycles: Achieving self-balance using steering control,
W. Deng, S. Moore, J. Bush, M. Mabey, and W. Zhang, “Towards automated bicycles: Achieving self-balance using steering control,” in Dynamic Systems and Control Conference, vol. 51906. American Society of Mechanical Engineers, 2018, p. V002T24A012
work page 2018
-
[6]
M. Baquero-Su ´arez, J. Cort ´es-Romero, J. Arcos-Legarda, and H. Coral- Enriquez, “A robust two-stage active disturbance rejection control for the stabilization of a riderless bicycle,”Multibody System Dynamics, vol. 45, no. 1, pp. 7–35, 2019
work page 2019
-
[7]
Steering control for autonomously balancing bicycle at low speed,
Y . Yu and M. Zhao, “Steering control for autonomously balancing bicycle at low speed,” in2018 IEEE International Conference on Robotics and Biomimetics (ROBIO). IEEE, 2018, pp. 33–38
work page 2018
-
[8]
Towards artificial general intelligence with hybrid tianjic chip architecture,
J. Pei, L. Deng, S. Song, M. Zhao, Y . Zhang, S. Wu, G. Wang, Z. Zou, Z. Wu, W. Heet al., “Towards artificial general intelligence with hybrid tianjic chip architecture,”Nature, vol. 572, no. 7767, pp. 106–111, 2019
work page 2019
-
[9]
Dynamic vehicle drifting with nonlinear mpc and a fused kinematic-dynamic bicycle model,
G. Bellegarda and Q. Nguyen, “Dynamic vehicle drifting with nonlinear mpc and a fused kinematic-dynamic bicycle model,”IEEE Control Systems Letters, vol. 6, pp. 1958–1963, 2021
work page 1958
-
[10]
A. Shafiekhani, M. J. Mahjoob, and M. Akraminia, “Design and im- plementation of an adaptive critic-based neuro-fuzzy controller on an unmanned bicycle,”Mechatronics, vol. 28, pp. 115–123, 2015
work page 2015
-
[11]
Controlling an autonomous vehicle with deep reinforcement learning,
A. Folkers, M. Rick, and C. B ¨uskens, “Controlling an autonomous vehicle with deep reinforcement learning,” in2019 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2019, pp. 2025–2031
work page 2019
-
[12]
Isaac gym: High performance gpu based physics sim- ulation for robot learning,
V . M. et al., “Isaac gym: High performance gpu based physics sim- ulation for robot learning,” inProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, J. Vanschoren and S. Yeung, Eds., vol. 1, 2021
work page 2021
-
[13]
Mujoco: A physics engine for model-based control,
E. Todorov, T. Erez, and Y . Tassa, “Mujoco: A physics engine for model-based control,” in2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033
work page 2012
-
[14]
Reinforcement learning and deep learning based lateral control for autonomous driving,
D. Li, D. Zhao, Q. Zhang, and Y . Chen, “Reinforcement learning and deep learning based lateral control for autonomous driving,”IEEE Computational Intelligence Magazine, vol. 14, no. 2, pp. 83–98, 2019
work page 2019
-
[15]
Fuzzy sliding-mode underac- tuated control for autonomous dynamic balance of an electrical bicycle,
C.-L. Hwang, H.-M. Wu, and C.-L. Shih, “Fuzzy sliding-mode underac- tuated control for autonomous dynamic balance of an electrical bicycle,” IEEE Transactions on Control Systems Technology, vol. 17, no. 3, pp. 658–670, 2009
work page 2009
-
[16]
A. Weißmann and D. G ¨orges, “An empirical study on ego vehicle trajectory prediction for bicycles in urban environments based on conditional imitation learning,” in2021 IEEE International Intelligent Transportation Systems Conference. IEEE, 2021, pp. 1482–1489
work page 2021
-
[17]
Reinforcement learning applications in unmanned vehicle control: A comprehensive overview,
H. Liu, B. Kiumarsi, Y . Kartal, A. Taha Koru, H. Modares, and F. L. Lewis, “Reinforcement learning applications in unmanned vehicle control: A comprehensive overview,”Unmanned Systems, vol. 11, no. 01, pp. 17–26, 2023
work page 2023
-
[18]
A deep reinforcement learning algorithm to control a two-wheeled scooter with a humanoid robot,
J. Baltes, G. Christmann, and S. Saeedvand, “A deep reinforcement learning algorithm to control a two-wheeled scooter with a humanoid robot,”Engineering Applications of Artificial Intelligence, vol. 126, p. 106941, 2023
work page 2023
-
[19]
Control of rough terrain vehicles using deep reinforcement learning,
V . Wiberg, E. Wallin, T. Nordfjell, and M. Servin, “Control of rough terrain vehicles using deep reinforcement learning,”IEEE Robotics and Automation Letters, vol. 7, no. 1, pp. 390–397, 2022
work page 2022
-
[20]
L. Guo, H. Lin, J. Jiang, Y . Song, and D. Gan, “Combined control algorithm based on synchronous reinforcement learning for a self- balancing bicycle robot,”ISA Transactions, vol. 145, pp. 479–492, 2024
work page 2024
-
[21]
Sim2real in robotics and automation: Applications and challenges,
S. H ¨ofer, K. Bekris, A. Handa, J. C. Gamboa, M. Mozifian, F. Golemo, C. Atkeson, D. Fox, K. Goldberg, J. Leonardet al., “Sim2real in robotics and automation: Applications and challenges,”IEEE Transactions on Automation Science and Engineering, vol. 18, no. 2, pp. 398–400, 2021
work page 2021
-
[22]
Domain randomization for transferring deep neural networks from sim- ulation to the real world,
J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from sim- ulation to the real world,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 23–30
work page 2017
-
[23]
Robust proximal adversarial reinforcement learning under model mismatch,
P. Zhai, X. Wei, T. Hou, X. Ji, Z. Dong, J. Yi, and L. Zhang, “Robust proximal adversarial reinforcement learning under model mismatch,” IEEE Robotics and Automation Letters, vol. 9, no. 11, pp. 10 248– 10 255, 2024
work page 2024
-
[24]
Bi-directional domain adaptation for sim2real transfer of embodied navigation agents,
J. Truong, S. Chernova, and D. Batra, “Bi-directional domain adaptation for sim2real transfer of embodied navigation agents,”IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 2634–2641, 2021
work page 2021
-
[25]
Linearized dynamics equations for the balance and steer of a bicycle: a benchmark and review,
J. P. Meijaard, J. M. Papadopoulos, A. Ruina, and A. L. Schwab, “Linearized dynamics equations for the balance and steer of a bicycle: a benchmark and review,”Proceedings of the Royal society A: math- ematical, physical and engineering sciences, vol. 463, no. 2084, pp. 1955–1982, 2007
work page 2084
-
[26]
Tra- jectory tracking and stabilisation of a riderless bicycle,
N. Persson, M. C. Ekstr ¨om, M. Ekstr ¨om, and A. V . Papadopoulos, “Tra- jectory tracking and stabilisation of a riderless bicycle,” in2021 IEEE International Intelligent Transportation Systems Conference (ITSC). IEEE, 2021, pp. 1859–1866
work page 2021
-
[27]
S. Yasuda, T. Kumagai, and H. Yoshida, “Safe and efficient dynamic window approach for differential mobile robots with stochastic dynamics using deterministic sampling,”IEEE Robotics and Automation Letters, vol. 8, no. 5, pp. 2614–2621, 2023
work page 2023
-
[28]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
J. Xiong, R. Yu, and C. Liu, “Steering control and stability analysis for an autonomous bicycle: part ii experiments and a modified linear control law,”Nonlinear Dynamics, vol. 112, no. 5, pp. 3107–3132, 2024
work page 2024
-
[30]
Ultra fast structure-aware deep lane detection,
Z. Qin, H. Wang, and X. Li, “Ultra fast structure-aware deep lane detection,” inEuropean conference on computer vision. Springer, 2020, pp. 276–291
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.