Curriculum-based Sample Efficient Reinforcement Learning for Robust Stabilization of a Quadrotor

Akshit Saradagi; Fausto Mauricio Lagos Suarez; George Nikolakopoulos; Shruti Kotpaliwar; Vidya Sumathy

arxiv: 2501.18490 · v3 · submitted 2025-01-30 · 💻 cs.RO · cs.AI

Curriculum-based Sample Efficient Reinforcement Learning for Robust Stabilization of a Quadrotor

Fausto Mauricio Lagos Suarez , Akshit Saradagi , Vidya Sumathy , Shruti Kotpaliwar , George Nikolakopoulos This is my paper

Pith reviewed 2026-05-23 04:31 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords curriculum learningreinforcement learningquadrotor stabilizationsample efficiencyend-to-end controlrobust stabilizationaerial robotics

0 comments

The pith

A three-stage curriculum trains an end-to-end RL policy for quadrotor stabilization using far fewer samples than one-stage training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that decomposing the quadrotor stabilization task into sequential stages of increasing difficulty allows a reinforcement learning policy to learn robust position and yaw control from random starts while controlling motor RPMs directly. This matters because conventional single-stage end-to-end RL demands large numbers of samples and long training times to meet transient and steady-state requirements. The curriculum transfers knowledge across stages without altering the reward function or truncation rules. Simulation results in Gym-PyBullet-Drones show the curriculum policy reaches better performance and robustness in hovering, coupling, and velocity-robust phases, and succeeds in an inspection pose-tracking task.

Core claim

The central claim is that the proposed three-stage curriculum learning approach, which first teaches hovering, then translational-rotational coupling, and finally robustness to random non-zero initial velocities, produces an end-to-end policy that outperforms a conventionally trained one-stage policy on the same reward and hyperparameters while using substantially fewer samples and less convergence time.

What carries the argument

The three-stage curriculum decomposition that incrementally raises task complexity while transferring policy parameters from one stage to the next.

If this is right

The curriculum policy achieves simultaneous position and yaw stabilization from random initial states while satisfying pre-specified transient and steady-state specs.
Training requires significantly fewer samples and shorter wall-clock time than single-stage RL.
The resulting policy performs robustly in an inspection pose-tracking scenario under varying initial conditions.
All validation occurs in the Gym-PyBullet-Drones simulator with direct motor-RPM actuation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged decomposition could apply to other high-dimensional continuous control problems where end-to-end learning stalls on sample count.
Curriculum stages might allow reuse of the same reward design across related robotic platforms without per-task retuning.
If the transfer holds in hardware, the approach could lower the compute barrier for deploying learned quadrotor controllers on inspection missions.

Load-bearing premise

The three stages produce positive knowledge transfer without any retuning of the reward function or episode truncation rules between stages.

What would settle it

Run identical one-stage and three-stage trainings on the same simulator with the same reward, hyperparameters, and random seeds, then measure total samples required for each to reach the target stabilization metrics under random initial conditions.

Figures

Figures reproduced from arXiv: 2501.18490 by Akshit Saradagi, Fausto Mauricio Lagos Suarez, George Nikolakopoulos, Shruti Kotpaliwar, Vidya Sumathy.

**Figure 1.** Figure 1: The Crazyflie Quadrotor. Problem statement. Despite the availability of highperformance computational resources such as Graphics Processing Units (GPUs), training an RL policy to achieve complex control tasks for a Quadrotor, with acceptable performance levels, requires millions of interactions [17] with the training environment. This high demand for interactions makes the training process computational… view at source ↗

**Figure 2.** Figure 2: Reinforcement Learning setup and configuration of th [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Evaluation of the curriculum-trained policy using 3 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Performance of the trained policy in achieving robus [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Motor RPMs generated by the curriculum-trained poli [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Performance of the curriculum-trained policy in rec [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

This article introduces a novel sample-efficient curriculum learning (CL) approach for training an end-to-end reinforcement learning (RL) policy for robust stabilization of a Quadrotor. The learning objective is to simultaneously stabilize position and yaw-orientation from random initial conditions through direct control over motor RPMs (end-to-end), while adhering to pre-specified transient and steady-state specifications. This objective, relevant in aerial inspection applications, is challenging for conventional one-stage end-to-end RL, which requires substantial computational resources and lengthy training times. To address this challenge, this article draws inspiration from human-inspired curriculum learning and decomposes the learning objective into a three-stage curriculum that incrementally increases task complexity, while transferring knowledge from one stage to the next. In the proposed curriculum, the policy sequentially learns hovering, the coupling between translational and rotational degrees of freedom, and robustness to random non-zero initial velocities, utilizing a custom reward function and episode truncation conditions. The results demonstrate that the proposed CL approach achieves superior performance compared to a policy trained conventionally in one stage, with the same reward function and hyperparameters, while significantly reducing computational resource needs (samples) and convergence time. The CL-trained policy's performance and robustness are thoroughly validated in a simulation engine (Gym-PyBullet-Drones), under random initial conditions, and in an inspection pose-tracking scenario. A video presenting our results is available at https://youtu.be/9wv6T4eezAU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Three-stage curriculum speeds quadrotor end-to-end RL training in simulation, but the baseline comparison is weakened by per-stage custom rewards and truncations.

read the letter

The main takeaway is that this three-stage curriculum for end-to-end quadrotor stabilization reduces the samples and time needed to train a policy that handles random initial conditions and meets transient specs, at least in their PyBullet simulation. They decompose the task into sequential stages: first hovering, then translational-rotational coupling, then robustness to random velocities. The policy carries over between stages, and each uses tailored rewards and truncation. Results show it outperforms a conventional one-stage training on the same reward and hyperparameters, with validation on random starts and an inspection task. This is a solid, incremental application of curriculum learning to a practical aerial robotics problem. The direct RPM control and focus on inspection-relevant specs make it relevant. The simulation setup with Gym-PyBullet-Drones is standard and allows for the reported checks. The potential issue is in the baseline comparison. The abstract mentions custom reward functions and episode truncation conditions for the curriculum stages. If those differ from what the single-stage baseline uses, then the environments are not identical, and part of the efficiency gain could come from the staged difficulty rather than transfer alone. The stress-test note flags this correctly based on the given text. Quantitative details like exact sample reductions or error bars are missing from the abstract, which makes it harder to assess the strength of the results. The free parameters for stage transitions are noted but not deeply analyzed. This paper is for people in RL for robotics who are looking for ways to make end-to-end training more feasible for quadrotors. It won't change the field but provides a worked example. I think it deserves peer review. The empirical demonstration is worth a referee's time to check the details and see if the gains hold up.

Referee Report

1 major / 1 minor

Summary. The manuscript presents a three-stage curriculum learning (CL) method for training an end-to-end RL policy to stabilize a quadrotor’s position and yaw from random initial conditions via direct motor RPM control. The stages progressively address hovering, translational-rotational coupling, and robustness to nonzero initial velocities, each with custom reward functions and truncation conditions. The central claim is that this CL policy outperforms a conventional one-stage baseline trained with identical reward function and hyperparameters, while requiring fewer samples and less training time; results are validated in Gym-PyBullet-Drones under random initial conditions and an inspection pose-tracking task.

Significance. If the comparison to the one-stage baseline is conducted under an identical MDP (same reward, truncation logic, and episode lengths), the work supplies concrete evidence that staged curriculum transfer can improve sample efficiency for end-to-end quadrotor control without retuning the underlying reward or termination rules. Such a result would be useful for aerial robotics applications where direct RL training is computationally prohibitive.

major comments (1)

[Abstract] Abstract: the assertion that the CL approach uses “the same reward function and hyperparameters” as the one-stage baseline is contradicted by the statement that each curriculum stage “utiliz[es] a custom reward function and episode truncation conditions.” Because the performance and sample-efficiency claims rest on the baseline being trained under the identical effective MDP, the manuscript must explicitly state the truncation thresholds, maximum episode lengths, and success criteria applied to the baseline versus each stage (and confirm they are unchanged).

minor comments (1)

[Abstract] Abstract: quantitative metrics, error bars, and ablation details on sample counts or convergence time are absent, which weakens the reader’s ability to gauge the magnitude of the reported gains from the summary alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback. We address the single major comment below regarding consistency in the abstract and the need for explicit details on the MDP components.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the CL approach uses “the same reward function and hyperparameters” as the one-stage baseline is contradicted by the statement that each curriculum stage “utiliz[es] a custom reward function and episode truncation conditions.” Because the performance and sample-efficiency claims rest on the baseline being trained under the identical effective MDP, the manuscript must explicitly state the truncation thresholds, maximum episode lengths, and success criteria applied to the baseline versus each stage (and confirm they are unchanged).

Authors: We agree there is an inconsistency in the abstract wording that requires clarification. The curriculum stages use tailored (custom) reward functions and truncation conditions to progressively build the policy, but the one-stage baseline is trained using the identical reward function, hyperparameters, truncation thresholds, maximum episode lengths, and success criteria as the final curriculum stage. This ensures the baseline comparison occurs under the same effective MDP as claimed. We will revise the abstract to remove the ambiguity and add an explicit table (or appendix) listing the truncation thresholds, episode lengths, and success criteria for the baseline and each stage, confirming they are unchanged for the baseline. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison to explicit one-stage baseline

full rationale

The paper reports an empirical RL experiment comparing a three-stage curriculum policy against a conventional one-stage policy trained with identical reward function and hyperparameters. Performance metrics (success rate, sample count, convergence time) are measured directly against this external baseline in simulation; no derivation, parameter fit, or uniqueness theorem is invoked whose output is definitionally identical to its input. The abstract explicitly states the comparison uses the same reward and hyperparameters, and the method does not rename or smuggle any fitted quantity as a prediction. Self-citations, if present, are not load-bearing for the central empirical claim. This is a standard self-contained empirical result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that staged task decomposition yields positive transfer for this dynamics; no new physical entities or mathematical axioms are introduced beyond standard RL assumptions (Markov property, bounded action space).

free parameters (1)

stage transition thresholds and reward weights
Custom reward function and episode truncation conditions are tuned per stage; their specific values are not reported in the abstract.

axioms (1)

domain assumption The quadrotor dynamics in Gym-PyBullet-Drones are sufficiently accurate for policy transfer to real hardware.
The paper validates only in simulation; sim-to-real gap is not addressed in the abstract.

pith-pipeline@v0.9.0 · 5806 in / 1331 out tokens · 27064 ms · 2026-05-23T04:31:39.436187+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 3 internal anchors

[1]

Drone Deep Reinforcement Learnin g: A Review,

A. T. Azar, A. Koubaa, N. Ali Mohamed, H. A. Ibrahim, Z. F. Ibrahim, M. Kazim, A. Ammar, B. Benjdira, A. M. Khamis, I. A. Hameed, and G. Casalino, “Drone Deep Reinforcement Learnin g: A Review,” Electronics, vol. 10, no. 9, p. 999, Jan. 2021, number: 9 Publisher: Multidisciplinary Digital Publishing Instit ute. [Online]. Available: https://www.mdpi.com/20...

work page 2021
[2]

Forest Fire Localization: From Reinforcement Learning Exploration to a Dynamic Drone Control,

J. Alvarez, A. Belbachir, F. Belbachir, J. Chahal, A. Gou djil, J. Gustave, and A. ¨Ozt¨ urk Suri, “Forest Fire Localization: From Reinforcement Learning Exploration to a Dynamic Drone Control,” Journal of Intelligent & Robotic Systems , vol. 109, no. 4, p. 83, Nov. 2023. [Online]. Available: https://doi.org/10.1007/s10846-023-02004-z

work page doi:10.1007/s10846-023-02004-z 2023
[3]

Sutton and A

R. Sutton and A. Barto, Reinforcement Learning, second edition: An Introduction , ser. Adaptive Computation and Machine Learning series. MIT Press, 2018. [Online]. Availa ble: https://books.google.se/books?id=sWV0DwAAQBAJ

work page 2018
[4]

Reinforc ement Learning for UA V Attitude Control,

W. Koch, R. Mancuso, R. West, and A. Bestavros, “Reinforc ement Learning for UA V Attitude Control,” ACM Trans. Cyber-Phys. Syst. , vol. 3, no. 2, pp. 22:1–22:21, Feb. 2019. [Online]. Availabl e: https://dl.acm.org/doi/10.1145/3301273

work page doi:10.1145/3301273 2019
[5]

Deep Model-Based Reinforc ement Learning for Predictive Control of Robotic Systems with Den se and Sparse Rewards,

L. Antonyshyn and S. Givigi, “Deep Model-Based Reinforc ement Learning for Predictive Control of Robotic Systems with Den se and Sparse Rewards,” Journal of Intelligent & Robotic Systems , vol. 110, no. 3, p. 100, Jul. 2024. [Online]. Available: https://doi.org/10.1007/s10846-024-02118-y

work page doi:10.1007/s10846-024-02118-y 2024
[6]

Learning to Fly i n Seconds,

J. Eschmann, D. Albani, and G. Loianno, “Learning to Fly i n Seconds,” Apr. 2024, arXiv:2311.13081 [cs, eess]. [Online ]. Available: http://arxiv.org/abs/2311.13081

work page arXiv 2024
[7]

Aerial Gym – Isaac Gym Simulator for Aerial Robots,

M. Kulkarni, T. J. L. Forgaard, and K. Alexis, “Aerial Gym – Isaac Gym Simulator for Aerial Robots,” May 2023, arXiv:2305.165 10 [cs]. [Online]. Available: http://arxiv.org/abs/2305.1 6510 Fig. 7. Performance of the curriculum-trained policy in rec overing the target position and attitude when subjected to e xternal disturbances. The plots show the Quadro...

work page 2023
[8]

OmniDrones: An Efﬁcient and Flexible Platform for Reinfor cement Learning in Drone Control,

B. Xu, F. Gao, C. Y u, R. Zhang, Y . Wu, and Y . Wang, “OmniDrones: An Efﬁcient and Flexible Platform for Reinfor cement Learning in Drone Control,” Sep. 2023. [Online]. Available : https://arxiv.org/abs/2309.12825v1

work page arXiv 2023
[9]

An innovative bio-in spired ﬂight controller for quad-rotor drones: Quad-rotor drone l earning to ﬂy using reinforcement learning,

A. Ramezani Dooraki and D.-J. Lee, “An innovative bio-in spired ﬂight controller for quad-rotor drones: Quad-rotor drone l earning to ﬂy using reinforcement learning,” Robotics and Autonomous Systems, vol. 135, p. 103671, Jan. 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S092188902030511X

work page 2021
[10]

An Improved Proximal Policy Optimization Method for Low-Level Control of a Quadr otor,

W. Xue, H. Wu, H. Y e, and S. Shao, “An Improved Proximal Policy Optimization Method for Low-Level Control of a Quadr otor,” Actuators, vol. 11, no. 4, p. 105, Apr. 2022, number: 4 Publisher: Multidisciplinary Digital Publishing Institute. [Online ]. Available: https://www.mdpi.com/2076-0825/11/4/105

work page 2022
[11]

Continuous control wi th deep reinforcement learning,

T. P . Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez , Y . Tassa, D. Silver, and D. Wierstra, “Continuous control wi th deep reinforcement learning,” Jul. 2019, arXiv:1509.0297 1 [cs, stat]. [Online]. Available: http://arxiv.org/abs/1509.02971

work page arXiv 2019
[12]

Trust Region Policy Optimization,

J. Schulman, S. Levine, P . Moritz, M. I. Jordan, and P . Ab beel, “Trust Region Policy Optimization,” Apr. 2017, arXiv:1502 .05477 [cs]. [Online]. Available: http://arxiv.org/abs/1502.0 5477

work page 2017
[13]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P . Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” Aug. 2017, arXiv:1707.06347 [cs]. [Online]. Available: http://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Control of a Quadrotor with Reinforcement Learning

J. Hwangbo, I. Sa, R. Siegwart, and M. Hutter, “Control o f a Quadrotor with Reinforcement Learning,” IEEE Robotics and Automation Letters, vol. 2, no. 4, pp. 2096–2103, Oct. 2017, arXiv:1707.05110 [c s]. [Online]. Available: http://arxiv.org/abs/1707.05110

work page internal anchor Pith review Pith/arXiv arXiv 2096
[15]

Quadrotor Dynamics and Control Rev 0.1

R. Beard, “Quadrotor Dynamics and Control Rev 0.1.”

work page
[16]

Design of a Trajectory Tracking Controller for a Nanoquadcopter

C. Luis and J. L. Ny, “Design of a Trajectory Tracking Con troller for a Nanoquadcopter,” Aug. 2016, arXiv:1608.05786 [cs]. [ Online]. Available: http://arxiv.org/abs/1608.05786

work page internal anchor Pith review Pith/arXiv arXiv 2016
[17]

End-to -end neural network based optimal quadcopter control,

R. Ferede, G. de Croon, C. De Wagter, and D. Izzo, “End-to -end neural network based optimal quadcopter control,” Robotics and Autonomous Systems, vol. 172, p. 104588, Feb. 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0921889023002270

work page 2024
[18]

Curriculum Learning for Reinforcement L earning Do- mains: A Framework and Survey,

S. Narvekar, “Curriculum Learning for Reinforcement L earning Do- mains: A Framework and Survey,” Journal of Machine Learning Research 21, Jul. 2020

work page 2020
[19]

Task Decomposition in Re- inforcement Learning

J. Karlsson, “Task Decomposition in Re- inforcement Learning.” [Online]. Available: https://aaai.org/papers/0006-ss94-02-006-task-decom position-in-reinforcement-learning/

work page
[20]

Reinforcement Learning with Task Decomposition and Task-Speciﬁc Reward System for Automation of High-Level Tasks,

G. Kwon, B. Kim, and N. K. Kwon, “Reinforcement Learning with Task Decomposition and Task-Speciﬁc Reward System for Automation of High-Level Tasks,” Biomimetics, vol. 9, no. 4, p. 196, Apr. 2024, number: 4 Publisher: Multidisciplinary Digital Publishin g Institute. [Online]. Available: https://www.mdpi.com/2313-7673/9 /4/196

work page 2024
[21]

Quadrotor motion control usin g deep reinforcement learning,

Z. Jiang and A. F. Lynch, “Quadrotor motion control usin g deep reinforcement learning,” Journal of Unmanned V ehicle Systems , vol. 9, no. 4, pp. 234–251, Dec. 2021. [Online]. Available: https://cdnsciencepub.com/doi/10.1139/juvs-2021-0010

work page doi:10.1139/juvs-2021-0010 2021
[22]

utiasDSL/gym-pybullet-drones at 627abb314e473eb52fe2a3c7e30df2de0e7ab589

“utiasDSL/gym-pybullet-drones at 627abb314e473eb52fe2a3c7e30df2de0e7ab589.” [Online]. Available: https://github.com/utiasDSL/gym-pybullet-drones

work page
[23]

Gymnasium

“Gymnasium.” [Online]. Available: https://zenodo.o rg/record/8127025

work page arXiv
[24]

PyBullet, a Python module for physics simulation for games, robotics and machine learning,

E. C. a. Y . Bai, “PyBullet, a Python module for physics simulation for games, robotics and machine learning,” Apr. 2024, original-date: 2011-04-12T18:45:08Z. [Online]. Av ailable: https://github.com/bulletphysics/bullet3

work page 2024
[25]

Stable-baselines3: reliable reinforcement l earning im- plementations,

A. Rafﬁn, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus , and N. Dormann, “Stable-baselines3: reliable reinforcement l earning im- plementations,” The Journal of Machine Learning Research , vol. 22, no. 1, pp. 268:12 348–268:12 355, Jan. 2021

work page 2021
[26]

Multilayer Perceptron and Neural Networks,

M.-C. Popescu, V . E. Balas, L. Perescu-Popescu, and N. M astorakis, “Multilayer Perceptron and Neural Networks,” vol. 8, no. 7, 2009

work page 2009

[1] [1]

Drone Deep Reinforcement Learnin g: A Review,

A. T. Azar, A. Koubaa, N. Ali Mohamed, H. A. Ibrahim, Z. F. Ibrahim, M. Kazim, A. Ammar, B. Benjdira, A. M. Khamis, I. A. Hameed, and G. Casalino, “Drone Deep Reinforcement Learnin g: A Review,” Electronics, vol. 10, no. 9, p. 999, Jan. 2021, number: 9 Publisher: Multidisciplinary Digital Publishing Instit ute. [Online]. Available: https://www.mdpi.com/20...

work page 2021

[2] [2]

Forest Fire Localization: From Reinforcement Learning Exploration to a Dynamic Drone Control,

J. Alvarez, A. Belbachir, F. Belbachir, J. Chahal, A. Gou djil, J. Gustave, and A. ¨Ozt¨ urk Suri, “Forest Fire Localization: From Reinforcement Learning Exploration to a Dynamic Drone Control,” Journal of Intelligent & Robotic Systems , vol. 109, no. 4, p. 83, Nov. 2023. [Online]. Available: https://doi.org/10.1007/s10846-023-02004-z

work page doi:10.1007/s10846-023-02004-z 2023

[3] [3]

Sutton and A

R. Sutton and A. Barto, Reinforcement Learning, second edition: An Introduction , ser. Adaptive Computation and Machine Learning series. MIT Press, 2018. [Online]. Availa ble: https://books.google.se/books?id=sWV0DwAAQBAJ

work page 2018

[4] [4]

Reinforc ement Learning for UA V Attitude Control,

W. Koch, R. Mancuso, R. West, and A. Bestavros, “Reinforc ement Learning for UA V Attitude Control,” ACM Trans. Cyber-Phys. Syst. , vol. 3, no. 2, pp. 22:1–22:21, Feb. 2019. [Online]. Availabl e: https://dl.acm.org/doi/10.1145/3301273

work page doi:10.1145/3301273 2019

[5] [5]

Deep Model-Based Reinforc ement Learning for Predictive Control of Robotic Systems with Den se and Sparse Rewards,

L. Antonyshyn and S. Givigi, “Deep Model-Based Reinforc ement Learning for Predictive Control of Robotic Systems with Den se and Sparse Rewards,” Journal of Intelligent & Robotic Systems , vol. 110, no. 3, p. 100, Jul. 2024. [Online]. Available: https://doi.org/10.1007/s10846-024-02118-y

work page doi:10.1007/s10846-024-02118-y 2024

[6] [6]

Learning to Fly i n Seconds,

J. Eschmann, D. Albani, and G. Loianno, “Learning to Fly i n Seconds,” Apr. 2024, arXiv:2311.13081 [cs, eess]. [Online ]. Available: http://arxiv.org/abs/2311.13081

work page arXiv 2024

[7] [7]

Aerial Gym – Isaac Gym Simulator for Aerial Robots,

M. Kulkarni, T. J. L. Forgaard, and K. Alexis, “Aerial Gym – Isaac Gym Simulator for Aerial Robots,” May 2023, arXiv:2305.165 10 [cs]. [Online]. Available: http://arxiv.org/abs/2305.1 6510 Fig. 7. Performance of the curriculum-trained policy in rec overing the target position and attitude when subjected to e xternal disturbances. The plots show the Quadro...

work page 2023

[8] [8]

OmniDrones: An Efﬁcient and Flexible Platform for Reinfor cement Learning in Drone Control,

B. Xu, F. Gao, C. Y u, R. Zhang, Y . Wu, and Y . Wang, “OmniDrones: An Efﬁcient and Flexible Platform for Reinfor cement Learning in Drone Control,” Sep. 2023. [Online]. Available : https://arxiv.org/abs/2309.12825v1

work page arXiv 2023

[9] [9]

An innovative bio-in spired ﬂight controller for quad-rotor drones: Quad-rotor drone l earning to ﬂy using reinforcement learning,

A. Ramezani Dooraki and D.-J. Lee, “An innovative bio-in spired ﬂight controller for quad-rotor drones: Quad-rotor drone l earning to ﬂy using reinforcement learning,” Robotics and Autonomous Systems, vol. 135, p. 103671, Jan. 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S092188902030511X

work page 2021

[10] [10]

An Improved Proximal Policy Optimization Method for Low-Level Control of a Quadr otor,

W. Xue, H. Wu, H. Y e, and S. Shao, “An Improved Proximal Policy Optimization Method for Low-Level Control of a Quadr otor,” Actuators, vol. 11, no. 4, p. 105, Apr. 2022, number: 4 Publisher: Multidisciplinary Digital Publishing Institute. [Online ]. Available: https://www.mdpi.com/2076-0825/11/4/105

work page 2022

[11] [11]

Continuous control wi th deep reinforcement learning,

T. P . Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez , Y . Tassa, D. Silver, and D. Wierstra, “Continuous control wi th deep reinforcement learning,” Jul. 2019, arXiv:1509.0297 1 [cs, stat]. [Online]. Available: http://arxiv.org/abs/1509.02971

work page arXiv 2019

[12] [12]

Trust Region Policy Optimization,

J. Schulman, S. Levine, P . Moritz, M. I. Jordan, and P . Ab beel, “Trust Region Policy Optimization,” Apr. 2017, arXiv:1502 .05477 [cs]. [Online]. Available: http://arxiv.org/abs/1502.0 5477

work page 2017

[13] [13]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P . Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” Aug. 2017, arXiv:1707.06347 [cs]. [Online]. Available: http://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

Control of a Quadrotor with Reinforcement Learning

J. Hwangbo, I. Sa, R. Siegwart, and M. Hutter, “Control o f a Quadrotor with Reinforcement Learning,” IEEE Robotics and Automation Letters, vol. 2, no. 4, pp. 2096–2103, Oct. 2017, arXiv:1707.05110 [c s]. [Online]. Available: http://arxiv.org/abs/1707.05110

work page internal anchor Pith review Pith/arXiv arXiv 2096

[15] [15]

Quadrotor Dynamics and Control Rev 0.1

R. Beard, “Quadrotor Dynamics and Control Rev 0.1.”

work page

[16] [16]

Design of a Trajectory Tracking Controller for a Nanoquadcopter

C. Luis and J. L. Ny, “Design of a Trajectory Tracking Con troller for a Nanoquadcopter,” Aug. 2016, arXiv:1608.05786 [cs]. [ Online]. Available: http://arxiv.org/abs/1608.05786

work page internal anchor Pith review Pith/arXiv arXiv 2016

[17] [17]

End-to -end neural network based optimal quadcopter control,

R. Ferede, G. de Croon, C. De Wagter, and D. Izzo, “End-to -end neural network based optimal quadcopter control,” Robotics and Autonomous Systems, vol. 172, p. 104588, Feb. 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0921889023002270

work page 2024

[18] [18]

Curriculum Learning for Reinforcement L earning Do- mains: A Framework and Survey,

S. Narvekar, “Curriculum Learning for Reinforcement L earning Do- mains: A Framework and Survey,” Journal of Machine Learning Research 21, Jul. 2020

work page 2020

[19] [19]

Task Decomposition in Re- inforcement Learning

J. Karlsson, “Task Decomposition in Re- inforcement Learning.” [Online]. Available: https://aaai.org/papers/0006-ss94-02-006-task-decom position-in-reinforcement-learning/

work page

[20] [20]

Reinforcement Learning with Task Decomposition and Task-Speciﬁc Reward System for Automation of High-Level Tasks,

G. Kwon, B. Kim, and N. K. Kwon, “Reinforcement Learning with Task Decomposition and Task-Speciﬁc Reward System for Automation of High-Level Tasks,” Biomimetics, vol. 9, no. 4, p. 196, Apr. 2024, number: 4 Publisher: Multidisciplinary Digital Publishin g Institute. [Online]. Available: https://www.mdpi.com/2313-7673/9 /4/196

work page 2024

[21] [21]

Quadrotor motion control usin g deep reinforcement learning,

Z. Jiang and A. F. Lynch, “Quadrotor motion control usin g deep reinforcement learning,” Journal of Unmanned V ehicle Systems , vol. 9, no. 4, pp. 234–251, Dec. 2021. [Online]. Available: https://cdnsciencepub.com/doi/10.1139/juvs-2021-0010

work page doi:10.1139/juvs-2021-0010 2021

[22] [22]

utiasDSL/gym-pybullet-drones at 627abb314e473eb52fe2a3c7e30df2de0e7ab589

“utiasDSL/gym-pybullet-drones at 627abb314e473eb52fe2a3c7e30df2de0e7ab589.” [Online]. Available: https://github.com/utiasDSL/gym-pybullet-drones

work page

[23] [23]

Gymnasium

“Gymnasium.” [Online]. Available: https://zenodo.o rg/record/8127025

work page arXiv

[24] [24]

PyBullet, a Python module for physics simulation for games, robotics and machine learning,

E. C. a. Y . Bai, “PyBullet, a Python module for physics simulation for games, robotics and machine learning,” Apr. 2024, original-date: 2011-04-12T18:45:08Z. [Online]. Av ailable: https://github.com/bulletphysics/bullet3

work page 2024

[25] [25]

Stable-baselines3: reliable reinforcement l earning im- plementations,

A. Rafﬁn, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus , and N. Dormann, “Stable-baselines3: reliable reinforcement l earning im- plementations,” The Journal of Machine Learning Research , vol. 22, no. 1, pp. 268:12 348–268:12 355, Jan. 2021

work page 2021

[26] [26]

Multilayer Perceptron and Neural Networks,

M.-C. Popescu, V . E. Balas, L. Perescu-Popescu, and N. M astorakis, “Multilayer Perceptron and Neural Networks,” vol. 8, no. 7, 2009

work page 2009