Online Reinforcement Learning for Safe Gain Scheduling in Nonlinear Quadrotor Control

Chieh Tsai; Hossein Rastgoftar; Muhammad Junayed Hasan Zahed; Salim Hariri

arxiv: 2604.16819 · v1 · submitted 2026-04-18 · 📡 eess.SY · cs.SY

Online Reinforcement Learning for Safe Gain Scheduling in Nonlinear Quadrotor Control

Muhammad Junayed Hasan Zahed , Chieh Tsai , Salim Hariri , Hossein Rastgoftar This is my paper

Pith reviewed 2026-05-10 07:22 UTC · model grok-4.3

classification 📡 eess.SY cs.SY

keywords reinforcement learninggain schedulingquadrotor controlsafe controldeep Q-networknonlinear dynamicstrajectory trackinghover regulation

0 comments

The pith

Online reinforcement learning selects gain vectors from a pre-certified library to adapt quadrotor feedback while preserving safety and control structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method that lets reinforcement learning change a quadrotor's feedback gains during operation by choosing from a fixed collection of already-verified stabilizing controllers rather than generating raw commands. Safety is maintained by limiting choices to those gains that keep all future states inside a designated safe region and by enforcing a minimum time between any two switches. A deep Q-network is trained to apply stronger gains during large maneuvers and milder gains once the vehicle nears hover, exploiting symmetry to share translational gains across axes while handling yaw separately. Nonlinear simulations confirm that the resulting closed-loop behavior tracks trajectories accurately, keeps attitude angles bounded, lowers control effort as the vehicle settles, and achieves stable hovering. This matters because it allows the benefits of learning without ever permitting the system to leave the region where stability has already been certified.

Core claim

The paper establishes that an online deep Q-network can learn a policy for selecting gain vectors from a finite library of pre-certified controllers for a nonlinear quadrotor, subject to constraints that ensure forward invariance of a safe state set and dwell-time limits on switching, thereby allowing adaptive feedback authority that improves performance in transients while maintaining closed-loop safety and the original snap-based control law.

What carries the argument

The deep Q-network policy that selects among admissible gain vectors, where admissibility requires that the chosen gains maintain forward invariance of the prescribed safe state set together with dwell-time constraints that limit switching speed.

If this is right

Accurate trajectory tracking is achieved under the learned policy.
Attitude motion remains bounded throughout the maneuver.
Control effort is reduced once the vehicle converges to the target.
Stable hover regulation is maintained without loss of safety.
The underlying snap-based control structure is preserved because only gains are scheduled rather than new commands generated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The symmetry-based sharing of translational gains could be tested on other vehicles whose dynamics are approximately isotropic.
The size and coverage of the certified library determine how well the method handles unexpected disturbances or model changes in real flight.
Hardware experiments with wind gusts or payload variations would directly test whether the simulated invariance carries over when unmodeled effects appear.
Allowing the safe set itself to adapt slowly online might reduce conservatism without sacrificing the forward-invariance guarantee.

Load-bearing premise

A finite library of pre-certified stabilizing controllers plus the invariance and dwell-time restrictions will keep the full nonlinear closed-loop system inside the safe set for every possible sequence of gain selections the learner might produce.

What would settle it

A nonlinear simulation or hardware flight in which the quadrotor state leaves the safe set or the attitude becomes unbounded while the learned policy is actively selecting gains.

Figures

Figures reproduced from arXiv: 2604.16819 by Chieh Tsai, Hossein Rastgoftar, Muhammad Junayed Hasan Zahed, Salim Hariri.

**Figure 2.** Figure 2: Shielded DQN rollout: external error states. Trans [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 5.** Figure 5: Physical evaluation: control inputs. The thrust [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Physical evaluation: reward per step. The reward [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

This paper presents an online reinforcement-learning framework for safe gain scheduling of a nonlinear quadcopter controller. Rather than learning thrust and torque commands directly, the proposed method selects gain vectors online from a finite library of pre-certified stabilizing controllers, thereby preserving the structure of the underlying snap-based control law. Safety is enforced by restricting the policy to admissible gains that maintain forward invariance of a prescribed safe state set, while dwell-time constraints prevent excessively fast switching. To reduce the action-space dimension, translational gains are shared across spatial axes by exploiting the isotropic structure of the translational dynamics, whereas yaw gains are scheduled independently. A deep Q-network learns to adjust feedback authority according to the current flight condition, using aggressive gains during large transients and milder gains near hover. High-fidelity nonlinear simulations demonstrate accurate trajectory tracking, bounded attitude motion, reduced control effort near convergence, and stable hover regulation under online safe gain scheduling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical RL method for online gain scheduling on quadrotors that preserves the snap-based controller and shrinks the action space via isotropy, but its safety claims rest only on simulations without a clear invariance proof for the switched nonlinear system.

read the letter

The main thing to know is that this work shows how to let a deep Q-network pick gains from a fixed library of pre-certified controllers for a quadrotor instead of learning raw commands. It adds dwell-time rules and invariance checks to limit switching, shares translational gains across axes, and keeps the underlying snap-based law untouched. High-fidelity simulations then report solid trajectory tracking, bounded attitudes, and stable hover with lower effort near convergence.

Referee Report

2 major / 1 minor

Summary. The paper proposes an online reinforcement learning framework for safe gain scheduling in nonlinear quadrotor control. Instead of learning commands directly, a deep Q-network selects gain vectors from a finite library of pre-certified stabilizing controllers while preserving the underlying snap-based control law. Safety is enforced by restricting the policy to admissible gains that maintain forward invariance of a prescribed safe set, combined with dwell-time constraints to limit switching speed. Translational gains are shared across axes due to isotropy, while yaw gains are scheduled independently. High-fidelity nonlinear simulations are used to demonstrate accurate trajectory tracking, bounded attitude motion, reduced control effort near convergence, and stable hover regulation.

Significance. If the safety argument for the switched nonlinear system holds, the method provides a practical bridge between reinforcement learning and certified control for quadrotors by leveraging a library of pre-certified controllers rather than learning from scratch. The structure-preserving approach and exploitation of isotropy to shrink the action space are clear strengths that could generalize to other underactuated systems. However, the current validation rests entirely on high-fidelity simulations without formal certificates for the composite switched dynamics or statistical quantification of performance, which limits the immediate significance for safety-critical applications.

major comments (2)

[Abstract] Abstract: The central safety claim that restricting the policy to admissible gains from the pre-certified library plus dwell-time constraints 'maintains forward invariance' of the safe set for the full nonlinear closed-loop system is not supported by any common Lyapunov function, control-barrier certificate, or dwell-time stability condition for the switched nonlinear vector field. Individual per-controller certifications (via linearization or local Lyapunov functions) do not automatically guarantee invariance under arbitrary admissible sequences, especially when translational and yaw gains are scheduled independently and the underlying snap-based law is nonlinear.
[Simulation results] Simulation results (as summarized in the abstract): The reported high-fidelity nonlinear simulation outcomes lack quantitative error bars, ablation studies isolating the contribution of the RL policy versus fixed-gain baselines, or formal verification that the learned policy never violates the safe set. This leaves the performance claims (accurate tracking, bounded attitude, reduced effort) difficult to assess rigorously and makes the 'safe' qualifier rest on unverified simulation trajectories.

minor comments (1)

[Abstract] The abstract would benefit from explicitly stating the cardinality of the gain library and the numerical value of the dwell-time constraint used in the experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address the major comments point by point below, providing clarifications on the safety argument and enhancing the simulation analysis in the revision.

read point-by-point responses

Referee: [Abstract] Abstract: The central safety claim that restricting the policy to admissible gains from the pre-certified library plus dwell-time constraints 'maintains forward invariance' of the safe set for the full nonlinear closed-loop system is not supported by any common Lyapunov function, control-barrier certificate, or dwell-time stability condition for the switched nonlinear vector field. Individual per-controller certifications (via linearization or local Lyapunov functions) do not automatically guarantee invariance under arbitrary admissible sequences, especially when translational and yaw gains are scheduled independently and the underlying snap-based law is nonlinear.

Authors: We respectfully maintain that the safety claim is supported by the per-controller invariance property. Each gain vector in the finite library is pre-certified such that the corresponding closed-loop nonlinear vector field (under the snap-based law) renders the prescribed safe set forward invariant, typically via control barrier functions or local Lyapunov analysis. Because the RL policy is constrained to select exclusively from this admissible set, the active vector field at every instant is one that preserves invariance. Forward invariance therefore composes directly under arbitrary switching among admissible modes; no common Lyapunov function is required for this set-invariance property (as opposed to asymptotic stability). Independent yaw scheduling is accommodated by defining admissibility over the full gain vector (translational plus yaw), ensuring the combined closed-loop dynamics satisfy the invariance certificate. Dwell-time limits are imposed for actuator practicality and to avoid chattering, but are not essential to the invariance argument. We have added a short clarifying subsection in the safety analysis to state this composition explicitly. revision: partial
Referee: [Simulation results] Simulation results (as summarized in the abstract): The reported high-fidelity nonlinear simulation outcomes lack quantitative error bars, ablation studies isolating the contribution of the RL policy versus fixed-gain baselines, or formal verification that the learned policy never violates the safe set. This leaves the performance claims (accurate tracking, bounded attitude, reduced effort) difficult to assess rigorously and makes the 'safe' qualifier rest on unverified simulation trajectories.

Authors: We agree that the empirical section would be strengthened by additional quantitative elements. In the revised manuscript we have added: (i) mean and standard-deviation error bars over 50 Monte-Carlo trials with randomized initial states and wind disturbances; (ii) ablation comparisons of the learned policy against fixed-aggressive, fixed-mild, and random-admissible switching baselines, quantifying tracking error, control effort, and attitude bounds; (iii) time-series plots confirming that the safe-set distance remains strictly positive in all runs. While exhaustive formal verification of every possible switched trajectory is intractable for the high-dimensional nonlinear system, the theoretical invariance guarantee from the admissible set, combined with the reported empirical evidence, substantiates the safety claims. These updates appear in the results section and revised abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core method selects gains via RL from a pre-certified finite library, with safety via explicit restriction to admissible gains plus dwell-time. No equation or claim reduces a prediction or invariance result to a fitted parameter or self-referential definition by construction. The safety statement is an assumption on the restriction's effect rather than a derived equivalence to inputs. No load-bearing self-citations, ansatz smuggling, or renaming of known results appear in the provided text or abstract. The approach builds on standard RL and switched-control primitives without self-referential fitting.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard control-theory assumptions about the existence of stabilizing gain libraries and the ability to certify forward invariance; no new physical entities are postulated and no free parameters are fitted inside the safety mechanism itself.

axioms (2)

domain assumption A finite library of pre-certified stabilizing controllers exists for the quadrotor dynamics
The entire scheduling policy is built on the availability of such a library.
domain assumption Admissible gains maintain forward invariance of the prescribed safe state set
Safety enforcement is defined by this property.

pith-pipeline@v0.9.0 · 5461 in / 1399 out tokens · 53558 ms · 2026-05-10T07:22:55.417128+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Achiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization. InInternational con- ference on machine learning, 22–31. Pmlr. Adibfar, A., Razkenari, M., and Costin, A. (2023). Review and assessment of technical and legal challenges in application of unmanned aerial vehicles in monitoring and inspection of bridges.Intelligent Tra...

work page 2017
[2]

C ¸ opur, E., Balta, E., and Bilgic, H. (2025). Tuning of cas- cade pid controller gains of quadcopter under bounded disturbances using metaheuristic based research algo- rithm.The Aeronautical Journal, 129(1337), 1810–1832. Dionigi, A., Costante, G., and Loianno, G. (2024). The power of input: Benchmarking zero-shot sim-to-real transfer of reinforcement ...

work page 2025
[3]

Lee, T., Leok, M., and McClamroch, N.H. (2010). Geo- metric tracking control of a quadrotor uav on se (3). In 49th IEEE conference on decision and control (CDC), 5420–5425. IEEE. Li, Q., Qian, J., Zhu, Z., Bao, X., Helwa, M.K., and Schoellig, A.P. (2017). Deep neural networks for im- proved, impromptu trajectory tracking of quadrotors. In2017 IEEE Interna...

work page doi:10.1109/tcns 2010

[1] [1]

Achiam, J., Held, D., Tamar, A., and Abbeel, P. (2017). Constrained policy optimization. InInternational con- ference on machine learning, 22–31. Pmlr. Adibfar, A., Razkenari, M., and Costin, A. (2023). Review and assessment of technical and legal challenges in application of unmanned aerial vehicles in monitoring and inspection of bridges.Intelligent Tra...

work page 2017

[2] [2]

C ¸ opur, E., Balta, E., and Bilgic, H. (2025). Tuning of cas- cade pid controller gains of quadcopter under bounded disturbances using metaheuristic based research algo- rithm.The Aeronautical Journal, 129(1337), 1810–1832. Dionigi, A., Costante, G., and Loianno, G. (2024). The power of input: Benchmarking zero-shot sim-to-real transfer of reinforcement ...

work page 2025

[3] [3]

Lee, T., Leok, M., and McClamroch, N.H. (2010). Geo- metric tracking control of a quadrotor uav on se (3). In 49th IEEE conference on decision and control (CDC), 5420–5425. IEEE. Li, Q., Qian, J., Zhu, Z., Bao, X., Helwa, M.K., and Schoellig, A.P. (2017). Deep neural networks for im- proved, impromptu trajectory tracking of quadrotors. In2017 IEEE Interna...

work page doi:10.1109/tcns 2010