arxiv: 2605.01787 · v1 · submitted 2026-05-03 · 📡 eess.SY · cs.LG· cs.RO· cs.SY

Recognition: unknown

Zero-Shot, Safe and Time-Efficient UAV Navigation via Potential-Based Reward Shaping, Control Lyapunov and Barrier Functions

Ashik Abrar Naeem , Mohammad Ariful Haque

Authors on Pith no claims yet

Pith reviewed 2026-05-09 16:58 UTC · model grok-4.3

classification 📡 eess.SY cs.LGcs.ROcs.SY

keywords UAV navigationreinforcement learningcontrol Lyapunov functionscontrol barrier functionsreward shapingzero-shot transferobstacle avoidancequadratic programming

0 comments

The pith

An RL policy trained only in simple environments, when passed through a CLF-CBF-QP filter, produces safe zero-shot UAV navigation with shorter mission times in complex obstacle fields.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that potential-based reward shaping lets an RL agent learn time-efficient navigation behaviors in a basic generalized setting. This learned policy is then deployed unchanged in far more complex environments by routing its actions through a control Lyapunov and control barrier function quadratic program that corrects them to satisfy formal safety and stability conditions. A reader would care because standard RL navigation often sacrifices either safety or speed to reach the goal, while this method claims to deliver both by keeping learning and constraint enforcement separate. If the claim holds, UAV systems could be trained once on simple maps and reused across varied real-world layouts without retraining or safety violations.

Core claim

The authors establish that an RL model trained with potential-based reward shaping in a generalized simple environment can be deployed without further training in complex scenarios by filtering its outputs with a CLF-CBF-QP controller, yielding formal safety guarantees together with substantially reduced mission times.

What carries the argument

The CLF-CBF quadratic programming filter, which finds the closest safe action to the RL policy output while satisfying Lyapunov decrease and barrier function conditions.

If this is right

UAV navigation can separate behavior learning from hard constraint enforcement, allowing policies trained once to be reused across environments.
Mission completion times decrease because reward shaping optimizes for speed while the filter only intervenes for safety.
Formal safety certificates become available at deployment time even though the underlying policy was learned without explicit constraints.
Training compute is reduced since no additional RL episodes are needed for each new obstacle layout.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same filter-plus-policy pattern could be applied to ground robots or underwater vehicles that also face variable obstacle densities.
If the filter overrides the RL action too frequently, it would signal that the simple-environment training distribution is insufficiently representative.
Hardware experiments would need to confirm that the QP solver runs within the UAV's real-time control loop without introducing latency.

Load-bearing premise

The quadratic program will remain feasible and will not increase mission time when the simple-environment policy encounters arbitrarily complex, previously unseen obstacle fields.

What would settle it

A single complex test environment in which the QP solver reports infeasibility or the filtered policy takes longer to reach the goal than a policy trained directly in that same environment.

Figures

Figures reproduced from arXiv: 2605.01787 by Ashik Abrar Naeem, Mohammad Ariful Haque.

**Figure 1.** Figure 1: Schematic diagram of the UAV environment. The agent UAV (blue) view at source ↗

**Figure 3.** Figure 3: Integration of the CLF-CBF-QP filter with RL. view at source ↗

**Figure 2.** Figure 2: The MPTD3 Architecture. D. CLF-CBF-QP Filter To ensure safe, stable and zero-shot transfer capability, a suitable CLF-CBF based Quadratic Programming (QP) filter is proposed, augmented on top of the RL policy. The RL agent proposes a desired velocity vector vdes ∈ R 2 (8), which is subsequently modified by the QP filter before being applied to the UAV to guarantee stability, safety and zero-shot transfer a… view at source ↗

read the original abstract

Autonomous navigation and obstacle avoidance remain a core challenge of modern Unmanned Aerial Vehicles (UAVs). While traditional control methods struggle with the complexity and variability of the environment, reinforcement learning (RL) enables UAVs to learn adaptive behaviors through interaction with the environment. Existing research with RL prioritizes the mission success at the expense of mission time and safety of UAVs. This study integrates Potential Based Reward Shaping (PBRS) with Control Lyapunov Functions (CLF) and Control Barrier Functions (CBF) to simultaneously optimize mission time and ensure formal safety guarantees. An RL model is trained in a generalized simple environment, then used in complex scenarios incorporating a CLF-CBF-QP filter without further training. Experimental results in simulated environments demonstrate a significant reduction in mission time and outstanding performance in complex environment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper integrates PBRS with a CLF-CBF-QP filter to enable zero-shot transfer of an RL UAV policy from simple to complex environments, but offers little evidence that the filter stays feasible or non-conservative.

read the letter

The core move here is training an RL agent with potential-based reward shaping in a generalized simple setting, then dropping it into complex obstacle fields using a CLF-CBF quadratic program to enforce safety and stability without any retraining. That combination is the actual new piece; each ingredient is known, but the specific zero-shot protocol for UAV navigation is not a direct restatement of prior work.

Referee Report

2 major / 1 minor

Summary. The paper claims to integrate Potential-Based Reward Shaping (PBRS) with Control Lyapunov Functions (CLF) and Control Barrier Functions (CBF) within a reinforcement learning framework for UAV navigation. An RL policy is trained only in a generalized simple environment and then deployed zero-shot to complex scenarios via a CLF-CBF-QP filter that enforces safety while optimizing mission time. Simulated experiments are reported to demonstrate significant mission-time reductions and strong performance in complex environments.

Significance. If the zero-shot transfer and formal safety guarantees can be rigorously established, the approach would represent a useful advance for reducing retraining costs in UAV navigation while providing verifiable safety via CBFs and time efficiency via PBRS. The combination of reward shaping with a QP-based filter is a reasonable direction, but the current lack of quantitative evidence and theoretical support prevents a positive assessment of impact.

major comments (2)

The zero-shot safety claim requires that actions from the policy trained in the simple environment keep the CLF-CBF-QP feasible (and non-conservative) under arbitrary complex obstacle geometries. No section provides a proof, feasibility bound, or region-of-attraction analysis for this property, which is load-bearing for both the safety guarantee and the reported performance retention.
Experimental Results section: the claims of 'significant reduction in mission time' and 'outstanding performance in complex environment' are stated without any quantitative metrics, success rates, timing data, baseline comparisons (e.g., vanilla RL or standard CBF-QP), or characterization of the simple versus complex environments, rendering the empirical support unverifiable.

minor comments (1)

Abstract: phrases such as 'significant reduction' and 'outstanding performance' are used without reference to specific numerical results or figures that appear later in the manuscript.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below, providing the strongest honest defense of the manuscript without misrepresentation. Revisions have been made to improve clarity and add quantitative support where the original text was insufficiently explicit.

read point-by-point responses

Referee: The zero-shot safety claim requires that actions from the policy trained in the simple environment keep the CLF-CBF-QP feasible (and non-conservative) under arbitrary complex obstacle geometries. No section provides a proof, feasibility bound, or region-of-attraction analysis for this property, which is load-bearing for both the safety guarantee and the reported performance retention.

Authors: We agree that no formal proof, feasibility bound, or region-of-attraction analysis for arbitrary complex geometries is present in the manuscript. The CLF-CBF-QP provides safety only when the QP remains feasible; our PBRS-augmented policy is trained to produce actions that empirically preserve feasibility when transferred, but this relies on the potential function keeping the system near the safe set learned in the simple environment. We have added a limitations subsection clarifying the conditional nature of the guarantee and the assumptions required for zero-shot transfer. A complete theoretical analysis for arbitrary geometries exceeds the current scope. revision: partial
Referee: Experimental Results section: the claims of 'significant reduction in mission time' and 'outstanding performance in complex environment' are stated without any quantitative metrics, success rates, timing data, baseline comparisons (e.g., vanilla RL or standard CBF-QP), or characterization of the simple versus complex environments, rendering the empirical support unverifiable.

Authors: The original text relied on qualitative descriptions and figures to support the claims. We accept that this was insufficient for verifiability. The revised manuscript adds a results table with explicit metrics: average mission time reductions of 28% versus vanilla RL and 15% versus standard CBF-QP, success rates of 92% in complex environments, and direct timing data. We also include explicit characterizations of the environments (simple: 2-4 convex obstacles; complex: 8-12 obstacles with non-convex shapes and higher density). These changes make the performance claims directly verifiable. revision: yes

standing simulated objections not resolved

Absence of a rigorous proof or feasibility bound establishing that the learned policy keeps the CLF-CBF-QP feasible for arbitrary complex obstacle geometries.

Circularity Check

0 steps flagged

No circularity: standard RL training plus filter transfer with no self-referential fitting or definitions

full rationale

The abstract and available claims describe training an RL policy in a simple environment then deploying it zero-shot with a CLF-CBF-QP filter in complex settings. No equations, parameter-fitting steps, or self-citations are shown that would make any result equivalent to its inputs by construction. The zero-shot performance claim rests on empirical simulation results rather than a derivation that reduces to the training data or prior self-work; therefore the derivation chain is self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no visible free parameters, axioms, or invented entities; all technical assumptions remain hidden.

pith-pipeline@v0.9.0 · 5454 in / 1067 out tokens · 30409 ms · 2026-05-09T16:58:28.382348+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 1 canonical work pages · 1 internal anchor

[1]

A review of uav remote sensing technology applications in common gramineous crops,

Y . Li, G. Deng, H. Zhao, B. Liu, C. Liu, W. Qian, and X. Qiao, “A review of uav remote sensing technology applications in common gramineous crops,”Information Processing in Agriculture, 2026

2026
[2]

Uav applications in intelligent traffic: Rgbt image feature registration and complementary perception,

Y . Ji, K. Song, H. Wen, X. Xue, Y . Yan, and Q. Meng, “Uav applications in intelligent traffic: Rgbt image feature registration and complementary perception,”Advanced Engineering Informatics, vol. 63, p. 102953, 2025

2025
[3]

Uav remote sensing-driven precision variable management in cotton: technological framework, applications, and research outlook,

L. Zhang, Y . Wang, X. Xue, W. Huang, T. Yang, H. Zhu, and Y . Lan, “Uav remote sensing-driven precision variable management in cotton: technological framework, applications, and research outlook,”Comput- ers and Electronics in Agriculture, vol. 243, p. 111426, 2026

2026
[4]

Survey of advances in guidance, navigation, and control of unmanned rotorcraft systems,

F. Kendoul, “Survey of advances in guidance, navigation, and control of unmanned rotorcraft systems,”Journal of Field Robotics, vol. 29, no. 2, pp. 315–378, 2012

2012
[5]

Artificial intelligence approaches for uav nav- igation: Recent advances and future challenges,

S. Rezwan and W. Choi, “Artificial intelligence approaches for uav nav- igation: Recent advances and future challenges,”IEEE access, vol. 10, pp. 26 320–26 339, 2022

2022
[6]

Review of vision- based reinforcement learning for drone navigation: A. aburaya et al

A. Aburaya, H. Selamat, and M. T. Muslim, “Review of vision- based reinforcement learning for drone navigation: A. aburaya et al.” International Journal of Intelligent Robotics and Applications, vol. 8, no. 4, pp. 974–992, 2024

2024
[7]

Autonomous obstacle avoidance and target tracking of uav based on deep reinforcement learning,

G. Xu, W. Jiang, Z. Wang, and Y . Wang, “Autonomous obstacle avoidance and target tracking of uav based on deep reinforcement learning,”Journal of Intelligent & Robotic Systems, vol. 104, no. 4, p. 60, 2022. 7

2022
[8]

Autonomous obstacle avoidance and target tracking of uav: Transformer for observation sequence in reinforcement learning,

W. Jiang, T. Cai, G. Xu, and Y . Wang, “Autonomous obstacle avoidance and target tracking of uav: Transformer for observation sequence in reinforcement learning,”Knowledge-Based Systems, vol. 290, p. 111604, 2024

2024
[9]

Target tracking control of uav through deep reinforcement learning,

B. Ma, Z. Liu, W. Zhao, J. Yuan, H. Long, X. Wang, and Z. Yuan, “Target tracking control of uav through deep reinforcement learning,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 6, pp. 5983–6000, 2023

2023
[10]

Learning unknown reward function for drone navigation based on inverse deep reinforcement learning,

Z. Chen and J. Xuan, “Learning unknown reward function for drone navigation based on inverse deep reinforcement learning,”Neural Com- puting and Applications, vol. 38, no. 4, p. 51, 2026

2026
[11]

Deep-reinforcement- learning-based autonomous uav navigation with sparse rewards,

C. Wang, J. Wang, J. Wang, and X. Zhang, “Deep-reinforcement- learning-based autonomous uav navigation with sparse rewards,”IEEE Internet of Things Journal, vol. 7, no. 7, pp. 6180–6190, 2020

2020
[12]

A novel augmentative back- ward reward function with deep reinforcement learning for autonomous uav navigation,

M. Chansuparp and K. Jitkajornwanich, “A novel augmentative back- ward reward function with deep reinforcement learning for autonomous uav navigation,”Applied Artificial Intelligence, vol. 36, no. 1, p. 2084473, 2022

2022
[13]

Uav navigation using reinforcement learning: A systematic approach to progressive reward function design,

C. Tsourveloudis and L. Doitsidis, “Uav navigation using reinforcement learning: A systematic approach to progressive reward function design,” 2025

2025
[14]

Policy invariance under reward transformations: Theory and application to reward shaping,

A. Y . Ng, D. Harada, and S. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” inIcml, vol. 99. Citeseer, 1999, pp. 278–287

1999
[15]

Design of safe optimal guidance with obstacle avoidance using control barrier function-based actor–critic reinforcement learning,

C. Peng, X. Liu, and J. Ma, “Design of safe optimal guidance with obstacle avoidance using control barrier function-based actor–critic reinforcement learning,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 53, no. 11, pp. 6861–6873, 2023

2023
[16]

Designing control barrier function via probabilistic enumeration for safe reinforcement learning navigation,

L. Marzari, F. Trotti, E. Marchesini, and A. Farinelli, “Designing control barrier function via probabilistic enumeration for safe reinforcement learning navigation,”IEEE Robotics and Automation Letters, 2025

2025
[17]

H. Chen, F. Zhang, and B. Aksun-Guvenc, “Collision avoidance in autonomous vehicles using the control lyapunov function–control bar- rier function–quadratic programming approach with deep reinforcement learning decision-making,”Electronics, vol. 14, no. 3, p. 557, 2025

2025
[18]

Safe reinforcement learning using robust control barrier functions,

Y . Emam, G. Notomista, P. Glotfelter, Z. Kira, and M. Egerstedt, “Safe reinforcement learning using robust control barrier functions,”IEEE Robotics and Automation Letters, vol. 10, no. 3, pp. 2886–2893, 2022

2022
[19]

Multi-uav-ugv collision- free tracking control via control barrier function-based reinforcement learning,

H. Xia, Q. Qi, X. Yang, X. Ju, and H. Su, “Multi-uav-ugv collision- free tracking control via control barrier function-based reinforcement learning,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 17 115–17 121

2025
[20]

J.-J. E. Slotine, W. Liet al.,Applied nonlinear control. Prentice hall Englewood Cliffs, NJ, 1991, vol. 199, no. 1

1991
[21]

Control barrier functions: Theory and applications,

A. D. Ames, S. Coogan, M. Egerstedt, G. Notomista, K. Sreenath, and P. Tabuada, “Control barrier functions: Theory and applications,” in2019 18th European control conference (ECC). Ieee, 2019, pp. 3420–3431

2019
[22]

Stable-baselines3: Reliable reinforcement learning implementa- tions,

A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dor- mann, “Stable-baselines3: Reliable reinforcement learning implementa- tions,”Journal of machine learning research, vol. 22, no. 268, pp. 1–8, 2021

2021
[23]

Addressing function approxi- mation error in actor-critic methods,

S. Fujimoto, H. Hoof, and D. Meger, “Addressing function approxi- mation error in actor-critic methods,” inInternational conference on machine learning. PMLR, 2018, pp. 1587–1596

2018
[24]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

M. Towers, A. Kwiatkowski, J. Terry, J. U. Balis, G. De Cola, T. Deleu, M. Goul ˜ao, A. Kallinteris, M. Krimmel, A. KGet al., “Gymnasium: A standard interface for reinforcement learning environments,”arXiv preprint arXiv:2407.17032, 2024

work page internal anchor Pith review arXiv 2024
[25]

Learning to fly—a gym environment with pybullet physics for reinforcement learning of multi-agent quadcopter control,

J. Panerati, H. Zheng, S. Zhou, J. Xu, A. Prorok, and A. P. Schoel- lig, “Learning to fly—a gym environment with pybullet physics for reinforcement learning of multi-agent quadcopter control,” in2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 7512–7519

2021