pith. sign in

arxiv: 2604.02779 · v2 · submitted 2026-04-03 · 💻 cs.RO

Vision-Based End-to-End Learning for UAV Traversal of Irregular Gaps via Differentiable Simulation

Pith reviewed 2026-05-13 20:17 UTC · model grok-4.3

classification 💻 cs.RO
keywords UAV navigationend-to-end learningdifferentiable simulationvision-based controlgap traversalautonomous dronesSE(3) control
0
0 comments X

The pith

A vision-based end-to-end controller lets drones fly through irregular gaps without explicit measurement or planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that depth images alone can be mapped directly to control commands so that a UAV traverses complex, irregular gaps in environments never seen during training. Traditional approaches first extract and measure the gap, which breaks down for non-regular shapes, while recent learned methods often assume simple rectangular openings and therefore generalize poorly. The new framework trains entirely in differentiable simulation, using a stop-gradient operator and bimodal initialization to stabilize learning in SE(3) where position and orientation must be controlled together. Two auxiliary heads predict gap-crossing success and traversability to keep the drone safe during continuous flight. Real-world flights confirm that policies trained this way transfer without major domain-gap fixes.

Core claim

We present a fully vision-based, end-to-end framework that maps depth images directly to control commands, enabling drones to traverse complex gaps within unseen environments. Operating in SE(3), the framework leverages differentiable simulation, a Stop-Gradient operator, and a Bimodal Initialization Distribution to achieve stable traversal through consecutive gaps. Two auxiliary prediction modules—a gap-crossing success classifier and a traversability predictor—further enhance continuous navigation and safety.

What carries the argument

Differentiable simulation of UAV dynamics and contact forces, paired with a Stop-Gradient operator and Bimodal Initialization Distribution, that trains a policy mapping raw depth images to SE(3) control commands.

If this is right

  • Policies trained purely in simulation can be deployed on physical UAVs for irregular-gap tasks without hand-tuned gap detectors.
  • The same depth-to-command mapping supports continuous flight through multiple consecutive gaps rather than isolated single-gap maneuvers.
  • Auxiliary success and traversability predictors provide built-in safety checks that reduce collision risk during autonomous navigation.
  • Applications such as inspection and search-and-rescue become feasible in cluttered, previously unmapped spaces where explicit 3-D reconstruction is impractical.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same differentiable-simulation recipe could be applied to other contact-rich tasks such as perching or object manipulation where explicit contact models are hard to write.
  • Replacing the depth-image input with raw RGB or event-camera streams might further reduce sensor cost while preserving generalization.
  • Because the method never builds an explicit map, it could serve as a low-latency fallback layer when SLAM or global planning temporarily fails.

Load-bearing premise

The differentiable simulation must accurately reproduce real-world UAV dynamics, contact forces, and visual observations so that policies trained inside it transfer to physical drones without large domain gaps.

What would settle it

A real drone equipped with the trained policy repeatedly collides or fails to cross a sequence of irregular gaps that the simulation predicted it would traverse successfully.

Figures

Figures reproduced from arXiv: 2604.02779 by Danping Zou, Feng Yu, Linzuo Zhang, Wenxian Yu, Yang Deng, Yu Hu.

Figure 1
Figure 1. Figure 1: Visualization of the training and real-world evaluation: The top row shows depth image sequences collected during training; the middle row shows depth images captured by the real drone during execution; the bottom row illustrates the real-world trajectory, where the drone first flies through a regular gate similar to the training scenarios and then successfully navigates an irregular, previously unseen gap… view at source ↗
Figure 2
Figure 2. Figure 2: System Overview. An end-to-end policy maps depth images directly to control commands and is trained via differentiable simulation, enabling direct back-propagation of task losses to the network. A gap-crossing detection module resets the policy hidden state to support continuous multi-gap traversal, while a traversability prediction module improves safety in challenging environments. A bimodal initializati… view at source ↗
Figure 3
Figure 3. Figure 3: Mesh-based depth renderer. A high-speed CUDA-based renderer generates depth images from mesh geometries. Domain randomization is applied by perturbing the angles and corner vertices, producing diverse gap configurations for training. B. Back-propagation through Time In contrast to other CUDA-accelerated simulators such as Isaac Sim1 , which solely support forward simulation and do not provide analytical gr… view at source ↗
Figure 4
Figure 4. Figure 4: AirSim Simulation environments. (a) Single-gap scenario, where the quadrotor flies through a single tilted gap. (b) Multi-gap scenario, where the quadrotor sequentially flies through multiple tilted gaps. (c) Wall-mounted gap scenario, where a square opening is embedded in a planar wall. recurrent unit (GRU), which feeds into a linear head for continuous control. Additionally, the GRU hidden state serves a… view at source ↗
Figure 5
Figure 5. Figure 5: Baseline comparison in AirSim. We compare our end-to-end vision￾based policy with two baselines using two depth inputs: ground-truth (GT) and Semi-Global Matching (SGM). Baselines are: (1) a PPO-based policy with edge-drawing front-end, and (2) a state-of-the-art vision-based navigation method [5]. To better approximate real-world conditions, we captured left and right grayscale images and applied the Semi… view at source ↗
Figure 7
Figure 7. Figure 7: Real-world quadrotor experiments in diverse environments. Top left: regular gaps with varying tilt angles. Top right: irregular gaps, including a half-masked ring, an irregular hole, tiled trees, and a partially occluded door. The success rate (SR) is annotated on each subplot. particular, without BIO, the quadrotor rapidly loses altitude, reflecting unstable post-gap states. These results confirm that com… view at source ↗
Figure 8
Figure 8. Figure 8: Evaluation of the traversability prediction module on the real quadrotor. The quadrotor first successfully traverses the gap, while the predicted traversability score remains high. When the target gap is switched to an irregular hole partially occluded by an obstacle, the predicted score drops sharply, triggering an emergency stop. generalizable strategies for agile gap traversal beyond the training distri… view at source ↗
Figure 9
Figure 9. Figure 9: Additional experiments. Left: average trajectories under different target direction noise levels. Right: PR curves of the collision traversability predictor under different model sizes. to execute an emergency stop. These real-world experiments confirm that our framework not only transfers seamlessly from simulation to hardware but also generalizes effectively to unseen scenarios, thereby validating its pr… view at source ↗
read the original abstract

-Navigation through narrow and irregular gaps is an essential skill in autonomous drones for applications such as inspection, search-and-rescue, and disaster response. However, traditional planning and control methods rely on explicit gap extraction and measurement, while recent end-to-end approaches often assume regularly shaped gaps, leading to poor generalization and limited practicality. In this work, we present a fully vision-based, end-to-end framework that maps depth images directly to control commands, enabling drones to traverse complex gaps within unseen environments. Operating in the Special Euclidean group SE(3), where position and orientation are tightly coupled, the framework leverages differentiable simulation, a Stop-Gradient operator, and a Bimodal Initialization Distribution to achieve stable traversal through consecutive gaps. Two auxiliary prediction modules-a gap-crossing success classifier and a traversability predictor-further enhance continuous navigation and safety. Extensive simulation and real-world experiments demonstrate the approach's effectiveness, generalization capability, and practical robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to present a fully vision-based, end-to-end framework that maps depth images directly to control commands, enabling UAVs to traverse complex irregular gaps in unseen environments. Operating in SE(3), it leverages differentiable simulation together with a stop-gradient operator and bimodal initialization distribution for stable traversal through consecutive gaps; two auxiliary predictors (gap-crossing success classifier and traversability predictor) are added to support continuous navigation and safety. The approach is supported by extensive simulation and real-world experiments demonstrating generalization and practical robustness.

Significance. If the sim-to-real transfer holds, the work offers a meaningful step toward practical autonomous navigation for drones in cluttered or disaster-response settings by removing reliance on explicit gap extraction and geometric planning. The use of differentiable simulation for direct depth-to-action learning is a timely direction, and the auxiliary predictors provide a concrete mechanism for safety-aware continuous flight. Impact would be higher with stronger evidence that the simulator captures the contact-rich dynamics that dominate real gap traversal.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments section: the central claim that sim-trained policies transfer to real UAVs for irregular-gap traversal rests on the unverified fidelity of the differentiable simulator in reproducing SE(3) dynamics, depth observations, and especially contact forces/friction. No quantitative sim-real metrics (trajectory error, force matching, or ablation on contact modeling) are reported, leaving the generalization argument load-bearing but unsupported.
  2. [Method] Method description (around the differentiable simulation and stop-gradient operator): the paper invokes these components for stability but does not specify how contact forces or visual noise are modeled, nor does it provide an ablation isolating their contribution to sim-to-real success. This detail is required to evaluate whether the end-to-end mapping can be expected to generalize beyond the training distribution.
minor comments (2)
  1. Clarify the exact network architecture and loss weighting between the main policy and the two auxiliary predictors; the current description leaves their integration ambiguous.
  2. Figure captions and axis labels in the experimental results should explicitly state the number of trials and the definition of success (e.g., minimum clearance or completion time).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of sim-to-real validation and methodological detail that we address point by point below. We have revised the manuscript to incorporate additional quantitative analysis and expanded descriptions where feasible.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: the central claim that sim-trained policies transfer to real UAVs for irregular-gap traversal rests on the unverified fidelity of the differentiable simulator in reproducing SE(3) dynamics, depth observations, and especially contact forces/friction. No quantitative sim-real metrics (trajectory error, force matching, or ablation on contact modeling) are reported, leaving the generalization argument load-bearing but unsupported.

    Authors: We acknowledge that the manuscript does not include explicit quantitative sim-to-real metrics such as trajectory error or force matching. Our real-world experiments demonstrate successful policy transfer for irregular gap traversal, but we agree that stronger quantitative evidence would better support the generalization claims. In the revised version, we will add a dedicated sim-to-real validation subsection reporting metrics including position and orientation errors between simulated and physical trajectories, along with a discussion of the contact force and friction modeling assumptions used in the differentiable simulator. revision: yes

  2. Referee: [Method] Method description (around the differentiable simulation and stop-gradient operator): the paper invokes these components for stability but does not specify how contact forces or visual noise are modeled, nor does it provide an ablation isolating their contribution to sim-to-real success. This detail is required to evaluate whether the end-to-end mapping can be expected to generalize beyond the training distribution.

    Authors: We agree that the current method description lacks sufficient detail on contact force modeling, friction parameters, and depth image noise. The revised manuscript will expand the differentiable simulation section to explicitly describe the contact model (including penalty-based forces and friction coefficients) and the visual noise injection process. We will also add an ablation study isolating the effects of these modeling choices on both simulation performance and real-world transfer success. revision: yes

Circularity Check

0 steps flagged

No evident circularity; framework uses standard differentiable simulation and auxiliary predictors

full rationale

The paper presents an end-to-end vision-based policy trained via differentiable simulation for UAV gap traversal. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described methods. Stop-Gradient, bimodal initialization, and auxiliary classifiers are standard techniques with independent grounding in RL/sim-to-real literature. The derivation chain remains self-contained against external benchmarks (sim and real experiments), yielding only a minor score for routine self-references if any exist in the full text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that differentiable simulation faithfully reproduces real UAV-gap interactions; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Differentiable simulation accurately captures UAV dynamics and visual observations for irregular gaps
    Invoked to justify end-to-end training from sim to real.

pith-pipeline@v0.9.0 · 5472 in / 1096 out tokens · 39564 ms · 2026-05-13T20:17:30.588291+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

  1. [1]

    Learning high-speed flight in the wild,

    A. Loquercio, E. Kaufmann, R. Ranftl, M. M ¨uller, V . Koltun, and D. Scaramuzza, “Learning high-speed flight in the wild,”Science Robotics, vol. 6, no. 59, p. eabg5810, 2021. Publisher: American Association for the Advancement of Science

  2. [2]

    Collision-tolerant autonomous navigation through manhole-sized con- fined environments,

    P. De Petris, H. Nguyen, T. Dang, F. Mascarich, and K. Alexis, “Collision-tolerant autonomous navigation through manhole-sized con- fined environments,” in2020 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), pp. 84–89, IEEE, 2020

  3. [3]

    Vds-nav: V olumetric depth-based safe navigation for aerial robots–bridging the sim-to-real gap,

    V . H. Dang, A. Redder, H. X. Pham, A. Sarabakha, and E. Kay- acan, “Vds-nav: V olumetric depth-based safe navigation for aerial robots–bridging the sim-to-real gap,”IEEE Robotics and Automation Letters, vol. 10, no. 10, pp. 11038–11045, 2025

  4. [4]

    Mavrl: Learn to fly in cluttered environments with varying speed,

    H. Yu, C. De Wagter, and G. C. E. de Croon, “Mavrl: Learn to fly in cluttered environments with varying speed,”IEEE Robotics and Automation Letters, 2024

  5. [5]

    Learning vision-based agile flight via differentiable physics,

    Y . Zhang, Y . Hu, Y . Song, D. Zou, and W. Lin, “Learning vision-based agile flight via differentiable physics,”Nature Machine Intelligence, pp. 1–13, 2025

  6. [6]

    Seeing through pixel motion: learning obstacle avoidance from optical flow with one camera,

    Y . Hu, Y . Zhang, Y . Song, Y . Deng, F. Yu, L. Zhang, W. Lin, D. Zou, and W. Yu, “Seeing through pixel motion: learning obstacle avoidance from optical flow with one camera,”IEEE Robotics and Automation Letters, 2025

  7. [7]

    Quadrotor navigation using reinforcement learning with privileged information,

    J. Lee, A. Rathod, K. Goel, J. Stecklein, and W. Tabib, “Quadrotor navigation using reinforcement learning with privileged information,” arXiv preprint arXiv:2509.08177, 2025

  8. [8]

    Aggressive quadrotor flight through narrow gaps with onboard sensing and com- puting using active vision,

    D. Falanga, E. Mueggler, M. Faessler, and D. Scaramuzza, “Aggressive quadrotor flight through narrow gaps with onboard sensing and com- puting using active vision,” in2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 5774–5781, 2017

  9. [9]

    Search-based motion planning for aggressive flight in se(3),

    S. Liu, K. Mohta, N. Atanasov, and V . Kumar, “Search-based motion planning for aggressive flight in se(3),”IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 2439–2446, 2018

  10. [10]

    Flying through a narrow gap using end- to-end deep reinforcement learning augmented with curriculum learning and sim2real,

    C. Xiao, P. Lu, and Q. He, “Flying through a narrow gap using end- to-end deep reinforcement learning augmented with curriculum learning and sim2real,”IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 5, pp. 2701–2708, 2023

  11. [11]

    Learning real-time dynamic responsive gap-traversing policy for quadrotors with safety- aware exploration,

    S. Chen, Y . Li, Y . Lou, K. Lin, and X. Wu, “Learning real-time dynamic responsive gap-traversing policy for quadrotors with safety- aware exploration,”IEEE Transactions on Intelligent V ehicles, vol. 8, no. 3, pp. 2271–2284, 2023

  12. [12]

    Flying through a narrow gap using neural network: an end-to-end planning and control approach,

    J. Lin, L. Wang, F. Gao, S. Shen, and F. Zhang, “Flying through a narrow gap using neural network: an end-to-end planning and control approach,” in2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3526–3533, 2019

  13. [13]

    Whole-body control through narrow gaps from pixels to action,

    T. Wu, Y . Chen, T. Chen, G. Zhao, and F. Gao, “Whole-body control through narrow gaps from pixels to action,” in2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 11317–11324, 2025

  14. [14]

    Whole-body real- time motion planning for multicopters,

    S. Yang, B. He, Z. Wang, C. Xu, and F. Gao, “Whole-body real- time motion planning for multicopters,” in2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 9197–9203, 2021

  15. [15]

    Gradients are not all you need

    L. Metz, C. D. Freeman, S. S. Schoenholz, and T. Kachman, “Gradients are not all you need,”arXiv preprint arXiv:2111.05803, 2021

  16. [17]

    Learning quadrotor control from visual features using differentiable simulation,

    J. Heeg, Y . Song, and D. Scaramuzza, “Learning quadrotor control from visual features using differentiable simulation,” in2025 IEEE Interna- tional Conference on Robotics and Automation (ICRA), pp. 4033–4039, 2025. 9

  17. [18]

    Visfly: An efficient and versatile simulator for training vision-based flight,

    F. Li, F. Sun, T. Zhang, and D. Zou, “Visfly: An efficient and versatile simulator for training vision-based flight,” in2025 IEEE International Conference on Robotics and Automation (ICRA), pp. 11325–11332, IEEE, 2025

  18. [19]

    Learning quadruped locomotion using differentiable simulation,

    Y . Song, S. Kim, and D. Scaramuzza, “Learning quadruped locomotion using differentiable simulation,”arXiv preprint arXiv:2403.14864, 2024

  19. [20]

    Training efficient controllers via analytic policy gradient,

    N. Wiedemann, V . W ¨uest, A. Loquercio, M. M ¨uller, D. Floreano, and D. Scaramuzza, “Training efficient controllers via analytic policy gradient,” in2023 International Conference on Robotics and Automation (ICRA), IEEE, 2023

  20. [21]

    Diffaero: A gpu-accelerated differentiable simulation framework for efficient quadrotor policy learning,

    X. Zhang, R. Wang, Y . Ren, J. Sun, H. Fang, J. Chen, and G. Wang, “Diffaero: A gpu-accelerated differentiable simulation framework for efficient quadrotor policy learning,”arXiv preprint arXiv:2509.10247, 2025

  21. [22]

    Aerial gym simulator: A framework for highly parallelized simulation of aerial robots,

    M. Kulkarni, W. Rehberg, and K. Alexis, “Aerial gym simulator: A framework for highly parallelized simulation of aerial robots,”IEEE Robotics and Automation Letters, pp. 1–8, 2025

  22. [23]

    Geometric tracking control of a quadrotor uav on se(3),

    T. Lee, M. Leok, and N. H. McClamroch, “Geometric tracking control of a quadrotor uav on se(3),” in49th IEEE Conference on Decision and Control (CDC), pp. 5420–5425, 2010

  23. [24]

    Airsim: High-fidelity visual and physical simulation for autonomous vehicles,

    S. Shah, D. Dey, C. Lovett, and A. Kapoor, “Airsim: High-fidelity visual and physical simulation for autonomous vehicles,” inField and Service Robotics, 2017

  24. [25]

    Stereo processing by semiglobal matching and mutual information,

    H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 328–341, 2007

  25. [26]

    Learning agile flights through narrow gaps with varying angles using onboard sensing,

    Y . Xie, M. Lu, R. Peng, and P. Lu, “Learning agile flights through narrow gaps with varying angles using onboard sensing,”IEEE Robotics and Automation Letters, vol. 8, no. 9, pp. 5424–5431, 2023

  26. [27]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  27. [28]

    Real-time edge segment detection with edge drawing algorithm,

    C. Topal, O. Ozsen, and C. Akinlar, “Real-time edge segment detection with edge drawing algorithm,” in2011 7th International Symposium on Image and Signal Processing and Analysis (ISPA), pp. 313–318, 2011