pith. sign in

arxiv: 2604.05062 · v1 · submitted 2026-04-06 · 💻 cs.RO

GaussFly: Contrastive Reinforcement Learning for Visuomotor Policies in 3D Gaussian Fields

Pith reviewed 2026-05-10 18:52 UTC · model grok-4.3

classification 💻 cs.RO
keywords visuomotor policies3D Gaussian Splattingcontrastive learningreinforcement learningsim-to-real transferautonomous aerial vehiclesmonocular visionrepresentation learning
0
0 comments X

The pith

GaussFly reconstructs scenes with constrained 3D Gaussian Splatting and pre-trains contrastive features to let visuomotor policies learn efficiently in simulation and transfer zero-shot to real aerial vehicles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GaussFly to learn monocular visuomotor policies for autonomous aerial vehicles. It first reconstructs real training scenes into photorealistic simulations using 3D Gaussian Splatting with added geometric constraints. A contrastive encoder is then trained on the rendered images to produce compact, noise-resistant latent features. These features feed into a reinforcement learning policy, reducing the policy's input dimension and improving its robustness. The result is higher sample efficiency during training and direct deployment on unseen real environments with complex textures.

Core claim

By decoupling representation learning from policy optimization in a real-to-sim-to-real pipeline, GaussFly first builds high-fidelity training environments through 3D Gaussian Splatting augmented with explicit geometric constraints, then extracts robust low-dimensional features via contrastive learning on the rendered views. Feeding these features to the visuomotor policy yields superior sample efficiency and asymptotic performance in simulation while enabling robust zero-shot transfer to physical settings.

What carries the argument

The real-to-sim-to-real paradigm that reconstructs scenes via constrained 3D Gaussian Splatting and applies contrastive representation learning to obtain compact features for the reinforcement learning policy.

If this is right

  • Visuomotor policies require far fewer environment interactions to reach high performance than direct image-to-action baselines.
  • The same policy achieves better final returns in simulation than methods that map raw pixels straight to controls.
  • A policy trained only in the reconstructed scenes executes successfully on real hardware in entirely new locations with complex textures.
  • Low-dimensional contrastive features lower the computational cost of policy training and inference.
  • The learned features remain stable under visual noise that would degrade raw-pixel policies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pre-trained encoder could be reused across multiple different tasks or vehicle platforms within similar visual domains without retraining.
  • Relaxing the geometric constraints during scene reconstruction would likely increase the remaining sim-to-real gap and reduce zero-shot success.
  • Extending the approach to dynamic scenes or moving objects would require updating the Gaussian reconstruction step to handle time-varying geometry.
  • The method suggests that explicit 3D reconstruction is more effective than purely image-based domain randomization for achieving texture-invariant features.

Load-bearing premise

The 3D Gaussian Splatting models with geometric constraints create simulated images close enough to real ones that contrastive features learned from them will generalize directly to physical environments with unseen complex textures.

What would settle it

Train a policy with GaussFly on a set of reconstructed scenes, then deploy it without any adaptation on a real aerial vehicle flying in an environment whose textures, lighting, and geometry were never part of the original reconstructions, and check whether success rate remains comparable to simulation.

Figures

Figures reproduced from arXiv: 2604.05062 by Chao Yan, Jiaping Xiao, Mingsheng Li, Mir Feroskhan, Yuhang Zhang, Yujing Shang, Zhuoyuan Yu.

Figure 1
Figure 1. Figure 1: The framework of GaussFly. (A) 3DGS-Based Scene Reconstruction. Background environments and foreground assets [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the simulation environments. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training reward curves. GaussFly achieves comparable [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative real-world flight trajectories in two indoor environments. Top: Third-person views of the AAV navigating [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: t-SNE visualization under visual interference. Red and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Attention visualization in real-world environments. By [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation study on contrastive pre-training. We eval [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Ablation study on geometric constraints. We evaluate [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
read the original abstract

Learning visuomotor policies for Autonomous Aerial Vehicles (AAVs) relying solely on monocular vision is an attractive yet highly challenging paradigm. Existing end-to-end learning approaches directly map high-dimensional RGB observations to action commands, which frequently suffer from low sample efficiency and severe sim-to-real gaps due to the visual discrepancy between simulation and physical domains. To address these long-standing challenges, we propose GaussFly, a novel framework that explicitly decouples representation learning from policy optimization through a cohesive real-to-sim-to-real paradigm. First, to achieve a high-fidelity real-to-sim transition, we reconstruct training scenes using 3D Gaussian Splatting (3DGS) augmented with explicit geometric constraints. Second, to ensure robust sim-to-real transfer, we leverage these photorealistic simulated environments and employ contrastive representation learning to extract compact, noise-resilient latent features from the rendered RGB images. By utilizing this pre-trained encoder to provide low-dimensional feature inputs, the computational burden on the visuomotor policy is significantly reduced while its resistance against visual noise is inherently enhanced. Extensive experiments in simulated and real-world environments demonstrate that GaussFly achieves superior sample efficiency and asymptotic performance compared to baselines. Crucially, it enables robust and zero-shot policy transfer to unseen real-world environments with complex textures, effectively bridging the sim-to-real gap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The manuscript proposes GaussFly, a framework for visuomotor policy learning in autonomous aerial vehicles that decouples representation learning from policy optimization via a real-to-sim-to-real pipeline. Scenes are reconstructed using 3D Gaussian Splatting augmented with explicit geometric constraints to create high-fidelity simulations; contrastive representation learning is then applied to rendered RGB images to obtain compact, noise-resilient latent features. These features serve as low-dimensional inputs to a reinforcement learning policy trained in simulation, with the goal of achieving superior sample efficiency, asymptotic performance, and zero-shot transfer to unseen real-world environments with complex textures.

Significance. If the empirical claims hold, the work offers a coherent approach to mitigating the sim-to-real gap in high-dimensional visual control by leveraging photorealistic 3DGS reconstructions and contrastive pre-training. The explicit separation of representation learning from policy optimization reduces computational burden on the policy while enhancing robustness to visual noise, which could improve sample efficiency in robotics applications. The real-to-sim-to-real paradigm and use of established 3DGS and contrastive techniques are presented without internal contradictions.

minor comments (1)
  1. Abstract: the claims of superior sample efficiency, asymptotic performance, and zero-shot transfer are stated without any quantitative metrics, baseline names, or effect sizes; adding one or two key numerical results would make the summary self-contained and proportionate to the experimental emphasis in the full text.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision for our manuscript on GaussFly. The description accurately reflects the framework's decoupling of representation learning via 3D Gaussian Splatting and contrastive features from policy optimization, along with the real-to-sim-to-real pipeline for improved sample efficiency and zero-shot transfer.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a pipeline that applies established 3D Gaussian Splatting for scene reconstruction, contrastive representation learning on rendered images, and standard reinforcement learning for visuomotor policies. No derivation, equation, or claim reduces to its own inputs by construction; the central claims rest on empirical validation of the combined pipeline rather than self-referential definitions or fitted parameters renamed as predictions. The method section describes sequential stages without internal loops that would force results tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on two domain assumptions about reconstruction fidelity and feature robustness; no free parameters or invented entities are introduced in the abstract description.

axioms (2)
  • domain assumption 3D Gaussian Splatting augmented with explicit geometric constraints achieves high-fidelity real-to-sim scene reconstruction
    Invoked as the first step to create photorealistic training environments.
  • domain assumption Contrastive representation learning extracts compact, noise-resilient latent features from rendered RGB images that support effective policy learning
    Invoked to justify reduced policy input dimensionality and improved sim-to-real robustness.

pith-pipeline@v0.9.0 · 5558 in / 1338 out tokens · 42793 ms · 2026-05-10T18:52:11.463016+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

  1. [1]

    Learning high-speed flight in the wild,

    A. Loquercio, E. Kaufmann, R. Ranftl, M. M ¨uller, V . Koltun, and D. Scaramuzza, “Learning high-speed flight in the wild,”Science Robotics, vol. 6, no. 59, p. eabg5810, 2021

  2. [2]

    Gaussian splatting slam,

    H. Matsuki, R. Murai, P. H. Kelly, and A. J. Davison, “Gaussian splatting slam,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 18 039–18 048

  3. [3]

    Tc-sfm: Robust track-community-based structure-from-motion,

    L. Wang, L. Ge, S. Luo, Z. Yan, Z. Cui, and J. Feng, “Tc-sfm: Robust track-community-based structure-from-motion,”IEEE Transactions on Image Processing, vol. 33, pp. 1534–1548, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

  4. [4]

    Vision-based learning for drones: A survey,

    J. Xiao, R. Zhang, Y . Zhang, and M. Feroskhan, “Vision-based learning for drones: A survey,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 9, pp. 15 601–15 621, 2025

  5. [5]

    Learning to fly by crashing,

    D. Gandhi, L. Pinto, and A. Gupta, “Learning to fly by crashing,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 3948–3955

  6. [6]

    Dronet: Learning to fly by driving,

    A. Loquercio, A. I. Maqueda, C. R. Del-Blanco, and D. Scaramuzza, “Dronet: Learning to fly by driving,”IEEE Robotics and Automation Letters, vol. 3, no. 2, pp. 1088–1095, 2018

  7. [7]

    Learning to fly by myself: A self- supervised cnn-based approach for autonomous navigation,

    A. Kouris and C.-S. Bouganis, “Learning to fly by myself: A self- supervised cnn-based approach for autonomous navigation,” in2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 1–9

  8. [8]

    Champion-level drone racing using deep reinforcement learning,

    E. Kaufmann, L. Bauersfeld, A. Loquercio, M. M ¨uller, V . Koltun, and D. Scaramuzza, “Champion-level drone racing using deep reinforcement learning,”Nature, vol. 620, no. 7976, pp. 982–987, 2023

  9. [9]

    Npe-drl: Enhancing perception constrained obstacle avoidance with non-expert policy guided reinforcement learning,

    Y . Zhang, C. Yan, J. Xiao, and M. Feroskhan, “Npe-drl: Enhancing perception constrained obstacle avoidance with non-expert policy guided reinforcement learning,”IEEE Transactions on Artificial Intelligence, vol. 6, no. 1, pp. 184–198, 2024

  10. [10]

    Learning vision-based agile flight via differentiable physics,

    Y . Zhang, Y . Hu, Y . Song, D. Zou, and W. Lin, “Learning vision-based agile flight via differentiable physics,”Nature Machine Intelligence, pp. 1–13, 2025

  11. [11]

    Domain randomization and generative models for robotic grasping,

    J. Tobin, L. Biewald, R. Duan, M. Andrychowicz, A. Handa, V . Ku- mar, B. McGrew, A. Ray, J. Schneider, P. Welinderet al., “Domain randomization and generative models for robotic grasping,” in2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 3482–3489

  12. [12]

    Object detection using sim2real domain randomization for robotic applications,

    D. Horv ´ath, G. Erd ˝os, Z. Istenes, T. Horv ´ath, and S. F ¨oldi, “Object detection using sim2real domain randomization for robotic applications,” IEEE Transactions on Robotics, vol. 39, no. 2, pp. 1225–1243, 2022

  13. [13]

    Navbest: Behavior-enhanced strategy with spatio-temporal perception for mapless navigation in dynamic environments,

    K. Wang, K. Ma, Z. Sun, S. Xia, J. Xu, and L. Pei, “Navbest: Behavior-enhanced strategy with spatio-temporal perception for mapless navigation in dynamic environments,”IEEE Transactions on Vehicular Technology, 2025

  14. [14]

    Seeing through pixel motion: learning obstacle avoidance from optical flow with one camera,

    Y . Hu, Y . Zhang, Y . Song, Y . Deng, F. Yu, L. Zhang, W. Lin, D. Zou, and W. Yu, “Seeing through pixel motion: learning obstacle avoidance from optical flow with one camera,”IEEE Robotics and Automation Letters, 2025

  15. [15]

    3d gaussian splatting for real-time radiance field rendering

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023

  16. [16]

    Decoupling representation learning from reinforcement learning,

    A. Stooke, K. Lee, P. Abbeel, and M. Laskin, “Decoupling representation learning from reinforcement learning,” inInternational Conference on Machine Learning (ICML). PMLR, 2021, pp. 9870–9879

  17. [17]

    Collision-avoiding flocking with multiple fixed-wing uavs in obstacle-cluttered environments: A task-specific curriculum-based madrl approach,

    C. Yan, C. Wang, X. Xiang, K. H. Low, X. Wang, X. Xu, and L. Shen, “Collision-avoiding flocking with multiple fixed-wing uavs in obstacle-cluttered environments: A task-specific curriculum-based madrl approach,”IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 8, pp. 10 894–10 908, 2023

  18. [18]

    Collaborative target search with a visual drone swarm: An adaptive curriculum embedded multistage re- inforcement learning approach,

    J. Xiao, P. Pisutsin, and M. Feroskhan, “Collaborative target search with a visual drone swarm: An adaptive curriculum embedded multistage re- inforcement learning approach,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 1, pp. 313–327, 2023

  19. [19]

    Gaussian splatting to real world flight navigation transfer with liquid networks,

    A. Quach, M. Chahine, A. Amini, R. Hasani, and D. Rus, “Gaussian splatting to real world flight navigation transfer with liquid networks,” arXiv preprint arXiv:2406.15149, 2024

  20. [20]

    Tube-nerf: Efficient imitation learning of visuomotor policies from mpc via tube-guided data augmentation and nerfs,

    A. Tagliabue and J. P. How, “Tube-nerf: Efficient imitation learning of visuomotor policies from mpc via tube-guided data augmentation and nerfs,”IEEE Robotics and Automation Letters, vol. 9, no. 6, pp. 5544– 5551, 2024

  21. [21]

    Sous vide: Cooking visual drone navigation policies in a gaussian splatting vacuum,

    J. Low, M. Adang, J. Yu, K. Nagami, and M. Schwager, “Sous vide: Cooking visual drone navigation policies in a gaussian splatting vacuum,”IEEE Robotics and Automation Letters, 2025

  22. [22]

    Grad-nav: Efficiently learning visual drone navigation with gaus- sian radiance fields and differentiable dynamics,

    Q. Chen, J. Sun, N. Gao, J. Low, T. Chen, and M. Schwager, “Grad-nav: Efficiently learning visual drone navigation with gaussian radiance fields and differentiable dynamics,”arXiv preprint arXiv:2503.03984, 2025

  23. [23]

    Flying in clutter on monocular rgb by learning in 3d radiance fields with domain adaptation,

    X. Huang, J. Li, T. Wu, X. Zhou, Z. Han, and F. Gao, “Flying in clutter on monocular rgb by learning in 3d radiance fields with domain adaptation,”arXiv preprint arXiv:2512.17349, 2025

  24. [24]

    Learning deep sensorimotor policies for vision-based autonomous drone racing,

    J. Fu, Y . Song, Y . Wu, F. Yu, and D. Scaramuzza, “Learning deep sensorimotor policies for vision-based autonomous drone racing,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 5243–5250

  25. [25]

    Efficient policy adaptation with contrastive prompt ensemble for embodied agents,

    W. Choi, W. K. Kim, S. Kim, and H. Woo, “Efficient policy adaptation with contrastive prompt ensemble for embodied agents,”Advances in Neural Information Processing Systems (NeurIPS), 2024

  26. [26]

    Con- trastive learning for enhancing robust scene transfer in vision-based agile flight,

    J. Xing, L. Bauersfeld, Y . Song, C. Xing, and D. Scaramuzza, “Con- trastive learning for enhancing robust scene transfer in vision-based agile flight,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 5330–5337

  27. [27]

    Learning cross-modal visuo- motor policies for autonomous drone navigation,

    Y . Zhang, J. Xiao, and M. Feroskhan, “Learning cross-modal visuo- motor policies for autonomous drone navigation,”IEEE Robotics and Automation Letters, vol. 10, no. 6, pp. 5425–5432, 2025

  28. [28]

    Oracle-guided masked contrastive reinforcement learning for visuomotor policies,

    Y . Zhang, J. Xiao, C. Yan, and M. Feroskhan, “Oracle-guided masked contrastive reinforcement learning for visuomotor policies,”arXiv preprint arXiv:2510.05692, 2025

  29. [29]

    Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction,

    D. Chen, H. Li, W. Ye, Y . Wang, W. Xie, S. Zhai, N. Wang, H. Liu, H. Bao, and G. Zhang, “Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction,”IEEE Transactions on Visualization and Computer Graphics, 2024

  30. [30]

    Sam 2: Segment anything in images and videos,

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,” in2025 International Conference on Learning Representations (ICLR), 2025

  31. [31]

    Structure-from-motion revisited,

    J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4104–4113

  32. [32]

    Neural signed distance function inference through splatting 3d gaussians pulled on zero-level set,

    W. Zhang, Y .-S. Liu, and Z. Han, “Neural signed distance function inference through splatting 3d gaussians pulled on zero-level set,” Advances in Neural Information Processing Systems (NeurIPS), vol. 37, pp. 101 856–101 879, 2024

  33. [33]

    Learning robust representations via multi-view information bottleneck,

    M. Federici, A. Dutta, P. Forr ´e, N. Kushman, and Z. Akata, “Learning robust representations via multi-view information bottleneck,” in2020 International Conference on Learning Representations (ICLR), 2020, pp. 1–26

  34. [34]

    A simple framework for contrastive learning of visual representations,

    T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inInternational Conference on Machine Learning (ICML). PMLR, 2020, pp. 1597– 1607

  35. [35]

    Deep residual learning for image recognition,

    K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778

  36. [36]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  37. [37]

    Zhang et al

    Y . Zhang, H. Yu, J. Xiao, and M. Feroskhan, “Grounded vision-language navigation for uavs with open-vocabulary goal understanding,”arXiv preprint arXiv:2506.10756, 2025

  38. [38]

    Partially-observable monocular autonomous navigation for uav through deep reinforcement learning,

    Y . Zhang, K. H. Low, and C. Lyu, “Partially-observable monocular autonomous navigation for uav through deep reinforcement learning,” inAIAA AVIATION 2023 Forum, 2023, p. 3813

  39. [39]

    Hg-dagger: Interactive imitation learning with human experts,

    M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer, “Hg-dagger: Interactive imitation learning with human experts,” in2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 8077–8083

  40. [40]

    An improved artificial potential field method for path planning and formation control of the multi-uav systems,

    Z. Pan, C. Zhang, Y . Xia, H. Xiong, and X. Shao, “An improved artificial potential field method for path planning and formation control of the multi-uav systems,”IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 3, pp. 1129–1133, 2021

  41. [41]

    Visualizing data using t-sne

    L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.”Journal of Machine Learning Research, vol. 9, no. 11, 2008

  42. [42]

    Grad-cam: Visual explanations from deep networks via gradient-based localization,

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618–626. Yuhang Zhang(graduate student member, IEEE) received the B.E. degree in flight vehicle propulsion en...

  43. [43]

    His research interests include nonlinear control systems, multi-agent systems, flight dynamics and control, and aerial robotics

    He is currently an assistant professor with the School of Mechanical & Aerospace Engineering at NTU. His research interests include nonlinear control systems, multi-agent systems, flight dynamics and control, and aerial robotics