GaussFly: Contrastive Reinforcement Learning for Visuomotor Policies in 3D Gaussian Fields
Pith reviewed 2026-05-10 18:52 UTC · model grok-4.3
The pith
GaussFly reconstructs scenes with constrained 3D Gaussian Splatting and pre-trains contrastive features to let visuomotor policies learn efficiently in simulation and transfer zero-shot to real aerial vehicles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By decoupling representation learning from policy optimization in a real-to-sim-to-real pipeline, GaussFly first builds high-fidelity training environments through 3D Gaussian Splatting augmented with explicit geometric constraints, then extracts robust low-dimensional features via contrastive learning on the rendered views. Feeding these features to the visuomotor policy yields superior sample efficiency and asymptotic performance in simulation while enabling robust zero-shot transfer to physical settings.
What carries the argument
The real-to-sim-to-real paradigm that reconstructs scenes via constrained 3D Gaussian Splatting and applies contrastive representation learning to obtain compact features for the reinforcement learning policy.
If this is right
- Visuomotor policies require far fewer environment interactions to reach high performance than direct image-to-action baselines.
- The same policy achieves better final returns in simulation than methods that map raw pixels straight to controls.
- A policy trained only in the reconstructed scenes executes successfully on real hardware in entirely new locations with complex textures.
- Low-dimensional contrastive features lower the computational cost of policy training and inference.
- The learned features remain stable under visual noise that would degrade raw-pixel policies.
Where Pith is reading between the lines
- The same pre-trained encoder could be reused across multiple different tasks or vehicle platforms within similar visual domains without retraining.
- Relaxing the geometric constraints during scene reconstruction would likely increase the remaining sim-to-real gap and reduce zero-shot success.
- Extending the approach to dynamic scenes or moving objects would require updating the Gaussian reconstruction step to handle time-varying geometry.
- The method suggests that explicit 3D reconstruction is more effective than purely image-based domain randomization for achieving texture-invariant features.
Load-bearing premise
The 3D Gaussian Splatting models with geometric constraints create simulated images close enough to real ones that contrastive features learned from them will generalize directly to physical environments with unseen complex textures.
What would settle it
Train a policy with GaussFly on a set of reconstructed scenes, then deploy it without any adaptation on a real aerial vehicle flying in an environment whose textures, lighting, and geometry were never part of the original reconstructions, and check whether success rate remains comparable to simulation.
Figures
read the original abstract
Learning visuomotor policies for Autonomous Aerial Vehicles (AAVs) relying solely on monocular vision is an attractive yet highly challenging paradigm. Existing end-to-end learning approaches directly map high-dimensional RGB observations to action commands, which frequently suffer from low sample efficiency and severe sim-to-real gaps due to the visual discrepancy between simulation and physical domains. To address these long-standing challenges, we propose GaussFly, a novel framework that explicitly decouples representation learning from policy optimization through a cohesive real-to-sim-to-real paradigm. First, to achieve a high-fidelity real-to-sim transition, we reconstruct training scenes using 3D Gaussian Splatting (3DGS) augmented with explicit geometric constraints. Second, to ensure robust sim-to-real transfer, we leverage these photorealistic simulated environments and employ contrastive representation learning to extract compact, noise-resilient latent features from the rendered RGB images. By utilizing this pre-trained encoder to provide low-dimensional feature inputs, the computational burden on the visuomotor policy is significantly reduced while its resistance against visual noise is inherently enhanced. Extensive experiments in simulated and real-world environments demonstrate that GaussFly achieves superior sample efficiency and asymptotic performance compared to baselines. Crucially, it enables robust and zero-shot policy transfer to unseen real-world environments with complex textures, effectively bridging the sim-to-real gap.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes GaussFly, a framework for visuomotor policy learning in autonomous aerial vehicles that decouples representation learning from policy optimization via a real-to-sim-to-real pipeline. Scenes are reconstructed using 3D Gaussian Splatting augmented with explicit geometric constraints to create high-fidelity simulations; contrastive representation learning is then applied to rendered RGB images to obtain compact, noise-resilient latent features. These features serve as low-dimensional inputs to a reinforcement learning policy trained in simulation, with the goal of achieving superior sample efficiency, asymptotic performance, and zero-shot transfer to unseen real-world environments with complex textures.
Significance. If the empirical claims hold, the work offers a coherent approach to mitigating the sim-to-real gap in high-dimensional visual control by leveraging photorealistic 3DGS reconstructions and contrastive pre-training. The explicit separation of representation learning from policy optimization reduces computational burden on the policy while enhancing robustness to visual noise, which could improve sample efficiency in robotics applications. The real-to-sim-to-real paradigm and use of established 3DGS and contrastive techniques are presented without internal contradictions.
minor comments (1)
- Abstract: the claims of superior sample efficiency, asymptotic performance, and zero-shot transfer are stated without any quantitative metrics, baseline names, or effect sizes; adding one or two key numerical results would make the summary self-contained and proportionate to the experimental emphasis in the full text.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision for our manuscript on GaussFly. The description accurately reflects the framework's decoupling of representation learning via 3D Gaussian Splatting and contrastive features from policy optimization, along with the real-to-sim-to-real pipeline for improved sample efficiency and zero-shot transfer.
Circularity Check
No significant circularity detected
full rationale
The paper presents a pipeline that applies established 3D Gaussian Splatting for scene reconstruction, contrastive representation learning on rendered images, and standard reinforcement learning for visuomotor policies. No derivation, equation, or claim reduces to its own inputs by construction; the central claims rest on empirical validation of the combined pipeline rather than self-referential definitions or fitted parameters renamed as predictions. The method section describes sequential stages without internal loops that would force results tautologically.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption 3D Gaussian Splatting augmented with explicit geometric constraints achieves high-fidelity real-to-sim scene reconstruction
- domain assumption Contrastive representation learning extracts compact, noise-resilient latent features from rendered RGB images that support effective policy learning
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We reconstruct training scenes using 3D Gaussian Splatting (3DGS) augmented with explicit geometric constraints... contrastive representation learning... InfoNCE loss... PPO algorithm
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We enforce explicit planar constraints and normal consistency during optimization
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Learning high-speed flight in the wild,
A. Loquercio, E. Kaufmann, R. Ranftl, M. M ¨uller, V . Koltun, and D. Scaramuzza, “Learning high-speed flight in the wild,”Science Robotics, vol. 6, no. 59, p. eabg5810, 2021
work page 2021
-
[2]
H. Matsuki, R. Murai, P. H. Kelly, and A. J. Davison, “Gaussian splatting slam,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 18 039–18 048
work page 2024
-
[3]
Tc-sfm: Robust track-community-based structure-from-motion,
L. Wang, L. Ge, S. Luo, Z. Yan, Z. Cui, and J. Feng, “Tc-sfm: Robust track-community-based structure-from-motion,”IEEE Transactions on Image Processing, vol. 33, pp. 1534–1548, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11
work page 2024
-
[4]
Vision-based learning for drones: A survey,
J. Xiao, R. Zhang, Y . Zhang, and M. Feroskhan, “Vision-based learning for drones: A survey,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 9, pp. 15 601–15 621, 2025
work page 2025
-
[5]
D. Gandhi, L. Pinto, and A. Gupta, “Learning to fly by crashing,” in2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 3948–3955
work page 2017
-
[6]
Dronet: Learning to fly by driving,
A. Loquercio, A. I. Maqueda, C. R. Del-Blanco, and D. Scaramuzza, “Dronet: Learning to fly by driving,”IEEE Robotics and Automation Letters, vol. 3, no. 2, pp. 1088–1095, 2018
work page 2018
-
[7]
Learning to fly by myself: A self- supervised cnn-based approach for autonomous navigation,
A. Kouris and C.-S. Bouganis, “Learning to fly by myself: A self- supervised cnn-based approach for autonomous navigation,” in2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 1–9
work page 2018
-
[8]
Champion-level drone racing using deep reinforcement learning,
E. Kaufmann, L. Bauersfeld, A. Loquercio, M. M ¨uller, V . Koltun, and D. Scaramuzza, “Champion-level drone racing using deep reinforcement learning,”Nature, vol. 620, no. 7976, pp. 982–987, 2023
work page 2023
-
[9]
Y . Zhang, C. Yan, J. Xiao, and M. Feroskhan, “Npe-drl: Enhancing perception constrained obstacle avoidance with non-expert policy guided reinforcement learning,”IEEE Transactions on Artificial Intelligence, vol. 6, no. 1, pp. 184–198, 2024
work page 2024
-
[10]
Learning vision-based agile flight via differentiable physics,
Y . Zhang, Y . Hu, Y . Song, D. Zou, and W. Lin, “Learning vision-based agile flight via differentiable physics,”Nature Machine Intelligence, pp. 1–13, 2025
work page 2025
-
[11]
Domain randomization and generative models for robotic grasping,
J. Tobin, L. Biewald, R. Duan, M. Andrychowicz, A. Handa, V . Ku- mar, B. McGrew, A. Ray, J. Schneider, P. Welinderet al., “Domain randomization and generative models for robotic grasping,” in2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 3482–3489
work page 2018
-
[12]
Object detection using sim2real domain randomization for robotic applications,
D. Horv ´ath, G. Erd ˝os, Z. Istenes, T. Horv ´ath, and S. F ¨oldi, “Object detection using sim2real domain randomization for robotic applications,” IEEE Transactions on Robotics, vol. 39, no. 2, pp. 1225–1243, 2022
work page 2022
-
[13]
K. Wang, K. Ma, Z. Sun, S. Xia, J. Xu, and L. Pei, “Navbest: Behavior-enhanced strategy with spatio-temporal perception for mapless navigation in dynamic environments,”IEEE Transactions on Vehicular Technology, 2025
work page 2025
-
[14]
Seeing through pixel motion: learning obstacle avoidance from optical flow with one camera,
Y . Hu, Y . Zhang, Y . Song, Y . Deng, F. Yu, L. Zhang, W. Lin, D. Zou, and W. Yu, “Seeing through pixel motion: learning obstacle avoidance from optical flow with one camera,”IEEE Robotics and Automation Letters, 2025
work page 2025
-
[15]
3d gaussian splatting for real-time radiance field rendering
B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering.”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023
work page 2023
-
[16]
Decoupling representation learning from reinforcement learning,
A. Stooke, K. Lee, P. Abbeel, and M. Laskin, “Decoupling representation learning from reinforcement learning,” inInternational Conference on Machine Learning (ICML). PMLR, 2021, pp. 9870–9879
work page 2021
-
[17]
C. Yan, C. Wang, X. Xiang, K. H. Low, X. Wang, X. Xu, and L. Shen, “Collision-avoiding flocking with multiple fixed-wing uavs in obstacle-cluttered environments: A task-specific curriculum-based madrl approach,”IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 8, pp. 10 894–10 908, 2023
work page 2023
-
[18]
J. Xiao, P. Pisutsin, and M. Feroskhan, “Collaborative target search with a visual drone swarm: An adaptive curriculum embedded multistage re- inforcement learning approach,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 1, pp. 313–327, 2023
work page 2023
-
[19]
Gaussian splatting to real world flight navigation transfer with liquid networks,
A. Quach, M. Chahine, A. Amini, R. Hasani, and D. Rus, “Gaussian splatting to real world flight navigation transfer with liquid networks,” arXiv preprint arXiv:2406.15149, 2024
-
[20]
A. Tagliabue and J. P. How, “Tube-nerf: Efficient imitation learning of visuomotor policies from mpc via tube-guided data augmentation and nerfs,”IEEE Robotics and Automation Letters, vol. 9, no. 6, pp. 5544– 5551, 2024
work page 2024
-
[21]
Sous vide: Cooking visual drone navigation policies in a gaussian splatting vacuum,
J. Low, M. Adang, J. Yu, K. Nagami, and M. Schwager, “Sous vide: Cooking visual drone navigation policies in a gaussian splatting vacuum,”IEEE Robotics and Automation Letters, 2025
work page 2025
-
[22]
Q. Chen, J. Sun, N. Gao, J. Low, T. Chen, and M. Schwager, “Grad-nav: Efficiently learning visual drone navigation with gaussian radiance fields and differentiable dynamics,”arXiv preprint arXiv:2503.03984, 2025
-
[23]
Flying in clutter on monocular rgb by learning in 3d radiance fields with domain adaptation,
X. Huang, J. Li, T. Wu, X. Zhou, Z. Han, and F. Gao, “Flying in clutter on monocular rgb by learning in 3d radiance fields with domain adaptation,”arXiv preprint arXiv:2512.17349, 2025
-
[24]
Learning deep sensorimotor policies for vision-based autonomous drone racing,
J. Fu, Y . Song, Y . Wu, F. Yu, and D. Scaramuzza, “Learning deep sensorimotor policies for vision-based autonomous drone racing,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 5243–5250
work page 2023
-
[25]
Efficient policy adaptation with contrastive prompt ensemble for embodied agents,
W. Choi, W. K. Kim, S. Kim, and H. Woo, “Efficient policy adaptation with contrastive prompt ensemble for embodied agents,”Advances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[26]
Con- trastive learning for enhancing robust scene transfer in vision-based agile flight,
J. Xing, L. Bauersfeld, Y . Song, C. Xing, and D. Scaramuzza, “Con- trastive learning for enhancing robust scene transfer in vision-based agile flight,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 5330–5337
work page 2024
-
[27]
Learning cross-modal visuo- motor policies for autonomous drone navigation,
Y . Zhang, J. Xiao, and M. Feroskhan, “Learning cross-modal visuo- motor policies for autonomous drone navigation,”IEEE Robotics and Automation Letters, vol. 10, no. 6, pp. 5425–5432, 2025
work page 2025
-
[28]
Oracle-guided masked contrastive reinforcement learning for visuomotor policies,
Y . Zhang, J. Xiao, C. Yan, and M. Feroskhan, “Oracle-guided masked contrastive reinforcement learning for visuomotor policies,”arXiv preprint arXiv:2510.05692, 2025
-
[29]
Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction,
D. Chen, H. Li, W. Ye, Y . Wang, W. Xie, S. Zhai, N. Wang, H. Liu, H. Bao, and G. Zhang, “Pgsr: Planar-based gaussian splatting for efficient and high-fidelity surface reconstruction,”IEEE Transactions on Visualization and Computer Graphics, 2024
work page 2024
-
[30]
Sam 2: Segment anything in images and videos,
N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafsonet al., “Sam 2: Segment anything in images and videos,” in2025 International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[31]
Structure-from-motion revisited,
J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4104–4113
work page 2016
-
[32]
Neural signed distance function inference through splatting 3d gaussians pulled on zero-level set,
W. Zhang, Y .-S. Liu, and Z. Han, “Neural signed distance function inference through splatting 3d gaussians pulled on zero-level set,” Advances in Neural Information Processing Systems (NeurIPS), vol. 37, pp. 101 856–101 879, 2024
work page 2024
-
[33]
Learning robust representations via multi-view information bottleneck,
M. Federici, A. Dutta, P. Forr ´e, N. Kushman, and Z. Akata, “Learning robust representations via multi-view information bottleneck,” in2020 International Conference on Learning Representations (ICLR), 2020, pp. 1–26
work page 2020
-
[34]
A simple framework for contrastive learning of visual representations,
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” inInternational Conference on Machine Learning (ICML). PMLR, 2020, pp. 1597– 1607
work page 2020
-
[35]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778
work page 2016
-
[36]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[37]
Y . Zhang, H. Yu, J. Xiao, and M. Feroskhan, “Grounded vision-language navigation for uavs with open-vocabulary goal understanding,”arXiv preprint arXiv:2506.10756, 2025
-
[38]
Partially-observable monocular autonomous navigation for uav through deep reinforcement learning,
Y . Zhang, K. H. Low, and C. Lyu, “Partially-observable monocular autonomous navigation for uav through deep reinforcement learning,” inAIAA AVIATION 2023 Forum, 2023, p. 3813
work page 2023
-
[39]
Hg-dagger: Interactive imitation learning with human experts,
M. Kelly, C. Sidrane, K. Driggs-Campbell, and M. J. Kochenderfer, “Hg-dagger: Interactive imitation learning with human experts,” in2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 8077–8083
work page 2019
-
[40]
Z. Pan, C. Zhang, Y . Xia, H. Xiong, and X. Shao, “An improved artificial potential field method for path planning and formation control of the multi-uav systems,”IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 3, pp. 1129–1133, 2021
work page 2021
-
[41]
L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.”Journal of Machine Learning Research, vol. 9, no. 11, 2008
work page 2008
-
[42]
Grad-cam: Visual explanations from deep networks via gradient-based localization,
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 618–626. Yuhang Zhang(graduate student member, IEEE) received the B.E. degree in flight vehicle propulsion en...
work page 2017
-
[43]
He is currently an assistant professor with the School of Mechanical & Aerospace Engineering at NTU. His research interests include nonlinear control systems, multi-agent systems, flight dynamics and control, and aerial robotics
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.