arxiv: 2603.28032 · v2 · submitted 2026-03-30 · 💻 cs.RO · cs.AI· cs.CV· cs.HC

Recognition: no theorem link

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

Tianle Zeng , Yanci Wen , Hong Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 22:09 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.HC

keywords CARLA-Airunified simulationair-ground cooperationUAV dynamicsembodied intelligenceROS 2Unreal Enginemulti-modal perception

0 comments

The pith

CARLA-Air unifies high-fidelity urban driving and multirotor flight inside one Unreal Engine process while preserving original CARLA and AirSim APIs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CARLA-Air to solve the separation between driving simulators and aerial simulators by running both inside a single Unreal Engine instance. This creates a shared physics and rendering pipeline so ground vehicles, pedestrians, and UAVs operate with strict spatial-temporal consistency and no bridge overhead. Existing Python code and ROS 2 scripts for either platform continue to work unchanged. The system synchronously captures up to 18 sensor streams and supports workloads such as air-ground cooperation, embodied navigation, and reinforcement learning policy training. An extensible asset pipeline also allows custom robots to be added to the shared world.

Core claim

CARLA-Air unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process while preserving both CARLA and AirSim native Python APIs and ROS 2 interfaces, delivering photorealistic environments with rule-compliant traffic, socially-aware pedestrians, and aerodynamically consistent UAV dynamics.

What carries the argument

Single-process integration of CARLA ground simulation and AirSim aerial dynamics inside one shared Unreal Engine tick and rendering pipeline.

Load-bearing premise

Merging the two systems keeps exact spatial-temporal alignment, photorealism, and full API compatibility without added latency or breakage.

What would settle it

Running joint air-ground scenarios and observing either timestamp mismatches between agents or failure of unmodified original CARLA or AirSim scripts.

Figures

Figures reproduced from arXiv: 2603.28032 by Hong Zhang, Tianle Zeng, Yanci Wen.

**Figure 1.** Figure 1: Overview of CARLA-Air, a unified simulation infrastructure for air-ground embodied intelligence. The examples shown here illustrate representative capabilities of the platform, including unified air-ground simulation, multi-modal sensing, embodied navigation, asset adaptation, and diverse urban scenarios within a single physically coherent environment. Abstract The convergence of low-altitude economies, em… view at source ↗

**Figure 2.** Figure 2: Per-frame inter-process data transfer time as [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Platform positioning along simulation fidelity [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Runtime architecture of CARLA-Air. A sin [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Resolving the UE4 single-game-mode constraint. (a) Both backends provide independent game mode classes; [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Coordinate frames of the two simulation back [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: VRAM trace over a 3-hour stability run (357 spawn/destroy cycles, moderate joint configuration, RTX A4000). Early-to-late drift is ≈10 MiB; linear regression yields R2 = 0.11, confirming no significant memory accumulation [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Dual-client architecture shared by all five work [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: W1: Air-ground cooperative precision landing on a moving vehicle. [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: W2: Embodied navigation with aerial reasoning. A UAV autonomously tracks a pedestrian (red box, [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: W3: Synchronized multi-modal dataset collection at a single simulation tick. [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: W4: Air-ground cross-view perception across diverse environments and weather conditions. Each row [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: W5: Reinforcement learning training environment. [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Custom assets imported into CARLA-Air through the extensible asset pipeline. Top: a four-wheeled mobile robot with onboard LiDAR, imported from an external FBX model. Bottom: a custom electric sport car with user-defined vehicle dynamics. Both assets operate within the shared simulation world alongside all built-in CARLA traffic and AirSim aerial agents, and are visible to all sensor modalities. Referenc… view at source ↗

read the original abstract

The convergence of low-altitude economies, embodied intelligence, and air-ground cooperative systems creates growing demand for simulation infrastructure capable of jointly modeling aerial and ground agents within a single physically coherent environment. Existing open-source platforms remain domain-segregated: driving simulators lack aerial dynamics, while multirotor simulators lack realistic ground scenes. Bridge-based co-simulation introduces synchronization overhead and cannot guarantee strict spatial-temporal consistency. We present CARLA-Air, an open-source infrastructure that unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process. The platform preserves both CARLA and AirSim native Python APIs and ROS 2 interfaces, enabling zero-modification code reuse. Within a shared physics tick and rendering pipeline, CARLA-Air delivers photorealistic environments with rule-compliant traffic, socially-aware pedestrians, and aerodynamically consistent UAV dynamics, synchronously capturing up to 18 sensor modalities across all platforms at each tick. The platform supports representative air-ground embodied intelligence workloads spanning cooperation, embodied navigation and vision-language action, multi-modal perception and dataset construction, and reinforcement-learning-based policy training. An extensible asset pipeline allows integration of custom robot platforms into the shared world. By inheriting AirSim's aerial capabilities -- whose upstream development has been archived -- CARLA-Air ensures this widely adopted flight stack continues to evolve within a modern infrastructure. Released with prebuilt binaries and full source: https://github.com/louiszengCN/CarlaAir

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CARLA-Air unifies CARLA and AirSim in one Unreal process with preserved native APIs, but supplies no benchmarks or timing data to confirm the zero-overhead claims.

read the letter

CARLA-Air puts AirSim multirotor dynamics inside CARLA's single Unreal Engine world so ground vehicles and UAVs share the same physics tick and rendering pass. The practical advance is that both original Python APIs and ROS 2 interfaces stay unmodified, which should let existing code run as-is. They also ship prebuilt binaries, full source, and an asset pipeline for adding custom robots, plus support for workloads like air-ground cooperation and multi-modal dataset collection. That combination removes the usual bridge overhead and keeps the widely used AirSim flight stack alive inside a modern urban simulator.

Referee Report

2 major / 0 minor

Summary. The paper presents CARLA-Air, an open-source infrastructure that unifies CARLA's high-fidelity urban driving simulation with AirSim's physics-accurate multirotor dynamics inside a single Unreal Engine process. It claims to preserve both platforms' native Python APIs and ROS 2 interfaces for zero-modification reuse, deliver synchronous capture of up to 18 sensor modalities under a shared physics tick and rendering pipeline, and support air-ground embodied intelligence workloads including cooperation, navigation, vision-language tasks, perception, and RL policy training. An extensible asset pipeline and prebuilt binaries are also provided.

Significance. If the integration claims hold, the platform would address a clear gap in domain-segregated simulators by enabling consistent air-ground co-simulation without bridge-induced overhead, potentially accelerating research in cooperative embodied systems. The open release of source and binaries supports reproducibility and extension.

major comments (2)

[Abstract] Abstract: The central claim that AirSim multirotor dynamics are embedded such that native APIs remain unmodified and strict spatial-temporal consistency is achieved under a single physics tick lacks any supporting implementation details, timing benchmarks, or API-equivalence tests. This makes it impossible to assess whether the zero-modification and zero-overhead guarantees are actually met.
[Abstract] The manuscript provides no validation experiments, performance measurements, or side-by-side comparisons against standalone CARLA and AirSim to substantiate the assertions of photorealism preservation, synchronization fidelity, or workload support. Without such data the soundness of the integration cannot be evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript describing CARLA-Air. The comments correctly identify gaps in implementation details and empirical validation that must be addressed to substantiate the platform's claims. We will perform a major revision incorporating the requested information.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that AirSim multirotor dynamics are embedded such that native APIs remain unmodified and strict spatial-temporal consistency is achieved under a single physics tick lacks any supporting implementation details, timing benchmarks, or API-equivalence tests. This makes it impossible to assess whether the zero-modification and zero-overhead guarantees are actually met.

Authors: We agree that the abstract and current manuscript lack the low-level implementation details, timing data, and equivalence tests needed to evaluate the claims. In the revised manuscript we will add a new technical section describing the embedding of AirSim's multirotor physics into the shared Unreal Engine process, the minimal modifications required to preserve native CARLA and AirSim Python APIs and ROS 2 interfaces, and the single-tick synchronization mechanism. We will also include concrete timing benchmarks (tick duration, sensor latency) and API-equivalence test results demonstrating that unmodified client code from both platforms runs without change. revision: yes
Referee: [Abstract] The manuscript provides no validation experiments, performance measurements, or side-by-side comparisons against standalone CARLA and AirSim to substantiate the assertions of photorealism preservation, synchronization fidelity, or workload support. Without such data the soundness of the integration cannot be evaluated.

Authors: We acknowledge that the submitted version contains only qualitative descriptions and example workloads without quantitative validation or baseline comparisons. The revised manuscript will add a dedicated evaluation section with side-by-side experiments measuring photorealism (via perceptual image metrics), synchronization fidelity (cross-agent event timing and collision consistency), performance overhead (tick rate and memory usage versus standalone CARLA and AirSim), and end-to-end support for air-ground tasks including cooperation, navigation, vision-language action, and RL policy training. These results will directly substantiate the integration claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; software integration claims rest on released code

full rationale

The paper presents a descriptive account of a simulation platform integration with no mathematical derivations, equations, fitted parameters, or predictions. All central claims concern API compatibility, shared physics ticks, and sensor synchronization, which are implementation assertions whose validity is delegated to the released GitHub binaries and source rather than any internal reduction or self-citation chain. No self-definitional loops, uniqueness theorems, or ansatzes appear; the work is self-contained against external benchmarks via code release.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the paper is a software platform description that integrates existing simulators without new theoretical constructs.

pith-pipeline@v0.9.0 · 5581 in / 967 out tokens · 38762 ms · 2026-05-14T22:09:57.479357+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap
cs.RO 2026-04 unverdicted novelty 4.0

A survey of UAV vision-and-language navigation that establishes a methodological taxonomy, reviews resources and challenges, and proposes a forward-looking research roadmap.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

VISTA 2.0: An open, data-driven simulator for multimodal sensing and policy learning for autonomous vehicles

Alexander Amini, Tsun-Hsuan Wang, Igor Gilitschenski, Wilko Schwarting, Zhijian Liu, Song Han, Sertac Karaman, and Daniela Rus. VISTA 2.0: An open, data-driven simulator for multimodal sensing and policy learning for autonomous vehicles. InIEEE International Conference on Robotics and Automation (ICRA), pages 4349–4356, 2022

work page 2022
[2]

CARLA: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InProceedings of the Conference on Robot Learning (CoRL), pages 1–16, 2017

work page 2017
[3]

Unreal Engine 4 documenta- tion

Epic Games. Unreal Engine 4 documenta- tion. https://docs.unrealengine.com/4. 26/, 2021

work page 2021
[4]

RotorS—a modular gazebo MA V simulator framework

Fadri Furrer, Michael Burri, Markus Achtelik, and Roland Siegwart. RotorS—a modular gazebo MA V simulator framework. InRobot Operating System (ROS): The Complete Reference, volume 1, pages 595–625. Springer, 2016

work page 2016
[5]

FlightGoggles: Pho- torealistic sensor simulation for perception-driven robotics using photogrammetry and virtual reality

Winter Guerra, Ezra Tal, Varun Murali, Gilhyun Ryou, and Sertac Karaman. FlightGoggles: Pho- torealistic sensor simulation for perception-driven robotics using photogrammetry and virtual reality. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 6941–6948, 2019

work page 2019
[6]

Design and use paradigms for Gazebo, an open-source multi-robot simulator

Nathan Koenig and Andrew Howard. Design and use paradigms for Gazebo, an open-source multi-robot simulator. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2149– 2154, 2004

work page 2004
[7]

MetaDrive: Composing diverse driving scenarios for generalizable reinforcement learning.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(3):3461–3475, 2023

Quanyi Li, Zhenghao Peng, Lan Feng, et al. MetaDrive: Composing diverse driving scenarios for generalizable reinforcement learning.IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 45(3):3461–3475, 2023

work page 2023
[8]

Microscopic traffic simulation using SUMO

Pablo Alvarez Lopez, Michael Behrisch, Laura Bieker-Walz, et al. Microscopic traffic simulation using SUMO. InIEEE International Conference on Intelligent Transportation Systems (ITSC), pages 2575–2582, 2018

work page 2018
[9]

Robot operating system 2: Design, architecture, and uses in the wild

Steve Macenski, Tully Foote, Brian Gerkey, Chris Lalancette, and William Woodall. Robot operating system 2: Design, architecture, and uses in the wild. Science Robotics, 7(66):eabm6074, 2022

work page 2022
[10]

Isaac Gym: High performance GPU-based physics simulation for robot learning

Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, et al. Isaac Gym: High performance GPU-based physics simulation for robot learning. InNeurIPS Datasets and Benchmarks Track, 2021

work page 2021
[11]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

NVIDIA. NVIDIA Isaac Lab: A unified and mod- ular framework for robot learning.arXiv preprint arXiv:2511.04831, 2025. 15

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Schoellig

Jacopo Panerati, Hehui Zheng, SiQi Zhou, Amanda Prorok, and Angela P. Schoellig. Learning to fly—a gym environment with PyBullet physics for reinforce- ment learning of multi-agent quadcopter control. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7512–7519, 2021

work page 2021
[13]

LGSVL Simulator: A high fidelity simulator for autonomous driving

Guodong Rong, Byung Hyun Shin, Hadi Tabatabaee, et al. LGSVL Simulator: A high fidelity simulator for autonomous driving. InIEEE International Con- ference on Intelligent Transportation Systems (ITSC), pages 1–6, 2020

work page 2020
[14]

Habitat: A platform for embodied AI research

Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, et al. Habitat: A platform for embodied AI research. InIEEE/CVF International Conference on Computer Vision (ICCV), pages 9339–9347, 2019

work page 2019
[15]

AirSim: High-fidelity visual and physical simulation for autonomous vehicles

Shital Shah, Debadeepta Dey, Chris Lovett, and Ashish Kapoor. AirSim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics, pages 621–635. Springer, 2018

work page 2018
[16]

Flightmare: A flexible quadrotor simulator

Yunlong Song, Selim Naji, Elia Kaufmann, Antonio Loquercio, and Davide Scaramuzza. Flightmare: A flexible quadrotor simulator. InProceedings of the Conference on Robot Learning (CoRL), pages 1–16, 2021

work page 2021
[17]

TranSimHub: A unified air-ground simulation platform for multi-modal perception and decision-making.arXiv preprint arXiv:2510.15365, 2025

Maonan Wang, Yirong Chen, Yuxin Cai, Aoyu Pang, Yuejiao Xie, Zian Ma, Chengcheng Xu, Kemou Jiang, Ding Wang, Laurent Roullet, Chung Shue Chen, Zhiyong Cui, Yuheng Kan, Michael Lepech, and Man-On Pun. TranSimHub: A unified air-ground simulation platform for multi-modal perception and decision-making.arXiv preprint arXiv:2510.15365, 2025

work page arXiv 2025
[18]

SAPIEN: A simAted part-based interactive ENvi- ronment

Fanbo Xiang, Yuzhe Qin, Kaichun Mo, et al. SAPIEN: A simAted part-based interactive ENvi- ronment. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11097– 11107, 2020

work page 2020
[19]

OmniDrones: An effi- cient and flexible platform for reinforcement learning in drone control.arXiv preprint arXiv:2309.12825, 2023

Botian Xu, Feng Gao, et al. OmniDrones: An effi- cient and flexible platform for reinforcement learning in drone control.arXiv preprint arXiv:2309.12825, 2023

work page arXiv 2023
[20]

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

Yuke Zhu, Josiah Wong, Ajay Mandlekar, Roberto Martín-Martín, et al. robosuite: A modular simula- tion framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020. A Appendix A.1 System Configuration Figure 14: Custom assets imported into CARLA-Air through the extensible asset pipeline.Top:a four-wheeled mobile robot with onboard LiD...

work page internal anchor Pith review Pith/arXiv arXiv 2009