FlyMirage: A Fully Automated Generation Pipeline for Diverse and Scalable UAV Flight Data via Generative World Model

Fei Gao; Jinhan Li; Mo Zhu; Qiyi He; Weiqi Ge; Xijie Huang; Xin Zhou; Yijin Wang; Yuze Wu; Zhaoqi Wang

arxiv: 2605.19600 · v1 · pith:7MPL6N4Knew · submitted 2026-05-19 · 💻 cs.RO

FlyMirage: A Fully Automated Generation Pipeline for Diverse and Scalable UAV Flight Data via Generative World Model

Jinhan Li , Xijie Huang , Zhaoqi Wang , Yijin Wang , Weiqi Ge , Qiyi He , Mo Zhu , Fei Gao

show 2 more authors

Yuze Wu Xin Zhou

This is my paper

Pith reviewed 2026-05-20 04:54 UTC · model grok-4.3

classification 💻 cs.RO

keywords UAVVision-Language NavigationGenerative World Model3D Gaussian SplattingAutomated Data PipelineFlight Trajectory PlanningEmbodied NavigationAerial Datasets

0 comments

The pith

FlyMirage automates generation of large-scale photorealistic UAV flight data for aerial vision-language navigation using LLMs and 3D Gaussian Splatting scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FlyMirage as a fully automated pipeline that creates aerial VLN datasets at scale. Large language models design diverse environments while a generative world model converts those designs into high-fidelity 3D Gaussian Splatting scenes. The system then automates scene exploration, semantic acquisition, and applies a dynamically feasible planner to produce UAV trajectories. This setup addresses limits in existing datasets by combining scale, diversity, and realism without heavy reliance on costly real-world captures or simplified simulations. A sympathetic reader would expect the output data to train embodied navigation models more effectively than prior sources.

Core claim

FlyMirage shows that pairing LLMs as environment designers with a generative world model to instantiate designs into 3D Gaussian Splatting scenes, combined with automated exploration, semantic acquisition, and a dynamically feasible planner, produces a large-scale, diverse, and photorealistic aerial VLN dataset containing dynamically feasible UAV flying trajectories.

What carries the argument

The FlyMirage pipeline, which uses LLMs for scene design, a generative world model for 3D Gaussian Splatting instantiation, and automated processes for exploration, semantics, and UAV trajectory planning.

If this is right

Datasets can be produced at scales far beyond manual real-world collection or basic simulations.
Output includes photorealistic visuals from 3DGS rendering and dynamically feasible paths from the planner.
Human effort for scene setup, labeling, and trajectory design drops sharply.
The resulting data directly supports training of next-generation embodied aerial navigation systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Generated datasets could serve as a primary training source if transfer to physical UAVs proves reliable.
The pipeline structure might adapt to create data for non-aerial robotic navigation tasks.
Iterative loops that use model errors to prompt new LLM scene designs could refine data quality over time.

Load-bearing premise

The scenes designed by LLMs and created via the generative world model, along with planner outputs, deliver data whose diversity, photorealism, and feasibility are close enough to real UAV conditions to train effective navigation models.

What would settle it

Train an aerial navigation model exclusively on FlyMirage data and measure its success rate on real-world UAV flights against a model trained on equivalent real data; a large performance gap would indicate the generated data falls short.

Figures

Figures reproduced from arXiv: 2605.19600 by Fei Gao, Jinhan Li, Mo Zhu, Qiyi He, Weiqi Ge, Xijie Huang, Xin Zhou, Yijin Wang, Yuze Wu, Zhaoqi Wang.

**Figure 1.** Figure 1: FlyMirage. It uses LLM-designed scene specifications, generative world model, automated scene annotation, and UAV-feasible trajectory planning to produce scalable, diverse, and photorealistic aerial VLN data. Abstract— In the field of Vision-Language Navigation (VLN), aerial datasets remain limited in their ability to combine scale, diversity, and realism, often relying on either costly real-world scenes o… view at source ↗

**Figure 2.** Figure 2: Overall Dataset Creation Pipeline of FlyMirage. It consists of three stages: World Generation, Scene Annotation and Navigation & Collection. world models to synthesize training data, it incurs a prohibitive computational cost, requiring approximately 81,000 NVIDIA L40 GPU hours to generate 240k samples. III. PIPELINE FOR DATASET CREATION As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: World Generation. A description guide is paired with a randomly selected scene type to generate a detailed scene description and image, which are then used to create a 3DGS scene with Marble. We organize common types of scenes into a hierarchical taxonomy of categories and subcategories. To generate a scene description, we first randomly select a subcategory within a broader category, and then prompt GPT-5… view at source ↗

**Figure 4.** Figure 4: Scene Annotation. An iterative algorithm is used to generate accurate bounding boxes for objects in generated scenes. (*Boxer [22] is a 3D bounding box estimation algorithm by Meta Reality Labs) d(ψ, θ) =   sin ψ cos θ cos ψ cos θ sin θ   , p(ψ, θ) = c − rorbd(ψ, θ). Each camera position p(ψ, θ) is paired with the viewing direction d(ψ, θ), orienting the camera to look through the world center c. This … view at source ↗

**Figure 5.** Figure 5: Navigation & Collection. An automated pipeline to determine the trajectory goals and deploy EGO-Planner to fly the drone between them. region. • Distance Constraint: The travel distance from the current position to the candidate target point must lie within the range of 2.0 m to 10.0 m. Candidates that satisfy all conditions are accepted into the target set and used to update the drone position for selecti… view at source ↗

**Figure 6.** Figure 6: Dataset Statistics. Top is the distribution of scene categories. Bottom left is object count distribution in generated scenes. Bottom right is trajectory statistics & example trajectory. TABLE I: Trajectory Datasets Comparison Dataset Ntraj Action Space Nscenes Scene Environment Traj Generation Traj Annotation Kinematics Scene/Traj Extensibility R2R[4] 7189 Node-based 90 Matterport3D Sampling from Nodes Hu… view at source ↗

read the original abstract

In the field of Vision-Language Navigation (VLN), aerial datasets remain limited in their ability to combine scale, diversity, and realism, often relying on either costly real-world scenes or visually limited simulations. To address these challenges, we introduce FlyMirage, a highly scalable and fully automated data generation pipeline for aerial VLN. Our approach leverages large language models (LLM) as an environment designer to promote scene diversity, paired with a generative world model that instantiates these designs into high-fidelity 3D Gaussian Splatting (3DGS) scenes. To substantially reduce human labor and ensure the feasibility of flight data, FlyMirage automates scene exploration and semantic information acquisition, and further integrates a dynamically feasible planner for uncrewed aerial vehicle (UAV) trajectory generation. Utilizing this toolchain, we generate a large-scale, diverse, and photorealistic aerial VLN dataset, with dynamically feasible flying trajectories, designed to support the development of next-generation embodied navigation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces FlyMirage, a fully automated pipeline for generating large-scale aerial VLN datasets for UAVs. LLMs design diverse environments, a generative world model instantiates them as high-fidelity 3D Gaussian Splatting scenes, and the pipeline automates scene exploration, semantic acquisition, and a dynamically feasible planner to produce photorealistic trajectories, addressing limitations of costly real-world data or visually limited simulations.

Significance. If the generated data demonstrably reduces domain gaps and improves downstream VLN models, the work could enable scalable training of embodied aerial navigation systems with greater diversity and realism than current alternatives. The constructive automation of LLM-based design plus 3DGS instantiation and feasible planning is a practical strength for robotics data pipelines.

major comments (1)

[Abstract] The central claim that the LLM-designed 3DGS scenes and planner outputs yield data with sufficient photorealism, diversity, and dynamic feasibility to support next-generation embodied models (without large sim-to-real gaps) is load-bearing yet unsupported by any quantitative validation, transfer results, or ablation studies in the manuscript.

minor comments (1)

[Pipeline Overview] Clarify the exact generative world model architecture and any constraints on scene complexity or UAV dynamics in the planner description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the pipeline's practical contributions. We address the major comment below and commit to revisions that strengthen the empirical support for our claims.

read point-by-point responses

Referee: [Abstract] The central claim that the LLM-designed 3DGS scenes and planner outputs yield data with sufficient photorealism, diversity, and dynamic feasibility to support next-generation embodied models (without large sim-to-real gaps) is load-bearing yet unsupported by any quantitative validation, transfer results, or ablation studies in the manuscript.

Authors: We acknowledge that the current manuscript primarily presents the pipeline architecture and qualitative examples of generated scenes and trajectories, without the quantitative benchmarks, downstream transfer results, or component ablations needed to fully substantiate the load-bearing claims. In the revised version we will add: (1) quantitative metrics for photorealism (e.g., FID and perceptual similarity scores against real UAV imagery), diversity (semantic label entropy and embedding variance across scenes), and dynamic feasibility (constraint violation rates and planner success statistics); (2) transfer experiments training VLN models on FlyMirage data and evaluating zero-shot or fine-tuned performance on held-out real or alternative simulated aerial navigation tasks; and (3) ablations isolating the contributions of LLM scene design, 3DGS world modeling, automated exploration, and the dynamically feasible planner. These additions will directly address the concern while preserving the paper's focus on the automated generation toolchain. revision: yes

Circularity Check

0 steps flagged

Constructive pipeline with no derivation chain or fitted predictions

full rationale

The paper describes a constructive data-generation pipeline that combines LLMs for scene design, a generative world model for 3DGS instantiation, automated exploration, semantic acquisition, and a dynamically feasible planner. No equations, first-principles derivations, parameter fitting, or predictions are presented that could reduce to the inputs by construction. The central claim is that the resulting dataset supports next-generation models; this is an empirical assertion about the pipeline's output quality rather than a mathematical result that is tautological with its own assumptions. No self-citations, ansatzes, or uniqueness theorems are invoked in a load-bearing way within the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about the reliability of LLMs for scene design and the fidelity of the generative world model for producing usable 3DGS environments; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption LLMs can serve as effective environment designers that promote scene diversity and feasibility for UAV navigation tasks
Invoked when stating that LLMs are used as environment designer to promote scene diversity.
domain assumption A generative world model can reliably instantiate LLM designs into high-fidelity 3D Gaussian Splatting scenes suitable for flight data generation
Core premise of the pipeline that pairs the generative world model with 3DGS instantiation.

pith-pipeline@v0.9.0 · 5736 in / 1396 out tokens · 39990 ms · 2026-05-20T04:54:21.461192+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

leverages LLMs as an environment designer paired with a generative world model that instantiates these designs into high-fidelity 3D Gaussian Splatting scenes, with automated scene exploration, semantic acquisition, and a dynamically feasible planner
IndisputableMonolith/Cost/FunctionalEquation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

integrates a dynamically feasible planner for uncrewed aerial vehicle (UAV) trajectory generation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 3 internal anchors

[1]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[2]

Gen-1: Scaling embodied foundation models to mastery,

G. A. Team, “Gen-1: Scaling embodied foundation models to mastery,” Generalist AI Blog, 2026, https://generalistai.com/blog/apr-02-2026- GEN-1

work page 2026
[3]

Uav-flow colosseo: A real-world benchmark for flying-on-a-word uav imitation learning,

X. Wang, D. Yang, Y . Liao, W. Zheng, wenjun wu, B. Dai, H. Li, and S. Liu, “Uav-flow colosseo: A real-world benchmark for flying-on-a-word uav imitation learning,” 2025. [Online]. Available: https://arxiv.org/abs/2505.15725

work page arXiv 2025
[4]

Vision-and-language nav- igation: Interpreting visually-grounded navigation instructions in real environments,

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S ¨underhauf, I. Reid, S. Gould, and A. van den Hengel, “Vision-and-language nav- igation: Interpreting visually-grounded navigation instructions in real environments,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

work page 2018
[5]

Room- Across-Room: Multilingual vision-and-language navigation with dense spatiotemporal grounding,

A. Ku, P. Anderson, R. Patel, E. Ie, and J. Baldridge, “Room- Across-Room: Multilingual vision-and-language navigation with dense spatiotemporal grounding,” inConference on Empirical Methods for Natural Language Processing (EMNLP), 2020

work page 2020
[6]

Beyond the nav-graph: Vision and language navigation in continuous environ- ments,

J. Krantz, E. Wijmans, A. Majundar, D. Batra, and S. Lee, “Beyond the nav-graph: Vision and language navigation in continuous environ- ments,” inEuropean Conference on Computer Vision (ECCV), 2020

work page 2020
[7]

Towards physically executable 3d gaussian for embodied navigation,

B. Miao, R. Wei, Z. Ge, X. sun, S. Gao, J. Zhu, R. Wang, S. Tang, J. Xiao, R. Tang, and J. Li, “Towards physically executable 3d gaussian for embodied navigation,” 2025. [Online]. Available: https://arxiv.org/abs/2510.21307

work page arXiv 2025
[8]

OpenFly: A comprehensive platform for aerial vision-language navigation

Y . Gao, C. Li, Z. You, J. Liu, Z. Li, P. Chen, Q. Chen, Z. Tang, L. Wang, P. Yang, Y . Tang, Y . Tang, S. Liang, S. Zhu, Z. Xiong, Y . Su, X. Ye, J. Li, Y . Ding, D. Wang, Z. Wang, B. Zhao, and X. Li, “Openfly: A comprehensive platform for aerial vision-language navigation,”CoRR, vol. abs/2502.18041, 2025

work page arXiv 2025
[9]

A formal basis for the heuristic determination of minimum cost paths,

P. E. Hart, N. J. Nilsson, and B. Raphael, “A formal basis for the heuristic determination of minimum cost paths,”IEEE Trans. Syst. Sci. Cybern., vol. 4, pp. 100–107, 1968. [Online]. Available: https://api.semanticscholar.org/CorpusID:206799161

work page 1968
[10]

Marble models,

World Labs, “Marble models,” https://docs.worldlabs.ai/marble/ models, 2026, accessed: 2026-05-09

work page 2026
[11]

3d gaussian splatting for real-time radiance field rendering,

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,”ACM Transactions on Graphics, vol. 42, no. 4, July 2023. [Online]. Available: https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

work page 2023
[12]

Interiorgs: A 3d gaussian splatting dataset of semantically labeled indoor scenes,

M. T. I. SpatialVerse Research Team, “Interiorgs: A 3d gaussian splatting dataset of semantically labeled indoor scenes,” https:// huggingface.co/datasets/spatialverse/InteriorGS, 2025

work page 2025
[13]

Citynav: Language-goal aerial navigation dataset with geographic information,

J. Lee, T. Miyanishi, S. Kurita, K. Sakamoto, D. Azuma, Y . Matsuo, and N. Inoue, “Citynav: Language-goal aerial navigation dataset with geographic information,” 2024

work page 2024
[14]

Aerialvln: Vision-and-language navigation for uavs,

S. Liu, H. Zhang, Y . Qi, P. Wang, Y . Zhang, and Q. Wu, “Aerialvln: Vision-and-language navigation for uavs,” inInternational Conference on Computer Vision (ICCV), 2023

work page 2023
[15]

Aerial vision-and-dialog navigation,

Y . Fan, W. Chen, T. Jiang, C. Zhou, Y . Zhang, and X. E. Wang, “Aerial vision-and-dialog navigation,” inFindings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 3043–3061. [Online]. Available: https://aclanthology.org/2023.findings-acl.190

work page 2023
[16]

Towards realistic UA V vision-language navigation: Platform, benchmark, and methodology

X. Wang, D. Yang, Z. Wang, H. Kwan, J. Chen, W. Wu, H. Li, Y . Liao, and S. Liu, “Towards realistic uav vision-language navigation: Platform, benchmark, and methodology,” 2024. [Online]. Available: https://arxiv.org/abs/2410.07087

work page arXiv 2024
[17]

Learning vision-and-language navigation from youtube videos,

K. Lin, P. Chen, D. Huang, T. H. Li, M. Tan, and C. Gan, “Learning vision-and-language navigation from youtube videos,”arXiv preprint arXiv:2307.11984, 2023

work page arXiv 2023
[18]

Roomtour3d: Geometry-aware video-instruction tuning for embodied navigation,

M. Han, L. Ma, K. Zhumakhanova, E. Radionova, J. Zhang, X. Chang, X. Liang, and I. Laptev, “Roomtour3d: Geometry-aware video-instruction tuning for embodied navigation,”arXiv preprint arXiv:2412.08591, 2023

work page arXiv 2023
[19]

Holodeck: Language guided generation of 3d embodied ai environments,

Y . Yang, F.-Y . Sun, L. Weihs, E. VanderBilt, A. Herrasti, W. Han, J. Wu, N. Haber, R. Krishna, L. Liu, C. Callison-Burch, M. Yatskar, A. Kembhavi, and C. Clark, “Holodeck: Language guided generation of 3d embodied ai environments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 16 227–16 237

work page 2024
[20]

Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent,

Y . Yang, B. Jia, S. Zhang, and S. Huang, “Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent,” in Advances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[21]

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Linet al., “Dreamgen: Unlocking generaliza- tion in robot learning through video world models,”arXiv preprint arXiv:2505.12705, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Boxer: Robust lifting of open-world 2d bounding boxes to 3d,

D. DeTone, T. Shen, F. Zhang, L. Ma, J. Straub, R. Newcombe, and J. Engel, “Boxer: Robust lifting of open-world 2d bounding boxes to 3d,” 2026

work page 2026
[23]

gsplat: An open-source library for gaussian splatting,

V . Ye, R. Li, J. Kerr, M. Turkulainen, B. Yi, Z. Pan, O. Seiskari, J. Ye, J. Hu, M. Tancik, and A. Kanazawa, “gsplat: An open-source library for gaussian splatting,”Journal of Machine Learning Research, vol. 26, no. 34, pp. 1–17, 2025

work page 2025
[24]

Ego-planner: An esdf- free gradient-based local planner for quadrotors,

X. Zhou, Z. Wang, H. Ye, C. Xu, and F. Gao, “Ego-planner: An esdf- free gradient-based local planner for quadrotors,”IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 478–485, 2021

work page 2021
[25]

Towards long-horizon vision-language navigation: Platform, benchmark and method,

X. Song, W. Chen, Y . Liu, W. Chen, G. Li, and L. Lin, “Towards long-horizon vision-language navigation: Platform, benchmark and method,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[26]

Indooruav: Benchmarking vision-language uav navigation in continuous indoor environments,

X. Liu, Y . Liu, H. Qiu, Y . Qirong, and Z. Lian, “Indooruav: Benchmarking vision-language uav navigation in continuous indoor environments,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 28, 2026, pp. 23 864–23 872

work page 2026
[27]

Qwen3.5: Towards native multimodal agents,

Qwen Team, “Qwen3.5: Towards native multimodal agents,” February

work page
[28]

Available: https://qwen.ai/blog?id=qwen3.5

[Online]. Available: https://qwen.ai/blog?id=qwen3.5

work page
[29]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track,...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[30]

Vla-an: An efficient and onboard vision-language-action framework for aerial navigation in complex environments,

Y . Wu, M. Zhu, X. Li, Y . Du, Y . Fan, W. Li, Z. Han, X. Zhou, and F. Gao, “Vla-an: An efficient and onboard vision-language-action framework for aerial navigation in complex environments,” 2025. [Online]. Available: https://arxiv.org/abs/2512.15258

work page arXiv 2025
[31]

Navdreamer: Video models as zero-shot 3d navigators,

X. Huang, W. Gai, T. Wu, C. Wang, Z. Liu, X. Zhou, Y . Wu, and F. Gao, “Navdreamer: Video models as zero-shot 3d navigators,”arXiv preprint arXiv:2602.09765, 2026

work page arXiv 2026

[1] [1]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[2] [2]

Gen-1: Scaling embodied foundation models to mastery,

G. A. Team, “Gen-1: Scaling embodied foundation models to mastery,” Generalist AI Blog, 2026, https://generalistai.com/blog/apr-02-2026- GEN-1

work page 2026

[3] [3]

Uav-flow colosseo: A real-world benchmark for flying-on-a-word uav imitation learning,

X. Wang, D. Yang, Y . Liao, W. Zheng, wenjun wu, B. Dai, H. Li, and S. Liu, “Uav-flow colosseo: A real-world benchmark for flying-on-a-word uav imitation learning,” 2025. [Online]. Available: https://arxiv.org/abs/2505.15725

work page arXiv 2025

[4] [4]

Vision-and-language nav- igation: Interpreting visually-grounded navigation instructions in real environments,

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S ¨underhauf, I. Reid, S. Gould, and A. van den Hengel, “Vision-and-language nav- igation: Interpreting visually-grounded navigation instructions in real environments,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

work page 2018

[5] [5]

Room- Across-Room: Multilingual vision-and-language navigation with dense spatiotemporal grounding,

A. Ku, P. Anderson, R. Patel, E. Ie, and J. Baldridge, “Room- Across-Room: Multilingual vision-and-language navigation with dense spatiotemporal grounding,” inConference on Empirical Methods for Natural Language Processing (EMNLP), 2020

work page 2020

[6] [6]

Beyond the nav-graph: Vision and language navigation in continuous environ- ments,

J. Krantz, E. Wijmans, A. Majundar, D. Batra, and S. Lee, “Beyond the nav-graph: Vision and language navigation in continuous environ- ments,” inEuropean Conference on Computer Vision (ECCV), 2020

work page 2020

[7] [7]

Towards physically executable 3d gaussian for embodied navigation,

B. Miao, R. Wei, Z. Ge, X. sun, S. Gao, J. Zhu, R. Wang, S. Tang, J. Xiao, R. Tang, and J. Li, “Towards physically executable 3d gaussian for embodied navigation,” 2025. [Online]. Available: https://arxiv.org/abs/2510.21307

work page arXiv 2025

[8] [8]

OpenFly: A comprehensive platform for aerial vision-language navigation

Y . Gao, C. Li, Z. You, J. Liu, Z. Li, P. Chen, Q. Chen, Z. Tang, L. Wang, P. Yang, Y . Tang, Y . Tang, S. Liang, S. Zhu, Z. Xiong, Y . Su, X. Ye, J. Li, Y . Ding, D. Wang, Z. Wang, B. Zhao, and X. Li, “Openfly: A comprehensive platform for aerial vision-language navigation,”CoRR, vol. abs/2502.18041, 2025

work page arXiv 2025

[9] [9]

A formal basis for the heuristic determination of minimum cost paths,

P. E. Hart, N. J. Nilsson, and B. Raphael, “A formal basis for the heuristic determination of minimum cost paths,”IEEE Trans. Syst. Sci. Cybern., vol. 4, pp. 100–107, 1968. [Online]. Available: https://api.semanticscholar.org/CorpusID:206799161

work page 1968

[10] [10]

Marble models,

World Labs, “Marble models,” https://docs.worldlabs.ai/marble/ models, 2026, accessed: 2026-05-09

work page 2026

[11] [11]

3d gaussian splatting for real-time radiance field rendering,

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,”ACM Transactions on Graphics, vol. 42, no. 4, July 2023. [Online]. Available: https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

work page 2023

[12] [12]

Interiorgs: A 3d gaussian splatting dataset of semantically labeled indoor scenes,

M. T. I. SpatialVerse Research Team, “Interiorgs: A 3d gaussian splatting dataset of semantically labeled indoor scenes,” https:// huggingface.co/datasets/spatialverse/InteriorGS, 2025

work page 2025

[13] [13]

Citynav: Language-goal aerial navigation dataset with geographic information,

J. Lee, T. Miyanishi, S. Kurita, K. Sakamoto, D. Azuma, Y . Matsuo, and N. Inoue, “Citynav: Language-goal aerial navigation dataset with geographic information,” 2024

work page 2024

[14] [14]

Aerialvln: Vision-and-language navigation for uavs,

S. Liu, H. Zhang, Y . Qi, P. Wang, Y . Zhang, and Q. Wu, “Aerialvln: Vision-and-language navigation for uavs,” inInternational Conference on Computer Vision (ICCV), 2023

work page 2023

[15] [15]

Aerial vision-and-dialog navigation,

Y . Fan, W. Chen, T. Jiang, C. Zhou, Y . Zhang, and X. E. Wang, “Aerial vision-and-dialog navigation,” inFindings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 3043–3061. [Online]. Available: https://aclanthology.org/2023.findings-acl.190

work page 2023

[16] [16]

Towards realistic UA V vision-language navigation: Platform, benchmark, and methodology

X. Wang, D. Yang, Z. Wang, H. Kwan, J. Chen, W. Wu, H. Li, Y . Liao, and S. Liu, “Towards realistic uav vision-language navigation: Platform, benchmark, and methodology,” 2024. [Online]. Available: https://arxiv.org/abs/2410.07087

work page arXiv 2024

[17] [17]

Learning vision-and-language navigation from youtube videos,

K. Lin, P. Chen, D. Huang, T. H. Li, M. Tan, and C. Gan, “Learning vision-and-language navigation from youtube videos,”arXiv preprint arXiv:2307.11984, 2023

work page arXiv 2023

[18] [18]

Roomtour3d: Geometry-aware video-instruction tuning for embodied navigation,

M. Han, L. Ma, K. Zhumakhanova, E. Radionova, J. Zhang, X. Chang, X. Liang, and I. Laptev, “Roomtour3d: Geometry-aware video-instruction tuning for embodied navigation,”arXiv preprint arXiv:2412.08591, 2023

work page arXiv 2023

[19] [19]

Holodeck: Language guided generation of 3d embodied ai environments,

Y . Yang, F.-Y . Sun, L. Weihs, E. VanderBilt, A. Herrasti, W. Han, J. Wu, N. Haber, R. Krishna, L. Liu, C. Callison-Burch, M. Yatskar, A. Kembhavi, and C. Clark, “Holodeck: Language guided generation of 3d embodied ai environments,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 16 227–16 237

work page 2024

[20] [20]

Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent,

Y . Yang, B. Jia, S. Zhang, and S. Huang, “Sceneweaver: All-in-one 3d scene synthesis with an extensible and self-reflective agent,” in Advances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[21] [21]

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

J. Jang, S. Ye, Z. Lin, J. Xiang, J. Bjorck, Y . Fang, F. Hu, S. Huang, K. Kundalia, Y .-C. Linet al., “Dreamgen: Unlocking generaliza- tion in robot learning through video world models,”arXiv preprint arXiv:2505.12705, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Boxer: Robust lifting of open-world 2d bounding boxes to 3d,

D. DeTone, T. Shen, F. Zhang, L. Ma, J. Straub, R. Newcombe, and J. Engel, “Boxer: Robust lifting of open-world 2d bounding boxes to 3d,” 2026

work page 2026

[23] [23]

gsplat: An open-source library for gaussian splatting,

V . Ye, R. Li, J. Kerr, M. Turkulainen, B. Yi, Z. Pan, O. Seiskari, J. Ye, J. Hu, M. Tancik, and A. Kanazawa, “gsplat: An open-source library for gaussian splatting,”Journal of Machine Learning Research, vol. 26, no. 34, pp. 1–17, 2025

work page 2025

[24] [24]

Ego-planner: An esdf- free gradient-based local planner for quadrotors,

X. Zhou, Z. Wang, H. Ye, C. Xu, and F. Gao, “Ego-planner: An esdf- free gradient-based local planner for quadrotors,”IEEE Robotics and Automation Letters, vol. 6, no. 2, pp. 478–485, 2021

work page 2021

[25] [25]

Towards long-horizon vision-language navigation: Platform, benchmark and method,

X. Song, W. Chen, Y . Liu, W. Chen, G. Li, and L. Lin, “Towards long-horizon vision-language navigation: Platform, benchmark and method,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[26] [26]

Indooruav: Benchmarking vision-language uav navigation in continuous indoor environments,

X. Liu, Y . Liu, H. Qiu, Y . Qirong, and Z. Lian, “Indooruav: Benchmarking vision-language uav navigation in continuous indoor environments,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 28, 2026, pp. 23 864–23 872

work page 2026

[27] [27]

Qwen3.5: Towards native multimodal agents,

Qwen Team, “Qwen3.5: Towards native multimodal agents,” February

work page

[28] [28]

Available: https://qwen.ai/blog?id=qwen3.5

[Online]. Available: https://qwen.ai/blog?id=qwen3.5

work page

[29] [29]

Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track,...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[30] [30]

Vla-an: An efficient and onboard vision-language-action framework for aerial navigation in complex environments,

Y . Wu, M. Zhu, X. Li, Y . Du, Y . Fan, W. Li, Z. Han, X. Zhou, and F. Gao, “Vla-an: An efficient and onboard vision-language-action framework for aerial navigation in complex environments,” 2025. [Online]. Available: https://arxiv.org/abs/2512.15258

work page arXiv 2025

[31] [31]

Navdreamer: Video models as zero-shot 3d navigators,

X. Huang, W. Gai, T. Wu, C. Wang, Z. Liu, X. Zhou, Y . Wu, and F. Gao, “Navdreamer: Video models as zero-shot 3d navigators,”arXiv preprint arXiv:2602.09765, 2026

work page arXiv 2026