Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
AeroVerse: UA V-agent benchmark suite for simulating, pre-training, finetuning, and evaluating aerospace embodied world models
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 4polarities
background 2representative citing papers
This survey organizes aerial vision-language navigation methods into five architectural categories, critically reviews evaluation infrastructure, and synthesizes seven open problems for LLM/VLM integration.
A universal LLM-to-drone interface is implemented via the Model Context Protocol (MCP) and Mavlink, demonstrated with real UAV flight control and simulated flights using live map data.
AirVista-II integrates agent-based task identification and scheduling, multimodal perception, and scenario-tailored keyframe extraction to deliver high-quality zero-shot semantic understanding for embodied UAVs in dynamic environments.
citing papers explorer
-
How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace
Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
-
Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models
This survey organizes aerial vision-language navigation methods into five architectural categories, critically reviews evaluation infrastructure, and synthesizes seven open problems for LLM/VLM integration.
-
A Universal Large Language Model -- Drone Command and Control Interface
A universal LLM-to-drone interface is implemented via the Model Context Protocol (MCP) and Mavlink, demonstrated with real UAV flight control and simulated flights using live map data.
-
AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding
AirVista-II integrates agent-based task identification and scheduling, multimodal perception, and scenario-tailored keyframe extraction to deliver high-quality zero-shot semantic understanding for embodied UAVs in dynamic environments.