AirGroundBench is a new diagnostic benchmark exposing that MLLMs handle basic spatial perception but struggle with cross-view alignment, transformation reasoning, and embodied navigation under heterogeneous air-ground views.
arXiv preprint arXiv:2410.09604 , year=
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
polarities
background 2representative citing papers
SpatialUAV releases a new multi-task benchmark for low-altitude UAV spatial intelligence and demonstrates that existing VLMs exhibit clear weaknesses in cross-view association and geometric reasoning.
SpatialWorld is a new multi-simulator benchmark showing top multimodal agents achieve under 18% success on interactive spatial tasks requiring active exploration and long-horizon planning.
Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
NormAct shows MLLMs reach explicit goals in 67.3% of cases but comply with hidden norms in only 26.4%, with NormPerceptor raising task success from 24.2% to 46.7%.
This survey organizes aerial vision-language navigation methods into five architectural categories, critically reviews evaluation infrastructure, and synthesizes seven open problems for LLM/VLM integration.
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
citing papers explorer
-
AirGroundBench: Probing Spatial Intelligence in Multimodal Large Models under Heterogeneous Multi-View Embodied Collaboration
AirGroundBench is a new diagnostic benchmark exposing that MLLMs handle basic spatial perception but struggle with cross-view alignment, transformation reasoning, and embodied navigation under heterogeneous air-ground views.
-
SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks
SpatialWorld is a new multi-simulator benchmark showing top multimodal agents achieve under 18% success on interactive spatial tasks requiring active exploration and long-horizon planning.
-
How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace
Large multimodal models display emerging but limited spatial action capabilities in goal-oriented urban 3D navigation, remaining far from human-level performance with errors diverging rapidly after critical decision points.
-
NormAct: A Benchmark for Hidden Social Norm Compliance in Embodied Planning
NormAct shows MLLMs reach explicit goals in 67.3% of cases but comply with hidden norms in only 26.4%, with NormPerceptor raising task success from 24.2% to 46.7%.
-
Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models
This survey organizes aerial vision-language navigation methods into five architectural categories, critically reviews evaluation infrastructure, and synthesizes seven open problems for LLM/VLM integration.