Refdrone: A challenging benchmark for referring expression comprehension in drone scenes

Zhichao Sun, Yepeng Liu, Zhiling Su, Huachao Zhu, Yuliang Gu, Yuda Zou, Zelong Liu, Gui-Song Xia, Bo Du, Yongchao Xu · 2025 · arXiv 2502.00392

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 2 dataset 1

citation-polarity summary

background 3

representative citing papers

ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue

cs.RO · 2026-05-02 · unverdicted · novelty 7.0

ESARBench is the first unified benchmark for MLLM-driven UAV agents that must explore, locate clues, and decide on victim positions in photorealistic simulated SAR environments.

Robust Grounding with MLLMs Against Occlusion and Small Objects via Language-Guided Semantic Cues

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

Language-guided semantic cues from MLLM visual pipelines, steered by text embeddings, refine object semantics and boost grounding accuracy against occlusion and small objects.

UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models

cs.CV · 2026-04-02 · conditional · novelty 6.0

UAV-Track VLA modifies the π0.5 VLA architecture with temporal compression and dual-branch decoding to reach 61.76% success and 269.65 average frames in long-distance pedestrian tracking on a new 890K-frame UAV dataset, while cutting inference latency by 33.4%.

Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models

cs.RO · 2026-04-09 · unverdicted · novelty 4.0

This survey organizes aerial vision-language navigation methods into five architectural categories, critically reviews evaluation infrastructure, and synthesizes seven open problems for LLM/VLM integration.

citing papers explorer

Showing 4 of 4 citing papers.

ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue cs.RO · 2026-05-02 · unverdicted · none · ref 53
ESARBench is the first unified benchmark for MLLM-driven UAV agents that must explore, locate clues, and decide on victim positions in photorealistic simulated SAR environments.
Robust Grounding with MLLMs Against Occlusion and Small Objects via Language-Guided Semantic Cues cs.CV · 2026-04-27 · unverdicted · none · ref 19
Language-guided semantic cues from MLLM visual pipelines, steered by text embeddings, refine object semantics and boost grounding accuracy against occlusion and small objects.
UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models cs.CV · 2026-04-02 · conditional · none · ref 19
UAV-Track VLA modifies the π0.5 VLA architecture with temporal compression and dual-branch decoding to reach 61.76% success and 269.65 average frames in long-distance pedestrian tracking on a new 890K-frame UAV dataset, while cutting inference latency by 33.4%.
Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models cs.RO · 2026-04-09 · unverdicted · none · ref 116
This survey organizes aerial vision-language navigation methods into five architectural categories, critically reviews evaluation infrastructure, and synthesizes seven open problems for LLM/VLM integration.

Refdrone: A challenging benchmark for referring expression comprehension in drone scenes

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer