SpatialUAV: Benchmarking Spatial Intelligence for Low-Altitude UAV Perception, Collaboration, and Motion

Haoyu Zhang; Kun Wang; Liqiang Nie; Meng Liu; Qianlong Xiang; Yaowei Wang

arxiv: 2606.27876 · v1 · pith:KZVSRESMnew · submitted 2026-06-26 · 💻 cs.CV · cs.AI

SpatialUAV: Benchmarking Spatial Intelligence for Low-Altitude UAV Perception, Collaboration, and Motion

Haoyu Zhang , Meng Liu , Qianlong Xiang , Kun Wang , Yaowei Wang , Liqiang Nie This is my paper

Pith reviewed 2026-06-29 04:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords SpatialUAVUAV benchmarkspatial intelligencevision-language modelslow-altitude UAVperceptioncollaborationmotion understanding

0 comments

The pith

SpatialUAV benchmark shows vision-language models remain far from human performance on low-altitude UAV spatial tasks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SpatialUAV, a benchmark built from real low-altitude UAV footage that contains 4,331 instances spread across 14 task types. These tasks test semantic discrimination, spatial relations, aerial-aerial and aerial-ground collaboration, and motion understanding through a unified question-answer format that accepts seven input setups and nine answer styles. Evaluation of representative vision-language models finds large gaps relative to human performance, especially in cross-view association, structured grounding, geometric reasoning, and temporal viewpoint understanding. The results supply concrete targets for improving spatial capabilities in UAV systems.

Core claim

SpatialUAV organizes real low-altitude UAV data into 4,331 validated instances and 14 fine-grained tasks that together require 3D spatial inference, multi-view collaboration, scene dynamics, and varied output formats; when representative vision-language models are tested on this collection they fall well short of human accuracy with the largest shortfalls appearing in cross-view association, structured grounding, geometric reasoning, and temporal viewpoint understanding.

What carries the argument

The SpatialUAV benchmark, which supplies a single visual-input-question-answer schema together with detector-assisted region labeling, depth supervision, metadata rules, and multi-turn human validation to produce reliable test cases across seven input configurations and nine answer formats.

If this is right

Models need targeted advances in cross-view association and geometric reasoning to approach human capability on UAV collaboration tasks.
Benchmarks for UAV perception must support multiple answer formats including region identifiers, geometric values, and free-form motion descriptions.
Diverse input configurations such as multi-view and temporal sequences expose limitations that single-image tests miss.
The identified bottlenecks supply empirical targets for training and architecture improvements in low-altitude UAV systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark's emphasis on real metadata and depth supervision could be reused to generate synthetic training data for UAV-specific models.
Similar curation pipelines might expose comparable gaps when applied to ground-robot or satellite imagery tasks.
If models improve on the listed bottlenecks, they may still require separate testing on safety-critical edge cases such as low-light or high-wind conditions.

Load-bearing premise

The 4,331 curated instances and 14 task types form a representative and unbiased sample of the spatial intelligence challenges that arise in real low-altitude UAV operations.

What would settle it

A controlled follow-up study on a fresh set of low-altitude UAV images in which the same models reach human-level accuracy on cross-view association and geometric-reasoning tasks would directly contradict the reported performance gaps.

Figures

Figures reproduced from arXiv: 2606.27876 by Haoyu Zhang, Kun Wang, Liqiang Nie, Meng Liu, Qianlong Xiang, Yaowei Wang.

**Figure 2.** Figure 2: Overall construction pipeline of SpatialUAV. In the task synthesis step, each instance is constructed by organizing task-specific visual inputs, designing [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Answer-format distribution of SpatialUAV. The histogram reports the [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 3.** Figure 3: Task distribution of SpatialUAV. The inner ring shows the major [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Macro-average performance across SpatialUAV reasoning groups. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative cases on representative SpatialUAV tasks. The examples cover aerial–aerial camera transformation, aerial–aerial object matching, and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Answer-format ablation across four representative tasks. Each panel reports one model, with Orig. and MC denoting the original structured-answer [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

Spatial intelligence is essential for low-altitude unmanned aerial vehicle (UAV) perception, collaboration, and navigation. However, existing UAV benchmarks often emphasize image-level recognition, single-view understanding, or narrow answer formats, leaving 3D spatial inference, multi-view collaboration, scene dynamics, and diverse task formulations insufficiently evaluated. To address these gaps, we introduce SpatialUAV, a real low-altitude UAV benchmark comprising 4,331 curated instances across 14 fine-grained task types, covering semantic discrimination, spatial relation, aerial--aerial collaboration, aerial--ground collaboration, and motion understanding. SpatialUAV organizes all samples into a unified visual-input--question--answer schema, while supporting seven input configurations and nine answer formats, including option labels, region identifiers, geometric values, cross-view correspondences, and free-form motion descriptions. To ensure reliable and grounded evaluation, our data construction pipeline integrates detector-assisted regions, depth supervision, metadata-derived rules, extensive manual annotation, blind filtering, and multi-turn human validation, together with task-specific metrics for heterogeneous outputs. Evaluating representative vision-language models across three categories, we show that current models remain far from human-level performance, with pronounced bottlenecks in cross-view association, structured grounding, geometric reasoning, and temporal viewpoint understanding. These results offer empirical guidance for advancing low-altitude UAV spatial intelligence. Code and data are available at https://github.com/Hyu-Zhang/SpatialUAV.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpatialUAV adds a new benchmark with solid construction steps for UAV spatial tasks and documents model gaps, though representativeness of the instances remains a minor open question.

read the letter

This paper's main point is a new benchmark dataset for low-altitude UAV spatial intelligence. It covers 4,331 instances across 14 task types that include cross-view collaboration, aerial-ground interaction, and motion understanding, with support for seven input configurations and nine answer formats.

The construction pipeline stands out as careful. Detector-assisted regions, depth supervision, metadata rules, blind filtering, and multi-turn human validation give the data more grounding than many quick benchmark releases. Releasing code and data is also a plus.

The evaluations on vision-language models show clear shortfalls in cross-view association, structured grounding, geometric reasoning, and temporal viewpoint understanding. These results line up with known VLM weaknesses but apply them to UAV-specific scenarios, which is useful.

The soft spot is modest. The claim that the instances form a representative sample of real UAV challenges rests on the pipeline, but the initial detector-assisted selection and any post-filtering effects could still tilt the difficulty distribution. The paper does not appear to quantify that risk in detail.

This work is for researchers working on aerial vision, multi-view reasoning, or robotics applications who need concrete tasks to measure progress. It is not aimed at broader theory.

It deserves peer review. The dataset fills a documented gap in prior UAV benchmarks and the evaluation setup is concrete enough to warrant referee input.

Referee Report

2 major / 2 minor

Summary. The paper introduces SpatialUAV, a benchmark for low-altitude UAV spatial intelligence comprising 4,331 curated instances across 14 task types (semantic discrimination, spatial relation, aerial-aerial and aerial-ground collaboration, motion understanding). Samples follow a unified visual-input--question--answer schema supporting seven input configurations and nine answer formats. Data construction uses a multi-stage pipeline (detector-assisted regions, depth supervision, metadata rules, manual annotation, blind filtering, multi-turn human validation) with task-specific metrics. Evaluation of representative vision-language models across three categories shows large gaps versus human performance, with bottlenecks in cross-view association, structured grounding, geometric reasoning, and temporal viewpoint understanding. Code and data are released publicly.

Significance. If the results hold, SpatialUAV supplies a much-needed resource that moves beyond image-level recognition and single-view tasks to emphasize 3D spatial inference, multi-view collaboration, scene dynamics, and heterogeneous output formats. The multi-control construction pipeline (detector assistance, depth supervision, blind filtering, human validation) directly mitigates selection bias and annotation artifacts, lending credibility to the reported model gaps. Public release of code and data supports reproducibility. The work identifies concrete, actionable bottlenecks that can guide future model development for UAV perception and navigation.

major comments (2)

[Data construction pipeline] Data construction pipeline (abstract and §3): the description of blind filtering and post-hoc rules is high-level; quantitative statistics on rejection rates at each filtering stage and their effect on the final distribution of the 4,331 instances are needed to confirm that the reported model gaps are not artifacts of the curation process.
[Evaluation protocol] Evaluation protocol (§4): human performance baselines for the nine heterogeneous answer formats (region identifiers, geometric values, free-form motion descriptions) are referenced but the exact protocol, number of annotators, and inter-annotator agreement are not detailed; without these the magnitude of the claimed bottlenecks cannot be fully assessed.

minor comments (2)

[Abstract] Abstract: the phrase 'pronounced bottlenecks' is used without a quantitative threshold; a brief statement of the performance gap (e.g., accuracy or score difference) would make the claim more precise.
[Task taxonomy] Task taxonomy: the distinction between the 14 fine-grained task types and the five high-level categories could be clarified with an explicit mapping table to avoid reader confusion when interpreting results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment, constructive feedback, and recommendation for minor revision. We address each major comment below with clarifications and commitments to strengthen the manuscript.

read point-by-point responses

Referee: [Data construction pipeline] Data construction pipeline (abstract and §3): the description of blind filtering and post-hoc rules is high-level; quantitative statistics on rejection rates at each filtering stage and their effect on the final distribution of the 4,331 instances are needed to confirm that the reported model gaps are not artifacts of the curation process.

Authors: We agree that quantitative details on the filtering stages would increase transparency and help rule out curation artifacts. The current manuscript describes the pipeline at a high level but does not report per-stage rejection counts or distributional shifts. In the revised version we will add a new table (or subsection in §3) listing the number of instances rejected after each step (detector-assisted region proposal, depth supervision, metadata rules, manual annotation, blind filtering, and multi-turn validation) together with a brief analysis of how these filters affected the final balance across the 14 task types and nine answer formats. revision: yes
Referee: [Evaluation protocol] Evaluation protocol (§4): human performance baselines for the nine heterogeneous answer formats (region identifiers, geometric values, free-form motion descriptions) are referenced but the exact protocol, number of annotators, and inter-annotator agreement are not detailed; without these the magnitude of the claimed bottlenecks cannot be fully assessed.

Authors: We acknowledge that the human baseline protocol is described only at a summary level. The revised manuscript will expand §4 (and the supplementary material) to specify: (i) the exact instructions and interface given to human annotators for each of the nine answer formats, (ii) the number of independent annotators per sample (minimum three), and (iii) inter-annotator agreement statistics (e.g., Cohen’s κ for categorical formats and normalized edit distance or IoU for geometric/region formats). These additions will allow readers to better calibrate the reported model–human gaps. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical benchmark paper with no mathematical derivation, equations, fitted parameters, or predictions. The central claim rests on a new dataset of 4,331 instances across 14 tasks, constructed via detector-assisted regions, depth supervision, metadata rules, blind filtering, and multi-turn human validation. These steps are independent quality controls, not reductions to author-defined quantities or self-citations. Model evaluations use standard metrics on the new benchmark, with no load-bearing self-citation chains or ansatzes. The paper is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is a new curated dataset whose quality rests on standard computer-vision assumptions about annotation reliability rather than new mathematical axioms or invented physical entities.

axioms (1)

domain assumption Multi-turn human validation combined with detector-assisted and depth-supervised rules produces accurate ground-truth labels for spatial and motion tasks.
Invoked in the data construction pipeline description; no independent verification of label accuracy is supplied beyond the pipeline itself.

pith-pipeline@v0.9.1-grok · 5805 in / 1350 out tokens · 36132 ms · 2026-06-29T04:29:57.235249+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 14 canonical work pages · 4 internal anchors

[1]

Long-short match for lost control in uav multi-object tracking,

Z. Zou, M. Ye, L. Ji, L. Zhou, S. Tang, Y . Gan, and S. Li, “Long-short match for lost control in uav multi-object tracking,”IEEE Transactions on Multimedia, vol. 28, pp. 786–800, 2026

2026
[2]

Fre-stformer: A frequency-based spatio-temporal transformer for uav human action recognition,

T. Xiang, X. Xia, J. Yuan, and Z. Tu, “Fre-stformer: A frequency-based spatio-temporal transformer for uav human action recognition,”IEEE Transactions on Multimedia, pp. 1–13, 2026

2026
[3]

Mode-track: Robust multi-object tracking with motion decoupling in uav videos,

Z. Song, Y . Li, S. Zhou, W. Tang, and L. Wang, “Mode-track: Robust multi-object tracking with motion decoupling in uav videos,”IEEE Transactions on Multimedia, pp. 1–11, 2026

2026
[4]

A survey of robotic language grounding: tradeoffs between symbols and embeddings,

V . Cohen, J. X. Liu, R. Mooney, S. Tellex, and D. Watkins, “A survey of robotic language grounding: tradeoffs between symbols and embeddings,” inProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024, pp. 7999–8009. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10

2024
[5]

Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai,

T. Wang, X. Mao, C. Zhu, R. Xu, R. Lyu, P. Li, X. Chen, W. Zhang, K. Chen, T. Xueet al., “Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 757–19 767

2024
[6]

Egothink: Evaluating first-person perspective thinking capability of vision-language models,

S. Cheng, Z. Guo, J. Wu, K. Fang, P. Li, H. Liu, and Y . Liu, “Egothink: Evaluating first-person perspective thinking capability of vision-language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 291–14 302

2024
[7]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,

B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, and F. Xia, “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 455–14 465

2024
[8]

Thinking in space: How multimodal large language models see, remember, and recall spaces,

J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 10 632–10 643

2025
[9]

Urbanvideo-bench: Benchmarking vision-language models on embodied intelligence with video data in urban spaces,

B. Zhao, J. Fang, Z. Dai, Z. Wang, J. Zha, W. Zhang, C. Gao, Y . Wang, J. Cui, X. Chenet al., “Urbanvideo-bench: Benchmarking vision-language models on embodied intelligence with video data in urban spaces,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 32 400– 32 423

2025
[10]

Cityeqa: A hierarchical llm agent on embodied question answering benchmark in city space,

Y . Zhao, K. Xu, Z. Zhu, Y . Hu, Z. Zheng, Y . Chen, Y . Ji, C. Gao, Y . Li, and J. Huang, “Cityeqa: A hierarchical llm agent on embodied question answering benchmark in city space,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 12 465–12 480

2025
[11]

Anti-uav: A large-scale benchmark for vision-based uav tracking,

N. Jiang, K. Wang, X. Peng, X. Yu, Q. Wang, J. Xing, G. Li, G. Guo, Q. Ye, J. Jiao, J. Zhao, and Z. Han, “Anti-uav: A large-scale benchmark for vision-based uav tracking,”IEEE Transactions on Multimedia, vol. 25, pp. 486–500, 2023

2023
[12]

Open3d-vqa: A benchmark for embodied spatial concept reasoning with multimodal large language model in open space,

W. Zhang, Z. Zhou, X. Zeng, L. Xuchen, J. Fang, C. Gao, J. Cui, Y . Li, X. Chen, and X.-P. Zhang, “Open3d-vqa: A benchmark for embodied spatial concept reasoning with multimodal large language model in open space,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 12 784–12 791

2025
[13]

Is your vlm sky-ready? a comprehensive spatial intelligence benchmark for uav navigation,

L. Zhang, Y . Zhang, H. Li, H. Fu, Y . Tang, H. Ye, L. Chen, X. Liang, X. Hao, and W. Ding, “Is your vlm sky-ready? a comprehensive spatial intelligence benchmark for uav navigation,”arXiv preprint arXiv:2511.13269, 2025

work page arXiv 2025
[14]

Scanqa: 3d question answering for spatial scene understanding,

D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 129–19 139

2022
[15]

Sqa3d: Situated question answering in 3d scenes,

X. Ma, S. Yong, Z. Zheng, Q. Li, Y . Liang, S.-C. Zhu, and S. Huang, “Sqa3d: Situated question answering in 3d scenes,” inInternational Conference on Learning Representations, 2023

2023
[16]

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

Y . Hong, J. Liu, H. Yin, M. Li, L. Guibas, F.-F. Li, J. Wu, and Y . Choi, “ESI-Bench: Towards embodied spatial intelligence that closes the perception–action loop,”arXiv preprint arXiv:2605.18746, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Aircopbench: A benchmark for multi-drone collaborative embodied perception and reasoning,

J. Zha, Y . Fan, T. Zhang, G. Chen, Y . Chen, C. Gao, and X. Chen, “Aircopbench: A benchmark for multi-drone collaborative embodied perception and reasoning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 2, 2026, pp. 1507–1515

2026
[18]

MM-UA VBench: How well do multimodal large language models see, think, and plan in low-altitude uav scenarios?

S. Dai, Z. Ma, Z. Luo, X. Yang, Y . Huang, W. Zhang, C. Chen, Z. Guo, W. Xu, Y . Sun, and M. Sun, “MM-UA VBench: How well do multimodal large language models see, think, and plan in low-altitude uav scenarios?” arXiv preprint arXiv:2512.23219, 2025

work page arXiv 2025
[19]

Uavbench and uavit-1m: Benchmarking and enhancing mllms for low-altitude uav vision-language understanding,

Y . Zhan and Y . Yuan, “Uavbench and uavit-1m: Benchmarking and enhancing mllms for low-altitude uav vision-language understanding,” arXiv preprint arXiv:2603.14336, 2026

work page arXiv 2026
[20]

Are vlms lost between sky and space? LinkS 2Bench for uav-satellite dynamic cross-view spatial intelligence,

D. Liu, J. Feng, D. Li, Y . Zheng, G. Li, W. Dong, and G. Shi, “Are vlms lost between sky and space? LinkS 2Bench for uav-satellite dynamic cross-view spatial intelligence,”arXiv preprint arXiv:2604.02020, 2026

work page arXiv 2026
[21]

Cambrian-s: Towards spatial supersensing in video,

S. Yang, J. Yang, P. Huang, E. L. Brown II, Z. Yang, Y . Yu, S. Tong, Z. Zheng, Y . Xu, M. Wanget al., “Cambrian-s: Towards spatial supersensing in video,” inThe Fourteenth International Conference on Learning Representations, 2026

2026
[22]

Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence,

J. Lin, R. Xu, S. Zhu, S. Yang, P. Cao, Y . Ran, M. Hu, C. Zhu, Y . Xie, Y . Long, W. Hu, D. Lin, T. Wang, and J. Pang, “Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence,”arXiv preprint arXiv:2512.10863, 2025

work page arXiv 2025
[23]

Embodiedcity: A benchmark platform for embodied agent in real-world city environment,

C. Gao, B. Zhao, W. Zhang, J. Mao, J. Zhang, Z. Zheng, F. Man, J. Fang, Z. Zhou, J. Cui, X. Chen, and Y . Li, “Embodiedcity: A benchmark platform for embodied agent in real-world city environment,”arXiv preprint arXiv:2410.09604, 2024

work page arXiv 2024
[24]

Real-time and accurate uav pedestrian detection for social distancing monitoring in covid-19 pandemic,

Z. Shao, G. Cheng, J. Ma, Z. Wang, J. Wang, and D. Li, “Real-time and accurate uav pedestrian detection for social distancing monitoring in covid-19 pandemic,”IEEE Transactions on Multimedia, vol. 24, pp. 2069–2083, 2022

2069
[25]

UA VBench: An open benchmark dataset for autonomous and agentic ai uav systems via llm- generated flight scenarios,

M. A. Ferrag, A. Lakas, and M. Debbah, “UA VBench: An open benchmark dataset for autonomous and agentic ai uav systems via llm- generated flight scenarios,”arXiv preprint arXiv:2511.11252, 2025

work page arXiv 2025
[26]

Bedi: A comprehensive benchmark for evaluating embodied agents on uavs,

M. Guo, M. Wu, J. He, S. Li, H. Li, and C. Tao, “Bedi: A comprehensive benchmark for evaluating embodied agents on uavs,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 232, pp. 910–936, 2026

2026
[27]

Multiview aerial visual recognition (mavrec): Can multi-view improve aerial visual perception?

A. Dutta, S. Das, J. Nielsen, R. Chakraborty, and M. Shah, “Multiview aerial visual recognition (mavrec): Can multi-view improve aerial visual perception?”arXiv preprint arXiv:2312.04548, 2023

work page arXiv 2023
[28]

Airscape: An aerial generative world model with motion controllability,

B. Zhao, R. Tang, M. Jia, Z. Wang, F. Man, X. Zhang, Y . Shang, W. Zhang, W. Wu, C. Gao, X. Chen, and Y . Li, “Airscape: An aerial generative world model with motion controllability,”arXiv preprint arXiv:2507.08885, 2025

work page arXiv 2025
[29]

University-1652: A multi-view multi- source benchmark for drone-based geo-localization,

Z. Zheng, Y . Wei, and Y . Yang, “University-1652: A multi-view multi- source benchmark for drone-based geo-localization,” inProceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1395–1403

2020
[30]

Metric3d: Towards zero-shot metric 3d prediction from a single image,

W. Yin, C. Zhang, H. Chen, Z. Cai, G. Yu, K. Wang, X. Chen, and C. Shen, “Metric3d: Towards zero-shot metric 3d prediction from a single image,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9043–9053

2023
[31]

Qwen3-VL Technical Report

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Scaling spatial intelligence with multimodal foundation models,

Z. Cai, R. Wang, C. Gu, F. Pu, J. Xu, Y . Wang, W. Yin, Z. Yang, C. Wei, T. Zhouet al., “Scaling spatial intelligence with multimodal foundation models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 7879–7890

2026
[34]

Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence,

D. Wu, F. Liu, Y .-H. Hung, and Y . Duan, “Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence,”Advances in neural information processing systems, vol. 38, pp. 13 569–13 597, 2026

2026
[35]

Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

R. Yang, Z. Zhu, Y . Li, J. Huang, S. Yan, S. Zhou, Z. Liu, X. Li, S. Li, W. Wanget al., “Visual spatial tuning,”arXiv preprint arXiv:2511.05491, 2025

work page arXiv 2025
[36]

Spatial understanding from videos: Structured prompts meet simulation data,

H. Zhang, M. Liu, Z. Li, H. Wen, W. Guan, Y . Wang, and L. Nie, “Spatial understanding from videos: Structured prompts meet simulation data,”Advances in Neural Information Processing Systems, vol. 38, pp. 103 202–103 229, 2026

2026
[37]

Spaceera++: A unified framework towards 3d spatial reasoning in video,

W. Guan, H. Zhang, M. Liu, Q. Xiang, Y . Wang, and L. Nie, “Spaceera++: A unified framework towards 3d spatial reasoning in video,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

2026
[38]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Long-short match for lost control in uav multi-object tracking,

Z. Zou, M. Ye, L. Ji, L. Zhou, S. Tang, Y . Gan, and S. Li, “Long-short match for lost control in uav multi-object tracking,”IEEE Transactions on Multimedia, vol. 28, pp. 786–800, 2026

2026

[2] [2]

Fre-stformer: A frequency-based spatio-temporal transformer for uav human action recognition,

T. Xiang, X. Xia, J. Yuan, and Z. Tu, “Fre-stformer: A frequency-based spatio-temporal transformer for uav human action recognition,”IEEE Transactions on Multimedia, pp. 1–13, 2026

2026

[3] [3]

Mode-track: Robust multi-object tracking with motion decoupling in uav videos,

Z. Song, Y . Li, S. Zhou, W. Tang, and L. Wang, “Mode-track: Robust multi-object tracking with motion decoupling in uav videos,”IEEE Transactions on Multimedia, pp. 1–11, 2026

2026

[4] [4]

A survey of robotic language grounding: tradeoffs between symbols and embeddings,

V . Cohen, J. X. Liu, R. Mooney, S. Tellex, and D. Watkins, “A survey of robotic language grounding: tradeoffs between symbols and embeddings,” inProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, 2024, pp. 7999–8009. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10

2024

[5] [5]

Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai,

T. Wang, X. Mao, C. Zhu, R. Xu, R. Lyu, P. Li, X. Chen, W. Zhang, K. Chen, T. Xueet al., “Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 757–19 767

2024

[6] [6]

Egothink: Evaluating first-person perspective thinking capability of vision-language models,

S. Cheng, Z. Guo, J. Wu, K. Fang, P. Li, H. Liu, and Y . Liu, “Egothink: Evaluating first-person perspective thinking capability of vision-language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 291–14 302

2024

[7] [7]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,

B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, and F. Xia, “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 455–14 465

2024

[8] [8]

Thinking in space: How multimodal large language models see, remember, and recall spaces,

J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 10 632–10 643

2025

[9] [9]

Urbanvideo-bench: Benchmarking vision-language models on embodied intelligence with video data in urban spaces,

B. Zhao, J. Fang, Z. Dai, Z. Wang, J. Zha, W. Zhang, C. Gao, Y . Wang, J. Cui, X. Chenet al., “Urbanvideo-bench: Benchmarking vision-language models on embodied intelligence with video data in urban spaces,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 32 400– 32 423

2025

[10] [10]

Cityeqa: A hierarchical llm agent on embodied question answering benchmark in city space,

Y . Zhao, K. Xu, Z. Zhu, Y . Hu, Z. Zheng, Y . Chen, Y . Ji, C. Gao, Y . Li, and J. Huang, “Cityeqa: A hierarchical llm agent on embodied question answering benchmark in city space,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 12 465–12 480

2025

[11] [11]

Anti-uav: A large-scale benchmark for vision-based uav tracking,

N. Jiang, K. Wang, X. Peng, X. Yu, Q. Wang, J. Xing, G. Li, G. Guo, Q. Ye, J. Jiao, J. Zhao, and Z. Han, “Anti-uav: A large-scale benchmark for vision-based uav tracking,”IEEE Transactions on Multimedia, vol. 25, pp. 486–500, 2023

2023

[12] [12]

Open3d-vqa: A benchmark for embodied spatial concept reasoning with multimodal large language model in open space,

W. Zhang, Z. Zhou, X. Zeng, L. Xuchen, J. Fang, C. Gao, J. Cui, Y . Li, X. Chen, and X.-P. Zhang, “Open3d-vqa: A benchmark for embodied spatial concept reasoning with multimodal large language model in open space,” inProceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 12 784–12 791

2025

[13] [13]

Is your vlm sky-ready? a comprehensive spatial intelligence benchmark for uav navigation,

L. Zhang, Y . Zhang, H. Li, H. Fu, Y . Tang, H. Ye, L. Chen, X. Liang, X. Hao, and W. Ding, “Is your vlm sky-ready? a comprehensive spatial intelligence benchmark for uav navigation,”arXiv preprint arXiv:2511.13269, 2025

work page arXiv 2025

[14] [14]

Scanqa: 3d question answering for spatial scene understanding,

D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 129–19 139

2022

[15] [15]

Sqa3d: Situated question answering in 3d scenes,

X. Ma, S. Yong, Z. Zheng, Q. Li, Y . Liang, S.-C. Zhu, and S. Huang, “Sqa3d: Situated question answering in 3d scenes,” inInternational Conference on Learning Representations, 2023

2023

[16] [16]

ESI-Bench: Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

Y . Hong, J. Liu, H. Yin, M. Li, L. Guibas, F.-F. Li, J. Wu, and Y . Choi, “ESI-Bench: Towards embodied spatial intelligence that closes the perception–action loop,”arXiv preprint arXiv:2605.18746, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [17]

Aircopbench: A benchmark for multi-drone collaborative embodied perception and reasoning,

J. Zha, Y . Fan, T. Zhang, G. Chen, Y . Chen, C. Gao, and X. Chen, “Aircopbench: A benchmark for multi-drone collaborative embodied perception and reasoning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 2, 2026, pp. 1507–1515

2026

[18] [18]

MM-UA VBench: How well do multimodal large language models see, think, and plan in low-altitude uav scenarios?

S. Dai, Z. Ma, Z. Luo, X. Yang, Y . Huang, W. Zhang, C. Chen, Z. Guo, W. Xu, Y . Sun, and M. Sun, “MM-UA VBench: How well do multimodal large language models see, think, and plan in low-altitude uav scenarios?” arXiv preprint arXiv:2512.23219, 2025

work page arXiv 2025

[19] [19]

Uavbench and uavit-1m: Benchmarking and enhancing mllms for low-altitude uav vision-language understanding,

Y . Zhan and Y . Yuan, “Uavbench and uavit-1m: Benchmarking and enhancing mllms for low-altitude uav vision-language understanding,” arXiv preprint arXiv:2603.14336, 2026

work page arXiv 2026

[20] [20]

Are vlms lost between sky and space? LinkS 2Bench for uav-satellite dynamic cross-view spatial intelligence,

D. Liu, J. Feng, D. Li, Y . Zheng, G. Li, W. Dong, and G. Shi, “Are vlms lost between sky and space? LinkS 2Bench for uav-satellite dynamic cross-view spatial intelligence,”arXiv preprint arXiv:2604.02020, 2026

work page arXiv 2026

[21] [21]

Cambrian-s: Towards spatial supersensing in video,

S. Yang, J. Yang, P. Huang, E. L. Brown II, Z. Yang, Y . Yu, S. Tong, Z. Zheng, Y . Xu, M. Wanget al., “Cambrian-s: Towards spatial supersensing in video,” inThe Fourteenth International Conference on Learning Representations, 2026

2026

[22] [22]

Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence,

J. Lin, R. Xu, S. Zhu, S. Yang, P. Cao, Y . Ran, M. Hu, C. Zhu, Y . Xie, Y . Long, W. Hu, D. Lin, T. Wang, and J. Pang, “Mmsi-video-bench: A holistic benchmark for video-based spatial intelligence,”arXiv preprint arXiv:2512.10863, 2025

work page arXiv 2025

[23] [23]

Embodiedcity: A benchmark platform for embodied agent in real-world city environment,

C. Gao, B. Zhao, W. Zhang, J. Mao, J. Zhang, Z. Zheng, F. Man, J. Fang, Z. Zhou, J. Cui, X. Chen, and Y . Li, “Embodiedcity: A benchmark platform for embodied agent in real-world city environment,”arXiv preprint arXiv:2410.09604, 2024

work page arXiv 2024

[24] [24]

Real-time and accurate uav pedestrian detection for social distancing monitoring in covid-19 pandemic,

Z. Shao, G. Cheng, J. Ma, Z. Wang, J. Wang, and D. Li, “Real-time and accurate uav pedestrian detection for social distancing monitoring in covid-19 pandemic,”IEEE Transactions on Multimedia, vol. 24, pp. 2069–2083, 2022

2069

[25] [25]

UA VBench: An open benchmark dataset for autonomous and agentic ai uav systems via llm- generated flight scenarios,

M. A. Ferrag, A. Lakas, and M. Debbah, “UA VBench: An open benchmark dataset for autonomous and agentic ai uav systems via llm- generated flight scenarios,”arXiv preprint arXiv:2511.11252, 2025

work page arXiv 2025

[26] [26]

Bedi: A comprehensive benchmark for evaluating embodied agents on uavs,

M. Guo, M. Wu, J. He, S. Li, H. Li, and C. Tao, “Bedi: A comprehensive benchmark for evaluating embodied agents on uavs,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 232, pp. 910–936, 2026

2026

[27] [27]

Multiview aerial visual recognition (mavrec): Can multi-view improve aerial visual perception?

A. Dutta, S. Das, J. Nielsen, R. Chakraborty, and M. Shah, “Multiview aerial visual recognition (mavrec): Can multi-view improve aerial visual perception?”arXiv preprint arXiv:2312.04548, 2023

work page arXiv 2023

[28] [28]

Airscape: An aerial generative world model with motion controllability,

B. Zhao, R. Tang, M. Jia, Z. Wang, F. Man, X. Zhang, Y . Shang, W. Zhang, W. Wu, C. Gao, X. Chen, and Y . Li, “Airscape: An aerial generative world model with motion controllability,”arXiv preprint arXiv:2507.08885, 2025

work page arXiv 2025

[29] [29]

University-1652: A multi-view multi- source benchmark for drone-based geo-localization,

Z. Zheng, Y . Wei, and Y . Yang, “University-1652: A multi-view multi- source benchmark for drone-based geo-localization,” inProceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1395–1403

2020

[30] [30]

Metric3d: Towards zero-shot metric 3d prediction from a single image,

W. Yin, C. Zhang, H. Chen, Z. Cai, G. Yu, K. Wang, X. Chen, and C. Shen, “Metric3d: Towards zero-shot metric 3d prediction from a single image,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9043–9053

2023

[31] [31]

Qwen3-VL Technical Report

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-vl technical report,”arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shaoet al., “Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Scaling spatial intelligence with multimodal foundation models,

Z. Cai, R. Wang, C. Gu, F. Pu, J. Xu, Y . Wang, W. Yin, Z. Yang, C. Wei, T. Zhouet al., “Scaling spatial intelligence with multimodal foundation models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 7879–7890

2026

[34] [34]

Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence,

D. Wu, F. Liu, Y .-H. Hung, and Y . Duan, “Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence,”Advances in neural information processing systems, vol. 38, pp. 13 569–13 597, 2026

2026

[35] [35]

Visual spatial tuning.arXiv preprint arXiv:2511.05491, 2025

R. Yang, Z. Zhu, Y . Li, J. Huang, S. Yan, S. Zhou, Z. Liu, X. Li, S. Li, W. Wanget al., “Visual spatial tuning,”arXiv preprint arXiv:2511.05491, 2025

work page arXiv 2025

[36] [36]

Spatial understanding from videos: Structured prompts meet simulation data,

H. Zhang, M. Liu, Z. Li, H. Wen, W. Guan, Y . Wang, and L. Nie, “Spatial understanding from videos: Structured prompts meet simulation data,”Advances in Neural Information Processing Systems, vol. 38, pp. 103 202–103 229, 2026

2026

[37] [37]

Spaceera++: A unified framework towards 3d spatial reasoning in video,

W. Guan, H. Zhang, M. Liu, Q. Xiang, Y . Wang, and L. Nie, “Spaceera++: A unified framework towards 3d spatial reasoning in video,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

2026

[38] [38]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025