AeroVerse-SatAgent: UAV-Satellite Collaborative Spatial Reasoning Inspired by the Dual Visual Pathway Theory of Cognitive Neuroscience

Chen Gao; Fanglong Yao; Kun Fu; Peng Hu; Wenyi Zhang; Xian Sun; Youzhi Liu; Zhengqiu Zhu

arxiv: 2606.31467 · v1 · pith:WBQDL4XWnew · submitted 2026-06-30 · 💻 cs.CV

AeroVerse-SatAgent: UAV-Satellite Collaborative Spatial Reasoning Inspired by the Dual Visual Pathway Theory of Cognitive Neuroscience

Wenyi Zhang , Fanglong Yao , Youzhi Liu , Peng Hu , Zhengqiu Zhu , Chen Gao , Xian Sun , Kun Fu This is my paper

Pith reviewed 2026-07-01 06:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords UAVsatellite imageryspatial reasoningmulti-view fusion3D reconstructionvisual pathwaysembodied intelligence

0 comments

The pith

SatAgent combines UAV and satellite views with explicit 3D geometry to enable more accurate spatial reasoning in urban settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SatAgent as a collaborative system that processes images from both UAV and satellite perspectives to overcome limitations of single-view perception and purely semantic models. It draws on the dual visual pathway idea to build explicit three-dimensional representations and align features across views in a shared coordinate system. A new dataset of 130,000 multi-view samples supports training and evaluation on geometric relationship tasks. The approach yields higher accuracy than existing models, especially on complex spatial queries. This matters because reliable aerial spatial understanding supports autonomous operations in environments with occlusions and changing viewpoints.

Core claim

SatAgent jointly leverages satellite and UAV perspectives through a Geometric-Aware 3D Reconstruction Encoder that lifts 2D UAV features into explicit 3D spatial representations, a multi-view topology-semantic alignment module that integrates cross-view features in a unified BEV coordinate system, and a multi-view consistency loss that encourages viewpoint-invariant representations, resulting in improved performance on spatial reasoning tasks.

What carries the argument

The Geometric-Aware 3D Reconstruction Encoder that converts 2D UAV features into explicit 3D spatial representations, together with multi-view alignment across satellite and UAV inputs.

If this is right

The model achieves 25.91 percent higher accuracy than general foundation models and 11.69 percent higher than specialized spatial reasoning models across tasks.
Accuracy is particularly strong on complex geometric relationship reasoning.
The multi-view consistency loss produces features that remain stable under changes in viewpoint and scale.
Joint UAV-satellite processing reduces errors from occlusions and perspective distortions compared with single-view methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same encoder and alignment structure could be adapted to other pairs of sensors, such as ground cameras paired with aerial views.
Real-time deployment on UAVs would require checking whether the 3D reconstruction step runs at acceptable speeds on embedded hardware.
Extending the dataset to rural or indoor scenes would test whether the performance pattern persists outside the original urban focus.

Load-bearing premise

The reported gains depend on the SatAgent-SR130K dataset capturing representative urban conditions and the improvements arising specifically from the new 3D encoder and alignment modules rather than from dataset choices or tuning.

What would settle it

Evaluating the model on an independent dataset with different city layouts, camera angles, or scale ranges and observing no accuracy advantage over baseline models would indicate the central claim does not hold.

Figures

Figures reproduced from arXiv: 2606.31467 by Chen Gao, Fanglong Yao, Kun Fu, Peng Hu, Wenyi Zhang, Xian Sun, Youzhi Liu, Zhengqiu Zhu.

**Figure 2.** Figure 2: (a) SatAgent jointly leverages the satellite perspective (global semantic priors) and UAV perspective (local geometric depth) via functionally [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overall architecture of the SatAgent. The model consists of three core modules working in concert: (1) Dual-Channel Collaborative Encoder: Inspired by the human dual visual pathways, it models complementary spatial information through a semantic prototype pathway and a geometric feature injection pathway, and implements bidirectional information feedback via bidirectional cross-stream gating; (2) Geometric… view at source ↗

**Figure 4.** Figure 4: Data flow of the Geometric-Aware 3D Reconstruction Encoder. Three inputs converge in EWA Gaussian soft projection: the Depth-Geometry Branch [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Data flow of the Dynamic k-NN Graph Construction and Propagation ˆ [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of the SatAgent-SR130K dataset:(a) Dataset samples from eight spatial reasoning categories; (b) vocabulary distribution; (c) length distribution [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Hyperparameter sensitivity analysis. (a) Effect of the number of k [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Complete encoding pipeline visualization of the geometry-aware 3D reconstruction encoder: (a) per-pixel depth estimation by UniDepth; (b) 3D [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison of SatAgent and GPT-5.4 on 3D spatial reasoning tasks; the third column shows failure cases. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Pearson correlation matrix of Token F1 scores across the eight [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

read the original abstract

With the rapid advancement of aerospace embodied intelligence, enabling Unmanned Aerial Vehicles (UAVs) to autonomously understand and reason about complex environments has become increasingly important. However, existing UAV-based spatial reasoning approaches face critical limitations: single-view perception renders them vulnerable to occlusions and perspective distortions, while most VLMs lack explicit geometric modeling, relying on semantic cues and yielding inconsistent reasoning under viewpoint and scale variations. To address these challenges, we propose SatAgent, a UAV-Satellite collaborative spatial reasoning model inspired by the dual-pathway mechanism of the human visual system. By jointly leveraging satellite and UAV perspectives, SatAgent enables robust, accurate reasoning in complex urban environments. We first introduce a Geometric-Aware 3D Reconstruction Encoder that elevates 2D UAV features into explicit 3D spatial representations. Next, we design a multi-view topology-semantic alignment module integrating cross-view features within a unified BEV coordinate system. We further introduce a multi-view consistency loss encouraging viewpoint-invariant representations. Finally, we construct SatAgent-SR130K, the first large-scale UAV-Satellite collaborative multi-view spatial reasoning dataset. Experiments show SatAgent outperforms state-of-the-art general-purpose foundation models and specialized spatial reasoning models by 25.91\% and 11.69\%, respectively, across diverse tasks, achieving particularly high accuracy in complex geometric relationship reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SatAgent brings a new UAV-satellite fusion model and the first claimed large collaborative dataset, but the headline gains lack ablations that separate the architecture from dataset and training choices.

read the letter

The paper introduces SatAgent, which adds a Geometric-Aware 3D Reconstruction Encoder to lift UAV features into 3D, a multi-view topology-semantic alignment module in BEV space, and a consistency loss. It also releases SatAgent-SR130K as the first large-scale UAV-satellite spatial reasoning dataset. These pieces target a concrete gap: single-view drone perception struggles with occlusions and scale changes, and most VLMs skip explicit geometry.

The approach is straightforward and the motivation from the dual visual pathway is clear enough. Combining overhead satellite context with UAV detail makes sense for urban scenes, and the BEV alignment plus consistency term are reasonable ways to enforce cross-view agreement.

The soft spot is exactly the one in the stress-test note. The abstract states 25.91% and 11.69% gains over general and specialized models, yet supplies no ablations that keep the dataset fixed and remove the 3D encoder or alignment module. No dataset statistics, error bars, or split details appear either. Without those controls it is impossible to tell whether the numbers come from the proposed modules or from how the new tasks and scenes were curated. Generalization beyond the collected urban views therefore stays untested.

This is for people working on aerial embodied AI or multi-view spatial reasoning who need a starting dataset and fusion baseline. A reader could borrow the encoder and alignment ideas even if the numbers require re-checking.

It deserves peer review so referees can request the missing ablations and data details; the practical setting is narrow but the engineering question is well-posed.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SatAgent, a UAV-satellite collaborative spatial reasoning model inspired by the dual visual pathway theory. It introduces a Geometric-Aware 3D Reconstruction Encoder to lift 2D UAV features into explicit 3D representations, a multi-view topology-semantic alignment module operating in a unified BEV coordinate system, and a multi-view consistency loss to promote viewpoint-invariant features. The authors also release the SatAgent-SR130K dataset for UAV-satellite multi-view spatial reasoning. The central claim is that SatAgent outperforms general-purpose foundation models by 25.91% and specialized spatial reasoning models by 11.69% across tasks, with strongest results on complex geometric relationship reasoning.

Significance. If the performance margins can be attributed to the proposed 3D encoder and cross-view alignment rather than dataset construction or training choices, the work would advance UAV embodied intelligence by providing an explicit geometric pathway that mitigates single-view occlusions and scale inconsistencies. The new collaborative dataset is a concrete enabling contribution for the community.

major comments (2)

[Experiments] Experiments section: The headline gains of 25.91% and 11.69% are reported without ablation experiments that train the same backbone on SatAgent-SR130K while removing the Geometric-Aware 3D Reconstruction Encoder or the BEV alignment module. Without such controls it is impossible to determine whether the margins arise from the dual-pathway architecture or from the new dataset and task formulation.
[Method] Method section (multi-view consistency loss): The loss is described only at a high level as encouraging viewpoint-invariant representations; its exact form (contrastive, regression, or otherwise), temperature, and relative weighting to the primary task loss are not specified, preventing assessment of whether it is load-bearing for the geometric reasoning accuracy.

minor comments (1)

[Abstract] Abstract: Performance figures are stated without error bars, number of runs, or basic dataset statistics (train/val/test sizes, scene diversity), which is standard practice for empirical claims in computer vision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Experiments] Experiments section: The headline gains of 25.91% and 11.69% are reported without ablation experiments that train the same backbone on SatAgent-SR130K while removing the Geometric-Aware 3D Reconstruction Encoder or the BEV alignment module. Without such controls it is impossible to determine whether the margins arise from the dual-pathway architecture or from the new dataset and task formulation.

Authors: We agree that the current experiments do not include the requested ablations on the same backbone and dataset. In the revised manuscript we will add these controls: training the identical backbone on SatAgent-SR130K after removing the Geometric-Aware 3D Reconstruction Encoder and, separately, after removing the BEV alignment module. The new results will be reported to isolate the contribution of each proposed component. revision: yes
Referee: [Method] Method section (multi-view consistency loss): The loss is described only at a high level as encouraging viewpoint-invariant representations; its exact form (contrastive, regression, or otherwise), temperature, and relative weighting to the primary task loss are not specified, preventing assessment of whether it is load-bearing for the geometric reasoning accuracy.

Authors: We acknowledge that the multi-view consistency loss is described at an insufficient level of detail. In the revision we will specify the exact loss formulation (including whether it is contrastive or regression-based), the temperature hyper-parameter if used, and the scalar weight applied relative to the primary task loss. revision: yes

Circularity Check

0 steps flagged

No circularity: new architecture and dataset presented without self-referential reductions

full rationale

The paper constructs SatAgent from scratch with a Geometric-Aware 3D Reconstruction Encoder, multi-view topology-semantic alignment module, and multi-view consistency loss, then evaluates on the newly introduced SatAgent-SR130K dataset. No equations, loss terms, or performance metrics are shown to reduce by construction to fitted parameters from the same data or to prior self-citations. The derivation chain consists of independent design choices and empirical reporting rather than any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, no explicit free parameters, axioms, or invented entities; all such elements are unknown.

pith-pipeline@v0.9.1-grok · 5797 in / 1127 out tokens · 21142 ms · 2026-07-01T06:16:33.336464+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

101 extracted references · 22 canonical work pages · 13 internal anchors

[1]

Aeroverse-review: Comprehensive survey on aerial em- bodied vision-and-language navigation,

F. Yao, Y . Liu, W. Zhang, Z. Zhu, C. Li, N. Liu, P. Hu, Y . Yue, K. Wei, X. Heet al., “Aeroverse-review: Comprehensive survey on aerial em- bodied vision-and-language navigation,”The Innovation Informatics, vol. 1, no. 1, p. 100015, 2025

2025
[2]

Aeroverse: Uav-agent benchmark suite for simulating, pre-training, finetuning, and evaluating aerospace embodied foundation models,

F. Yao, Y . Yue, Y . Liu, Z. Wang, L. Jin, B. Zhao, J. Zhao, X. Sun, and K. Fu, “Aeroverse: Uav-agent benchmark suite for simulating, pre-training, finetuning, and evaluating aerospace embodied foundation models,”IEEE Transactions on Pattern Analysis and Machine Intelli- gence, pp. 1–18, 2026

2026
[3]

Navagent: Multi- scale urban street view fusion for uav embodied vision-and-language navigation,

Y . Liu, F. Yao, Y . Yue, G. Xu, X. Sun, and K. Fu, “Navagent: Multi- scale urban street view fusion for uav embodied vision-and-language navigation,”arXiv preprint arXiv:2411.08579, 2024

work page arXiv 2024
[4]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,

J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” inPro- ceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2901–2910

2017
[5]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S ¨underhauf, I. Reid, S. Gould, and A. Van Den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3674–3683

2018
[6]

Matterport3D: Learning from RGB-D Data in Indoor Environments

A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y . Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,”arXiv preprint arXiv:1709.06158, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

Learning to Navigate in Complex Environments

P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuogluet al., “Learning to navigate in complex environments,”arXiv preprint arXiv:1611.03673, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

AI2-THOR: An Interactive 3D Environment for Visual AI

E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, M. Deitke, K. Ehsani, D. Gordon, Y . Zhuet al., “Ai2-thor: An inter- active 3d environment for visual ai,”arXiv preprint arXiv:1712.05474, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Knowledge-based embodied question answering,

S. Tan, M. Ge, D. Guo, H. Liu, and F. Sun, “Knowledge-based embodied question answering,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 11 948–11 960, 2023

2023
[10]

Depth and video segmentation based visual attention for embodied question answering,

H. Luo, G. Lin, Y . Yao, F. Liu, Z. Liu, and Z. Tang, “Depth and video segmentation based visual attention for embodied question answering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 6807–6819, 2023

2023
[11]

3D Gaussian splatting for real-time radiance field rendering,

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3D Gaussian splatting for real-time radiance field rendering,”ACM Transactions on Graphics, vol. 42, no. 4, 2023

2023
[12]

NeRF: Representing scenes as neural radiance fields for view synthesis,

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoor- thi, and R. Ng, “NeRF: Representing scenes as neural radiance fields for view synthesis,” inComputer Vision – ECCV 2020, ser. Lecture Notes in Computer Science, vol. 12346. Springer, 2020, pp. 405– 421

2020
[13]

Unmanned aerial vehicle-neural radiance field (UA V-NeRF): Learning multiview drone three-dimensional reconstruction with neural radiance field,

L. Li, Y . Zhang, Z. Jiang, Z. Wang, L. Zhang, and H. Gao, “Unmanned aerial vehicle-neural radiance field (UA V-NeRF): Learning multiview drone three-dimensional reconstruction with neural radiance field,” Remote Sensing, vol. 16, no. 22, p. 4168, 2024

2024
[14]

Depth anything V2,

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything V2,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024
[15]

SkyScenes: A synthetic dataset for aerial scene understand- ing,

S. Khose, A. Pal, A. Agarwal, Deepanshi, J. Hoffman, and P. Chat- topadhyay, “SkyScenes: A synthetic dataset for aerial scene understand- ing,” inComputer Vision – ECCV 2024, ser. Lecture Notes in Computer Science, vol. 15137. Springer, 2024, pp. 19–35

2024
[16]

Semantic-driven autonomous visual navigation for unmanned aerial vehicles,

P. Yue, J. Xin, Y . Zhang, Y . Lu, and M. Shan, “Semantic-driven autonomous visual navigation for unmanned aerial vehicles,”IEEE Transactions on Industrial Electronics, vol. 71, no. 11, pp. 14 853– 14 863, 2024

2024
[17]

Vision-based navigation techniques for unmanned aerial vehicles: Review and challenges,

M. Y . Arafat, M. M. Alam, and S. Moh, “Vision-based navigation techniques for unmanned aerial vehicles: Review and challenges,” Drones, vol. 7, no. 2, p. 89, 2023

2023
[18]

Learning perception- aware agile flight in cluttered environments,

Y . Song, K. Shi, R. Penicka, and D. Scaramuzza, “Learning perception- aware agile flight in cluttered environments,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 1989–1995

2023
[19]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models,

M. Jia, Z. Qi, S. Zhang, W. Zhang, X. Yu, J. He, H. Wang, and L. Yi, “Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models,”arXiv preprint arXiv:2506.03135, 2025

work page arXiv 2025
[20]

Visual spatial reasoning,

F. Liu, G. Emerson, and N. Collier, “Visual spatial reasoning,”Trans- actions of the Association for Computational Linguistics, vol. 11, pp. 635–651, 2023

2023
[21]

Mind the gap: Benchmarking spatial reasoning in vision-language models,

I. Stogiannidis, S. McDonagh, and S. A. Tsaftaris, “Mind the gap: Benchmarking spatial reasoning in vision-language models,”arXiv preprint arXiv:2503.19707, 2025

work page arXiv 2025
[22]

Spatialrgpt: Grounded spatial reasoning in vision-language models,

A.-C. Cheng, H. Yin, Y . Fu, Q. Guo, R. Yang, J. Kautz, X. Wang, and S. Liu, “Spatialrgpt: Grounded spatial reasoning in vision-language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 135 062–135 093, 2024

2024
[23]

Accurate 3-dof camera geo-localization via ground-to-satellite image matching,

Y . Shi, X. Yu, L. Liu, D. Campbell, P. Koniusz, and H. Li, “Accurate 3-dof camera geo-localization via ground-to-satellite image matching,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 2682–2697, 2022

2022
[24]

arXiv preprint arXiv:2503.01773 , year=

S. Chen, T. Zhu, R. Zhou, J. Zhang, S. Gao, J. C. Niebles, M. Geva, J. He, J. Wu, and M. Li, “Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas,”arXiv preprint arXiv:2503.01773, 2025

work page arXiv 2025
[25]

Sphere: Unveiling spatial blind spots in vision- language models through hierarchical evaluation,

W. Zhang, W. E. Ng, L. Ma, Y . Wang, J. Zhao, A. Koenecke, B. Li, and W. Wanglu, “Sphere: Unveiling spatial blind spots in vision- language models through hierarchical evaluation,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 11 591–11 609

2025
[26]

Is a picture worth a thousand words? delving into spatial reasoning for vision language models,

J. Wang, Y . Ming, Z. Shi, V . Vineet, X. Wang, S. Li, and N. Joshi, “Is a picture worth a thousand words? delving into spatial reasoning for vision language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 75 392–75 421, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17

2024
[27]

A neurological dissociation between perceiving objects and grasping them,

M. A. Goodale, A. D. Milner, L. S. Jakobson, and D. P. Carey, “A neurological dissociation between perceiving objects and grasping them,”Nature, vol. 349, no. 6305, pp. 154–156, 1991

1991
[28]

Separate visual pathways for perception and action,

M. A. Goodale and A. D. Milner, “Separate visual pathways for perception and action,”Trends in Neurosciences, vol. 15, no. 1, pp. 20–25, 1992

1992
[29]

Two cortical visual systems,

L. G. Ungerleider, “Two cortical visual systems,”Analysis of visual behavior, vol. 549, pp. chapter–18, 1982

1982
[30]

Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching,

M. Chu, Z. Zheng, W. Ji, T. Wang, and T.-S. Chua, “Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 213–231

2024
[31]

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. [Online]. Available: https://arxiv.org/abs/2412.14171

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

ViewSpatial-Bench: Evaluating multi-perspective spatial localization in vision-language models,

R. Li, S. Li, L. Kong, X. Yang, and J. Liang, “ViewSpatial-Bench: Evaluating multi-perspective spatial localization in vision-language models,”arXiv preprint arXiv:2505.21500, 2025. [Online]. Available: https://arxiv.org/abs/2505.21500

work page arXiv 2025
[33]

MM- Spatial: Exploring 3D spatial understanding in multimodal LLMs,

E. Daxberger, N. Wenzel, D. Griffiths, H. Gang, J. Lazarow, G. Kohavi, K. Kang, M. Eichner, Y . Yang, A. Dehghan, and P. Grasch, “MM- Spatial: Exploring 3D spatial understanding in multimodal LLMs,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 7395–7408

2025
[34]

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

S. Yang, R. Xu, Y . Xie, S. Yang, M. Li, J. Lin, C. Zhu, X. Chen, H. Duan, X. Yueet al., “MMSI-Bench: A benchmark for multi-image spatial intelligence,” inInternational Conference on Learning Representations, 2026. [Online]. Available: https: //arxiv.org/abs/2505.23764

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Urbench: A comprehensive benchmark for eval- uating large multimodal models in multi-view urban scenarios,

B. Zhou, H. Yang, D. Chen, J. Ye, T. Bai, J. Yu, S. Zhang, D. Lin, C. He, and W. Li, “Urbench: A comprehensive benchmark for eval- uating large multimodal models in multi-view urban scenarios,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2025

2025
[36]

All-angles bench: Can mllms answer different perspective questions well?

C.-H. Yeh, C. Wang, S. Tong, T.-Y . Cheng, Y . Zhai, Y . Chen, S. Gao, and Y . Ma, “All-angles bench: Can mllms answer different perspective questions well?” 2025

2025
[37]

AirCopBench: A benchmark for multi-drone collaborative embodied perception and reasoning,

J. Zha, Y . Fan, T. Zhang, G. Chen, Y . Chen, C. Gao, and X. Chen, “AirCopBench: A benchmark for multi-drone collaborative embodied perception and reasoning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 2, 2026, pp. 1507–1515

2026
[38]

CityCube: Benchmarking cross-view spatial reasoning on vision-language models in urban environments,

H. Xu, Y . Hu, Z. Zhu, C. Gao, Z. Wang, J. Rao, W. Lu, W. Li, Q. Yin, and Y . Li, “CityCube: Benchmarking cross-view spatial reasoning on vision-language models in urban environments,” arXiv preprint arXiv:2601.14339, 2026. [Online]. Available: https: //arxiv.org/abs/2601.14339

work page arXiv 2026
[39]

Open3dvqa: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space,

W. Zhanget al., “Open3dvqa: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space,” in Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM), 2025

2025
[40]

Uavreason: A unified, large-scale benchmark for multimodal aerial scene reasoning and generation,

o. Li, “Uavreason: A unified, large-scale benchmark for multimodal aerial scene reasoning and generation,” 2026

2026
[41]

Are VLMs lost between sky and space? LinkS 2Bench for UA V- satellite dynamic cross-view spatial intelligence,

D. Liu, Y . Zheng, J. Feng, G. Li, G. Shi, D. Li, and W. Dong, “Are VLMs lost between sky and space? LinkS 2Bench for UA V- satellite dynamic cross-view spatial intelligence,”arXiv preprint arXiv:2604.02020, 2026. [Online]. Available: https://arxiv.org/abs/ 2604.02020

work page arXiv 2026
[42]

Spatialladder: Progressive train- ing for spatial reasoning in vision-language models,

H. Li, D. Li, Z. Wang, Y . Yan, H. Wu, W. Zhang, Y . Shen, W. Lu, J. Xiao, and Y . Zhuang, “Spatialladder: Progressive train- ing for spatial reasoning in vision-language models,”arXiv preprint arXiv:2510.08531, 2025

work page arXiv 2025
[43]

Spatial- vilt: Enhancing visual spatial reasoning through multi-task learning,

C. M. Islam, O. Mamo, S. J. Chacko, X. Liu, and W. Yu, “Spatial- vilt: Enhancing visual spatial reasoning through multi-task learning,” inInternational Symposium on Visual Computing. Springer, 2025, pp. 47–58

2025
[44]

Sparkle: Mastering basic spatial capabilities in vision language models elicits generalization to spatial reasoning,

Y . Tang, A. Qu, Z. Wang, D. Zhuang, Z. Wu, W. Ma, S. Wang, Y . Zheng, Z. Zhao, and J. Zhao, “Sparkle: Mastering basic spatial capabilities in vision language models elicits generalization to spatial reasoning,”Findings of the Association for Computational Linguistics: EMNLP, vol. 2025, pp. 4083–4103, 2025

2025
[45]

Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision-language models,

Y .-H. Liao, R. Mahmood, S. Fidler, and D. Acuna, “Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision-language models,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 17 028– 17 047

2024
[46]

3D- LLM: Injecting the 3D world into large language models,

Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3D- LLM: Injecting the 3D world into large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[47]

LLaV A-3D: A simple yet effective pathway to empowering LMMs with 3D-awareness,

C. Zhu, T. Wang, W. Zhang, J. Pang, and X. Liu, “LLaV A-3D: A simple yet effective pathway to empowering LMMs with 3D-awareness,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025
[48]

ConceptGraphs: Open-vocabulary 3D scene graphs for perception and planning,

Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, C. Gan, C. M. de Melo, J. B. Tenenbaum, A. Torralba, F. Shkurti, and L. Paull, “ConceptGraphs: Open-vocabulary 3D scene graphs for perception and planning,” inProceedings of the IEEE International Conference on Robotics and Automation (IC...

2024
[49]

Kosmos-2: Grounding multimodal large language models to the world,

Z. Peng, W. Wang, L. Dong, Y . Hao, S. Huang, S. Ma, and F. Wei, “Kosmos-2: Grounding multimodal large language models to the world,” inInternational Conference on Learning Representations (ICLR), 2024

2024
[50]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao, “Shikra: Unleashing multimodal LLM’s referential dialogue magic,” arXiv preprint arXiv:2306.15195, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Ferret: Refer and ground anything anywhere at any granularity,

H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, and Y . Yang, “Ferret: Refer and ground anything anywhere at any granularity,” inInternational Conference on Learning Representa- tions (ICLR), 2024

2024
[52]

RegionGPT: Towards region understanding vision language model,

Q. Guo, S. De Mello, H. Yin, W. Byeon, K. C. Cheung, Y . Yu, P. Luo, and S. Liu, “RegionGPT: Towards region understanding vision language model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13 796– 13 806

2024
[53]

Osprey: Pixel understanding with visual instruction tuning,

Y . Yuan, W. Li, J. Liu, D. Tang, X. Luo, C. Qin, L. Zhang, and J. Zhu, “Osprey: Pixel understanding with visual instruction tuning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 28 202–28 211

2024
[54]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,

B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, and F. Xia, “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 455–14 465

2024
[55]

Ucdnet: Multi-uav collaborative 3-d object detection network by reliable feature mapping,

P. Tian, Z. Wang, P. Chenget al., “Ucdnet: Multi-uav collaborative 3-d object detection network by reliable feature mapping,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–16, 2024

2024
[56]

Drones help drones: A col- laborative framework for multi-drone object trajectory prediction and beyond,

Z. Wang, P. Cheng, M. Chenet al., “Drones help drones: A col- laborative framework for multi-drone object trajectory prediction and beyond,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 64 604–64 628

2024
[57]

MANTIS: Interleaved multi-image instruction tuning,

D. Jiang, X. He, H. Zeng, C. Wei, M. W. Ku, Q. Liu, and W. Chen, “MANTIS: Interleaved multi-image instruction tuning,”Transactions on Machine Learning Research (TMLR), 2024

2024
[58]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

F. Li, R. Zhang, H. Zhang, Y . Zhang, B. Li, W. Li, Z. Ma, and C. Li, “LLaV A-NeXT-Interleave: Tackling multi-image, video, and 3D in large multimodal models,”arXiv preprint arXiv:2407.07895, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Video-3D LLM: Learning position- aware video representation for 3D scene understanding,

D. Zheng, S. Huang, and L. Wang, “Video-3D LLM: Learning position- aware video representation for 3D scene understanding,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[60]

BEVFormer: Learning bird’s-eye-view representation from LiDAR- camera via spatiotemporal transformers,

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y . Qiao, and J. Dai, “BEVFormer: Learning bird’s-eye-view representation from LiDAR- camera via spatiotemporal transformers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, 2024

2024
[61]

RSVQA: Visual question answering for remote sensing data,

S. Lobry, D. Marcos, J. Murray, and D. Tuia, “RSVQA: Visual question answering for remote sensing data,” inIEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 12, 2020, pp. 8555–8566

2020
[62]

Rsvg: Exploring data and models for visual grounding on remote sensing data,

Y . Zhan, Z. Yuan, B. Xiong, J. Su, Q. Wang, J. Gui, R. Wang, K. Wang, and X. X. Zhu, “Rsvg: Exploring data and models for visual grounding on remote sensing data,” inISPRS Journal of Photogrammetry and Remote Sensing, vol. 196, 2023, pp. 89–101

2023
[63]

EarthVQA: Towards queryable earth via relational reasoning-based remote sensing visual question answering,

K. Wang, R. Wang, J. Zhao, J. Guo, B. Dang, Y . Zhang, and S. Xiang, “EarthVQA: Towards queryable earth via relational reasoning-based remote sensing visual question answering,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 6258–6266

2024
[64]

STAR: A first- ever dataset and a large-scale benchmark for scene graph generation in large-size satellite imagery,

Y . Li, J. Luo, Y . Zhang, Y . Tan, J.-G. Yu, and S. Bai, “STAR: A first- ever dataset and a large-scale benchmark for scene graph generation in large-size satellite imagery,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024
[65]

RemoteCLIP: A vision language foundation model for remote JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18 sensing,

F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhu, Q. Ye, L. Fu, and J. Zhou, “RemoteCLIP: A vision language foundation model for remote JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18 sensing,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024

2021
[66]

GeoChat: Grounded large vision-language model for remote sensing,

K. Kuckreja, M. S. Danish, M. Naseer, A. Das, S. Khan, and F. S. Khan, “GeoChat: Grounded large vision-language model for remote sensing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 831–27 840

2024
[67]

SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model,

Y . Zhan, B. Xiong, and Z. Yuan, “SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model,”arXiv preprint arXiv:2401.09712, 2024

work page arXiv 2024
[68]

University-1652: A multi-view multi- source benchmark for drone-based geo-localization,

Z. Zheng, Y . Wei, and Y . Yang, “University-1652: A multi-view multi- source benchmark for drone-based geo-localization,” inProceedings of the ACM International Conference on Multimedia, 2020, pp. 1395– 1403

2020
[69]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning (ICML), 2021, pp. 8748–8763

2021
[70]

Unidepth: Universal monocular metric depth estimation,

L. Piccinelli, Y .-H. Yang, C. Sakaridis, M. Seg `u, S. Li, L. Van Gool, and F. Yu, “Unidepth: Universal monocular metric depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 10 106–10 116

2024
[71]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–7141

2018
[72]

BEVFusion: Multi-task multi-sensor fusion with unified BEV repre- sentation,

Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. Rus, and S. Han, “BEVFusion: Multi-task multi-sensor fusion with unified BEV repre- sentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 5795–5805

2023
[73]

Ewa volume splatting,

M. Zwicker, H. Pfister, J. van Baar, and M. Gross, “Ewa volume splatting,” inProceedings of the 28th annual conference on Computer graphics and interactive techniques (SIGGRAPH), 2001, pp. 29–38

2001
[74]

Spatial transformer networks,

M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” inAdvances in Neural Information Processing Systems (NeurIPS), 2015

2015
[75]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 5998–6008

2017
[76]

Graph attention networks,

P. Veli ˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Li `o, and Y . Bengio, “Graph attention networks,” inInternational Conference on Learning Representations (ICLR), 2018

2018
[77]

Gated Multimodal Units for Information Fusion

J. Arevalo, T. Solorio, M. Montes-y G ´omez, and F. A. Gonz ´alez, “Gated multimodal units for information fusion,”arXiv preprint arXiv:1702.01992, 2017. [Online]. Available: https://arxiv.org/abs/ 1702.01992

work page internal anchor Pith review Pith/arXiv arXiv 2017
[78]

Qlora: Efficient finetuning of quantized llms,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” inAdvances in Neural Infor- mation Processing Systems (NeurIPS), 2023

2023
[79]

Representation Learning with Contrastive Predictive Coding

A. van den Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[80]

Barlow twins: Self-supervised learning via redundancy reduction,

J. Zbontar, L. Jing, I. Misra, Y . LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” inInternational Conference on Machine Learning (ICML), 2021, pp. 12 310–12 320

2021

Showing first 80 references.

[1] [1]

Aeroverse-review: Comprehensive survey on aerial em- bodied vision-and-language navigation,

F. Yao, Y . Liu, W. Zhang, Z. Zhu, C. Li, N. Liu, P. Hu, Y . Yue, K. Wei, X. Heet al., “Aeroverse-review: Comprehensive survey on aerial em- bodied vision-and-language navigation,”The Innovation Informatics, vol. 1, no. 1, p. 100015, 2025

2025

[2] [2]

Aeroverse: Uav-agent benchmark suite for simulating, pre-training, finetuning, and evaluating aerospace embodied foundation models,

F. Yao, Y . Yue, Y . Liu, Z. Wang, L. Jin, B. Zhao, J. Zhao, X. Sun, and K. Fu, “Aeroverse: Uav-agent benchmark suite for simulating, pre-training, finetuning, and evaluating aerospace embodied foundation models,”IEEE Transactions on Pattern Analysis and Machine Intelli- gence, pp. 1–18, 2026

2026

[3] [3]

Navagent: Multi- scale urban street view fusion for uav embodied vision-and-language navigation,

Y . Liu, F. Yao, Y . Yue, G. Xu, X. Sun, and K. Fu, “Navagent: Multi- scale urban street view fusion for uav embodied vision-and-language navigation,”arXiv preprint arXiv:2411.08579, 2024

work page arXiv 2024

[4] [4]

Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,

J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” inPro- ceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2901–2910

2017

[5] [5]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. S ¨underhauf, I. Reid, S. Gould, and A. Van Den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3674–3683

2018

[6] [6]

Matterport3D: Learning from RGB-D Data in Indoor Environments

A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, and Y . Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,”arXiv preprint arXiv:1709.06158, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

Learning to Navigate in Complex Environments

P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. J. Ballard, A. Banino, M. Denil, R. Goroshin, L. Sifre, K. Kavukcuogluet al., “Learning to navigate in complex environments,”arXiv preprint arXiv:1611.03673, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[8] [8]

AI2-THOR: An Interactive 3D Environment for Visual AI

E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, M. Deitke, K. Ehsani, D. Gordon, Y . Zhuet al., “Ai2-thor: An inter- active 3d environment for visual ai,”arXiv preprint arXiv:1712.05474, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

Knowledge-based embodied question answering,

S. Tan, M. Ge, D. Guo, H. Liu, and F. Sun, “Knowledge-based embodied question answering,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 10, pp. 11 948–11 960, 2023

2023

[10] [10]

Depth and video segmentation based visual attention for embodied question answering,

H. Luo, G. Lin, Y . Yao, F. Liu, Z. Liu, and Z. Tang, “Depth and video segmentation based visual attention for embodied question answering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 6807–6819, 2023

2023

[11] [11]

3D Gaussian splatting for real-time radiance field rendering,

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3D Gaussian splatting for real-time radiance field rendering,”ACM Transactions on Graphics, vol. 42, no. 4, 2023

2023

[12] [12]

NeRF: Representing scenes as neural radiance fields for view synthesis,

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoor- thi, and R. Ng, “NeRF: Representing scenes as neural radiance fields for view synthesis,” inComputer Vision – ECCV 2020, ser. Lecture Notes in Computer Science, vol. 12346. Springer, 2020, pp. 405– 421

2020

[13] [13]

Unmanned aerial vehicle-neural radiance field (UA V-NeRF): Learning multiview drone three-dimensional reconstruction with neural radiance field,

L. Li, Y . Zhang, Z. Jiang, Z. Wang, L. Zhang, and H. Gao, “Unmanned aerial vehicle-neural radiance field (UA V-NeRF): Learning multiview drone three-dimensional reconstruction with neural radiance field,” Remote Sensing, vol. 16, no. 22, p. 4168, 2024

2024

[14] [14]

Depth anything V2,

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything V2,” inAdvances in Neural Information Processing Systems (NeurIPS), 2024

2024

[15] [15]

SkyScenes: A synthetic dataset for aerial scene understand- ing,

S. Khose, A. Pal, A. Agarwal, Deepanshi, J. Hoffman, and P. Chat- topadhyay, “SkyScenes: A synthetic dataset for aerial scene understand- ing,” inComputer Vision – ECCV 2024, ser. Lecture Notes in Computer Science, vol. 15137. Springer, 2024, pp. 19–35

2024

[16] [16]

Semantic-driven autonomous visual navigation for unmanned aerial vehicles,

P. Yue, J. Xin, Y . Zhang, Y . Lu, and M. Shan, “Semantic-driven autonomous visual navigation for unmanned aerial vehicles,”IEEE Transactions on Industrial Electronics, vol. 71, no. 11, pp. 14 853– 14 863, 2024

2024

[17] [17]

Vision-based navigation techniques for unmanned aerial vehicles: Review and challenges,

M. Y . Arafat, M. M. Alam, and S. Moh, “Vision-based navigation techniques for unmanned aerial vehicles: Review and challenges,” Drones, vol. 7, no. 2, p. 89, 2023

2023

[18] [18]

Learning perception- aware agile flight in cluttered environments,

Y . Song, K. Shi, R. Penicka, and D. Scaramuzza, “Learning perception- aware agile flight in cluttered environments,” inProceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 1989–1995

2023

[19] [19]

Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models,

M. Jia, Z. Qi, S. Zhang, W. Zhang, X. Yu, J. He, H. Wang, and L. Yi, “Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models,”arXiv preprint arXiv:2506.03135, 2025

work page arXiv 2025

[20] [20]

Visual spatial reasoning,

F. Liu, G. Emerson, and N. Collier, “Visual spatial reasoning,”Trans- actions of the Association for Computational Linguistics, vol. 11, pp. 635–651, 2023

2023

[21] [21]

Mind the gap: Benchmarking spatial reasoning in vision-language models,

I. Stogiannidis, S. McDonagh, and S. A. Tsaftaris, “Mind the gap: Benchmarking spatial reasoning in vision-language models,”arXiv preprint arXiv:2503.19707, 2025

work page arXiv 2025

[22] [22]

Spatialrgpt: Grounded spatial reasoning in vision-language models,

A.-C. Cheng, H. Yin, Y . Fu, Q. Guo, R. Yang, J. Kautz, X. Wang, and S. Liu, “Spatialrgpt: Grounded spatial reasoning in vision-language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 135 062–135 093, 2024

2024

[23] [23]

Accurate 3-dof camera geo-localization via ground-to-satellite image matching,

Y . Shi, X. Yu, L. Liu, D. Campbell, P. Koniusz, and H. Li, “Accurate 3-dof camera geo-localization via ground-to-satellite image matching,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 2682–2697, 2022

2022

[24] [24]

arXiv preprint arXiv:2503.01773 , year=

S. Chen, T. Zhu, R. Zhou, J. Zhang, S. Gao, J. C. Niebles, M. Geva, J. He, J. Wu, and M. Li, “Why is spatial reasoning hard for vlms? an attention mechanism perspective on focus areas,”arXiv preprint arXiv:2503.01773, 2025

work page arXiv 2025

[25] [25]

Sphere: Unveiling spatial blind spots in vision- language models through hierarchical evaluation,

W. Zhang, W. E. Ng, L. Ma, Y . Wang, J. Zhao, A. Koenecke, B. Li, and W. Wanglu, “Sphere: Unveiling spatial blind spots in vision- language models through hierarchical evaluation,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 11 591–11 609

2025

[26] [26]

Is a picture worth a thousand words? delving into spatial reasoning for vision language models,

J. Wang, Y . Ming, Z. Shi, V . Vineet, X. Wang, S. Li, and N. Joshi, “Is a picture worth a thousand words? delving into spatial reasoning for vision language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 75 392–75 421, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17

2024

[27] [27]

A neurological dissociation between perceiving objects and grasping them,

M. A. Goodale, A. D. Milner, L. S. Jakobson, and D. P. Carey, “A neurological dissociation between perceiving objects and grasping them,”Nature, vol. 349, no. 6305, pp. 154–156, 1991

1991

[28] [28]

Separate visual pathways for perception and action,

M. A. Goodale and A. D. Milner, “Separate visual pathways for perception and action,”Trends in Neurosciences, vol. 15, no. 1, pp. 20–25, 1992

1992

[29] [29]

Two cortical visual systems,

L. G. Ungerleider, “Two cortical visual systems,”Analysis of visual behavior, vol. 549, pp. chapter–18, 1982

1982

[30] [30]

Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching,

M. Chu, Z. Zheng, W. Ji, T. Wang, and T.-S. Chua, “Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 213–231

2024

[31] [31]

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. [Online]. Available: https://arxiv.org/abs/2412.14171

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

ViewSpatial-Bench: Evaluating multi-perspective spatial localization in vision-language models,

R. Li, S. Li, L. Kong, X. Yang, and J. Liang, “ViewSpatial-Bench: Evaluating multi-perspective spatial localization in vision-language models,”arXiv preprint arXiv:2505.21500, 2025. [Online]. Available: https://arxiv.org/abs/2505.21500

work page arXiv 2025

[33] [33]

MM- Spatial: Exploring 3D spatial understanding in multimodal LLMs,

E. Daxberger, N. Wenzel, D. Griffiths, H. Gang, J. Lazarow, G. Kohavi, K. Kang, M. Eichner, Y . Yang, A. Dehghan, and P. Grasch, “MM- Spatial: Exploring 3D spatial understanding in multimodal LLMs,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 7395–7408

2025

[34] [34]

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

S. Yang, R. Xu, Y . Xie, S. Yang, M. Li, J. Lin, C. Zhu, X. Chen, H. Duan, X. Yueet al., “MMSI-Bench: A benchmark for multi-image spatial intelligence,” inInternational Conference on Learning Representations, 2026. [Online]. Available: https: //arxiv.org/abs/2505.23764

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

Urbench: A comprehensive benchmark for eval- uating large multimodal models in multi-view urban scenarios,

B. Zhou, H. Yang, D. Chen, J. Ye, T. Bai, J. Yu, S. Zhang, D. Lin, C. He, and W. Li, “Urbench: A comprehensive benchmark for eval- uating large multimodal models in multi-view urban scenarios,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2025

2025

[36] [36]

All-angles bench: Can mllms answer different perspective questions well?

C.-H. Yeh, C. Wang, S. Tong, T.-Y . Cheng, Y . Zhai, Y . Chen, S. Gao, and Y . Ma, “All-angles bench: Can mllms answer different perspective questions well?” 2025

2025

[37] [37]

AirCopBench: A benchmark for multi-drone collaborative embodied perception and reasoning,

J. Zha, Y . Fan, T. Zhang, G. Chen, Y . Chen, C. Gao, and X. Chen, “AirCopBench: A benchmark for multi-drone collaborative embodied perception and reasoning,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 2, 2026, pp. 1507–1515

2026

[38] [38]

CityCube: Benchmarking cross-view spatial reasoning on vision-language models in urban environments,

H. Xu, Y . Hu, Z. Zhu, C. Gao, Z. Wang, J. Rao, W. Lu, W. Li, Q. Yin, and Y . Li, “CityCube: Benchmarking cross-view spatial reasoning on vision-language models in urban environments,” arXiv preprint arXiv:2601.14339, 2026. [Online]. Available: https: //arxiv.org/abs/2601.14339

work page arXiv 2026

[39] [39]

Open3dvqa: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space,

W. Zhanget al., “Open3dvqa: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space,” in Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM), 2025

2025

[40] [40]

Uavreason: A unified, large-scale benchmark for multimodal aerial scene reasoning and generation,

o. Li, “Uavreason: A unified, large-scale benchmark for multimodal aerial scene reasoning and generation,” 2026

2026

[41] [41]

Are VLMs lost between sky and space? LinkS 2Bench for UA V- satellite dynamic cross-view spatial intelligence,

D. Liu, Y . Zheng, J. Feng, G. Li, G. Shi, D. Li, and W. Dong, “Are VLMs lost between sky and space? LinkS 2Bench for UA V- satellite dynamic cross-view spatial intelligence,”arXiv preprint arXiv:2604.02020, 2026. [Online]. Available: https://arxiv.org/abs/ 2604.02020

work page arXiv 2026

[42] [42]

Spatialladder: Progressive train- ing for spatial reasoning in vision-language models,

H. Li, D. Li, Z. Wang, Y . Yan, H. Wu, W. Zhang, Y . Shen, W. Lu, J. Xiao, and Y . Zhuang, “Spatialladder: Progressive train- ing for spatial reasoning in vision-language models,”arXiv preprint arXiv:2510.08531, 2025

work page arXiv 2025

[43] [43]

Spatial- vilt: Enhancing visual spatial reasoning through multi-task learning,

C. M. Islam, O. Mamo, S. J. Chacko, X. Liu, and W. Yu, “Spatial- vilt: Enhancing visual spatial reasoning through multi-task learning,” inInternational Symposium on Visual Computing. Springer, 2025, pp. 47–58

2025

[44] [44]

Sparkle: Mastering basic spatial capabilities in vision language models elicits generalization to spatial reasoning,

Y . Tang, A. Qu, Z. Wang, D. Zhuang, Z. Wu, W. Ma, S. Wang, Y . Zheng, Z. Zhao, and J. Zhao, “Sparkle: Mastering basic spatial capabilities in vision language models elicits generalization to spatial reasoning,”Findings of the Association for Computational Linguistics: EMNLP, vol. 2025, pp. 4083–4103, 2025

2025

[45] [45]

Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision-language models,

Y .-H. Liao, R. Mahmood, S. Fidler, and D. Acuna, “Reasoning paths with reference objects elicit quantitative spatial reasoning in large vision-language models,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 17 028– 17 047

2024

[46] [46]

3D- LLM: Injecting the 3D world into large language models,

Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3D- LLM: Injecting the 3D world into large language models,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023

[47] [47]

LLaV A-3D: A simple yet effective pathway to empowering LMMs with 3D-awareness,

C. Zhu, T. Wang, W. Zhang, J. Pang, and X. Liu, “LLaV A-3D: A simple yet effective pathway to empowering LMMs with 3D-awareness,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025

2025

[48] [48]

ConceptGraphs: Open-vocabulary 3D scene graphs for perception and planning,

Q. Gu, A. Kuwajerwala, S. Morin, K. M. Jatavallabhula, B. Sen, A. Agarwal, C. Rivera, W. Paul, K. Ellis, R. Chellappa, C. Gan, C. M. de Melo, J. B. Tenenbaum, A. Torralba, F. Shkurti, and L. Paull, “ConceptGraphs: Open-vocabulary 3D scene graphs for perception and planning,” inProceedings of the IEEE International Conference on Robotics and Automation (IC...

2024

[49] [49]

Kosmos-2: Grounding multimodal large language models to the world,

Z. Peng, W. Wang, L. Dong, Y . Hao, S. Huang, S. Ma, and F. Wei, “Kosmos-2: Grounding multimodal large language models to the world,” inInternational Conference on Learning Representations (ICLR), 2024

2024

[50] [50]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and R. Zhao, “Shikra: Unleashing multimodal LLM’s referential dialogue magic,” arXiv preprint arXiv:2306.15195, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Ferret: Refer and ground anything anywhere at any granularity,

H. You, H. Zhang, Z. Gan, X. Du, B. Zhang, Z. Wang, L. Cao, S.-F. Chang, and Y . Yang, “Ferret: Refer and ground anything anywhere at any granularity,” inInternational Conference on Learning Representa- tions (ICLR), 2024

2024

[52] [52]

RegionGPT: Towards region understanding vision language model,

Q. Guo, S. De Mello, H. Yin, W. Byeon, K. C. Cheung, Y . Yu, P. Luo, and S. Liu, “RegionGPT: Towards region understanding vision language model,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13 796– 13 806

2024

[53] [53]

Osprey: Pixel understanding with visual instruction tuning,

Y . Yuan, W. Li, J. Liu, D. Tang, X. Luo, C. Qin, L. Zhang, and J. Zhu, “Osprey: Pixel understanding with visual instruction tuning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 28 202–28 211

2024

[54] [54]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,

B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, and F. Xia, “Spatialvlm: Endowing vision-language models with spatial reasoning capabilities,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 455–14 465

2024

[55] [55]

Ucdnet: Multi-uav collaborative 3-d object detection network by reliable feature mapping,

P. Tian, Z. Wang, P. Chenget al., “Ucdnet: Multi-uav collaborative 3-d object detection network by reliable feature mapping,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–16, 2024

2024

[56] [56]

Drones help drones: A col- laborative framework for multi-drone object trajectory prediction and beyond,

Z. Wang, P. Cheng, M. Chenet al., “Drones help drones: A col- laborative framework for multi-drone object trajectory prediction and beyond,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 64 604–64 628

2024

[57] [57]

MANTIS: Interleaved multi-image instruction tuning,

D. Jiang, X. He, H. Zeng, C. Wei, M. W. Ku, Q. Liu, and W. Chen, “MANTIS: Interleaved multi-image instruction tuning,”Transactions on Machine Learning Research (TMLR), 2024

2024

[58] [58]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

F. Li, R. Zhang, H. Zhang, Y . Zhang, B. Li, W. Li, Z. Ma, and C. Li, “LLaV A-NeXT-Interleave: Tackling multi-image, video, and 3D in large multimodal models,”arXiv preprint arXiv:2407.07895, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[59] [59]

Video-3D LLM: Learning position- aware video representation for 3D scene understanding,

D. Zheng, S. Huang, and L. Wang, “Video-3D LLM: Learning position- aware video representation for 3D scene understanding,” inProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[60] [60]

BEVFormer: Learning bird’s-eye-view representation from LiDAR- camera via spatiotemporal transformers,

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y . Qiao, and J. Dai, “BEVFormer: Learning bird’s-eye-view representation from LiDAR- camera via spatiotemporal transformers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, 2024

2024

[61] [61]

RSVQA: Visual question answering for remote sensing data,

S. Lobry, D. Marcos, J. Murray, and D. Tuia, “RSVQA: Visual question answering for remote sensing data,” inIEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 12, 2020, pp. 8555–8566

2020

[62] [62]

Rsvg: Exploring data and models for visual grounding on remote sensing data,

Y . Zhan, Z. Yuan, B. Xiong, J. Su, Q. Wang, J. Gui, R. Wang, K. Wang, and X. X. Zhu, “Rsvg: Exploring data and models for visual grounding on remote sensing data,” inISPRS Journal of Photogrammetry and Remote Sensing, vol. 196, 2023, pp. 89–101

2023

[63] [63]

EarthVQA: Towards queryable earth via relational reasoning-based remote sensing visual question answering,

K. Wang, R. Wang, J. Zhao, J. Guo, B. Dang, Y . Zhang, and S. Xiang, “EarthVQA: Towards queryable earth via relational reasoning-based remote sensing visual question answering,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 2024, pp. 6258–6266

2024

[64] [64]

STAR: A first- ever dataset and a large-scale benchmark for scene graph generation in large-size satellite imagery,

Y . Li, J. Luo, Y . Zhang, Y . Tan, J.-G. Yu, and S. Bai, “STAR: A first- ever dataset and a large-scale benchmark for scene graph generation in large-size satellite imagery,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024

[65] [65]

RemoteCLIP: A vision language foundation model for remote JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18 sensing,

F. Liu, D. Chen, Z. Guan, X. Zhou, J. Zhu, Q. Ye, L. Fu, and J. Zhou, “RemoteCLIP: A vision language foundation model for remote JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18 sensing,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–16, 2024

2021

[66] [66]

GeoChat: Grounded large vision-language model for remote sensing,

K. Kuckreja, M. S. Danish, M. Naseer, A. Das, S. Khan, and F. S. Khan, “GeoChat: Grounded large vision-language model for remote sensing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27 831–27 840

2024

[67] [67]

SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model,

Y . Zhan, B. Xiong, and Z. Yuan, “SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model,”arXiv preprint arXiv:2401.09712, 2024

work page arXiv 2024

[68] [68]

University-1652: A multi-view multi- source benchmark for drone-based geo-localization,

Z. Zheng, Y . Wei, and Y . Yang, “University-1652: A multi-view multi- source benchmark for drone-based geo-localization,” inProceedings of the ACM International Conference on Multimedia, 2020, pp. 1395– 1403

2020

[69] [69]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning (ICML), 2021, pp. 8748–8763

2021

[70] [70]

Unidepth: Universal monocular metric depth estimation,

L. Piccinelli, Y .-H. Yang, C. Sakaridis, M. Seg `u, S. Li, L. Van Gool, and F. Yu, “Unidepth: Universal monocular metric depth estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 10 106–10 116

2024

[71] [71]

Squeeze-and-excitation networks,

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–7141

2018

[72] [72]

BEVFusion: Multi-task multi-sensor fusion with unified BEV repre- sentation,

Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. Rus, and S. Han, “BEVFusion: Multi-task multi-sensor fusion with unified BEV repre- sentation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 5795–5805

2023

[73] [73]

Ewa volume splatting,

M. Zwicker, H. Pfister, J. van Baar, and M. Gross, “Ewa volume splatting,” inProceedings of the 28th annual conference on Computer graphics and interactive techniques (SIGGRAPH), 2001, pp. 29–38

2001

[74] [74]

Spatial transformer networks,

M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial transformer networks,” inAdvances in Neural Information Processing Systems (NeurIPS), 2015

2015

[75] [75]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 5998–6008

2017

[76] [76]

Graph attention networks,

P. Veli ˇckovi´c, G. Cucurull, A. Casanova, A. Romero, P. Li `o, and Y . Bengio, “Graph attention networks,” inInternational Conference on Learning Representations (ICLR), 2018

2018

[77] [77]

Gated Multimodal Units for Information Fusion

J. Arevalo, T. Solorio, M. Montes-y G ´omez, and F. A. Gonz ´alez, “Gated multimodal units for information fusion,”arXiv preprint arXiv:1702.01992, 2017. [Online]. Available: https://arxiv.org/abs/ 1702.01992

work page internal anchor Pith review Pith/arXiv arXiv 2017

[78] [78]

Qlora: Efficient finetuning of quantized llms,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” inAdvances in Neural Infor- mation Processing Systems (NeurIPS), 2023

2023

[79] [79]

Representation Learning with Contrastive Predictive Coding

A. van den Oord, Y . Li, and O. Vinyals, “Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[80] [80]

Barlow twins: Self-supervised learning via redundancy reduction,

J. Zbontar, L. Jing, I. Misra, Y . LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” inInternational Conference on Machine Learning (ICML), 2021, pp. 12 310–12 320

2021