VertiCue-Bench: Diagnosing Whether MLLMs Use Height Cues to Resolve 2D Ambiguity in Remote Sensing Natural Scenes

Cheng Li; Di Wang; Duanchu Wang; Jing Huang; Junjie Yang; Lin Cui; Zhouyi Wu; Zihang Cheng

arxiv: 2605.25784 · v1 · pith:7FFA2K6Snew · submitted 2026-05-25 · 💻 cs.CV · cs.MM

VertiCue-Bench: Diagnosing Whether MLLMs Use Height Cues to Resolve 2D Ambiguity in Remote Sensing Natural Scenes

Jing Huang , Duanchu Wang , Junjie Yang , Zihang Cheng , Cheng Li , Lin Cui , Zhouyi Wu , Di Wang This is my paper

Pith reviewed 2026-06-29 22:29 UTC · model grok-4.3

classification 💻 cs.CV cs.MM

keywords multimodal large language modelsremote sensingcanopy height modelsgeospatial reasoningbenchmark3D perceptionsemantic disambiguationheight cues

0 comments

The pith

Current MLLMs read canopy height data but rarely apply it to resolve spectral ambiguities in natural remote sensing scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VertiCue-Bench, a diagnostic set of 1,534 instances across 17 tasks, to test whether multimodal large language models can use explicit vertical structure from canopy height models to disambiguate remote sensing scenes where different land covers share similar textures. Evaluations of 14 general and remote-sensing MLLMs show they achieve some competence at extracting height information yet largely fail to integrate it with image data for correct semantic judgments, and sometimes perform worse than RGB-only versions. This dissociation indicates that models treat height as an isolated signal rather than a constraint that narrows possible interpretations of the scene. A sympathetic reader would care because many geospatial tasks require exactly this step from geometry to meaning, such as distinguishing forest types or urban features that look alike from above.

Core claim

VertiCue-Bench reveals a perception-reasoning dissociation: models exhibit emerging competence in reading raw CHM height cues, yet largely fail to translate geometric perception into reliable semantic reasoning and often underperform RGB-only baselines when joint constraints are required.

What carries the argument

VertiCue-Bench, a benchmark of 1,534 curated instances across 17 tasks that separate low-level height perception from ambiguity-aware semantic reasoning.

If this is right

Models that pass isolated height-reading tests will still produce unreliable answers on tasks requiring joint image-and-height constraints.
Adding canopy height data can degrade performance relative to image-only input when the model lacks a mechanism to treat height as a disambiguating constraint.
Progress on geospatial MLLMs will require explicit training or architectures that enforce consistency between geometric measurements and semantic categories.
Existing remote-sensing benchmarks that omit vertical structure will continue to overestimate model capability on natural scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future benchmarks could add controlled noise to the height maps to measure how sensitive reasoning remains when the geometric cue is imperfect.
The observed dissociation suggests that training objectives focused only on captioning or question answering may not encourage the model to treat height as evidence that rules out certain interpretations.
Architectures that maintain separate streams for appearance and structure until a late fusion stage might close the gap observed here.

Load-bearing premise

The 1,534 instances and 17 tasks were designed and selected so that success or failure can be attributed specifically to the ability to combine height with appearance rather than to other factors in data curation or prompting.

What would settle it

Run the same 14 models on a fresh collection of scenes where height differences are the only reliable cue for correct labels; if accuracy stays near or below the RGB-only baseline, the dissociation claim holds.

Figures

Figures reproduced from arXiv: 2605.25784 by Cheng Li, Di Wang, Duanchu Wang, Jing Huang, Junjie Yang, Lin Cui, Zhouyi Wu, Zihang Cheng.

**Figure 1.** Figure 1: Hierarchical capability taxonomy of VertiCue-Bench. The benchmark is structured into two capability domains—Perception and Reasoning—spanning five levels from low-level CHM reading and RGB–CHM matching to ambiguity-aware semantic–geometric reasoning, with representative examples shown for each leaf task. • L4 (Cross-modal Matching, CMM) aims to evaluate whether the model can establish stable and reliable c… view at source ↗

**Figure 2.** Figure 2: Overview of the construction pipeline of VertiCue-Bench. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the dataset statistics. (a) Sample counts and answer distribution for each task. (b) Semantic [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of VertiCue-Bench evaluations. Perception–reasoning gap. The most important signal in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Perception–Reasoning Capability Comparison [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Results on CHM counterfactual testing. Flip Success Rate (FSR) and Placebo Stability Rate (PSR) is the [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Error Decomposition RemoteReasoner attains a relatively high FSR while maintaining a PSR close to its baseline, indicating improved sensitivity to CHM interventions without sacrificing robustness to irrelevant edits. Nevertheless, its overall accuracy still lags behind the strongest general-purpose models. Overall, the key challenge lies not in resisting placebo edits, but in consistently responding to eff… view at source ↗

read the original abstract

Multimodal Large Language Models (MLLMs) have recently shown promising progress in geospatial reasoning. However, existing remote sensing benchmarks remain largely 2D-centric, evaluating models primarily on optical appearance. In natural environments, this paradigm breaks down due to severe spectral confusion, where ecologically distinct regions share similar textures but differ fundamentally in vertical structure. In such cases, explicit 3D structural data, such as Canopy Height Models (CHMs), become essential geometric evidence for semantic disambiguation. Yet, it remains unclear whether current MLLMs can genuinely leverage vertical cues to resolve appearance-level ambiguity. To address this gap, we introduce VertiCue-Bench, the first diagnostic benchmark for CHM-grounded geospatial reasoning. VertiCue-Bench comprises 1,534 carefully curated instances across 17 tasks, explicitly disentangling low-level height perception from ambiguity-aware semantic reasoning. Evaluations on 14 state-of-the-art general and remote-sensing-specialized MLLMs, combined with counterfactual modality testing, reveal a striking perception-reasoning dissociation. While models exhibit emerging competence in reading raw CHM height cues, they largely fail to translate geometric perception into reliable semantic reasoning, often underperforming RGB-only baselines when joint constraints are required. Overall, VertiCue-Bench exposes a critical geometry-to-semantics gap in natural scene understanding, offering actionable insights for advancing geospatial MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a benchmark to check if MLLMs actually apply CHM height data to resolve spectral ambiguities in remote sensing scenes, and reports that perception works better than the downstream reasoning step.

read the letter

The main takeaway is that VertiCue-Bench tries to isolate whether models can read canopy height from CHM inputs and then use that geometry to disambiguate scenes where RGB textures overlap. They run 14 models across 17 tasks and 1,534 instances with modality swaps, and the pattern they describe is that raw height reading shows up but joint RGB+CHM reasoning often drops below RGB-only performance.

What is new is the explicit split between low-level height perception tasks and the ambiguity-aware semantic tasks, plus the counterfactual tests. That framing is not in the earlier remote-sensing MLLM benchmarks referenced in the abstract, and it targets a real practical issue in natural scenes where vertical structure should matter.

The paper does a reasonable job naming the geometry-to-semantics gap and showing that current models are not closing it. The idea of holding out the height cue and measuring whether it helps is straightforward and worth testing.

The soft spot is that the abstract gives almost no information on how the 1,534 instances were selected or validated to guarantee that height is the only reliable disambiguator. If RGB texture already correlates with height in the chosen scenes, or if other cues suffice, then the reported underperformance on joint inputs could come from fusion issues rather than a clean failure to use geometry. The stress-test concern about curation therefore lands until the full paper shows explicit checks for unique disambiguation cases. No error bars or significance numbers appear in the abstract either, so the size of the dissociation is hard to judge.

This is for people working on multimodal models for geospatial or remote-sensing applications. A reader who needs a diagnostic for height cue usage would find the task split useful even if the numbers need more scrutiny.

It should go to peer review. The benchmark direction is worth referee time, but the construction details and statistical support will need to be checked in the full manuscript.

Referee Report

1 major / 2 minor

Summary. The paper introduces VertiCue-Bench, a diagnostic benchmark with 1,534 instances across 17 tasks that explicitly disentangles low-level height perception (via Canopy Height Models) from ambiguity-aware semantic reasoning in remote sensing natural scenes. Evaluations of 14 general and remote-sensing-specialized MLLMs, combined with counterfactual modality testing, claim to reveal a perception-reasoning dissociation: models show emerging competence in reading raw CHM cues but largely fail to translate this into reliable semantic reasoning and often underperform RGB-only baselines when joint constraints are required.

Significance. If the dissociation holds after proper validation of the benchmark tasks, the work would be significant for exposing a geometry-to-semantics gap in current MLLMs for geospatial reasoning. It provides a new diagnostic benchmark that could guide improvements in integrating 3D structural data with semantic understanding, particularly in natural scenes with spectral confusion.

major comments (1)

[Abstract / VertiCue-Bench construction] The central claim of a perception-reasoning dissociation rests on the 1,534 instances across 17 tasks being constructed such that low-level CHM reading is separable from ambiguity-aware semantic use, and that joint RGB+CHM inputs require the height cue for correct reasoning. The abstract states the instances were 'carefully curated to explicitly disentangle' these capabilities, but provides no details on validation of the disentanglement, confirmation of unique disambiguation cases, or checks against confounds such as RGB texture already correlating with height or alternative non-height cues sufficing. Without such validation, underperformance on joint inputs could reflect fusion or prompting artifacts rather than a true gap (see also the weakest assumption in the provided reader's take).

minor comments (2)

[Abstract] The abstract reports evaluations on 14 models but does not mention error bars, statistical significance testing, or variance across runs, which are needed to assess whether the reported dissociation is robust.
[Abstract] Details on task construction, data selection criteria, and how the 17 tasks were designed to isolate height perception from semantic reasoning are absent from the abstract and would be required for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for identifying the need for greater transparency on benchmark validation. We address the major comment below and will incorporate clarifications in the revision.

read point-by-point responses

Referee: [Abstract / VertiCue-Bench construction] The central claim of a perception-reasoning dissociation rests on the 1,534 instances across 17 tasks being constructed such that low-level CHM reading is separable from ambiguity-aware semantic use, and that joint RGB+CHM inputs require the height cue for correct reasoning. The abstract states the instances were 'carefully curated to explicitly disentangle' these capabilities, but provides no details on validation of the disentanglement, confirmation of unique disambiguation cases, or checks against confounds such as RGB texture already correlating with height or alternative non-height cues sufficing. Without such validation, underperformance on joint inputs could reflect fusion or prompting artifacts rather than a true gap (see also the weakest assumption in the provided reader's take).

Authors: We agree that the abstract is necessarily concise and does not itself contain the validation details. Section 3.2 and Appendix A of the manuscript describe the curation pipeline: candidate scenes were first filtered for spectral confusion where RGB-only performance is near chance; expert annotators then verified that CHM supplies the sole disambiguating cue (with inter-annotator agreement reported); and each task includes explicit RGB-only, CHM-only, and joint conditions to isolate the contribution of height. We acknowledge, however, that quantitative checks for residual RGB-height correlations and exhaustive tests ruling out alternative non-height cues (e.g., shadow or texture gradients) are only summarized rather than tabulated. We will add a dedicated subsection and supplementary table reporting these confound analyses in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with direct measurements

full rationale

The paper introduces VertiCue-Bench as an empirical diagnostic dataset and reports model performance on it. No derivations, equations, fitted parameters, or predictions exist that could reduce to inputs by construction. The dissociation claim is measured from held-out model outputs under different modality conditions rather than being defined in terms of the benchmark itself. Curation is presented as a methodological choice, not a self-referential definition or self-citation load-bearing step. No self-citations or ansatzes are invoked in the provided text to support core results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the curated benchmark instances validly isolate height perception from semantic reasoning and that the 14 evaluated models are representative; no free parameters, axioms, or invented entities are introduced beyond standard MLLM evaluation practices.

pith-pipeline@v0.9.1-grok · 5812 in / 1213 out tokens · 27411 ms · 2026-06-29T22:29:23.252454+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad S

Kartik Kuckreja, Muhammad S. Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad S. Khan. Geochat: Grounded large vision-language model for remote sensing.The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[4]

Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery.Remote Sensing, 16(9), 2024

Yakoub Bazi, Laila Bashmal, Mohamad Mahmoud Al Rahhal, Riccardo Ricci, and Farid Melgani. Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery.Remote Sensing, 16(9), 2024

2024
[5]

Vhm: Versatile and honest vision language model for remote sensing image analysis.Proceedings of the AAAI Conference on Artificial Intelligence, 39(6):6381–6388, Apr

Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, and Conghui He. Vhm: Versatile and honest vision language model for remote sensing image analysis.Proceedings of the AAAI Conference on Artificial Intelligence, 39(6):6381–6388, Apr. 2025

2025
[6]

Earthdial: Turning multi-sensory earth observations to interactive dialogues

Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shahbaz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, 2025

2025
[7]

Geopixel: Pixel grounding large multimodal model in remote sensing

Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad Shahbaz Khan, and Salman Khan. Geopixel: Pixel grounding large multimodal model in remote sensing. InForty-second International Conference on Machine Learning, 2025

2025
[8]

Towards faithful reasoning in remote sensing: A perceptually- grounded geospatial chain-of-thought for vision-language models

Jiaqi Liu, Lang Sun, Ronghao Fu, and Bo Yang. Towards faithful reasoning in remote sensing: A perceptually- grounded geospatial chain-of-thought for vision-language models. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[9]

Re- motereasoner: Towards unifying geospatial reasoning workflow.Proceedings of the AAAI Conference on Artificial Intelligence, 40(14):11883–11891, Mar

Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, and Pai Peng. Re- motereasoner: Towards unifying geospatial reasoning workflow.Proceedings of the AAAI Conference on Artificial Intelligence, 40(14):11883–11891, Mar. 2026

2026
[10]

Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

2025
[11]

Hrvqa: A visual question answering benchmark for high-resolution aerial images.ISPRS Journal of Photogrammetry and Remote Sensing, 214:65–81, 2024

Kun Li, George V osselman, and Michael Ying Yang. Hrvqa: A visual question answering benchmark for high-resolution aerial images.ISPRS Journal of Photogrammetry and Remote Sensing, 214:65–81, 2024

2024
[12]

Earthvqa: Towards queryable earth via relational reasoning-based remote sensing visual question answering

Junjue Wang, Zhuo Zheng, Zihang Chen, Ailong Ma, and Yanfei Zhong. Earthvqa: Towards queryable earth via relational reasoning-based remote sensing visual question answering. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 5481–5489, 2024. 12 VertiCue-Bench: Diagnosing Whether MLLMs Use Height Cues to Resolve 2D Ambiguity ...

2024
[13]

Rsvlm-qa: A benchmark dataset for remote sensing vision language model-based question answering

Xing Zi, Jinghao Xiao, Yunxiao Shi, Xian Tao, Jun Li, Ali Braytee, and Mukesh Prasad. Rsvlm-qa: A benchmark dataset for remote sensing vision language model-based question answering. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12905–12911, 2025

2025
[14]

Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding.Advances in Neural Information Processing Systems, 37:3229–3242, 2024

Xiang Li, Jian Ding, and Mohamed Elhoseiny. Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding.Advances in Neural Information Processing Systems, 37:3229–3242, 2024

2024
[15]

Geobench-vlm: Benchmarking vision-language models for geospatial tasks

Muhammad Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro, Alexandre Lacoste, and Salman Khan. Geobench-vlm: Benchmarking vision-language models for geospatial tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7132–7142, 2025

2025
[16]

CHOICE: Benchmarking the remote sensing capabilities of large vision-language models

Xiao An, Jiaxing Sun, Zihan Gui, and Wei He. CHOICE: Benchmarking the remote sensing capabilities of large vision-language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

2025
[17]

Fengxiang Wang, Hongzhen Wang, Zonghao Guo, Di Wang, Yulin Wang, Mingshuo Chen, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, et al. Xlrs-bench: Could your multimodal llms understand extremely large ultra-high-resolution remote sensing imagery? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14325–14336, 2025

2025
[18]

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

Zhiming Luo, Di Wang, Haonan Guo, Jing Zhang, and Bo Du. Vlrs-bench: A vision-language reasoning benchmark for remote sensing.arXiv preprint arXiv:2602.07045, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Geo3dvqa: Evaluating vision-language models for 3d geospatial reasoning from aerial imagery

Mai Tsujimoto, Junjue Wang, Weihao Xuan, and Naoto Yokoya. Geo3dvqa: Evaluating vision-language models for 3d geospatial reasoning from aerial imagery. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4994–5003, 2026

2026
[20]

A survey of image classification methods and techniques for improving classification performance.International journal of Remote sensing, 28(5):823–870, 2007

Dengsheng Lu and Qihao Weng. A survey of image classification methods and techniques for improving classification performance.International journal of Remote sensing, 28(5):823–870, 2007

2007
[21]

Lidar data fusion to improve forest attribute estimates: A review.Current Forestry Reports, 10(4):281–297, 2024

Mattia Balestra, Suzanne Marselis, Temuulen Tsagaan Sankey, Carlos Cabo, Xinlian Liang, Martin Mokroš, Xi Peng, Arunima Singh, Krzysztof Stere´nczak, Cedric Vega, et al. Lidar data fusion to improve forest attribute estimates: A review.Current Forestry Reports, 10(4):281–297, 2024

2024
[22]

Fine classification of urban tree species based on uav-based rgb imagery and lidar data.Forests, 15(2):390, 2024

Jingru Wu, Qixia Man, Xinming Yang, Pinliang Dong, Xiaotong Ma, Chunhui Liu, and Changyin Han. Fine classification of urban tree species based on uav-based rgb imagery and lidar data.Forests, 15(2):390, 2024

2024
[23]

A deep-learning-based tree species classification for natural secondary forests using unmanned aerial vehicle hyperspectral images and lidar.Ecological Indicators, 159:111608, 2024

Ye Ma, Yuting Zhao, Jungho Im, Yinghui Zhao, and Zhen Zhen. A deep-learning-based tree species classification for natural secondary forests using unmanned aerial vehicle hyperspectral images and lidar.Ecological Indicators, 159:111608, 2024

2024
[24]

Mapping urban tree species by integrating canopy height model with multi-temporal sentinel-2 data.Remote Sensing, 17(5):790, 2025

Yang Yao, Xiaoke Wang, Haiming Qin, Weimin Wang, and Weiqi Zhou. Mapping urban tree species by integrating canopy height model with multi-temporal sentinel-2 data.Remote Sensing, 17(5):790, 2025

2025
[25]

Object-based tree species classification using airborne hyperspectral images and lidar data.Forests, 11(1):32, 2019

Yanshuang Wu and Xiaoli Zhang. Object-based tree species classification using airborne hyperspectral images and lidar data.Forests, 11(1):32, 2019

2019
[26]

Remote sensing vision-language foundation models without annotations via ground remote alignment

Utkarsh Mall, Cheng Perng Phoo, Meilin Kelsey Liu, Carl V ondrick, Bharath Hariharan, and Kavita Bala. Remote sensing vision-language foundation models without annotations via ground remote alignment. InICLR, 2024

2024
[27]

Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

2024
[28]

Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, pages 1–1, 2024

Zilun Zhang, Tiancheng Zhao, Yulong Guo, and Jianwei Yin. Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, pages 1–1, 2024

2024
[29]

Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model.ISPRS Journal of Photogrammetry and Remote Sensing, 221:64–77, 2025

Yang Zhan, Zhitong Xiong, and Yuan Yuan. Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model.ISPRS Journal of Photogrammetry and Remote Sensing, 221:64–77, 2025

2025
[30]

Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model

Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. InComputer Vision – ECCV 2024, pages 440–457, Cham,

2024
[31]

Springer Nature Switzerland
[32]

Geollava-8k: scaling remote-sensing multimodal large language models to 8k resolution.arXiv preprint arXiv:2505.21375, 2025

Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Boqi Shan, Long Lan, Yulin Wang, et al. Geollava-8k: scaling remote-sensing multimodal large language models to 8k resolution.arXiv preprint arXiv:2505.21375, 2025. 13 VertiCue-Bench: Diagnosing Whether MLLMs Use Height Cues to Resolve 2D Ambiguity in Remote Sensing...

work page arXiv 2025
[33]

Terramind: Large-scale generative multimodality for earth observation

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, et al. Terramind: Large-scale generative multimodality for earth observation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7383–7394, 2025

2025
[34]

Geoeyes: On-demand visual focusing for evidence-grounded understanding of ultra-high-resolution remote sensing imagery.arXiv preprint arXiv:2602.14201, 2026

Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yifan Zhang, Long Lan, Xue Yang, Hongda Sun, Yulin Wang, Di Wang, et al. Geoeyes: On-demand visual focusing for evidence-grounded understanding of ultra-high-resolution remote sensing imagery.arXiv preprint arXiv:2602.14201, 2026

work page arXiv 2026
[35]

Show me what and where has changed? question answering and grounding for remote sensing change detection,

Ke Li, Fuyu Dong, Di Wang, Shaofeng Li, Quan Wang, Xinbo Gao, and Tat-Seng Chua. Show me what and where has changed? question answering and grounding for remote sensing change detection.arXiv preprint arXiv:2410.23828, 2024

work page arXiv 2024
[36]

Can large multimodal models understand agricultural scenes? benchmarking with agromind.arXiv preprint arXiv:2505.12207, 2025

Qingmei Li, Yang Zhang, Zurong Mai, Yuhang Chen, Shuohong Lou, Henglian Huang, Jiarui Zhang, Zhiwei Zhang, Yibin Wen, Weijia Li, et al. Can large multimodal models understand agricultural scenes? benchmarking with agromind.arXiv preprint arXiv:2505.12207, 2025

work page arXiv 2025
[37]

Canopy height model and naip imagery pairs across conus.Scientific Data, 12(1):322, 2025

Brady W Allred, Sarah E McCord, and Scott L Morford. Canopy height model and naip imagery pairs across conus.Scientific Data, 12(1):322, 2025

2025
[38]

YiFan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, Liang Wang, and Rong Jin. MME-realworld: Could your multimodal LLM challenge high-resolution real-world scenarios that are difficult for humans? InThe Thirteenth International Conference on Learning Representations, 2025

2025
[39]

Openai gpt-5 system card, 2025

Aaditya Singh, Adam Fry, Adam Perelman, et al. Openai gpt-5 system card, 2025

2025
[40]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

2026
[41]

Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025

Weiyun Wang, Zhangwei Gao, Lixin Gu, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025

2025
[42]

Mistralai/mistral-small-3.1-24b-instruct-2503 · hugging face

Team MistralAI. Mistralai/mistral-small-3.1-24b-instruct-2503 · hugging face
[43]

Phi-4 technical report, 2024

Marah Abdin, Jyoti Aneja, Harkirat Behl, et al. Phi-4 technical report, 2024

2024
[44]

Gemma 3 technical report, 2025

Gemma Team, Aishwarya Kamath, Johan Ferret, et al. Gemma 3 technical report, 2025. 14

2025

[1] [1]

Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad S

Kartik Kuckreja, Muhammad S. Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad S. Khan. Geochat: Grounded large vision-language model for remote sensing.The IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[2] [4]

Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery.Remote Sensing, 16(9), 2024

Yakoub Bazi, Laila Bashmal, Mohamad Mahmoud Al Rahhal, Riccardo Ricci, and Farid Melgani. Rs-llava: A large vision-language model for joint captioning and question answering in remote sensing imagery.Remote Sensing, 16(9), 2024

2024

[3] [5]

Vhm: Versatile and honest vision language model for remote sensing image analysis.Proceedings of the AAAI Conference on Artificial Intelligence, 39(6):6381–6388, Apr

Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, and Conghui He. Vhm: Versatile and honest vision language model for remote sensing image analysis.Proceedings of the AAAI Conference on Artificial Intelligence, 39(6):6381–6388, Apr. 2025

2025

[4] [6]

Earthdial: Turning multi-sensory earth observations to interactive dialogues

Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shahbaz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, 2025

2025

[5] [7]

Geopixel: Pixel grounding large multimodal model in remote sensing

Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad Shahbaz Khan, and Salman Khan. Geopixel: Pixel grounding large multimodal model in remote sensing. InForty-second International Conference on Machine Learning, 2025

2025

[6] [8]

Towards faithful reasoning in remote sensing: A perceptually- grounded geospatial chain-of-thought for vision-language models

Jiaqi Liu, Lang Sun, Ronghao Fu, and Bo Yang. Towards faithful reasoning in remote sensing: A perceptually- grounded geospatial chain-of-thought for vision-language models. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[7] [9]

Re- motereasoner: Towards unifying geospatial reasoning workflow.Proceedings of the AAAI Conference on Artificial Intelligence, 40(14):11883–11891, Mar

Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, and Pai Peng. Re- motereasoner: Towards unifying geospatial reasoning workflow.Proceedings of the AAAI Conference on Artificial Intelligence, 40(14):11883–11891, Mar. 2026

2026

[8] [10]

Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

2025

[9] [11]

Hrvqa: A visual question answering benchmark for high-resolution aerial images.ISPRS Journal of Photogrammetry and Remote Sensing, 214:65–81, 2024

Kun Li, George V osselman, and Michael Ying Yang. Hrvqa: A visual question answering benchmark for high-resolution aerial images.ISPRS Journal of Photogrammetry and Remote Sensing, 214:65–81, 2024

2024

[10] [12]

Earthvqa: Towards queryable earth via relational reasoning-based remote sensing visual question answering

Junjue Wang, Zhuo Zheng, Zihang Chen, Ailong Ma, and Yanfei Zhong. Earthvqa: Towards queryable earth via relational reasoning-based remote sensing visual question answering. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 5481–5489, 2024. 12 VertiCue-Bench: Diagnosing Whether MLLMs Use Height Cues to Resolve 2D Ambiguity ...

2024

[11] [13]

Rsvlm-qa: A benchmark dataset for remote sensing vision language model-based question answering

Xing Zi, Jinghao Xiao, Yunxiao Shi, Xian Tao, Jun Li, Ali Braytee, and Mukesh Prasad. Rsvlm-qa: A benchmark dataset for remote sensing vision language model-based question answering. InProceedings of the 33rd ACM International Conference on Multimedia, pages 12905–12911, 2025

2025

[12] [14]

Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding.Advances in Neural Information Processing Systems, 37:3229–3242, 2024

Xiang Li, Jian Ding, and Mohamed Elhoseiny. Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding.Advances in Neural Information Processing Systems, 37:3229–3242, 2024

2024

[13] [15]

Geobench-vlm: Benchmarking vision-language models for geospatial tasks

Muhammad Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro, Alexandre Lacoste, and Salman Khan. Geobench-vlm: Benchmarking vision-language models for geospatial tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7132–7142, 2025

2025

[14] [16]

CHOICE: Benchmarking the remote sensing capabilities of large vision-language models

Xiao An, Jiaxing Sun, Zihan Gui, and Wei He. CHOICE: Benchmarking the remote sensing capabilities of large vision-language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

2025

[15] [17]

Fengxiang Wang, Hongzhen Wang, Zonghao Guo, Di Wang, Yulin Wang, Mingshuo Chen, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, et al. Xlrs-bench: Could your multimodal llms understand extremely large ultra-high-resolution remote sensing imagery? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14325–14336, 2025

2025

[16] [18]

VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing

Zhiming Luo, Di Wang, Haonan Guo, Jing Zhang, and Bo Du. Vlrs-bench: A vision-language reasoning benchmark for remote sensing.arXiv preprint arXiv:2602.07045, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[17] [19]

Geo3dvqa: Evaluating vision-language models for 3d geospatial reasoning from aerial imagery

Mai Tsujimoto, Junjue Wang, Weihao Xuan, and Naoto Yokoya. Geo3dvqa: Evaluating vision-language models for 3d geospatial reasoning from aerial imagery. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4994–5003, 2026

2026

[18] [20]

A survey of image classification methods and techniques for improving classification performance.International journal of Remote sensing, 28(5):823–870, 2007

Dengsheng Lu and Qihao Weng. A survey of image classification methods and techniques for improving classification performance.International journal of Remote sensing, 28(5):823–870, 2007

2007

[19] [21]

Lidar data fusion to improve forest attribute estimates: A review.Current Forestry Reports, 10(4):281–297, 2024

Mattia Balestra, Suzanne Marselis, Temuulen Tsagaan Sankey, Carlos Cabo, Xinlian Liang, Martin Mokroš, Xi Peng, Arunima Singh, Krzysztof Stere´nczak, Cedric Vega, et al. Lidar data fusion to improve forest attribute estimates: A review.Current Forestry Reports, 10(4):281–297, 2024

2024

[20] [22]

Fine classification of urban tree species based on uav-based rgb imagery and lidar data.Forests, 15(2):390, 2024

Jingru Wu, Qixia Man, Xinming Yang, Pinliang Dong, Xiaotong Ma, Chunhui Liu, and Changyin Han. Fine classification of urban tree species based on uav-based rgb imagery and lidar data.Forests, 15(2):390, 2024

2024

[21] [23]

A deep-learning-based tree species classification for natural secondary forests using unmanned aerial vehicle hyperspectral images and lidar.Ecological Indicators, 159:111608, 2024

Ye Ma, Yuting Zhao, Jungho Im, Yinghui Zhao, and Zhen Zhen. A deep-learning-based tree species classification for natural secondary forests using unmanned aerial vehicle hyperspectral images and lidar.Ecological Indicators, 159:111608, 2024

2024

[22] [24]

Mapping urban tree species by integrating canopy height model with multi-temporal sentinel-2 data.Remote Sensing, 17(5):790, 2025

Yang Yao, Xiaoke Wang, Haiming Qin, Weimin Wang, and Weiqi Zhou. Mapping urban tree species by integrating canopy height model with multi-temporal sentinel-2 data.Remote Sensing, 17(5):790, 2025

2025

[23] [25]

Object-based tree species classification using airborne hyperspectral images and lidar data.Forests, 11(1):32, 2019

Yanshuang Wu and Xiaoli Zhang. Object-based tree species classification using airborne hyperspectral images and lidar data.Forests, 11(1):32, 2019

2019

[24] [26]

Remote sensing vision-language foundation models without annotations via ground remote alignment

Utkarsh Mall, Cheng Perng Phoo, Meilin Kelsey Liu, Carl V ondrick, Bharath Hariharan, and Kavita Bala. Remote sensing vision-language foundation models without annotations via ground remote alignment. InICLR, 2024

2024

[25] [27]

Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

2024

[26] [28]

Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, pages 1–1, 2024

Zilun Zhang, Tiancheng Zhao, Yulong Guo, and Jianwei Yin. Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, pages 1–1, 2024

2024

[27] [29]

Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model.ISPRS Journal of Photogrammetry and Remote Sensing, 221:64–77, 2025

Yang Zhan, Zhitong Xiong, and Yuan Yuan. Skyeyegpt: Unifying remote sensing vision-language tasks via instruction tuning with large language model.ISPRS Journal of Photogrammetry and Remote Sensing, 221:64–77, 2025

2025

[28] [30]

Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model

Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. InComputer Vision – ECCV 2024, pages 440–457, Cham,

2024

[29] [31]

Springer Nature Switzerland

[30] [32]

Geollava-8k: scaling remote-sensing multimodal large language models to 8k resolution.arXiv preprint arXiv:2505.21375, 2025

Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Boqi Shan, Long Lan, Yulin Wang, et al. Geollava-8k: scaling remote-sensing multimodal large language models to 8k resolution.arXiv preprint arXiv:2505.21375, 2025. 13 VertiCue-Bench: Diagnosing Whether MLLMs Use Height Cues to Resolve 2D Ambiguity in Remote Sensing...

work page arXiv 2025

[31] [33]

Terramind: Large-scale generative multimodality for earth observation

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, et al. Terramind: Large-scale generative multimodality for earth observation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7383–7394, 2025

2025

[32] [34]

Geoeyes: On-demand visual focusing for evidence-grounded understanding of ultra-high-resolution remote sensing imagery.arXiv preprint arXiv:2602.14201, 2026

Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yifan Zhang, Long Lan, Xue Yang, Hongda Sun, Yulin Wang, Di Wang, et al. Geoeyes: On-demand visual focusing for evidence-grounded understanding of ultra-high-resolution remote sensing imagery.arXiv preprint arXiv:2602.14201, 2026

work page arXiv 2026

[33] [35]

Show me what and where has changed? question answering and grounding for remote sensing change detection,

Ke Li, Fuyu Dong, Di Wang, Shaofeng Li, Quan Wang, Xinbo Gao, and Tat-Seng Chua. Show me what and where has changed? question answering and grounding for remote sensing change detection.arXiv preprint arXiv:2410.23828, 2024

work page arXiv 2024

[34] [36]

Can large multimodal models understand agricultural scenes? benchmarking with agromind.arXiv preprint arXiv:2505.12207, 2025

Qingmei Li, Yang Zhang, Zurong Mai, Yuhang Chen, Shuohong Lou, Henglian Huang, Jiarui Zhang, Zhiwei Zhang, Yibin Wen, Weijia Li, et al. Can large multimodal models understand agricultural scenes? benchmarking with agromind.arXiv preprint arXiv:2505.12207, 2025

work page arXiv 2025

[35] [37]

Canopy height model and naip imagery pairs across conus.Scientific Data, 12(1):322, 2025

Brady W Allred, Sarah E McCord, and Scott L Morford. Canopy height model and naip imagery pairs across conus.Scientific Data, 12(1):322, 2025

2025

[36] [38]

YiFan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, Liang Wang, and Rong Jin. MME-realworld: Could your multimodal LLM challenge high-resolution real-world scenarios that are difficult for humans? InThe Thirteenth International Conference on Learning Representations, 2025

2025

[37] [39]

Openai gpt-5 system card, 2025

Aaditya Singh, Adam Fry, Adam Perelman, et al. Openai gpt-5 system card, 2025

2025

[38] [40]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

2026

[39] [41]

Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025

Weiyun Wang, Zhangwei Gao, Lixin Gu, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025

2025

[40] [42]

Mistralai/mistral-small-3.1-24b-instruct-2503 · hugging face

Team MistralAI. Mistralai/mistral-small-3.1-24b-instruct-2503 · hugging face

[41] [43]

Phi-4 technical report, 2024

Marah Abdin, Jyoti Aneja, Harkirat Behl, et al. Phi-4 technical report, 2024

2024

[42] [44]

Gemma 3 technical report, 2025

Gemma Team, Aishwarya Kamath, Johan Ferret, et al. Gemma 3 technical report, 2025. 14

2025