pith. sign in

arxiv: 2605.17949 · v1 · pith:6WIHTUU3new · submitted 2026-05-18 · 💻 cs.CV

SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning

Pith reviewed 2026-05-20 12:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords remote sensingvision-language modelsencoder-free architectureimage groundingspatial reasoningmultimodal fusionvisual evidencepatch tokens
0
0 comments X

The pith

SkyNative feeds raw image patches directly into a language model for remote sensing reasoning to reduce reliance on language priors and retain spatial details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Remote sensing vision-language models typically compress images via pretrained encoders before language-model reasoning, which can discard fine local evidence needed for spatial tasks. SkyNative removes the visual encoder entirely and instead injects raw image patches as tokens into the language model space. A modality-aware decoupling step uses separate parameters for visual and text inputs inside one autoregressive backbone to keep the two modalities aligned. The authors evaluate this setup on standard remote sensing benchmarks plus a new visual reliance test that gradually degrades images or adds contradictory text prompts. If the approach holds, answers would track actual image content more closely even in very large remote sensing scenes.

Core claim

By adopting an encoder-free design that represents images as raw patch tokens inside the language-model token space and reconciling them through a modality-aware decoupling mechanism with modality-specific parameters, SkyNative preserves fine-grained spatial evidence that pretrained visual encoders tend to compress away, producing stronger image-grounded perception and greater robustness to prompt-induced language priors across remote sensing understanding and large-format spatial reasoning tasks.

What carries the argument

The modality-aware decoupling mechanism that inserts modality-specific parameters into a single autoregressive backbone to align raw visual patch tokens with textual tokens without a separate visual encoder.

If this is right

  • Remote sensing models could process ultra-high-resolution imagery without early loss of local object boundaries or textures.
  • Reasoning outputs would shift more when the actual image content changes and less when surrounding text is altered.
  • Training pipelines could skip separate large-scale visual pretraining stages for remote sensing data.
  • Spatial reasoning tasks that require precise localization would become more reliable under varying prompt conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same raw-patch approach might reduce language bias in other high-resolution imagery domains such as medical or aerial photography.
  • End-to-end training from pixels to answers could shorten the current two-stage vision-language pipeline.
  • Scaling the patch token count to even larger images would test whether the decoupling mechanism continues to prevent modality interference.

Load-bearing premise

Raw image patches can be integrated directly into the language model token space while still retaining the fine spatial details that pretrained visual encoders normally lose.

What would settle it

On the visual reliance benchmark, SkyNative accuracy would drop sharply with progressive image degradation yet remain stable under misleading textual prompts, while encoder-based models show the opposite pattern.

Figures

Figures reproduced from arXiv: 2605.17949 by Bo Yang, Haoran Liu, Jiaqi Liu, Jiasen Hu, Jiashun Zhu, Lang Sun, Ronghao Fu, Weijie Zhang, Weipeng Zhang, Xiao Yang, Xu Na, Zhiwen Lin, Zhuoran Duan.

Figure 1
Figure 1. Figure 1: Visual dependence analysis and structural comparison. (a) Comparison between encoder [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SkyNative framework. 3.1 Vision-Language Encoding Layer To preserve fine-grained textures and spatial structures in RS imagery, SkyNative removes the standalone pretrained visual encoder and directly constructs visual tokens from native image patches. Given an input RS image I ∈ R H×W×3 , we employ a lightweight convolutional patch embedding layer consisting of two convolutional layers with a G… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of active perception paradigms and performance. Linguistic Misleading Test. We further assessed the model’s robustness against linguistic misinformation to examine its behavior in the presence of textual dis￾tractors [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of SkyNative on complex remote sensing imagery. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: RSME-Bench example 1. B.3 Prompt Template To generate high-quality adversarial contexts, we feed raw imagery, manually curated questions, and options into the GPT-4o-mini API to synthesize targeted misleading prompts. The specific prompt template is shown below: 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: RSME-Bench example 2. C Additional Experiment Results In this section, we present a more granular breakdown of the quantitative results across four com￾prehensive large-format remote sensing benchmarks: XLRS-Bench, MME-RW-RS, LRS-VQA, and RSHR-Bench [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Remote sensing vision-language models commonly rely on pretrained visual encoders to convert images into semantic features before language-model reasoning. While effective for scene-level understanding, this pipeline may prematurely compress local visual evidence, making fine-grained spatial reasoning vulnerable to language priors, especially in ultra-high-resolution remote sensing imagery. We present SkyNative, a native multimodal framework for remote sensing that adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. To reconcile low-level visual patches with textual tokens, SkyNative introduces a modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone. We further introduce a visual reliance benchmark that diagnoses whether models ground their answers in image evidence through progressive visual degradation and misleading textual prompts. Across standard remote sensing understanding tasks and large-format spatial reasoning evaluations, SkyNative shows stronger image-grounded perception and improved robustness against prompt-induced language priors. These results suggest that native patch-level multimodal modeling is a promising direction for reliable remote sensing vision-language reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents SkyNative, an encoder-free multimodal framework for remote sensing visual evidence reasoning. It directly feeds raw image patches as tokens into a unified autoregressive language-model backbone, reconciled only by a modality-aware decoupling mechanism that employs modality-specific parameters. The authors introduce a visual reliance benchmark that applies progressive visual degradation and misleading textual prompts to test whether model outputs are grounded in image evidence rather than language priors. Experiments on standard remote sensing understanding tasks and large-format spatial reasoning evaluations are reported to show stronger image-grounded perception and greater robustness to prompt-induced priors than encoder-based baselines.

Significance. If the central claims are substantiated, the work would be significant for remote sensing vision-language modeling: it offers a concrete alternative to the dominant pretrained-encoder pipeline and supplies a diagnostic benchmark for visual grounding. Demonstrating that native patch tokens can retain fine-grained spatial evidence without encoder compression would be a useful data point for high-resolution imagery applications. The benchmark itself could become a reusable tool for the community.

major comments (2)
  1. [§3.2] §3.2 (Modality-aware decoupling): The claim that raw patch tokens are reconciled 'without the compression losses of pretrained visual encoders' is load-bearing for the central thesis, yet the section provides no quantitative measure (e.g., mutual information, reconstruction PSNR, or token-level entropy) of information retained after the learned projection, positional encoding, and modality-specific parameter layers. Any such reconciliation necessarily introduces a learned transformation whose bottleneck is unquantified.
  2. [§5.1] §5.1 and Table 3 (Visual reliance benchmark results): The reported robustness gains under misleading prompts are presented without ablation isolating the contribution of the native patch representation from the modality-specific parameters or from training data differences. Without these controls it is unclear whether the observed improvements are attributable to the encoder-free design or to other modeling choices.
minor comments (2)
  1. [Figure 4] Figure 4 caption and axis labels use inconsistent terminology ('native tokens' vs. 'raw patches'); standardize notation across text and figures.
  2. [§3.1] The description of the autoregressive backbone in §3.1 does not specify whether the modality-specific parameters are frozen or jointly optimized; clarify the training protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our encoder-free approach. We respond to each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Modality-aware decoupling): The claim that raw patch tokens are reconciled 'without the compression losses of pretrained visual encoders' is load-bearing for the central thesis, yet the section provides no quantitative measure (e.g., mutual information, reconstruction PSNR, or token-level entropy) of information retained after the learned projection, positional encoding, and modality-specific parameter layers. Any such reconciliation necessarily introduces a learned transformation whose bottleneck is unquantified.

    Authors: We agree that a quantitative characterization of retained information would strengthen the central claim. The modality-aware decoupling does apply learned projections and modality-specific parameters to align raw patches with the language-model token space. In the revised manuscript we will add an analysis in §3.2 that reports token-level entropy before and after the projection layers and compares reconstruction fidelity (via a lightweight decoder) against a standard pretrained visual encoder on the same remote-sensing patches. revision: yes

  2. Referee: [§5.1] §5.1 and Table 3 (Visual reliance benchmark results): The reported robustness gains under misleading prompts are presented without ablation isolating the contribution of the native patch representation from the modality-specific parameters or from training data differences. Without these controls it is unclear whether the observed improvements are attributable to the encoder-free design or to other modeling choices.

    Authors: We concur that isolating the source of the robustness gains is necessary. The native patch representation and the modality-specific parameters are tightly coupled in the encoder-free design; however, we will add an ablation in the revised §5.1 that trains a controlled variant using the same backbone and data but with modality-specific parameters disabled (replaced by shared parameters). We will also clarify that all compared models were trained on the same remote-sensing corpora and report the effect of this control on the visual-reliance scores. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The provided abstract and context describe an encoder-free architecture with a modality-aware decoupling mechanism and a new visual reliance benchmark. Performance claims are tied to evaluations on standard remote sensing tasks and large-format spatial reasoning, without any equations, fitted parameters renamed as predictions, or self-citations that reduce the central results to inputs by construction. The derivation remains self-contained against external benchmarks, consistent with a normal non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that raw patch tokens can be directly reconciled with text tokens via modality-specific parameters without external pretraining; no explicit free parameters, axioms, or invented entities are named in the abstract, but the modality-aware decoupling mechanism functions as an ad-hoc architectural choice.

pith-pipeline@v0.9.0 · 5748 in / 1254 out tokens · 30701 ms · 2026-05-20T12:51:44.909336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

  1. [1]

    Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66, 2024

    Xiang Li, Congcong Wen, Yuan Hu, Zhenghang Yuan, and Xiao Xiang Zhu. Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66, 2024

  2. [2]

    Foundation models in remote sensing: Evolving from unimodality to multimodality.IEEE Geoscience and Remote Sensing Magazine, 2026

    Danfeng Hong, Chenyu Li, Xuyang Li, Gustau Camps-Valls, and Jocelyn Chanussot. Foundation models in remote sensing: Evolving from unimodality to multimodality.IEEE Geoscience and Remote Sensing Magazine, 2026

  3. [3]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  4. [4]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

  5. [5]

    Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

    Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

  6. [6]

    Geochat: Grounded large vision-language model for remote sensing

    Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27831–27840, 2024

  7. [7]

    O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Rajabalifardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley

    Mohammad Asadi, Jack W. O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Rajabalifardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage: The illusion of visual understanding, 2026

  8. [8]

    The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer

    Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, and Zilong Huang. The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20758–20769, 2025

  9. [9]

    Vhm: Versatile and honest vision language model for remote sensing image analysis

    Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, et al. Vhm: Versatile and honest vision language model for remote sensing image analysis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6381–6388, 2025

  10. [10]

    Geopixel: Pixel grounding large multimodal model in remote sensing

    Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad Shahbaz Khan, and Salman Khan. Geopixel: Pixel grounding large multimodal model in remote sensing. In International Conference on Machine Learning, pages 54095–54111. PMLR, 2025

  11. [11]

    Geollava-8k: Scaling remote-sensing mul- timodal large language models to 8k resolution

    Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Shan Boqi, Long Lan, Yulin Wang, et al. Geollava-8k: Scaling remote-sensing mul- timodal large language models to 8k resolution. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  12. [12]

    When large vision-language model meets large remote sensing imagery: Coarse- to-fine text-guided token pruning

    Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, and Yansheng Li. When large vision-language model meets large remote sensing imagery: Coarse- to-fine text-guided token pruning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9206–9217, 2025

  13. [13]

    Zoomearth: Active perception for ultra-high-resolution geospatial vision-language tasks, 2025

    Ruixun Liu, Bowen Fu, Jiayi Song, Kaiyu Li, Wanchen Li, Lanxuan Xue, Hui Qiao, Weizhan Zhang, Deyu Meng, and Xiangyong Cao. Zoomearth: Active perception for ultra-high-resolution geospatial vision-language tasks, 2025

  14. [14]

    To see or to please: Uncovering visual sycophancy and split beliefs in vlms, 2026

    Rui Hong and Shuxue Quan. To see or to please: Uncovering visual sycophancy and split beliefs in vlms, 2026. 10

  15. [15]

    Vision language models are biased

    An V o, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, and Daeyoung Kim. Vision language models are biased. InThe Fourteenth International Conference on Learning Representations, 2026

  16. [16]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

  17. [17]

    Do images speak louder than words? investigating the effect of textual misinformation in VLMs

    Chi Zhang, Wenxuan Ding, Jiale Liu, Mingrui Wu, Qingyun Wu, and Ray Mooney. Do images speak louder than words? investigating the effect of textual misinformation in VLMs. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Pape...

  18. [18]

    Context-vqa: Towards context-aware and purposeful visual question answering

    Nandita Naik, Christopher Potts, and Elisa Kreiss. Context-vqa: Towards context-aware and purposeful visual question answering. In2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 2813–2817. IEEE, 2023

  19. [19]

    CHOICE: Benchmarking the remote sensing capabilities of large vision-language models

    Xiao An, Jiaxing Sun, Zihan Gui, and Wei He. CHOICE: Benchmarking the remote sensing capabilities of large vision-language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

  20. [20]

    Rshallu: Dual-mode hallucination evaluation for remote-sensing multimodal large language models with domain-tailored mitigation, 2026

    Zihui Zhou, Yong Feng, Yanying Chen, Guofan Duan, Zhenxi Song, Mingliang Zhou, and Weijia Jia. Rshallu: Dual-mode hallucination evaluation for remote-sensing multimodal large language models with domain-tailored mitigation, 2026

  21. [21]

    A benchmark for ultra-high-resolution remote sensing mllms, 2025

    Yunkai Dang, Meiyi Zhu, Donghao Wang, Yizhuo Zhang, Jiacheng Yang, Qi Fan, Yuekun Yang, Wenbin Li, Feng Miao, and Yang Gao. A benchmark for ultra-high-resolution remote sensing mllms, 2025

  22. [22]

    Omniearth: A benchmark for evaluating vision-language models in geospatial tasks, 2026

    Ronghao Fu, Haoran Liu, Weijie Zhang, Zhiwen Lin, Xiao Yang, Peng Zhang, and Bo Yang. Omniearth: A benchmark for evaluating vision-language models in geospatial tasks, 2026

  23. [23]

    Object detection in aerial images: A large-scale benchmark and challenges.IEEE transactions on pattern analysis and machine intelligence, 44(11):7778–7796, 2021

    Jian Ding, Nan Xue, Gui-Song Xia, Xiang Bai, Wen Yang, Michael Ying Yang, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, et al. Object detection in aerial images: A large-scale benchmark and challenges.IEEE transactions on pattern analysis and machine intelligence, 44(11):7778–7796, 2021

  24. [24]

    Object detection in optical remote sensing images: A survey and a new benchmark.ISPRS journal of photogrammetry and remote sensing, 159:296–307, 2020

    Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Junwei Han. Object detection in optical remote sensing images: A survey and a new benchmark.ISPRS journal of photogrammetry and remote sensing, 159:296–307, 2020

  25. [25]

    Gpt-4o mini: advancing cost-efficient intelligence, 2024

    OpenAI. Gpt-4o mini: advancing cost-efficient intelligence, 2024

  26. [26]

    Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3867–3876, 2025

    Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3867–3876, 2025

  27. [27]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

  28. [28]

    Claude opus 4 & claude sonnet 4 system card, 2025

    Anthropic. Claude opus 4 & claude sonnet 4 system card, 2025

  29. [29]

    Evev2: Improved baselines for encoder-free vision-language models

    Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, and Xinlong Wang. Evev2: Improved baselines for encoder-free vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21014–21025, 2025. 11

  30. [30]

    Qwen2.5-vl technical report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

  31. [31]

    Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023

    Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023

  32. [32]

    V ocot: Unleashing visually grounded multi-step reasoning in large multi-modal models

    Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuan-Jing Huang, and Zhongyu Wei. V ocot: Unleashing visually grounded multi-step reasoning in large multi-modal models. InProceed- ings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 37...

  33. [33]

    Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding, 2024

    Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jingdong Chen, Yihua Tan, and Yansheng Li. Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding, 2024

  34. [34]

    Earthdial: Turning multi-sensory earth observations to interactive dialogues

    Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shah- baz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, 2025

  35. [35]

    Fengxiang Wang, Hongzhen Wang, Zonghao Guo, Di Wang, Yulin Wang, Mingshuo Chen, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, et al. Xlrs-bench: Could your multimodal llms understand extremely large ultra-high-resolution remote sensing imagery? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14325–14336, 2025

  36. [36]

    YiFan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, Liang Wang, and Rong Jin. MME-realworld: Could your multimodal LLM challenge high-resolution real-world scenarios that are difficult for humans? InThe Thirteenth International Conference on Learning Representations, 2025

  37. [37]

    Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts visio...

  38. [38]

    Breen: bridge data-efficient encoder-free multimodal learning with learnable queries

    Tianle Li, Yongming Rao, Winston Hu, and Yu Cheng. Breen: bridge data-efficient encoder-free multimodal learning with learnable queries. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5384–5395, 2026

  39. [39]

    Multimodal learning with next-token prediction for large multimodal models.Nature, pages 1–7, 2026

    Xinlong Wang, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Xiaosong Zhang, Zhengx- iong Luo, Quan Sun, Zhen Li, Yuqi Wang, et al. Multimodal learning with next-token prediction for large multimodal models.Nature, pages 1–7, 2026

  40. [40]

    Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

    Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

  41. [41]

    Exploring models and data for remote sensing image caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017

    Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. Exploring models and data for remote sensing image caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017

  42. [42]

    Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval

    Zhiqiang Yuan, Wenkai Zhang, Kun Fu, Xuan Li, Chubo Deng, Hongqi Wang, and Xian Sun. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. IEEE Transactions on Geoscience and Remote Sensing, 60:1–19, 2021

  43. [43]

    Deep semantic understanding of high resolution remote sensing image

    Bo Qu, Xuelong Li, Dacheng Tao, and Xiaoqiang Lu. Deep semantic understanding of high resolution remote sensing image. In2016 International conference on computer, information and telecommunication systems (Cits), pages 1–5. IEEE, 2016. 12

  44. [44]

    Junyao Ge, Xu Zhang, Yang Zheng, Kaitai Guo, and Jimin Liang. Rsteller: Scaling up visual language modeling in remote sensing with rich linguistic semantics from openly available data and large language models.ISPRS Journal of Photogrammetry and Remote Sensing, 226:146–163, 2025

  45. [45]

    Rsvqa: Visual question answering for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555– 8566, 2020

    Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. Rsvqa: Visual question answering for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555– 8566, 2020

  46. [46]

    Floodnet: A high resolution aerial imagery dataset for post flood scene understanding, 2020

    Maryam Rahnemoonfar, Tashnim Chowdhury, Argho Sarkar, Debvrat Varshney, Masoud Yari, and Robin Murphy. Floodnet: A high resolution aerial imagery dataset for post flood scene understanding, 2020

  47. [47]

    Towards faithful reasoning in remote sensing: A perceptually-grounded geospatial chain-of-thought for vision-language models, 2026

    Jiaqi Liu, Lang Sun, Ronghao Fu, and Bo Yang. Towards faithful reasoning in remote sensing: A perceptually-grounded geospatial chain-of-thought for vision-language models, 2026

  48. [48]

    Aid: A benchmark data set for performance evaluation of aerial scene classification.IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017

    Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. Aid: A benchmark data set for performance evaluation of aerial scene classification.IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017

  49. [49]

    Structural high-resolution satellite image indexing

    Gui-Song Xia, Wen Yang, Julie Delon, Yann Gousseau, Hong Sun, and Henri Maître. Structural high-resolution satellite image indexing. InISPRS TC VII Symposium-100 Years ISPRS, volume 38, pages 298–303, 2010

  50. [50]

    Dota: A large-scale dataset for object detection in aerial images

    Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liangpei Zhang. Dota: A large-scale dataset for object detection in aerial images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3974–3983, 2018

  51. [51]

    Yuanlin Zhang, Yuan Yuan, Yachuang Feng, and Xiaoqiang Lu. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection.IEEE Transactions on Geoscience and Remote Sensing, 57(8):5535–5548, 2019

  52. [52]

    Gong Cheng, Junwei Han, Peicheng Zhou, and Lei Guo. Multi-class geospatial object detection and geographic image classification based on collection of part detectors.ISPRS Journal of Photogrammetry and Remote Sensing, 98:119–132, 2014

  53. [53]

    Visdrone-det2021: The vision meets drone object detection challenge results

    Yaru Cao, Zhijian He, Lujia Wang, Wenguan Wang, Yixuan Yuan, Dingwen Zhang, Jinglin Zhang, Pengfei Zhu, Luc Van Gool, Junwei Han, et al. Visdrone-det2021: The vision meets drone object detection challenge results. InProceedings of the IEEE/CVF International confer- ence on computer vision, pages 2847–2854, 2021

  54. [54]

    Vrsbench: A versatile vision-language bench- mark dataset for remote sensing image understanding.Advances in Neural Information Pro- cessing Systems, 37:3229–3242, 2024

    Xiang Li, Jian Ding, and Mohamed Elhoseiny. Vrsbench: A versatile vision-language bench- mark dataset for remote sensing image understanding.Advances in Neural Information Pro- cessing Systems, 37:3229–3242, 2024

  55. [55]

    mirage effect

    Ronghao Fu, Haoran Liu, Weijie Zhang, Zhiwen Lin, Xiao Yang, Peng Zhang, and Bo Yang. Omniearth: A benchmark for evaluating vision-language models in geospatial tasks.arXiv preprint arXiv:2603.09471, 2026. 13 A Detailed Experimental Setup A.1 Train Datasets Our training pipeline consists of three distinct stages, leveraging a comprehensive mixture of remo...