SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning

Bo Yang; Haoran Liu; Jiaqi Liu; Jiasen Hu; Jiashun Zhu; Lang Sun; Ronghao Fu; Weijie Zhang; Weipeng Zhang; Xiao Yang

arxiv: 2605.17949 · v1 · pith:6WIHTUU3new · submitted 2026-05-18 · 💻 cs.CV

SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning

Xiao Yang , Ronghao Fu , Zhiwen Lin , Zhuoran Duan , Jiashun Zhu , Jiasen Hu , Lang Sun , Weipeng Zhang

show 5 more authors

Jiaqi Liu Xu Na Haoran Liu Weijie Zhang Bo Yang

This is my paper

Pith reviewed 2026-05-20 12:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords remote sensingvision-language modelsencoder-free architectureimage groundingspatial reasoningmultimodal fusionvisual evidencepatch tokens

0 comments

The pith

SkyNative feeds raw image patches directly into a language model for remote sensing reasoning to reduce reliance on language priors and retain spatial details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Remote sensing vision-language models typically compress images via pretrained encoders before language-model reasoning, which can discard fine local evidence needed for spatial tasks. SkyNative removes the visual encoder entirely and instead injects raw image patches as tokens into the language model space. A modality-aware decoupling step uses separate parameters for visual and text inputs inside one autoregressive backbone to keep the two modalities aligned. The authors evaluate this setup on standard remote sensing benchmarks plus a new visual reliance test that gradually degrades images or adds contradictory text prompts. If the approach holds, answers would track actual image content more closely even in very large remote sensing scenes.

Core claim

By adopting an encoder-free design that represents images as raw patch tokens inside the language-model token space and reconciling them through a modality-aware decoupling mechanism with modality-specific parameters, SkyNative preserves fine-grained spatial evidence that pretrained visual encoders tend to compress away, producing stronger image-grounded perception and greater robustness to prompt-induced language priors across remote sensing understanding and large-format spatial reasoning tasks.

What carries the argument

The modality-aware decoupling mechanism that inserts modality-specific parameters into a single autoregressive backbone to align raw visual patch tokens with textual tokens without a separate visual encoder.

If this is right

Remote sensing models could process ultra-high-resolution imagery without early loss of local object boundaries or textures.
Reasoning outputs would shift more when the actual image content changes and less when surrounding text is altered.
Training pipelines could skip separate large-scale visual pretraining stages for remote sensing data.
Spatial reasoning tasks that require precise localization would become more reliable under varying prompt conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same raw-patch approach might reduce language bias in other high-resolution imagery domains such as medical or aerial photography.
End-to-end training from pixels to answers could shorten the current two-stage vision-language pipeline.
Scaling the patch token count to even larger images would test whether the decoupling mechanism continues to prevent modality interference.

Load-bearing premise

Raw image patches can be integrated directly into the language model token space while still retaining the fine spatial details that pretrained visual encoders normally lose.

What would settle it

On the visual reliance benchmark, SkyNative accuracy would drop sharply with progressive image degradation yet remain stable under misleading textual prompts, while encoder-based models show the opposite pattern.

Figures

Figures reproduced from arXiv: 2605.17949 by Bo Yang, Haoran Liu, Jiaqi Liu, Jiasen Hu, Jiashun Zhu, Lang Sun, Ronghao Fu, Weijie Zhang, Weipeng Zhang, Xiao Yang, Xu Na, Zhiwen Lin, Zhuoran Duan.

**Figure 2.** Figure 2: Overview of SkyNative framework. 3.1 Vision-Language Encoding Layer To preserve fine-grained textures and spatial structures in RS imagery, SkyNative removes the standalone pretrained visual encoder and directly constructs visual tokens from native image patches. Given an input RS image I ∈ R H×W×3 , we employ a lightweight convolutional patch embedding layer consisting of two convolutional layers with a G… view at source ↗

**Figure 3.** Figure 3: Comparison of active perception paradigms and performance. Linguistic Misleading Test. We further assessed the model’s robustness against linguistic misinformation to examine its behavior in the presence of textual distractors [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results of SkyNative on complex remote sensing imagery. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: RSME-Bench example 1. B.3 Prompt Template To generate high-quality adversarial contexts, we feed raw imagery, manually curated questions, and options into the GPT-4o-mini API to synthesize targeted misleading prompts. The specific prompt template is shown below: 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: RSME-Bench example 2. C Additional Experiment Results In this section, we present a more granular breakdown of the quantitative results across four comprehensive large-format remote sensing benchmarks: XLRS-Bench, MME-RW-RS, LRS-VQA, and RSHR-Bench [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

read the original abstract

Remote sensing vision-language models commonly rely on pretrained visual encoders to convert images into semantic features before language-model reasoning. While effective for scene-level understanding, this pipeline may prematurely compress local visual evidence, making fine-grained spatial reasoning vulnerable to language priors, especially in ultra-high-resolution remote sensing imagery. We present SkyNative, a native multimodal framework for remote sensing that adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. To reconcile low-level visual patches with textual tokens, SkyNative introduces a modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone. We further introduce a visual reliance benchmark that diagnoses whether models ground their answers in image evidence through progressive visual degradation and misleading textual prompts. Across standard remote sensing understanding tasks and large-format spatial reasoning evaluations, SkyNative shows stronger image-grounded perception and improved robustness against prompt-induced language priors. These results suggest that native patch-level multimodal modeling is a promising direction for reliable remote sensing vision-language reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkyNative skips the visual encoder for remote sensing VLMs by feeding raw patches with modality-specific reconciliation and adds a visual reliance benchmark, but the preservation claim needs more checks.

read the letter

Hi, the main point is that SkyNative drops pretrained visual encoders and feeds raw image patches directly as tokens into a unified autoregressive model for remote sensing, using a modality-aware decoupling step with dedicated parameters to handle the visual and text mix. They also introduce a benchmark that degrades the image progressively and adds misleading prompts to measure how much the model actually grounds answers in visual evidence rather than language priors. This targets a practical issue in high-resolution satellite imagery where early compression can hurt fine spatial reasoning. The benchmark itself is a useful diagnostic that could transfer to other multimodal setups, and the reported gains in image-grounded perception and robustness are worth examining if the experiments control for training differences. The paper does a reasonable job framing the compression problem and offering a concrete alternative architecture. On the soft spots, the central advantage still hinges on whether the patch-to-token reconciliation and parameter sharing avoid their own information bottlenecks, and the abstract does not quantify retained spatial detail or show ablations that isolate this effect. The stress-test concern about untested assumptions in the native path holds up based on what's described, so the robustness claims would need stronger supporting evidence in the full results. This is aimed at researchers working on vision-language models for earth observation who want alternatives to standard encoder pipelines or better tools to test visual grounding. A reader interested in practical robustness for spatial tasks would get value from the benchmark and the modeling direction. It deserves a serious referee to review the methods, ablations, and full evaluation details.

Referee Report

2 major / 2 minor

Summary. The paper presents SkyNative, an encoder-free multimodal framework for remote sensing visual evidence reasoning. It directly feeds raw image patches as tokens into a unified autoregressive language-model backbone, reconciled only by a modality-aware decoupling mechanism that employs modality-specific parameters. The authors introduce a visual reliance benchmark that applies progressive visual degradation and misleading textual prompts to test whether model outputs are grounded in image evidence rather than language priors. Experiments on standard remote sensing understanding tasks and large-format spatial reasoning evaluations are reported to show stronger image-grounded perception and greater robustness to prompt-induced priors than encoder-based baselines.

Significance. If the central claims are substantiated, the work would be significant for remote sensing vision-language modeling: it offers a concrete alternative to the dominant pretrained-encoder pipeline and supplies a diagnostic benchmark for visual grounding. Demonstrating that native patch tokens can retain fine-grained spatial evidence without encoder compression would be a useful data point for high-resolution imagery applications. The benchmark itself could become a reusable tool for the community.

major comments (2)

[§3.2] §3.2 (Modality-aware decoupling): The claim that raw patch tokens are reconciled 'without the compression losses of pretrained visual encoders' is load-bearing for the central thesis, yet the section provides no quantitative measure (e.g., mutual information, reconstruction PSNR, or token-level entropy) of information retained after the learned projection, positional encoding, and modality-specific parameter layers. Any such reconciliation necessarily introduces a learned transformation whose bottleneck is unquantified.
[§5.1] §5.1 and Table 3 (Visual reliance benchmark results): The reported robustness gains under misleading prompts are presented without ablation isolating the contribution of the native patch representation from the modality-specific parameters or from training data differences. Without these controls it is unclear whether the observed improvements are attributable to the encoder-free design or to other modeling choices.

minor comments (2)

[Figure 4] Figure 4 caption and axis labels use inconsistent terminology ('native tokens' vs. 'raw patches'); standardize notation across text and figures.
[§3.1] The description of the autoregressive backbone in §3.1 does not specify whether the modality-specific parameters are frozen or jointly optimized; clarify the training protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our encoder-free approach. We respond to each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Modality-aware decoupling): The claim that raw patch tokens are reconciled 'without the compression losses of pretrained visual encoders' is load-bearing for the central thesis, yet the section provides no quantitative measure (e.g., mutual information, reconstruction PSNR, or token-level entropy) of information retained after the learned projection, positional encoding, and modality-specific parameter layers. Any such reconciliation necessarily introduces a learned transformation whose bottleneck is unquantified.

Authors: We agree that a quantitative characterization of retained information would strengthen the central claim. The modality-aware decoupling does apply learned projections and modality-specific parameters to align raw patches with the language-model token space. In the revised manuscript we will add an analysis in §3.2 that reports token-level entropy before and after the projection layers and compares reconstruction fidelity (via a lightweight decoder) against a standard pretrained visual encoder on the same remote-sensing patches. revision: yes
Referee: [§5.1] §5.1 and Table 3 (Visual reliance benchmark results): The reported robustness gains under misleading prompts are presented without ablation isolating the contribution of the native patch representation from the modality-specific parameters or from training data differences. Without these controls it is unclear whether the observed improvements are attributable to the encoder-free design or to other modeling choices.

Authors: We concur that isolating the source of the robustness gains is necessary. The native patch representation and the modality-specific parameters are tightly coupled in the encoder-free design; however, we will add an ablation in the revised §5.1 that trains a controlled variant using the same backbone and data but with modality-specific parameters disabled (replaced by shared parameters). We will also clarify that all compared models were trained on the same remote-sensing corpora and report the effect of this control on the visual-reliance scores. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The provided abstract and context describe an encoder-free architecture with a modality-aware decoupling mechanism and a new visual reliance benchmark. Performance claims are tied to evaluations on standard remote sensing tasks and large-format spatial reasoning, without any equations, fitted parameters renamed as predictions, or self-citations that reduce the central results to inputs by construction. The derivation remains self-contained against external benchmarks, consistent with a normal non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that raw patch tokens can be directly reconciled with text tokens via modality-specific parameters without external pretraining; no explicit free parameters, axioms, or invented entities are named in the abstract, but the modality-aware decoupling mechanism functions as an ad-hoc architectural choice.

pith-pipeline@v0.9.0 · 5748 in / 1254 out tokens · 30701 ms · 2026-05-20T12:51:44.909336+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SkyNative adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. ... modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We further introduce a visual reliance benchmark that diagnoses whether models ground their answers in image evidence through progressive visual degradation and misleading textual prompts.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

[1]

Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66, 2024

Xiang Li, Congcong Wen, Yuan Hu, Zhenghang Yuan, and Xiao Xiang Zhu. Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66, 2024

work page 2024
[2]

Foundation models in remote sensing: Evolving from unimodality to multimodality.IEEE Geoscience and Remote Sensing Magazine, 2026

Danfeng Hong, Chenyu Li, Xuyang Li, Gustau Camps-Valls, and Jocelyn Chanussot. Foundation models in remote sensing: Evolving from unimodality to multimodality.IEEE Geoscience and Remote Sensing Magazine, 2026

work page 2026
[3]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[4]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

work page 2023
[5]

Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

work page 2025
[6]

Geochat: Grounded large vision-language model for remote sensing

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27831–27840, 2024

work page 2024
[7]

O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Rajabalifardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley

Mohammad Asadi, Jack W. O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Rajabalifardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage: The illusion of visual understanding, 2026

work page 2026
[8]

The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer

Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, and Zilong Huang. The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20758–20769, 2025

work page 2025
[9]

Vhm: Versatile and honest vision language model for remote sensing image analysis

Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, et al. Vhm: Versatile and honest vision language model for remote sensing image analysis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6381–6388, 2025

work page 2025
[10]

Geopixel: Pixel grounding large multimodal model in remote sensing

Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad Shahbaz Khan, and Salman Khan. Geopixel: Pixel grounding large multimodal model in remote sensing. In International Conference on Machine Learning, pages 54095–54111. PMLR, 2025

work page 2025
[11]

Geollava-8k: Scaling remote-sensing mul- timodal large language models to 8k resolution

Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Shan Boqi, Long Lan, Yulin Wang, et al. Geollava-8k: Scaling remote-sensing mul- timodal large language models to 8k resolution. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[12]

When large vision-language model meets large remote sensing imagery: Coarse- to-fine text-guided token pruning

Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, and Yansheng Li. When large vision-language model meets large remote sensing imagery: Coarse- to-fine text-guided token pruning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9206–9217, 2025

work page 2025
[13]

Zoomearth: Active perception for ultra-high-resolution geospatial vision-language tasks, 2025

Ruixun Liu, Bowen Fu, Jiayi Song, Kaiyu Li, Wanchen Li, Lanxuan Xue, Hui Qiao, Weizhan Zhang, Deyu Meng, and Xiangyong Cao. Zoomearth: Active perception for ultra-high-resolution geospatial vision-language tasks, 2025

work page 2025
[14]

To see or to please: Uncovering visual sycophancy and split beliefs in vlms, 2026

Rui Hong and Shuxue Quan. To see or to please: Uncovering visual sycophancy and split beliefs in vlms, 2026. 10

work page 2026
[15]

Vision language models are biased

An V o, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, and Daeyoung Kim. Vision language models are biased. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[16]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

work page 2025
[17]

Do images speak louder than words? investigating the effect of textual misinformation in VLMs

Chi Zhang, Wenxuan Ding, Jiale Liu, Mingrui Wu, Qingyun Wu, and Ray Mooney. Do images speak louder than words? investigating the effect of textual misinformation in VLMs. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Pape...

work page 2026
[18]

Context-vqa: Towards context-aware and purposeful visual question answering

Nandita Naik, Christopher Potts, and Elisa Kreiss. Context-vqa: Towards context-aware and purposeful visual question answering. In2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 2813–2817. IEEE, 2023

work page 2023
[19]

CHOICE: Benchmarking the remote sensing capabilities of large vision-language models

Xiao An, Jiaxing Sun, Zihan Gui, and Wei He. CHOICE: Benchmarking the remote sensing capabilities of large vision-language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

work page 2025
[20]

Rshallu: Dual-mode hallucination evaluation for remote-sensing multimodal large language models with domain-tailored mitigation, 2026

Zihui Zhou, Yong Feng, Yanying Chen, Guofan Duan, Zhenxi Song, Mingliang Zhou, and Weijia Jia. Rshallu: Dual-mode hallucination evaluation for remote-sensing multimodal large language models with domain-tailored mitigation, 2026

work page 2026
[21]

A benchmark for ultra-high-resolution remote sensing mllms, 2025

Yunkai Dang, Meiyi Zhu, Donghao Wang, Yizhuo Zhang, Jiacheng Yang, Qi Fan, Yuekun Yang, Wenbin Li, Feng Miao, and Yang Gao. A benchmark for ultra-high-resolution remote sensing mllms, 2025

work page 2025
[22]

Omniearth: A benchmark for evaluating vision-language models in geospatial tasks, 2026

Ronghao Fu, Haoran Liu, Weijie Zhang, Zhiwen Lin, Xiao Yang, Peng Zhang, and Bo Yang. Omniearth: A benchmark for evaluating vision-language models in geospatial tasks, 2026

work page 2026
[23]

Object detection in aerial images: A large-scale benchmark and challenges.IEEE transactions on pattern analysis and machine intelligence, 44(11):7778–7796, 2021

Jian Ding, Nan Xue, Gui-Song Xia, Xiang Bai, Wen Yang, Michael Ying Yang, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, et al. Object detection in aerial images: A large-scale benchmark and challenges.IEEE transactions on pattern analysis and machine intelligence, 44(11):7778–7796, 2021

work page 2021
[24]

Object detection in optical remote sensing images: A survey and a new benchmark.ISPRS journal of photogrammetry and remote sensing, 159:296–307, 2020

Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Junwei Han. Object detection in optical remote sensing images: A survey and a new benchmark.ISPRS journal of photogrammetry and remote sensing, 159:296–307, 2020

work page 2020
[25]

Gpt-4o mini: advancing cost-efficient intelligence, 2024

OpenAI. Gpt-4o mini: advancing cost-efficient intelligence, 2024

work page 2024
[26]

Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3867–3876, 2025

Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3867–3876, 2025

work page 2025
[27]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

work page 2025
[28]

Claude opus 4 & claude sonnet 4 system card, 2025

Anthropic. Claude opus 4 & claude sonnet 4 system card, 2025

work page 2025
[29]

Evev2: Improved baselines for encoder-free vision-language models

Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, and Xinlong Wang. Evev2: Improved baselines for encoder-free vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21014–21025, 2025. 11

work page 2025
[30]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

work page 2025
[31]

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023

work page 2023
[32]

V ocot: Unleashing visually grounded multi-step reasoning in large multi-modal models

Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuan-Jing Huang, and Zhongyu Wei. V ocot: Unleashing visually grounded multi-step reasoning in large multi-modal models. InProceed- ings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 37...

work page 2025
[33]

Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding, 2024

Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jingdong Chen, Yihua Tan, and Yansheng Li. Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding, 2024

work page 2024
[34]

Earthdial: Turning multi-sensory earth observations to interactive dialogues

Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shah- baz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, 2025

work page 2025
[35]

Fengxiang Wang, Hongzhen Wang, Zonghao Guo, Di Wang, Yulin Wang, Mingshuo Chen, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, et al. Xlrs-bench: Could your multimodal llms understand extremely large ultra-high-resolution remote sensing imagery? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14325–14336, 2025

work page 2025
[36]

YiFan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, Liang Wang, and Rong Jin. MME-realworld: Could your multimodal LLM challenge high-resolution real-world scenarios that are difficult for humans? InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[37]

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts visio...

work page 2024
[38]

Breen: bridge data-efficient encoder-free multimodal learning with learnable queries

Tianle Li, Yongming Rao, Winston Hu, and Yu Cheng. Breen: bridge data-efficient encoder-free multimodal learning with learnable queries. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5384–5395, 2026

work page 2026
[39]

Multimodal learning with next-token prediction for large multimodal models.Nature, pages 1–7, 2026

Xinlong Wang, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Xiaosong Zhang, Zhengx- iong Luo, Quan Sun, Zhen Li, Yuqi Wang, et al. Multimodal learning with next-token prediction for large multimodal models.Nature, pages 1–7, 2026

work page 2026
[40]

Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

work page 2017
[41]

Exploring models and data for remote sensing image caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017

Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. Exploring models and data for remote sensing image caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017

work page 2017
[42]

Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval

Zhiqiang Yuan, Wenkai Zhang, Kun Fu, Xuan Li, Chubo Deng, Hongqi Wang, and Xian Sun. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. IEEE Transactions on Geoscience and Remote Sensing, 60:1–19, 2021

work page 2021
[43]

Deep semantic understanding of high resolution remote sensing image

Bo Qu, Xuelong Li, Dacheng Tao, and Xiaoqiang Lu. Deep semantic understanding of high resolution remote sensing image. In2016 International conference on computer, information and telecommunication systems (Cits), pages 1–5. IEEE, 2016. 12

work page 2016
[44]

Junyao Ge, Xu Zhang, Yang Zheng, Kaitai Guo, and Jimin Liang. Rsteller: Scaling up visual language modeling in remote sensing with rich linguistic semantics from openly available data and large language models.ISPRS Journal of Photogrammetry and Remote Sensing, 226:146–163, 2025

work page 2025
[45]

Rsvqa: Visual question answering for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555– 8566, 2020

Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. Rsvqa: Visual question answering for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555– 8566, 2020

work page 2020
[46]

Floodnet: A high resolution aerial imagery dataset for post flood scene understanding, 2020

Maryam Rahnemoonfar, Tashnim Chowdhury, Argho Sarkar, Debvrat Varshney, Masoud Yari, and Robin Murphy. Floodnet: A high resolution aerial imagery dataset for post flood scene understanding, 2020

work page 2020
[47]

Towards faithful reasoning in remote sensing: A perceptually-grounded geospatial chain-of-thought for vision-language models, 2026

Jiaqi Liu, Lang Sun, Ronghao Fu, and Bo Yang. Towards faithful reasoning in remote sensing: A perceptually-grounded geospatial chain-of-thought for vision-language models, 2026

work page 2026
[48]

Aid: A benchmark data set for performance evaluation of aerial scene classification.IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017

Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. Aid: A benchmark data set for performance evaluation of aerial scene classification.IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017

work page 2017
[49]

Structural high-resolution satellite image indexing

Gui-Song Xia, Wen Yang, Julie Delon, Yann Gousseau, Hong Sun, and Henri Maître. Structural high-resolution satellite image indexing. InISPRS TC VII Symposium-100 Years ISPRS, volume 38, pages 298–303, 2010

work page 2010
[50]

Dota: A large-scale dataset for object detection in aerial images

Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liangpei Zhang. Dota: A large-scale dataset for object detection in aerial images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3974–3983, 2018

work page 2018
[51]

Yuanlin Zhang, Yuan Yuan, Yachuang Feng, and Xiaoqiang Lu. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection.IEEE Transactions on Geoscience and Remote Sensing, 57(8):5535–5548, 2019

work page 2019
[52]

Gong Cheng, Junwei Han, Peicheng Zhou, and Lei Guo. Multi-class geospatial object detection and geographic image classification based on collection of part detectors.ISPRS Journal of Photogrammetry and Remote Sensing, 98:119–132, 2014

work page 2014
[53]

Visdrone-det2021: The vision meets drone object detection challenge results

Yaru Cao, Zhijian He, Lujia Wang, Wenguan Wang, Yixuan Yuan, Dingwen Zhang, Jinglin Zhang, Pengfei Zhu, Luc Van Gool, Junwei Han, et al. Visdrone-det2021: The vision meets drone object detection challenge results. InProceedings of the IEEE/CVF International confer- ence on computer vision, pages 2847–2854, 2021

work page 2021
[54]

Vrsbench: A versatile vision-language bench- mark dataset for remote sensing image understanding.Advances in Neural Information Pro- cessing Systems, 37:3229–3242, 2024

Xiang Li, Jian Ding, and Mohamed Elhoseiny. Vrsbench: A versatile vision-language bench- mark dataset for remote sensing image understanding.Advances in Neural Information Pro- cessing Systems, 37:3229–3242, 2024

work page 2024
[55]

mirage effect

Ronghao Fu, Haoran Liu, Weijie Zhang, Zhiwen Lin, Xiao Yang, Peng Zhang, and Bo Yang. Omniearth: A benchmark for evaluating vision-language models in geospatial tasks.arXiv preprint arXiv:2603.09471, 2026. 13 A Detailed Experimental Setup A.1 Train Datasets Our training pipeline consists of three distinct stages, leveraging a comprehensive mixture of remo...

work page arXiv 2026

[1] [1]

Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66, 2024

Xiang Li, Congcong Wen, Yuan Hu, Zhenghang Yuan, and Xiao Xiang Zhu. Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66, 2024

work page 2024

[2] [2]

Foundation models in remote sensing: Evolving from unimodality to multimodality.IEEE Geoscience and Remote Sensing Magazine, 2026

Danfeng Hong, Chenyu Li, Xuyang Li, Gustau Camps-Valls, and Jocelyn Chanussot. Foundation models in remote sensing: Evolving from unimodality to multimodality.IEEE Geoscience and Remote Sensing Magazine, 2026

work page 2026

[3] [3]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021

[4] [4]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

work page 2023

[5] [5]

Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

work page 2025

[6] [6]

Geochat: Grounded large vision-language model for remote sensing

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27831–27840, 2024

work page 2024

[7] [7]

O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Rajabalifardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley

Mohammad Asadi, Jack W. O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Rajabalifardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage: The illusion of visual understanding, 2026

work page 2026

[8] [8]

The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer

Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, and Zilong Huang. The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20758–20769, 2025

work page 2025

[9] [9]

Vhm: Versatile and honest vision language model for remote sensing image analysis

Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, et al. Vhm: Versatile and honest vision language model for remote sensing image analysis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6381–6388, 2025

work page 2025

[10] [10]

Geopixel: Pixel grounding large multimodal model in remote sensing

Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad Shahbaz Khan, and Salman Khan. Geopixel: Pixel grounding large multimodal model in remote sensing. In International Conference on Machine Learning, pages 54095–54111. PMLR, 2025

work page 2025

[11] [11]

Geollava-8k: Scaling remote-sensing mul- timodal large language models to 8k resolution

Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Shan Boqi, Long Lan, Yulin Wang, et al. Geollava-8k: Scaling remote-sensing mul- timodal large language models to 8k resolution. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[12] [12]

When large vision-language model meets large remote sensing imagery: Coarse- to-fine text-guided token pruning

Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, and Yansheng Li. When large vision-language model meets large remote sensing imagery: Coarse- to-fine text-guided token pruning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9206–9217, 2025

work page 2025

[13] [13]

Zoomearth: Active perception for ultra-high-resolution geospatial vision-language tasks, 2025

Ruixun Liu, Bowen Fu, Jiayi Song, Kaiyu Li, Wanchen Li, Lanxuan Xue, Hui Qiao, Weizhan Zhang, Deyu Meng, and Xiangyong Cao. Zoomearth: Active perception for ultra-high-resolution geospatial vision-language tasks, 2025

work page 2025

[14] [14]

To see or to please: Uncovering visual sycophancy and split beliefs in vlms, 2026

Rui Hong and Shuxue Quan. To see or to please: Uncovering visual sycophancy and split beliefs in vlms, 2026. 10

work page 2026

[15] [15]

Vision language models are biased

An V o, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, and Daeyoung Kim. Vision language models are biased. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[16] [16]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

work page 2025

[17] [17]

Do images speak louder than words? investigating the effect of textual misinformation in VLMs

Chi Zhang, Wenxuan Ding, Jiale Liu, Mingrui Wu, Qingyun Wu, and Ray Mooney. Do images speak louder than words? investigating the effect of textual misinformation in VLMs. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Pape...

work page 2026

[18] [18]

Context-vqa: Towards context-aware and purposeful visual question answering

Nandita Naik, Christopher Potts, and Elisa Kreiss. Context-vqa: Towards context-aware and purposeful visual question answering. In2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 2813–2817. IEEE, 2023

work page 2023

[19] [19]

CHOICE: Benchmarking the remote sensing capabilities of large vision-language models

Xiao An, Jiaxing Sun, Zihan Gui, and Wei He. CHOICE: Benchmarking the remote sensing capabilities of large vision-language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

work page 2025

[20] [20]

Rshallu: Dual-mode hallucination evaluation for remote-sensing multimodal large language models with domain-tailored mitigation, 2026

Zihui Zhou, Yong Feng, Yanying Chen, Guofan Duan, Zhenxi Song, Mingliang Zhou, and Weijia Jia. Rshallu: Dual-mode hallucination evaluation for remote-sensing multimodal large language models with domain-tailored mitigation, 2026

work page 2026

[21] [21]

A benchmark for ultra-high-resolution remote sensing mllms, 2025

Yunkai Dang, Meiyi Zhu, Donghao Wang, Yizhuo Zhang, Jiacheng Yang, Qi Fan, Yuekun Yang, Wenbin Li, Feng Miao, and Yang Gao. A benchmark for ultra-high-resolution remote sensing mllms, 2025

work page 2025

[22] [22]

Omniearth: A benchmark for evaluating vision-language models in geospatial tasks, 2026

Ronghao Fu, Haoran Liu, Weijie Zhang, Zhiwen Lin, Xiao Yang, Peng Zhang, and Bo Yang. Omniearth: A benchmark for evaluating vision-language models in geospatial tasks, 2026

work page 2026

[23] [23]

Object detection in aerial images: A large-scale benchmark and challenges.IEEE transactions on pattern analysis and machine intelligence, 44(11):7778–7796, 2021

Jian Ding, Nan Xue, Gui-Song Xia, Xiang Bai, Wen Yang, Michael Ying Yang, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, et al. Object detection in aerial images: A large-scale benchmark and challenges.IEEE transactions on pattern analysis and machine intelligence, 44(11):7778–7796, 2021

work page 2021

[24] [24]

Object detection in optical remote sensing images: A survey and a new benchmark.ISPRS journal of photogrammetry and remote sensing, 159:296–307, 2020

Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Junwei Han. Object detection in optical remote sensing images: A survey and a new benchmark.ISPRS journal of photogrammetry and remote sensing, 159:296–307, 2020

work page 2020

[25] [25]

Gpt-4o mini: advancing cost-efficient intelligence, 2024

OpenAI. Gpt-4o mini: advancing cost-efficient intelligence, 2024

work page 2024

[26] [26]

Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3867–3876, 2025

Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3867–3876, 2025

work page 2025

[27] [27]

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025

work page 2025

[28] [28]

Claude opus 4 & claude sonnet 4 system card, 2025

Anthropic. Claude opus 4 & claude sonnet 4 system card, 2025

work page 2025

[29] [29]

Evev2: Improved baselines for encoder-free vision-language models

Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, and Xinlong Wang. Evev2: Improved baselines for encoder-free vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21014–21025, 2025. 11

work page 2025

[30] [30]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025

work page 2025

[31] [31]

Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023

work page 2023

[32] [32]

V ocot: Unleashing visually grounded multi-step reasoning in large multi-modal models

Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuan-Jing Huang, and Zhongyu Wei. V ocot: Unleashing visually grounded multi-step reasoning in large multi-modal models. InProceed- ings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 37...

work page 2025

[33] [33]

Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding, 2024

Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jingdong Chen, Yihua Tan, and Yansheng Li. Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding, 2024

work page 2024

[34] [34]

Earthdial: Turning multi-sensory earth observations to interactive dialogues

Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shah- baz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, 2025

work page 2025

[35] [35]

Fengxiang Wang, Hongzhen Wang, Zonghao Guo, Di Wang, Yulin Wang, Mingshuo Chen, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, et al. Xlrs-bench: Could your multimodal llms understand extremely large ultra-high-resolution remote sensing imagery? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14325–14336, 2025

work page 2025

[36] [36]

YiFan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, Liang Wang, and Rong Jin. MME-realworld: Could your multimodal LLM challenge high-resolution real-world scenarios that are difficult for humans? InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[37] [37]

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts visio...

work page 2024

[38] [38]

Breen: bridge data-efficient encoder-free multimodal learning with learnable queries

Tianle Li, Yongming Rao, Winston Hu, and Yu Cheng. Breen: bridge data-efficient encoder-free multimodal learning with learnable queries. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5384–5395, 2026

work page 2026

[39] [39]

Multimodal learning with next-token prediction for large multimodal models.Nature, pages 1–7, 2026

Xinlong Wang, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Xiaosong Zhang, Zhengx- iong Luo, Quan Sun, Zhen Li, Yuqi Wang, et al. Multimodal learning with next-token prediction for large multimodal models.Nature, pages 1–7, 2026

work page 2026

[40] [40]

Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

work page 2017

[41] [41]

Exploring models and data for remote sensing image caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017

Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. Exploring models and data for remote sensing image caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017

work page 2017

[42] [42]

Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval

Zhiqiang Yuan, Wenkai Zhang, Kun Fu, Xuan Li, Chubo Deng, Hongqi Wang, and Xian Sun. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. IEEE Transactions on Geoscience and Remote Sensing, 60:1–19, 2021

work page 2021

[43] [43]

Deep semantic understanding of high resolution remote sensing image

Bo Qu, Xuelong Li, Dacheng Tao, and Xiaoqiang Lu. Deep semantic understanding of high resolution remote sensing image. In2016 International conference on computer, information and telecommunication systems (Cits), pages 1–5. IEEE, 2016. 12

work page 2016

[44] [44]

Junyao Ge, Xu Zhang, Yang Zheng, Kaitai Guo, and Jimin Liang. Rsteller: Scaling up visual language modeling in remote sensing with rich linguistic semantics from openly available data and large language models.ISPRS Journal of Photogrammetry and Remote Sensing, 226:146–163, 2025

work page 2025

[45] [45]

Rsvqa: Visual question answering for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555– 8566, 2020

Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. Rsvqa: Visual question answering for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555– 8566, 2020

work page 2020

[46] [46]

Floodnet: A high resolution aerial imagery dataset for post flood scene understanding, 2020

Maryam Rahnemoonfar, Tashnim Chowdhury, Argho Sarkar, Debvrat Varshney, Masoud Yari, and Robin Murphy. Floodnet: A high resolution aerial imagery dataset for post flood scene understanding, 2020

work page 2020

[47] [47]

Towards faithful reasoning in remote sensing: A perceptually-grounded geospatial chain-of-thought for vision-language models, 2026

Jiaqi Liu, Lang Sun, Ronghao Fu, and Bo Yang. Towards faithful reasoning in remote sensing: A perceptually-grounded geospatial chain-of-thought for vision-language models, 2026

work page 2026

[48] [48]

Aid: A benchmark data set for performance evaluation of aerial scene classification.IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017

Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. Aid: A benchmark data set for performance evaluation of aerial scene classification.IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017

work page 2017

[49] [49]

Structural high-resolution satellite image indexing

Gui-Song Xia, Wen Yang, Julie Delon, Yann Gousseau, Hong Sun, and Henri Maître. Structural high-resolution satellite image indexing. InISPRS TC VII Symposium-100 Years ISPRS, volume 38, pages 298–303, 2010

work page 2010

[50] [50]

Dota: A large-scale dataset for object detection in aerial images

Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liangpei Zhang. Dota: A large-scale dataset for object detection in aerial images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3974–3983, 2018

work page 2018

[51] [51]

Yuanlin Zhang, Yuan Yuan, Yachuang Feng, and Xiaoqiang Lu. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection.IEEE Transactions on Geoscience and Remote Sensing, 57(8):5535–5548, 2019

work page 2019

[52] [52]

Gong Cheng, Junwei Han, Peicheng Zhou, and Lei Guo. Multi-class geospatial object detection and geographic image classification based on collection of part detectors.ISPRS Journal of Photogrammetry and Remote Sensing, 98:119–132, 2014

work page 2014

[53] [53]

Visdrone-det2021: The vision meets drone object detection challenge results

Yaru Cao, Zhijian He, Lujia Wang, Wenguan Wang, Yixuan Yuan, Dingwen Zhang, Jinglin Zhang, Pengfei Zhu, Luc Van Gool, Junwei Han, et al. Visdrone-det2021: The vision meets drone object detection challenge results. InProceedings of the IEEE/CVF International confer- ence on computer vision, pages 2847–2854, 2021

work page 2021

[54] [54]

Vrsbench: A versatile vision-language bench- mark dataset for remote sensing image understanding.Advances in Neural Information Pro- cessing Systems, 37:3229–3242, 2024

Xiang Li, Jian Ding, and Mohamed Elhoseiny. Vrsbench: A versatile vision-language bench- mark dataset for remote sensing image understanding.Advances in Neural Information Pro- cessing Systems, 37:3229–3242, 2024

work page 2024

[55] [55]

mirage effect

Ronghao Fu, Haoran Liu, Weijie Zhang, Zhiwen Lin, Xiao Yang, Peng Zhang, and Bo Yang. Omniearth: A benchmark for evaluating vision-language models in geospatial tasks.arXiv preprint arXiv:2603.09471, 2026. 13 A Detailed Experimental Setup A.1 Train Datasets Our training pipeline consists of three distinct stages, leveraging a comprehensive mixture of remo...

work page arXiv 2026