SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning
Pith reviewed 2026-05-20 12:51 UTC · model grok-4.3
The pith
SkyNative feeds raw image patches directly into a language model for remote sensing reasoning to reduce reliance on language priors and retain spatial details.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By adopting an encoder-free design that represents images as raw patch tokens inside the language-model token space and reconciling them through a modality-aware decoupling mechanism with modality-specific parameters, SkyNative preserves fine-grained spatial evidence that pretrained visual encoders tend to compress away, producing stronger image-grounded perception and greater robustness to prompt-induced language priors across remote sensing understanding and large-format spatial reasoning tasks.
What carries the argument
The modality-aware decoupling mechanism that inserts modality-specific parameters into a single autoregressive backbone to align raw visual patch tokens with textual tokens without a separate visual encoder.
If this is right
- Remote sensing models could process ultra-high-resolution imagery without early loss of local object boundaries or textures.
- Reasoning outputs would shift more when the actual image content changes and less when surrounding text is altered.
- Training pipelines could skip separate large-scale visual pretraining stages for remote sensing data.
- Spatial reasoning tasks that require precise localization would become more reliable under varying prompt conditions.
Where Pith is reading between the lines
- The same raw-patch approach might reduce language bias in other high-resolution imagery domains such as medical or aerial photography.
- End-to-end training from pixels to answers could shorten the current two-stage vision-language pipeline.
- Scaling the patch token count to even larger images would test whether the decoupling mechanism continues to prevent modality interference.
Load-bearing premise
Raw image patches can be integrated directly into the language model token space while still retaining the fine spatial details that pretrained visual encoders normally lose.
What would settle it
On the visual reliance benchmark, SkyNative accuracy would drop sharply with progressive image degradation yet remain stable under misleading textual prompts, while encoder-based models show the opposite pattern.
Figures
read the original abstract
Remote sensing vision-language models commonly rely on pretrained visual encoders to convert images into semantic features before language-model reasoning. While effective for scene-level understanding, this pipeline may prematurely compress local visual evidence, making fine-grained spatial reasoning vulnerable to language priors, especially in ultra-high-resolution remote sensing imagery. We present SkyNative, a native multimodal framework for remote sensing that adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. To reconcile low-level visual patches with textual tokens, SkyNative introduces a modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone. We further introduce a visual reliance benchmark that diagnoses whether models ground their answers in image evidence through progressive visual degradation and misleading textual prompts. Across standard remote sensing understanding tasks and large-format spatial reasoning evaluations, SkyNative shows stronger image-grounded perception and improved robustness against prompt-induced language priors. These results suggest that native patch-level multimodal modeling is a promising direction for reliable remote sensing vision-language reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents SkyNative, an encoder-free multimodal framework for remote sensing visual evidence reasoning. It directly feeds raw image patches as tokens into a unified autoregressive language-model backbone, reconciled only by a modality-aware decoupling mechanism that employs modality-specific parameters. The authors introduce a visual reliance benchmark that applies progressive visual degradation and misleading textual prompts to test whether model outputs are grounded in image evidence rather than language priors. Experiments on standard remote sensing understanding tasks and large-format spatial reasoning evaluations are reported to show stronger image-grounded perception and greater robustness to prompt-induced priors than encoder-based baselines.
Significance. If the central claims are substantiated, the work would be significant for remote sensing vision-language modeling: it offers a concrete alternative to the dominant pretrained-encoder pipeline and supplies a diagnostic benchmark for visual grounding. Demonstrating that native patch tokens can retain fine-grained spatial evidence without encoder compression would be a useful data point for high-resolution imagery applications. The benchmark itself could become a reusable tool for the community.
major comments (2)
- [§3.2] §3.2 (Modality-aware decoupling): The claim that raw patch tokens are reconciled 'without the compression losses of pretrained visual encoders' is load-bearing for the central thesis, yet the section provides no quantitative measure (e.g., mutual information, reconstruction PSNR, or token-level entropy) of information retained after the learned projection, positional encoding, and modality-specific parameter layers. Any such reconciliation necessarily introduces a learned transformation whose bottleneck is unquantified.
- [§5.1] §5.1 and Table 3 (Visual reliance benchmark results): The reported robustness gains under misleading prompts are presented without ablation isolating the contribution of the native patch representation from the modality-specific parameters or from training data differences. Without these controls it is unclear whether the observed improvements are attributable to the encoder-free design or to other modeling choices.
minor comments (2)
- [Figure 4] Figure 4 caption and axis labels use inconsistent terminology ('native tokens' vs. 'raw patches'); standardize notation across text and figures.
- [§3.1] The description of the autoregressive backbone in §3.1 does not specify whether the modality-specific parameters are frozen or jointly optimized; clarify the training protocol.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the presentation of our encoder-free approach. We respond to each major comment below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Modality-aware decoupling): The claim that raw patch tokens are reconciled 'without the compression losses of pretrained visual encoders' is load-bearing for the central thesis, yet the section provides no quantitative measure (e.g., mutual information, reconstruction PSNR, or token-level entropy) of information retained after the learned projection, positional encoding, and modality-specific parameter layers. Any such reconciliation necessarily introduces a learned transformation whose bottleneck is unquantified.
Authors: We agree that a quantitative characterization of retained information would strengthen the central claim. The modality-aware decoupling does apply learned projections and modality-specific parameters to align raw patches with the language-model token space. In the revised manuscript we will add an analysis in §3.2 that reports token-level entropy before and after the projection layers and compares reconstruction fidelity (via a lightweight decoder) against a standard pretrained visual encoder on the same remote-sensing patches. revision: yes
-
Referee: [§5.1] §5.1 and Table 3 (Visual reliance benchmark results): The reported robustness gains under misleading prompts are presented without ablation isolating the contribution of the native patch representation from the modality-specific parameters or from training data differences. Without these controls it is unclear whether the observed improvements are attributable to the encoder-free design or to other modeling choices.
Authors: We concur that isolating the source of the robustness gains is necessary. The native patch representation and the modality-specific parameters are tightly coupled in the encoder-free design; however, we will add an ablation in the revised §5.1 that trains a controlled variant using the same backbone and data but with modality-specific parameters disabled (replaced by shared parameters). We will also clarify that all compared models were trained on the same remote-sensing corpora and report the effect of this control on the visual-reliance scores. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The provided abstract and context describe an encoder-free architecture with a modality-aware decoupling mechanism and a new visual reliance benchmark. Performance claims are tied to evaluations on standard remote sensing tasks and large-format spatial reasoning, without any equations, fitted parameters renamed as predictions, or self-citations that reduce the central results to inputs by construction. The derivation remains self-contained against external benchmarks, consistent with a normal non-circular finding.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SkyNative adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. ... modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We further introduce a visual reliance benchmark that diagnoses whether models ground their answers in image evidence through progressive visual degradation and misleading textual prompts.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Xiang Li, Congcong Wen, Yuan Hu, Zhenghang Yuan, and Xiao Xiang Zhu. Vision-language models in remote sensing: Current progress and future trends.IEEE Geoscience and Remote Sensing Magazine, 12(2):32–66, 2024
work page 2024
-
[2]
Danfeng Hong, Chenyu Li, Xuyang Li, Gustau Camps-Valls, and Jocelyn Chanussot. Foundation models in remote sensing: Evolving from unimodality to multimodality.IEEE Geoscience and Remote Sensing Magazine, 2026
work page 2026
-
[3]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[4]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023
work page 2023
-
[5]
Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025
work page 2025
-
[6]
Geochat: Grounded large vision-language model for remote sensing
Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 27831–27840, 2024
work page 2024
-
[7]
O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Rajabalifardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley
Mohammad Asadi, Jack W. O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Rajabalifardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage: The illusion of visual understanding, 2026
work page 2026
-
[8]
Weixian Lei, Jiacong Wang, Haochen Wang, Xiangtai Li, Jun Hao Liew, Jiashi Feng, and Zilong Huang. The scalability of simplicity: Empirical analysis of vision-language learning with a single transformer. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20758–20769, 2025
work page 2025
-
[9]
Vhm: Versatile and honest vision language model for remote sensing image analysis
Chao Pang, Xingxing Weng, Jiang Wu, Jiayu Li, Yi Liu, Jiaxing Sun, Weijia Li, Shuai Wang, Litong Feng, Gui-Song Xia, et al. Vhm: Versatile and honest vision language model for remote sensing image analysis. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 6381–6388, 2025
work page 2025
-
[10]
Geopixel: Pixel grounding large multimodal model in remote sensing
Akashah Shabbir, Mohammed Zumri, Mohammed Bennamoun, Fahad Shahbaz Khan, and Salman Khan. Geopixel: Pixel grounding large multimodal model in remote sensing. In International Conference on Machine Learning, pages 54095–54111. PMLR, 2025
work page 2025
-
[11]
Geollava-8k: Scaling remote-sensing mul- timodal large language models to 8k resolution
Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Shan Boqi, Long Lan, Yulin Wang, et al. Geollava-8k: Scaling remote-sensing mul- timodal large language models to 8k resolution. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[12]
Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, and Yansheng Li. When large vision-language model meets large remote sensing imagery: Coarse- to-fine text-guided token pruning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9206–9217, 2025
work page 2025
-
[13]
Zoomearth: Active perception for ultra-high-resolution geospatial vision-language tasks, 2025
Ruixun Liu, Bowen Fu, Jiayi Song, Kaiyu Li, Wanchen Li, Lanxuan Xue, Hui Qiao, Weizhan Zhang, Deyu Meng, and Xiangyong Cao. Zoomearth: Active perception for ultra-high-resolution geospatial vision-language tasks, 2025
work page 2025
-
[14]
To see or to please: Uncovering visual sycophancy and split beliefs in vlms, 2026
Rui Hong and Shuxue Quan. To see or to please: Uncovering visual sycophancy and split beliefs in vlms, 2026. 10
work page 2026
-
[15]
Vision language models are biased
An V o, Khai-Nguyen Nguyen, Mohammad Reza Taesiri, Vy Tuong Dang, Anh Totti Nguyen, and Daeyoung Kim. Vision language models are biased. InThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[16]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025
work page 2025
-
[17]
Do images speak louder than words? investigating the effect of textual misinformation in VLMs
Chi Zhang, Wenxuan Ding, Jiale Liu, Mingrui Wu, Qingyun Wu, and Ray Mooney. Do images speak louder than words? investigating the effect of textual misinformation in VLMs. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Pape...
work page 2026
-
[18]
Context-vqa: Towards context-aware and purposeful visual question answering
Nandita Naik, Christopher Potts, and Elisa Kreiss. Context-vqa: Towards context-aware and purposeful visual question answering. In2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 2813–2817. IEEE, 2023
work page 2023
-
[19]
CHOICE: Benchmarking the remote sensing capabilities of large vision-language models
Xiao An, Jiaxing Sun, Zihan Gui, and Wei He. CHOICE: Benchmarking the remote sensing capabilities of large vision-language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025
work page 2025
-
[20]
Zihui Zhou, Yong Feng, Yanying Chen, Guofan Duan, Zhenxi Song, Mingliang Zhou, and Weijia Jia. Rshallu: Dual-mode hallucination evaluation for remote-sensing multimodal large language models with domain-tailored mitigation, 2026
work page 2026
-
[21]
A benchmark for ultra-high-resolution remote sensing mllms, 2025
Yunkai Dang, Meiyi Zhu, Donghao Wang, Yizhuo Zhang, Jiacheng Yang, Qi Fan, Yuekun Yang, Wenbin Li, Feng Miao, and Yang Gao. A benchmark for ultra-high-resolution remote sensing mllms, 2025
work page 2025
-
[22]
Omniearth: A benchmark for evaluating vision-language models in geospatial tasks, 2026
Ronghao Fu, Haoran Liu, Weijie Zhang, Zhiwen Lin, Xiao Yang, Peng Zhang, and Bo Yang. Omniearth: A benchmark for evaluating vision-language models in geospatial tasks, 2026
work page 2026
-
[23]
Jian Ding, Nan Xue, Gui-Song Xia, Xiang Bai, Wen Yang, Michael Ying Yang, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, et al. Object detection in aerial images: A large-scale benchmark and challenges.IEEE transactions on pattern analysis and machine intelligence, 44(11):7778–7796, 2021
work page 2021
-
[24]
Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Junwei Han. Object detection in optical remote sensing images: A survey and a new benchmark.ISPRS journal of photogrammetry and remote sensing, 159:296–307, 2020
work page 2020
-
[25]
Gpt-4o mini: advancing cost-efficient intelligence, 2024
OpenAI. Gpt-4o mini: advancing cost-efficient intelligence, 2024
work page 2024
-
[26]
Ailin Deng, Tri Cao, Zhirui Chen, and Bryan Hooi. Words or vision: Do vision-language models have blind faith in text? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3867–3876, 2025
work page 2025
-
[27]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025
work page 2025
-
[28]
Claude opus 4 & claude sonnet 4 system card, 2025
Anthropic. Claude opus 4 & claude sonnet 4 system card, 2025
work page 2025
-
[29]
Evev2: Improved baselines for encoder-free vision-language models
Haiwen Diao, Xiaotong Li, Yufeng Cui, Yueze Wang, Haoge Deng, Ting Pan, Wenxuan Wang, Huchuan Lu, and Xinlong Wang. Evev2: Improved baselines for encoder-free vision-language models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21014–21025, 2025. 11
work page 2025
-
[30]
Qwen2.5-vl technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025
work page 2025
-
[31]
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning, 2023
work page 2023
-
[32]
V ocot: Unleashing visually grounded multi-step reasoning in large multi-modal models
Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuan-Jing Huang, and Zhongyu Wei. V ocot: Unleashing visually grounded multi-step reasoning in large multi-modal models. InProceed- ings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 37...
work page 2025
-
[33]
Junwei Luo, Zhen Pang, Yongjun Zhang, Tingzhu Wang, Linlin Wang, Bo Dang, Jiangwei Lao, Jian Wang, Jingdong Chen, Yihua Tan, and Yansheng Li. Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding, 2024
work page 2024
-
[34]
Earthdial: Turning multi-sensory earth observations to interactive dialogues
Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shah- baz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, 2025
work page 2025
-
[35]
Fengxiang Wang, Hongzhen Wang, Zonghao Guo, Di Wang, Yulin Wang, Mingshuo Chen, Qiang Ma, Long Lan, Wenjing Yang, Jing Zhang, et al. Xlrs-bench: Could your multimodal llms understand extremely large ultra-high-resolution remote sensing imagery? InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14325–14336, 2025
work page 2025
-
[36]
YiFan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, Liang Wang, and Rong Jin. MME-realworld: Could your multimodal LLM challenge high-resolution real-world scenarios that are difficult for humans? InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[37]
Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts visio...
work page 2024
-
[38]
Breen: bridge data-efficient encoder-free multimodal learning with learnable queries
Tianle Li, Yongming Rao, Winston Hu, and Yu Cheng. Breen: bridge data-efficient encoder-free multimodal learning with learnable queries. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5384–5395, 2026
work page 2026
-
[39]
Multimodal learning with next-token prediction for large multimodal models.Nature, pages 1–7, 2026
Xinlong Wang, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Xiaosong Zhang, Zhengx- iong Luo, Quan Sun, Zhen Li, Yuqi Wang, et al. Multimodal learning with next-token prediction for large multimodal models.Nature, pages 1–7, 2026
work page 2026
-
[40]
Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017
work page 2017
-
[41]
Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xuelong Li. Exploring models and data for remote sensing image caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017
work page 2017
-
[42]
Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval
Zhiqiang Yuan, Wenkai Zhang, Kun Fu, Xuan Li, Chubo Deng, Hongqi Wang, and Xian Sun. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval. IEEE Transactions on Geoscience and Remote Sensing, 60:1–19, 2021
work page 2021
-
[43]
Deep semantic understanding of high resolution remote sensing image
Bo Qu, Xuelong Li, Dacheng Tao, and Xiaoqiang Lu. Deep semantic understanding of high resolution remote sensing image. In2016 International conference on computer, information and telecommunication systems (Cits), pages 1–5. IEEE, 2016. 12
work page 2016
-
[44]
Junyao Ge, Xu Zhang, Yang Zheng, Kaitai Guo, and Jimin Liang. Rsteller: Scaling up visual language modeling in remote sensing with rich linguistic semantics from openly available data and large language models.ISPRS Journal of Photogrammetry and Remote Sensing, 226:146–163, 2025
work page 2025
-
[45]
Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. Rsvqa: Visual question answering for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555– 8566, 2020
work page 2020
-
[46]
Floodnet: A high resolution aerial imagery dataset for post flood scene understanding, 2020
Maryam Rahnemoonfar, Tashnim Chowdhury, Argho Sarkar, Debvrat Varshney, Masoud Yari, and Robin Murphy. Floodnet: A high resolution aerial imagery dataset for post flood scene understanding, 2020
work page 2020
-
[47]
Jiaqi Liu, Lang Sun, Ronghao Fu, and Bo Yang. Towards faithful reasoning in remote sensing: A perceptually-grounded geospatial chain-of-thought for vision-language models, 2026
work page 2026
-
[48]
Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. Aid: A benchmark data set for performance evaluation of aerial scene classification.IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017
work page 2017
-
[49]
Structural high-resolution satellite image indexing
Gui-Song Xia, Wen Yang, Julie Delon, Yann Gousseau, Hong Sun, and Henri Maître. Structural high-resolution satellite image indexing. InISPRS TC VII Symposium-100 Years ISPRS, volume 38, pages 298–303, 2010
work page 2010
-
[50]
Dota: A large-scale dataset for object detection in aerial images
Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liangpei Zhang. Dota: A large-scale dataset for object detection in aerial images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3974–3983, 2018
work page 2018
-
[51]
Yuanlin Zhang, Yuan Yuan, Yachuang Feng, and Xiaoqiang Lu. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection.IEEE Transactions on Geoscience and Remote Sensing, 57(8):5535–5548, 2019
work page 2019
-
[52]
Gong Cheng, Junwei Han, Peicheng Zhou, and Lei Guo. Multi-class geospatial object detection and geographic image classification based on collection of part detectors.ISPRS Journal of Photogrammetry and Remote Sensing, 98:119–132, 2014
work page 2014
-
[53]
Visdrone-det2021: The vision meets drone object detection challenge results
Yaru Cao, Zhijian He, Lujia Wang, Wenguan Wang, Yixuan Yuan, Dingwen Zhang, Jinglin Zhang, Pengfei Zhu, Luc Van Gool, Junwei Han, et al. Visdrone-det2021: The vision meets drone object detection challenge results. InProceedings of the IEEE/CVF International confer- ence on computer vision, pages 2847–2854, 2021
work page 2021
-
[54]
Xiang Li, Jian Ding, and Mohamed Elhoseiny. Vrsbench: A versatile vision-language bench- mark dataset for remote sensing image understanding.Advances in Neural Information Pro- cessing Systems, 37:3229–3242, 2024
work page 2024
-
[55]
Ronghao Fu, Haoran Liu, Weijie Zhang, Zhiwen Lin, Xiao Yang, Peng Zhang, and Bo Yang. Omniearth: A benchmark for evaluating vision-language models in geospatial tasks.arXiv preprint arXiv:2603.09471, 2026. 13 A Detailed Experimental Setup A.1 Train Datasets Our training pipeline consists of three distinct stages, leveraging a comprehensive mixture of remo...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.