pith. machine review for the scientific record. sign in

arxiv: 2604.14044 · v1 · submitted 2026-04-15 · 💻 cs.CV

Recognition: unknown

Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:58 UTC · model grok-4.3

classification 💻 cs.CV
keywords remote sensing change detectionmultimodal large language modelschange understandingtemporal reasoningDelta-QA benchmarksatellite imageryvisual question answeringchange segmentation
0
0 comments X

The pith

Delta-LLaVA adds three attention modules to multimodal models so they can compare satellite images across time and explain the changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to fix the inability of current multimodal large language models to handle differences between images taken at different times in remote sensing. It releases Delta-QA, a dataset of 180,000 question-answer pairs that combines pixel-level segmentation with visual questions for two or three time points and organizes the questions into four increasing levels of reasoning difficulty. The proposed Delta-LLaVA model inserts a Change-Enhanced Attention module to highlight differences, a Change-SEG module that feeds prior-based difference features into the language model, and Local Causal Attention to stop information from one time leaking into another. A sympathetic reader would care because successful models of this kind could turn raw satellite sequences into automatic, human-readable accounts of events such as flooding, construction, or deforestation.

Core claim

Delta-LLaVA is a multimodal large language model for multi-temporal remote sensing that uses Change-Enhanced Attention to isolate and amplify visual differences, Change-SEG with Change Prior Embedding to produce differentiable change features for the language model, and Local Causal Attention to block cross-temporal leakage, resulting in better performance than both generalist MLLMs and specialized segmentation models on complex change deduction and high-precision boundary localization.

What carries the argument

Delta-LLaVA framework whose three core modules—Change-Enhanced Attention, Change-SEG with Change Prior Embedding, and Local Causal Attention—supply the missing mechanisms for multi-temporal contrastive reasoning and spatial grounding.

If this is right

  • A single model can now perform both pixel-level change segmentation and visual question answering across bi-temporal and tri-temporal remote sensing scenes.
  • Complex change deduction and high-precision boundary localization become feasible within the same multimodal language framework.
  • The four-level cognitive structure of the Delta-QA benchmark supplies a progressive testbed for measuring temporal reasoning depth.
  • Earth-observation pipelines can move from separate detection and interpretation stages toward a unified language-based interface.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same modular pattern for injecting temporal contrast could be tried on non-satellite temporal image tasks such as medical follow-up scans or video event description.
  • If the attention modules prove portable, they offer a lightweight way to retrofit existing MLLMs for any domain where ordered image pairs must be compared.
  • The progressive cognitive dimensions in Delta-QA suggest a general template for building benchmarks that separate low-level detection from higher-level causal explanation.

Load-bearing premise

The three added modules create genuine multi-temporal contrastive reasoning and spatial grounding instead of simply fitting the particular patterns in the new Delta-QA benchmark.

What would settle it

Testing Delta-LLaVA on an independent remote sensing change dataset whose image statistics and question distribution differ from Delta-QA and finding no measurable gain over standard MLLMs or segmentation models in change deduction accuracy would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.14044 by Haohua Wu, Hong Wang, Jiahao Li, Kaixin Zhang, Leilei Lin, Xiaohe Li, Yuqiang Fang, Zide Fan.

Figure 1
Figure 1. Figure 1: Our work primarily addresses the unified task of multi-temporal remote sensing change detection and understanding. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of the data annotation framework for the Delta-QA dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 7
Figure 7. Figure 7: Land cover transitions and evolution trends in Delta [PITH_FULL_IMAGE:figures/full_fig_p004_7.png] view at source ↗
Figure 5
Figure 5. Figure 5: QA text length distribution in Delta￾SECOND [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overall architecture of Delta-LLaVA. The VCP module explicitly amplifies differences in imagery feature. Visual, [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance of change seg￾mentation on the Delta-WUSU. Qwen2.5-VL 7B Qwen3-VL 8B InternVL2.5 8B InternVL3 8B GPT-4oGPT-4.1 Gemini 2.5 Flash Delta-LLaVA 0 20 40 60 80 Accuracy (%) 33.64 39.66 34.21 38.71 29.61 41.71 18.84 70.39 [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison of Delta-LLaVA and other models for bi-temporal change understanding. [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
read the original abstract

While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental "temporal blindness". Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and visual question answering across bi- and tri-temporal scenarios, structuring change interpretation into four progressive cognitive dimensions. Methodologically, we propose Delta-LLaVA, a novel MLLM framework explicitly tailored for multi-temporal remote sensing interpretation. It overcomes the limitations of naive feature concatenation through three core innovations: a Change-Enhanced Attention module that systematically isolates and amplifies visual differences, a Change-SEG module utilizing Change Prior Embedding to extract differentiable difference features as input for the LLM, and Local Causal Attention to prevent cross-temporal contextual leakage. Extensive experiments demonstrate that Delta-LLaVA decisively outperforms leading generalist MLLMs and specialized segmentation models in complex change deduction and high-precision boundary localization, establishing a unified framework for earth observation intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Delta-QA, a benchmark of 180k VQA samples unifying pixel-level segmentation and visual question answering for bi- and tri-temporal remote sensing change interpretation across four cognitive dimensions. It proposes Delta-LLaVA, an MLLM with three modules—Change-Enhanced Attention, Change-SEG using Change Prior Embedding, and Local Causal Attention—to overcome temporal blindness in existing MLLMs. The central claim is that Delta-LLaVA decisively outperforms generalist MLLMs and specialized segmentation models in complex change deduction and high-precision boundary localization.

Significance. If the performance gains are shown to arise from the modules' intrinsic multi-temporal contrastive reasoning and spatial grounding rather than specialization to Delta-QA, the work would provide a valuable unified framework for remote sensing change detection and understanding. The scale and structure of the new benchmark represent a concrete contribution to the field, though its broader utility hinges on external validation.

major comments (2)
  1. [Abstract] Abstract: the claim of 'decisive outperformance' lacks supporting details on exact metrics, baseline implementations, statistical significance, or controls for data leakage between training and test splits in Delta-QA; without these in the experimental section the central claim cannot be evaluated.
  2. [Method and Experiments] Method and Experiments: the necessity of Change-Enhanced Attention, Change-SEG with Change Prior Embedding, and Local Causal Attention for genuine multi-temporal reasoning is not secured by module-isolating ablations or results on external standard change-detection datasets (e.g., LEVIR-CD); outperformance versus generalist MLLMs may reflect fitting to the new benchmark's characteristics rather than architectural necessity.
minor comments (2)
  1. [Introduction] The description of 'temporal blindness' would benefit from explicit citations to prior multi-temporal MLLM limitations.
  2. [Benchmark] Clarify whether Delta-QA will be publicly released with code and splits to enable reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'decisive outperformance' lacks supporting details on exact metrics, baseline implementations, statistical significance, or controls for data leakage between training and test splits in Delta-QA; without these in the experimental section the central claim cannot be evaluated.

    Authors: The abstract is intentionally concise as a high-level overview. The experimental section already reports concrete metrics (e.g., mIoU, accuracy, F1) across all tasks, lists the exact baseline implementations (including versions of generalist MLLMs and segmentation models), and describes the Delta-QA construction process. However, we acknowledge that explicit statistical significance tests and a dedicated paragraph on leakage controls are not currently present. In the revision we will add these elements, including p-values for key comparisons and a clear description of the temporal and spatial split strategy used to avoid leakage. revision: partial

  2. Referee: [Method and Experiments] Method and Experiments: the necessity of Change-Enhanced Attention, Change-SEG with Change Prior Embedding, and Local Causal Attention for genuine multi-temporal reasoning is not secured by module-isolating ablations or results on external standard change-detection datasets (e.g., LEVIR-CD); outperformance versus generalist MLLMs may reflect fitting to the new benchmark's characteristics rather than architectural necessity.

    Authors: Module-isolating ablations are already reported in the experiments section and show consistent gains when each component is added individually. To further demonstrate that the improvements stem from the proposed multi-temporal mechanisms rather than benchmark specialization, we will add quantitative results on the external LEVIR-CD dataset in the revised manuscript, using the same evaluation protocol as the original change-detection literature. This will provide direct evidence of generalization beyond Delta-QA. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical model with external baselines and no derivations or self-referential equations

full rationale

The paper introduces Delta-QA benchmark and Delta-LLaVA architecture with three modules, then reports empirical outperformance versus external generalist MLLMs and segmentation models. No equations, derivations, or parameter-fitting steps are present that reduce any claimed result to quantities defined by the paper's own inputs. The central claim rests on benchmark comparisons rather than any self-definitional, fitted-prediction, or self-citation chain. This is the standard honest outcome for an applied empirical work that evaluates against independent external systems.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim rests on the effectiveness of three newly introduced modules and the representativeness of the new benchmark; no free parameters are explicitly fitted in the abstract description.

axioms (2)
  • domain assumption Existing MLLM architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning
    Stated as the fundamental limitation motivating the work
  • domain assumption The proposed modules can be integrated into an MLLM without introducing new failure modes or requiring domain-specific retraining beyond standard fine-tuning
    Implicit in the claim that the three innovations overcome the temporal blindness
invented entities (3)
  • Change-Enhanced Attention module no independent evidence
    purpose: Systematically isolates and amplifies visual differences across time
    New component proposed to address temporal blindness
  • Change-SEG module no independent evidence
    purpose: Uses Change Prior Embedding to extract differentiable difference features for the LLM
    New module bridging visual differences to language model input
  • Local Causal Attention no independent evidence
    purpose: Prevents cross-temporal contextual leakage
    New attention mechanism to maintain temporal separation

pith-pipeline@v0.9.0 · 5528 in / 1558 out tokens · 58958 ms · 2026-05-10T12:58:46.003255+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 21 canonical work pages · 6 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...

  2. [2]

    Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, et al. 2024. Internlm2 technical report. arXiv preprint arXiv:2403.17297(2024)

  3. [3]

    Chang Guang Satellite Technology Co., Ltd. 2026. JL1 Remote Sensing Semantic Change Detection Dataset. https://www.jl1mall.com/resrepo/?fromUrl=https: //www.jl1mall.com/edu. Accessed: 2026-03-31

  4. [4]

    Hongruixuan Chen, Jian Song, Chengxi Han, Junshi Xia, and Naoto Yokoya

  5. [5]

    ChangeMamba: Remote sensing change detection with spatiotemporal state space model.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–20

  6. [6]

    Zhenyuan Chen, Chenxi Wang, Ningyu Zhang, and Feng Zhang. 2025. Rscc: A large-scale remote sensing change caption dataset for disaster events.arXiv preprint arXiv:2509.01907(2025)

  7. [7]

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al . 2024. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271(2024)

  8. [8]

    Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention mask transformer for universal image segmen- tation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1290–1299

  9. [9]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

  10. [10]

    Muhammad Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Kartik Kuckreja, Fahad Shahbaz Khan, Paolo Fraccaro, Alexandre Lacoste, and Salman Khan. 2025. Geobench-vlm: Benchmarking vision-language models for geospatial tasks. InProceedings of the IEEE/CVF International Conference on Computer Vision. 7132–7142

  11. [11]

    Pei Deng, Wenqian Zhou, and Hanlin Wu. 2025. Changechat: An interactive model for remote sensing change analysis via multimodal instruction tuning. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  12. [12]

    Pei Deng, Wenqian Zhou, and Hanlin Wu. 2026. DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-Guided Difference Perception. Remote Sensing18, 4 (2026), 541

  13. [13]

    Lei Ding, Jing Zhang, Haitao Guo, Kai Zhang, Bing Liu, and Lorenzo Bruzzone

  14. [14]

    Joint spatio-temporal modeling for semantic change detection in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–14

  15. [15]

    Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. 2025. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing224 (2025), 272–286

  16. [16]

    Zhenyang Huang, Xiao Yu, Yi Zhang, Decheng Wang, and Hang Ruan. 2025. ReasonCD: A Multimodal Reasoning Large Model for Implicit Change-of-Interest Semantic Mining.arXiv preprint arXiv:2512.19354(2025)

  17. [17]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

  18. [18]

    Md Shamim Hussain, Mohammed J Zaki, and Dharmashankar Subramanian. 2022. Global self-attention as a replacement for graph convolution. InProceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining. 655–665

  19. [19]

    Jeremy Andrew Irvin, Emily Ruoyu Liu, Joyce Chuyi Chen, Ines Dormoy, Jiny- oung Kim, Samar Khanna, Zhuo Zheng, and Stefano Ermon. 2024. Teochat: A large vision-language assistant for temporal earth observation data.arXiv preprint arXiv:2410.06234(2024)

  20. [20]

    Ali Can Karaca, Enes Ozelbas, Saadettin Berber, Orkhan Karimli, Turabi Yildirim, and M Fatih Amasyali. 2025. Robust change captioning in remote sensing: Second- cc dataset and mmodalcc framework.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing(2025)

  21. [21]

    Pankhi Kashyap, Mainak Singha, and Biplab Banerjee. 2026. bi-modal textual prompt learning for vision-language models in remote sensing.arXiv preprint arXiv:2601.20675(2026)

  22. [22]

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al

  23. [23]

    InProceedings of the IEEE/CVF international conference on computer vision

    Segment anything. InProceedings of the IEEE/CVF international conference on computer vision. 4015–4026

  24. [24]

    Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. 2024. Geochat: Grounded large vision- language model for remote sensing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 27831–27840

  25. [25]

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. 2024. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9579–9589

  26. [26]

    Kaiyu Li, Xiangyong Cao, Yupeng Deng, Chao Pang, Zepeng Xin, Hui Qiao, Tieliang Gong, Deyu Meng, and Zhi Wang. 2026. Dynamicearth: How far are we from open-vocabulary change detection?. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 6279–6287

  27. [27]

    Qingyun Li, Shuran Ma, Junwei Luo, Yi Yu, Yue Zhou, Fengxiang Wang, Xudong Lu, Xiaoxing Wang, Xin He, Yushi Chen, et al. 2026. Co-Training Vision-Language Models for Remote Sensing Multi-Task Learning.Remote Sensing18, 2 (2026), 222

  28. [28]

    Yujie Li, Wenjia Xu, Yuanben Zhang, Zhiwei Wei, and Mugen Peng. 2025. BTC- Chat: Advancing Remote Sensing Bi-temporal Change Captioning with Multi- modal Large Language Model.arXiv preprint arXiv:2509.05895(2025)

  29. [29]

    Chenyang Liu, Keyan Chen, Haotian Zhang, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi. 2024. Change-agent: Toward interactive comprehensive remote sensing change interpretation and analysis.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–16

  30. [30]

    Chenyang Liu, Rui Zhao, Hao Chen, Zhengxia Zou, and Zhenwei Shi. 2022. Remote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset.IEEE Transactions on Geoscience and Remote Sensing60 (2022), 1–20

  31. [31]

    Xu Liu and Zhouhui Lian. 2024. Rsunivlm: A unified vision language model for remote sensing via granularity-oriented mixture of experts.arXiv preprint arXiv:2412.05679(2024)

  32. [32]

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11976–11986

  33. [33]

    Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, and Yansheng Li. 2025. When large vision-language model meets large remote sensing imagery: Coarse-to-fine text-guided token pruning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 9206–9217

  34. [34]

    Xianzhi Ma, Jianhui Li, Changhua Pei, and Hao Liu. 2025. Geomag: A vision- language model for pixel-level fine-grained remote sensing image parsing. In Proceedings of the 33rd ACM International Conference on Multimedia. 5441–5450

  35. [35]

    Liye Mei, Zhaoyi Ye, Chuan Xu, Hongzhu Wang, Ying Wang, Cheng Lei, Wei Yang, and Yansheng Li. 2024. SCD-SAM: Adapting segment anything model for semantic change detection in remote sensing imagery.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–13

  36. [36]

    Mubashir Noman, Noor Ahsan, Muzammal Naseer, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, and Fahad Shahbaz Khan. 2025. Cdchat: A large multimodal model for remote sensing change description. InIGARSS 2025-2025 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 7033–7037

  37. [37]

    OpenAI. 2025. GPT-4.1 Model API Reference. https://developers.openai. com/api/docs/models/gpt-4.1. Large Language Model, API endpoint: gpt-4.1-2025-04-14. Conference’17, July 2017, Washington, DC, USA Trovato et al

  38. [38]

    Jerome Quenum, Wen-Han Hsieh, Tsung-Han Wu, Ritwik Gupta, Trevor Darrell, and David M Chan. 2025. LISAT: Language-Instructed Segmentation Assistant for Satellite Imagery.arXiv preprint arXiv:2505.02829(2025)

  39. [39]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  40. [40]

    Maryam Rahnemoonfar, Tashnim Chowdhury, and Robin Murphy. 2023. Res- cueNet: A high resolution UAV semantic segmentation dataset for natural disaster damage assessment.Scientific data10, 1 (2023), 913

  41. [41]

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S Khan. 2024. Glamm: Pixel grounding large multimodal model. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13009–13018

  42. [42]

    Sunan Shi, Yanfei Zhong, Yinhe Liu, Jue Wang, Yuting Wan, Ji Zhao, Pengyuan Lv, Liangpei Zhang, and Deren Li. 2023. Multi-temporal urban semantic under- standing based on GF-2 remote sensing imagery: from tri-temporal datasets to multi-task mapping.International Journal of Digital Earth16, 1 (2023), 3321–3347

  43. [43]

    Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shahbaz Khan, et al. 2025. Earthdial: Turning multi-sensory earth observations to interactive dialogues. InProceedings of the Computer Vision and Pattern Recognition Conference. 14303–14313

  44. [44]

    Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yifan Zhang, Long Lan, Xue Yang, Hongda Sun, Yulin Wang, Di Wang, et al. 2026. GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery.arXiv preprint arXiv:2602.14201(2026)

  45. [45]

    Peijin Wang, Huiyang Hu, Boyuan Tong, Ziqi Zhang, Fanglong Yao, Yingchao Feng, Zining Zhu, Hao Chang, Wenhui Diao, Qixiang Ye, et al. 2024. Ringmogpt: A unified remote sensing foundation model for vision, language, and grounded tasks.IEEE Transactions on Geoscience and Remote Sensing63 (2024), 1–20

  46. [46]

    Zhiming Wang, Mingze Wang, Sheng Xu, Yanjing Li, and Baochang Zhang

  47. [47]

    Ccexpert: Advancing mllm capability in remote sensing change captioning with difference-aware integration and a foundational dataset.arXiv preprint arXiv:2411.11360(2024)

  48. [48]

    Weihao Xuan, Junjue Wang, Heli Qi, Zihang Chen, Zhuo Zheng, Yanfei Zhong, Junshi Xia, and Naoto Yokoya. 2025. Dynamicvl: Benchmarking multimodal large language models for dynamic city understanding.arXiv preprint arXiv:2505.21076 (2025)

  49. [49]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  50. [50]

    Kunping Yang, Gui-Song Xia, Zicheng Liu, Bo Du, Wen Yang, Marcello Pelillo, and Liangpei Zhang. 2021. Asymmetric siamese networks for semantic change detection in aerial images.IEEE Transactions on Geoscience and Remote Sensing 60 (2021), 1–18

  51. [51]

    Xiao Yang, Ronghao Fu, Zhuoran Duan, Zhiwen Lin, Xueyan Liu, and Bo Yang

  52. [52]

    GeoAlignCLIP: Enhancing Fine-Grained Vision-Language Alignment in Remote Sensing via Multi-Granular Consistency Learning.arXiv preprint arXiv:2603.09566(2026)

  53. [53]

    Liang Yao, Fan Liu, Delong Chen, Chuanyi Zhang, Yijun Wang, Ziyun Chen, Wei Xu, Shimin Di, and Yuhui Zheng. 2025. RemoteSAM: Towards Segment Anything for Earth Observation.Proceedings of the 33rd ACM International Conference on Multimedia(2025). https://api.semanticscholar.org/CorpusID:278886842

  54. [54]

    Kai Ye, Xiaotong You, Jianghang Lin, Jiayi Ji, Pingyang Dai, and Liujuan Cao. 2025. Evolving, Not Training: Zero-Shot Reasoning Segmentation via Evolutionary Prompting.arXiv preprint arXiv:2512.24702(2025)

  55. [55]

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. 2024. A survey on multimodal large language models.National Science Review11, 12 (2024), nwae403

  56. [56]

    Panli Yuan, Qingzhan Zhao, Xingbiao Zhao, Xuewen Wang, Xuefeng Long, and Yuchen Zheng. 2022. A transformer-based Siamese network and an open optical dataset for semantic change detection of remote sensing images.International Journal of Digital Earth15, 1 (2022), 1506–1525

  57. [57]

    Haotian Zhang, Han Guo, Keyan Chen, Hao Chen, Zhengxia Zou, and Zhen- wei Shi. 2025. FoBa: A foreground–background co-guided method and new benchmark for remote sensing semantic change detection.IEEE Transactions on Geoscience and Remote Sensing63 (2025), 1–19

  58. [58]

    Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. 2024. Omg-llava: Bridging image-level, object-level, pixel-level reasoning and understanding.Advances in neural infor- mation processing systems37 (2024), 71737–71767

  59. [59]

    Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. 2024. Earth- GPT: A universal multimodal large language model for multisensor image com- prehension in remote sensing domain.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–20

  60. [60]

    Xu Zhang, Danyang Li, Xiaohang Dong, Tianhao Wu, Hualong Yu, Jianye Wang, Qicheng Li, and Xiang Li. 2025. UniChange: Unifying Change Detection with Multimodal Large Language Model.arXiv preprint arXiv:2511.02607(2025)

  61. [61]

    Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. 2024. Psalm: Pixelwise segmentation with large multi-modal model. InEuropean Conference on Computer Vision. Springer, 74–91

  62. [62]

    Zilun Zhang, Haozhan Shen, Tiancheng Zhao, Bin Chen, Zian Guan, Yuhao Wang, Xu Jia, Yuxiang Cai, Yongheng Shang, and Jianwei Yin. 2025. Georsmllm: A multimodal large language model for vision-language tasks in geoscience and remote sensing.arXiv preprint arXiv:2503.12490(2025)

  63. [63]

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. 2025. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479(2025)