pith. machine review for the scientific record. sign in

arxiv: 2605.07178 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

Masks Can Talk: Extracting Structured Text Information from Single-Modal Images for Remote Sensing Change Detection

Fupeng Wei, Hang-Cheng Dong, Jiatong Pan, Kai Zheng, Wei Zhang, Zhenkai Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:20 UTC · model grok-4.3

classification 💻 cs.CV
keywords remote sensing change detectionsemantic quadruplesmask-derived textmultimodal supervisionbi-directional contrastive lossmulti-class change detectionGaza-Change-v2structured text extraction
0
0 comments X

The pith

Ground-truth change masks can be automatically transcribed into precise semantic text to improve remote sensing detection accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard ground-truth masks for remote sensing change detection already encode detailed semantics about each change region, including its location, the land-cover types before and after, the type of transition, and the number of objects involved. These details are extracted as semantic quadruples and rendered as fixed-template text descriptions, supplying exact multimodal supervision at zero added annotation cost. A two-stage training process first adapts the visual encoder to remote sensing imagery and then aligns the visual features with the structured text embeddings using bi-directional contrastive loss. This yields stronger results than both traditional unimodal methods and multimodal approaches that rely on large language models, as measured on the new Gaza-Change-v2 multi-class change detection dataset. Readers would care because the approach turns an overlooked property of existing labels into a clean, dense training signal for applications such as urban monitoring and disaster assessment.

Core claim

Change detection masks implicitly contain fine-grained semantics that can be transcribed into semantic quadruples specifying where the change occurred, what land-cover types were involved before and after, how the transition happened, and how many objects were affected. These quadruples are converted into multiple fixed-template text descriptions to serve as precise multimodal supervision. A two-stage training strategy first fine-tunes the visual encoder on remote sensing imagery for domain adaptation, then introduces a multimodal decoder trained with bi-directional contrastive loss to deeply align visual features with the structured textual embeddings. This leads to state-of-the-art scores,

What carries the argument

Semantic quadruples (where, what, how, how many) automatically derived from ground-truth change masks and rendered as fixed-template text descriptions for bi-directional contrastive alignment with visual features

If this is right

  • Provides noise-free, dense text supervision directly from standard dataset labels without extra human annotation.
  • The two-stage process separates domain adaptation of the visual encoder from subsequent multimodal alignment.
  • Achieves Sek of 17.80% and F_scd of 66.14% on the Gaza-Change-v2 dataset, exceeding multimodal methods that use large language models.
  • Single-modal imagery benefits from structured semantic text extracted from masks, raising change detection accuracy for tasks like urban monitoring.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The mask-to-text transcription technique could extend to other segmentation domains such as medical imaging where ground-truth masks similarly encode semantic attributes.
  • By avoiding LLM-generated text, the method may reduce both computational overhead and the risk of hallucinated descriptions in remote sensing pipelines.
  • Structured quadruples might support improved cross-sensor generalization or few-shot adaptation in future change detection systems.

Load-bearing premise

The automatically transcribed semantic quadruples from the masks must be precise, dense, and free of noise so that the two-stage training produces alignment superior to prior multimodal supervision.

What would settle it

Testing S2M on a second multi-class remote sensing change detection dataset where its Sek and F_scd scores fall below those of multimodal methods that use large language models, or where manual review finds systematic errors in the generated quadruples.

Figures

Figures reproduced from arXiv: 2605.07178 by Fupeng Wei, Hang-Cheng Dong, Jiatong Pan, Kai Zheng, Wei Zhang, Zhenkai Wu.

Figure 1
Figure 1. Figure 1: Flowchart of the proposed S2M method. Textual features are generated from conventional [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the constructed multimodal change detection dataset. The top two rows [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The proposed model training pipeline. It is worth mentioning that the introduced text [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example results on Gaza-change-v2 dataset. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on the Gaza-Change-v2 dataset. Color coding: white for true [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Remote sensing change detection is pivotal for urban monitoring, disaster assessment, and environmental resource management. Yet, unimodal deep learning methods frequently confuse genuine semantic changes with visually similar but irrelevant variations. Recent multimodal approaches incorporate text as auxiliary supervision, but their descriptions are either semantically coarse and unstructured or model-generated and thus noisy. Critically, all of them overlook a simple fact: fine-grained change semantics are already implicitly encoded in the ground-truth mask labels that come standard with every change detection dataset. These masks know where the change happened, what the land-cover types were before and after, how the transition occurred, and how many objects were involved. In this paper, we propose S2M, a framework that obtains structured textual features directly from change labels at zero additional annotation cost. Specifically, each change region is automatically transcribed into a semantic quadruple (where, what, how, how many) and converted into several fixed-template text descriptions, providing precise, dense, and noise-free multimodal supervision. We adopts a two-stage training strategy to fine-tune on remote sensing imagery firstly for robust domain-specific representation, after which a multimodal decoder with a bi-directional contrastive loss is introduced to achieve deep alignment between visual features and structured textual embeddings. To validate our method, we construct Gaza-Change-v2, a new multi-class change detection (MCD) dataset about the Gaza Strip. On this MCD dataset, S2M achieves a Sek of 17.80\% and an F$_{\text{scd}}$ of 66.14\%, notably surpassing even multimodal methods that leverage large language models. Our work demonstrates that masks can indeed talk. They tell us exactly what, where, how, and how many changes have occurred.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes S2M, a framework for remote sensing change detection that extracts structured textual descriptions directly from ground-truth change masks by automatically transcribing each change region into semantic quadruples (where, what, how, how many) and converting them into fixed-template text descriptions. It employs a two-stage training strategy—first fine-tuning a visual encoder on remote sensing imagery, then introducing a multimodal decoder with bi-directional contrastive loss for deep alignment between visual features and the structured textual embeddings. The method is evaluated on a newly constructed multi-class change detection dataset, Gaza-Change-v2, where it reports Sek of 17.80% and F_scd of 66.14%, outperforming even multimodal approaches that use large language models, all at zero additional annotation cost.

Significance. If the mask-derived quadruples prove precise and the alignment effective, this work could meaningfully advance change detection by demonstrating that existing mask labels already encode fine-grained semantics sufficient for high-quality multimodal supervision, reducing dependence on noisy LLM-generated texts. The zero-cost aspect and the new Gaza-Change-v2 benchmark are practical contributions that could encourage similar exploitation of label information in other vision tasks.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Method): The central claim that the transcribed semantic quadruples are 'precise, dense, and noise-free' and superior to LLM supervision rests on the automatic extraction of 'how' (transition type) and 'how many' (object counts). Standard multi-class change masks encode region labels and binary change indicators but do not inherently contain transition semantics or counts; the paper must explicitly detail the heuristics, connected-component rules, or class-mapping procedures used for these fields, as any systematic error would inject noise into the contrastive targets and undermine the reported gains.
  2. [§4] §4 (Experiments): The performance numbers (Sek 17.80%, F_scd 66.14%) and the claim of surpassing multimodal LLM baselines are presented without experimental details, specific baseline implementations, ablation studies on the two-stage training or bi-directional contrastive loss, error bars, or cross-validation. These omissions make it impossible to verify whether the gains are attributable to the mask-derived supervision or to other factors, which is load-bearing for the superiority claim.
minor comments (2)
  1. [Abstract] The notation F_scd should be defined explicitly (e.g., whether it is a standard semantic change detection F-score or a custom metric) and compared to prior literature.
  2. [§4] Ensure the Gaza-Change-v2 dataset construction details (class definitions, annotation protocol, train/val/test splits) are fully documented so that the benchmark can be reproduced and extended by others.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight areas where additional clarity and experimental rigor will strengthen the manuscript. We address each major comment point-by-point below and will revise the paper to incorporate the requested details.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Method): The central claim that the transcribed semantic quadruples are 'precise, dense, and noise-free' and superior to LLM supervision rests on the automatic extraction of 'how' (transition type) and 'how many' (object counts). Standard multi-class change masks encode region labels and binary change indicators but do not inherently contain transition semantics or counts; the paper must explicitly detail the heuristics, connected-component rules, or class-mapping procedures used for these fields, as any systematic error would inject noise into the contrastive targets and undermine the reported gains.

    Authors: We agree that explicit procedural details are essential for reproducibility and to substantiate the noise-free claim. In the Gaza-Change-v2 dataset construction (detailed in §4), each multi-class change mask encodes before/after land-cover labels per pixel. 'Where' is the bounding region of each connected component; 'what' maps directly to the before and after class labels; 'how' is the deterministic transition type obtained by pairing the before/after classes (e.g., vegetation-to-built-up); 'how many' is obtained by counting distinct connected components per transition type using standard 8-connectivity rules on the binary mask for that class pair. These steps are fully automatic and deterministic from the ground-truth annotations, introducing no external noise. We will add a dedicated paragraph and pseudocode in §3 describing these heuristics, class mappings, and connected-component rules. revision: yes

  2. Referee: [§4] §4 (Experiments): The performance numbers (Sek 17.80%, F_scd 66.14%) and the claim of surpassing multimodal LLM baselines are presented without experimental details, specific baseline implementations, ablation studies on the two-stage training or bi-directional contrastive loss, error bars, or cross-validation. These omissions make it impossible to verify whether the gains are attributable to the mask-derived supervision or to other factors, which is load-bearing for the superiority claim.

    Authors: We acknowledge the need for greater experimental transparency. In the revised §4 we will: (i) provide full implementation details and hyperparameters for all baselines, including exact prompting and adaptation steps for the LLM-based multimodal methods; (ii) report ablation tables quantifying the contribution of the two-stage training and bi-directional contrastive loss; (iii) include error bars from five independent runs with different random seeds; and (iv) add k-fold cross-validation results on Gaza-Change-v2. These additions will allow readers to attribute performance gains specifically to the mask-derived supervision. revision: yes

Circularity Check

0 steps flagged

No circularity: mask-to-text transcription is direct and deterministic, results are empirical

full rationale

The paper describes a two-stage training pipeline in which change masks are transcribed via fixed templates into semantic quadruples and text descriptions to supply supervision; this transcription step is presented as a zero-cost, rule-based conversion from existing labels rather than a fitted or self-referential quantity. Reported metrics (Sek 17.80%, F_scd 66.14%) are framed as experimental outcomes on the newly constructed Gaza-Change-v2 dataset and compared against external baselines, with no equations, parameters, or uniqueness theorems shown to reduce the claimed gains to the inputs by construction. No self-citations, ansatzes, or renamings of known results appear as load-bearing elements in the provided description. The derivation chain therefore remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that mask-derived text is superior to LLM-generated text and that contrastive alignment works as described; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Bi-directional contrastive loss produces deep alignment between visual features and structured textual embeddings
    Invoked in the description of the multimodal decoder stage.

pith-pipeline@v0.9.0 · 5634 in / 1251 out tokens · 33056 ms · 2026-05-11T01:20:55.931356+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 2 internal anchors

  1. [1]

    A transformer-based siamese network for change detection

    Wele Gedara Chaminda Bandara and Vishal M Patel. A transformer-based siamese network for change detection. InIGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium, pages 207–210. IEEE, 2022

  2. [2]

    The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks

    Maxim Berman, Amal Rannen Triki, and Matthew B Blaschko. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4413–4421, 2018

  3. [3]

    Remote sensing image change detection with transformers.IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2021

    Hao Chen, Zipeng Qi, and Zhenwei Shi. Remote sensing image change detection with transformers.IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2021

  4. [4]

    A spatial-temporal attention-based method and a new dataset for remote sensing image change detection.Remote sensing, 12(10):1662, 2020

    Hao Chen and Zhenwei Shi. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection.Remote sensing, 12(10):1662, 2020

  5. [5]

    Changemamba: Remote sensing change detection with spatiotemporal state space model.IEEE Transactions on Geoscience and Remote Sensing, 62:1–20, 2024

    Hongruixuan Chen, Jian Song, Chengxi Han, Junshi Xia, and Naoto Yokoya. Changemamba: Remote sensing change detection with spatiotemporal state space model.IEEE Transactions on Geoscience and Remote Sensing, 62:1–20, 2024

  6. [6]

    Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in neural information processing systems, 36:49250–49267, 2023

  7. [7]

    Fully convolutional siamese networks for change detection

    Rodrigo Caye Daudt, Bertr Le Saux, and Alexandre Boulch. Fully convolutional siamese networks for change detection. In2018 25th IEEE international conference on image processing (ICIP), pages 4063–4067. IEEE, 2018

  8. [8]

    Bi-temporal semantic reasoning for the semantic change detection in hr remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2022

    Lei Ding, Haitao Guo, Sicong Liu, Lichao Mou, Jing Zhang, and Lorenzo Bruzzone. Bi-temporal semantic reasoning for the semantic change detection in hr remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2022

  9. [9]

    Changeclip: Remote sensing change detection with multimodal vision-language representation learning.ISPRS Journal of Photogrammetry and Remote Sensing, 208:53–69, 2024

    Sijun Dong, Libo Wang, Bo Du, and Xiaoliang Meng. Changeclip: Remote sensing change detection with multimodal vision-language representation learning.ISPRS Journal of Photogrammetry and Remote Sensing, 208:53–69, 2024

  10. [10]

    Changes in gaza: Dinov3-powered multi-class change detection for damage assessment in conflict zones.arXiv preprint arXiv:2511.19035, 2025

    Anonymous et al. Changes in gaza: Dinov3-powered multi-class change detection for damage assessment in conflict zones.arXiv preprint arXiv:2511.19035, 2025

  11. [11]

    Snunet-cd: A densely connected siamese network for change detection of vhr images.IEEE Geoscience and Remote Sensing Letters, 19:1–5, 2021

    Sheng Fang, Kaiyu Li, Jinyuan Shao, and Zhe Li. Snunet-cd: A densely connected siamese network for change detection of vhr images.IEEE Geoscience and Remote Sensing Letters, 19:1–5, 2021

  12. [12]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020

  13. [13]

    Edge-cvt: Edge-informed cnn and vision transformer for building change detection in satellite imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 227:48–68, 2025

    Shimaa Holail, Tamer Saleh, Xiongwu Xiao, Mohamed Zahran, Gui-Song Xia, and Deren Li. Edge-cvt: Edge-informed cnn and vision transformer for building change detection in satellite imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 227:48–68, 2025

  14. [14]

    Shunping Ji, Shiqing Wei, and Meng Lu. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set.IEEE Transactions on geoscience and remote sensing, 57(1):574–586, 2018. 10

  15. [15]

    Ultralightweight spatial–spectral feature cooperation network for change detection in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 61:1–14, 2023

    Tao Lei, Xinzhe Geng, Hailong Ning, Zhiyong Lv, Maoguo Gong, Yaochu Jin, and Asoke K Nandi. Ultralightweight spatial–spectral feature cooperation network for change detection in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 61:1–14, 2023

  16. [16]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. InProceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

  17. [17]

    Rscama: Remote sensing image change captioning with state space model.IEEE Geoscience and Remote Sensing Letters, 21:1–5, 2024

    Chenyang Liu, Keyan Chen, Bowen Chen, Haotian Zhang, Zhengxia Zou, and Zhenwei Shi. Rscama: Remote sensing image change captioning with state space model.IEEE Geoscience and Remote Sensing Letters, 21:1–5, 2024

  18. [18]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  19. [19]

    Tongfei Liu, Maoguo Gong, Di Lu, Qingfu Zhang, Hanhong Zheng, Fenlong Jiang, and Mingyang Zhang. Building change detection for vhr remote sensing images via local–global pyramid network and cross-task transfer learning strategy.IEEE Transactions on Geoscience and Remote Sensing, 60:1–17, 2021

  20. [20]

    Yi Liu, Chao Pang, Zongqian Zhan, Xiaomeng Zhang, and Xue Yang. Building change detection for remote sensing images using a dual-task constrained deep siamese convolutional network model.IEEE Geoscience and Remote Sensing Letters, 18(5):811–815, 2020

  21. [21]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

  22. [22]

    Cdmask: Change customized mask architecture for change detection

    Xiaowen Ma, Zhenkai Wu, Jiatong Pan, Kai Zheng, Rongrong Lian, Wangyu Wu, Zhenhua Huang, Yun Chen, Renxiang Guan, Rong Fu, et al. Cdmask: Change customized mask architecture for change detection. IEEE Transactions on Geoscience and Remote Sensing, 2026

  23. [23]

    Ddlnet: Boosting remote sensing change detection with dual-domain learning

    Xiaowen Ma, Jiawei Yang, Rui Che, Huanting Zhang, and Wei Zhang. Ddlnet: Boosting remote sensing change detection with dual-domain learning. In2024 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2024

  24. [24]

    Stnet: Spatial and temporal feature fusion network for change detection in remote sensing images

    Xiaowen Ma, Jiawei Yang, Tingfeng Hong, Mengting Ma, Ziyan Zhao, Tian Feng, and Wei Zhang. Stnet: Spatial and temporal feature fusion network for change detection in remote sensing images. In2023 IEEE International Conference on Multimedia and Expo (ICME), pages 2195–2200, 2023

  25. [25]

    V-net: Fully convolutional neural networks for volumetric medical image segmentation

    Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016

  26. [26]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  27. [27]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  28. [28]

    Semantic feature-constrained multitask siamese network for building change detection in high-spatial-resolution remote sensing imagery

    Qian Shen, Jiru Huang, Min Wang, Shikang Tao, Rui Yang, and Xin Zhang. Semantic feature-constrained multitask siamese network for building change detection in high-spatial-resolution remote sensing imagery. ISPRS Journal of Photogrammetry and Remote Sensing, 189:78–94, 2022

  29. [29]

    Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

  30. [30]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  31. [31]

    Zhaoming Wu, Mingyong Cai, Xinsheng Zhang, Zhengchao Chen, Yixiang Li, Zeqing Wang, Yu Zhao, and Xuewei Shi. Textmcd: A mask-classification change detection network with multimodal vision–language supervision.International Journal of Applied Earth Observation and Geoinformation, 146:105152, 2026. 11

  32. [32]

    Cdxlstm: Boosting remote sensing change detection with extended long short-term memory.IEEE Geoscience and Remote Sensing Letters, 2025

    Zhenkai Wu, Xiaowen Ma, Rongrong Lian, Kai Zheng, and Wei Zhang. Cdxlstm: Boosting remote sensing change detection with extended long short-term memory.IEEE Geoscience and Remote Sensing Letters, 2025

  33. [33]

    Change detection based on deep siamese convolutional network for optical aerial images.IEEE Geoscience and Remote Sensing Letters, 14(10):1845–1849, 2017

    Yang Zhan, Kun Fu, Menglong Yan, Xian Sun, Hongqi Wang, and Xiaosong Qiu. Change detection based on deep siamese convolutional network for optical aerial images.IEEE Geoscience and Remote Sensing Letters, 14(10):1845–1849, 2017

  34. [34]

    Swinsunet: Pure transformer network for remote sensing image change detection.IEEE Transactions on Geoscience and Remote Sensing, 60:1–13, 2022

    Cui Zhang, Liejun Wang, Shuli Cheng, and Yongming Li. Swinsunet: Pure transformer network for remote sensing image change detection.IEEE Transactions on Geoscience and Remote Sensing, 60:1–13, 2022

  35. [35]

    Cdmamba: Incorporating local clues into mamba for remote sensing image binary change detection.IEEE Transactions on Geoscience and Remote Sensing, 2025

    Haotian Zhang, Keyan Chen, Chenyang Liu, Hao Chen, Zhengxia Zou, and Zhenwei Shi. Cdmamba: Incorporating local clues into mamba for remote sensing image binary change detection.IEEE Transactions on Geoscience and Remote Sensing, 2025

  36. [36]

    Unichange: Unifying change detection with multimodal large language model.arXiv preprint arXiv:2511.02607, 2025

    Xu Zhang, Danyang Li, Xiaohang Dong, Tianhao Wu, Hualong Yu, Jianye Wang, Qicheng Li, and Xiang Li. Unichange: Unifying change detection with multimodal large language model.arXiv preprint arXiv:2511.02607, 2025

  37. [38]

    Multimodal feature fusion network with text difference enhancement for remote sensing change detection.IEEE Transactions on Geoscience and Remote Sensing, 63:1–17, 2025

    Yijun Zhou, Yikui Zhai, Zilu Ying, Tingfeng Xian, Wenlve Zhou, Zhiheng Zhou, Xiaolin Tian, Xudong Jia, Hongsheng Zhang, and CL Philip Chen. Multimodal feature fusion network with text difference enhancement for remote sensing change detection.IEEE Transactions on Geoscience and Remote Sensing, 63:1–17, 2025

  38. [39]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models.arXiv preprint arXiv:2304.10592, 2023

  39. [40]

    Semantic-cd: Remote sensing image semantic change detection towards open-vocabulary setting

    Yongshuo Zhu, Lu Li, Keyan Chen, Chenyang Liu, Fugen Zhou, and Zhenwei Shi. Semantic-cd: Remote sensing image semantic change detection towards open-vocabulary setting. InIGARSS 2025 - 2025 IEEE International Geoscience and Remote Sensing Symposium, pages 6388–6392, 2025. 12