arxiv: 2605.04409 · v1 · submitted 2026-05-06 · 💻 cs.CV

UAV as Urban Construction Change Monitor: A New Benchmark and Change Captioning Model

Yupeng Gao , Tianyu Li , Guoqing Wang , Yang Yang This is my paper

Pith reviewed 2026-05-08 17:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords change captioningremote sensingUAV imageryurban constructionprototype learningmulti-task learningchange detection

0 comments

The pith

PTNet uses a learnable prototype bank to model structured change semantics for generating natural language descriptions of urban construction changes from UAV image pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Remote sensing change captioning seeks natural language descriptions of how scenes evolve between two images instead of just binary change masks. Existing methods rely on implicit feature differences and cannot easily reconcile the distinct needs of accurate change detection with semantically coherent descriptions. PTNet addresses this by maintaining a bank of learnable prototypes that represent common change patterns, using them to align features across time steps, gating representations so detection and captioning do not interfere, and feeding detection outputs as spatial guidance into the caption decoder. The authors also release UCCD, a new benchmark of 9,000 high-resolution UAV pairs focused on urban construction with 45,000 annotated sentences. Experiments on UCCD and an existing dataset show PTNet produces more accurate and coherent results than prior approaches.

Core claim

PTNet explicitly models structured change semantics through a learnable prototype bank that guides cross-temporal interaction, disentangles task-specific representations via multi-head gating, and injects detection-derived spatial priors into caption generation, enabling coherent semantic correspondence while preserving fine-grained spatial sensitivity.

What carries the argument

A learnable prototype bank that captures structured change semantics, guides cross-temporal feature alignment, and supports task-specific disentanglement in a joint change detection and captioning model.

If this is right

Joint detection and captioning yields spatially grounded descriptions that align with actual changed regions.
Explicit prototypes allow the model to handle complex, multi-object urban changes more coherently than implicit differencing.
The UCCD benchmark provides a standardized testbed for future work on high-resolution construction monitoring.
Detection priors injected into captioning improve fine-grained spatial sensitivity without sacrificing semantic quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The prototype approach could transfer to other change-description tasks such as vegetation or infrastructure monitoring if the bank is initialized from domain-specific data.
If prototypes prove stable across datasets, the method might support lighter supervision for new regions rather than full retraining.
Real-time UAV streams could feed the same prototype bank to produce ongoing natural-language summaries of construction activity.

Load-bearing premise

A learnable prototype bank can reliably capture and generalize structured change semantics across diverse urban construction scenarios without overfitting to the training distribution.

What would settle it

Evaluating PTNet on a new UAV dataset of urban construction changes from cities or construction types absent from UCCD training data, then checking whether caption coherence and accuracy gains disappear compared with baselines.

Figures

Figures reproduced from arXiv: 2605.04409 by Guoqing Wang, Tianyu Li, Yang Yang, Yupeng Gao.

**Figure 1.** Figure 1: (a) single-task methods that produce either a change mask or a caption, (b) existing joint methods that suffer from feature conflicts and inaccurate descriptions, and (c) the proposed PTNet, which introduces prototype-guided semantic modeling and task-adaptive feature decoupling for accurate and spatially faithful change captioning. Existing RSICC methods follow an encoder–decoder paradigm to model cross-t… view at source ↗

**Figure 2.** Figure 2: Overall architecture of the proposed PTNet. 3.2 Prototype-Guided Change-Aware Interaction Prototype Initialization As depicted in view at source ↗

**Figure 3.** Figure 3: (a) Prototype bank construction: training-set difference features are clustered via K-means and spatially recovered via RBF interpolation to form the learnable prototype bank P ∈ R K×N×D. (b) PG-CAI Block: P modulates bidirectional crossattention between {F i 1, F i 2}, producing change-aware features {Gi 1, Gi 2}. (c) Change Captioning Decoder: change-aware features are projected into the LLM token spac… view at source ↗

**Figure 4.** Figure 4: Overview of the UCCD dataset construction and statistical analysis. (a) Data annotation pipeline for UCCD dataset construction. (b) Sentence length distribution across Train, Val, and Test splits. (c) Part-of-speech distribution of all captions, with nouns (29.4%) and verbs (25.1%) dominating, reflecting the action-oriented nature of change descriptions. (d) Inter-annotator semantic consistency across four… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison on WHU-CDC and UCCD. Red text highlights erroneous or hallucinated descriptions view at source ↗

read the original abstract

Remote Sensing Image Change Captioning (RSICC) aims to generate spatially grounded natural language descriptions of scene evolution from bi-temporal imagery, moving beyond binary change masks toward semantic-level understanding. However, existing methods rely on implicit feature differencing without explicitly modeling structured change semantics, and struggle to reconcile the conflicting representation demands of change detection and caption generation. In addition, current benchmarks provide limited coverage of high-resolution urban construction scenarios. To address these challenges, we propose PTNet, a prototype-guided task-adaptive framework for joint change captioning and detection. PTNet explicitly models structured change semantics through a learnable prototype bank that guides cross-temporal interaction, disentangles task-specific representations via multi-head gating, and injects detection-derived spatial priors into caption generation, enabling coherent semantic correspondence while preserving fine-grained spatial sensitivity. Furthermore, we construct UCCD, a large-scale UAV-based benchmark comprising 9,000 high-resolution image pairs and 45,000 annotated sentences for urban construction monitoring. Extensive experiments on UCCD and WHU-CDC demonstrate that PTNet consistently outperforms existing methods. The dataset and source code are publicly available at https://github.com/G124556/ptnet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a solid new UAV construction change dataset and a prototype-based captioning model that beats baselines, but the prototype bank's role in the gains is not clearly isolated.

read the letter

The main contribution here is the UCCD benchmark: 9,000 high-resolution UAV image pairs focused on urban construction scenes, with 45,000 annotated sentences. That fills a gap in existing RSICC datasets, which tend to cover broader or lower-res scenes. PTNet pairs this with a prototype bank that guides cross-temporal features, multi-head gating to separate detection and caption tasks, and injection of detection priors into the caption decoder. Experiments report consistent gains over prior methods on both UCCD and WHU-CDC, and the authors release the data and code.

Referee Report

2 major / 3 minor

Summary. The paper introduces PTNet, a prototype-guided task-adaptive network for remote sensing image change captioning (RSICC) that uses a learnable prototype bank to explicitly model structured change semantics, multi-head gating to disentangle change detection and captioning representations, and injection of detection-derived spatial priors into the caption decoder. It also presents UCCD, a new UAV-based benchmark with 9,000 high-resolution bi-temporal image pairs and 45,000 annotated sentences focused on urban construction changes. Experiments claim consistent outperformance over prior methods on both UCCD and the existing WHU-CDC dataset, with public release of data and code.

Significance. If the central claims hold, the work supplies a much-needed high-resolution urban construction benchmark and an architecture that moves RSICC beyond implicit differencing toward explicit semantic modeling. The public dataset and code are clear strengths that support reproducibility and further research in UAV-based monitoring applications.

major comments (2)

[§3.2] §3.2 (Prototype Bank): The learnable prototype bank is presented as the key mechanism for capturing and guiding structured change semantics, yet the manuscript provides no details on prototype count selection, initialization, update rule, or regularization against collapse/overfitting. Because UCCD is newly introduced and the bank is fully learnable, this omission leaves open the possibility that reported gains arise from dataset-specific fitting rather than generalizable semantics.
[§4] §4 (Experiments and Ablations): The ablation studies do not isolate the prototype bank's contribution from the multi-head gating and spatial-prior components. Without a controlled variant that removes or freezes the prototype bank while keeping other modules fixed, it is impossible to attribute the claimed outperformance on UCCD and WHU-CDC specifically to the structured semantic modeling.

minor comments (3)

[§2] The description of the UCCD annotation protocol (number of annotators, quality control, sentence diversity across construction types) is insufficient for a new benchmark paper.
[Figure 3] Figure 3 (architecture diagram) would benefit from explicit labeling of the prototype-bank interaction arrows and the gating module to match the text in §3.
[§4.1] The abstract and §4.1 state that PTNet 'consistently outperforms' existing methods, but no statistical significance tests or variance across multiple runs are reported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and experimental rigor, and we will revise the paper to address them fully.

read point-by-point responses

Referee: [§3.2] §3.2 (Prototype Bank): The learnable prototype bank is presented as the key mechanism for capturing and guiding structured change semantics, yet the manuscript provides no details on prototype count selection, initialization, update rule, or regularization against collapse/overfitting. Because UCCD is newly introduced and the bank is fully learnable, this omission leaves open the possibility that reported gains arise from dataset-specific fitting rather than generalizable semantics.

Authors: We agree that the current description of the prototype bank lacks sufficient implementation details for full reproducibility and to rule out dataset-specific effects. In the revised manuscript, we will expand §3.2 (and add corresponding material to the supplement) with explicit descriptions of prototype count selection, initialization strategy, the update rule during training, and any regularization applied to prevent collapse or overfitting. These additions will clarify how the bank models generalizable structured change semantics rather than fitting idiosyncrasies of UCCD. revision: yes
Referee: [§4] §4 (Experiments and Ablations): The ablation studies do not isolate the prototype bank's contribution from the multi-head gating and spatial-prior components. Without a controlled variant that removes or freezes the prototype bank while keeping other modules fixed, it is impossible to attribute the claimed outperformance on UCCD and WHU-CDC specifically to the structured semantic modeling.

Authors: We acknowledge that the existing ablations do not isolate the prototype bank's specific contribution. In the revised §4, we will introduce a controlled ablation that removes or freezes the prototype bank while holding the multi-head gating and spatial-prior components fixed. Performance differences on both UCCD and WHU-CDC will be reported to directly attribute gains to the structured semantic modeling. revision: yes

Circularity Check

0 steps flagged

No circularity in PTNet derivation or UCCD benchmark claims

full rationale

The paper presents PTNet as an architectural proposal (learnable prototype bank guiding cross-temporal interaction, multi-head gating for disentanglement, and detection-derived spatial priors) whose behavior is defined by standard neural network components rather than by construction equaling any fitted output or prior result. Claims rest on empirical outperformance on the newly introduced UCCD dataset (9k pairs) and the external WHU-CDC benchmark, with no equations, self-citations, or uniqueness theorems shown that reduce the reported gains to tautological inputs. The derivation chain is therefore self-contained and externally falsifiable via the released code and data.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the effectiveness of newly introduced components (prototype bank and gating) whose behavior is learned from data rather than derived from first principles; standard deep-learning assumptions about representation learning are invoked without additional justification.

free parameters (2)

prototype bank size
The number of learnable prototypes is a hyperparameter that must be chosen and trained on the data to represent change semantics.
multi-head gating weights
Gating parameters are learned during end-to-end training to disentangle task-specific representations.

axioms (2)

domain assumption Paired bi-temporal images contain sufficient visual information to support both change localization and natural-language description.
This is the foundational premise of the RSICC task and is invoked throughout the motivation and method description.
domain assumption Neural networks trained with standard supervision can learn disentangled and semantically meaningful representations when guided by prototypes.
Standard assumption underlying the prototype bank and multi-head gating design.

invented entities (1)

learnable prototype bank no independent evidence
purpose: To explicitly represent and guide structured change semantics across time steps.
New component introduced to move beyond implicit feature differencing.

pith-pipeline@v0.9.0 · 5509 in / 1748 out tokens · 85519 ms · 2026-05-08T17:52:13.596746+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/DimensionForcing (2^D=8 from D=3) — superficially the number 8 appears, but here it is a hand-tuned hyperparameter, not a forced period. Not applicable; no derivational link. unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

For WHU-CDC and UCCD, the number of prototype clusters K is set to 5 and 8, respectively, determined by the semantic diversity of change types in each dataset.
Contrast with RS parameter-free chain in Foundation/RealityFromDistinction. reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ the AdamW optimizer with a global initial learning rate of 1e-4... LoRA with rank r=16 and r=64... Training proceeds for 200 epochs.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 7 canonical work pages

[1]

Change cap- tioning: A new paradigm for multitemporal remote sensing image analysis

Genc Hoxha, Saliha Chouaf, Farid Melgani, and Youcef Smara. Change cap- tioning: A new paradigm for multitemporal remote sensing image analysis. IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2022

2022
[2]

Change3d: Revisiting change detection and captioning from a video modeling perspective

Duowang Zhu, Xiaohu Huang, Haiyan Huang, Hao Zhou, and Zhenfeng Shao. Change3d: Revisiting change detection and captioning from a video modeling perspective. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24011–24022, 2025

2025
[3]

Cd4c: Change detection for remote sensing image change captioning.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025

Xiliang Li, Bin Sun, Zhenhua Wu, Shutao Li, and Hu Guo. Cd4c: Change detection for remote sensing image change captioning.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025

2025
[4]

Pixel-level change detection pseudo-label learning for remote sensing change captioning

Chenyang Liu, Keyan Chen, Zipeng Qi, Zili Liu, Haotian Zhang, Zhengxia Zou, and Zhenwei Shi. Pixel-level change detection pseudo-label learning for remote sensing change captioning. InIGARSS 2024-2024 IEEE Interna- tional Geoscience and Remote Sensing Symposium, pages 8405–8408. IEEE, 2024

2024
[5]

Change caption- ing for satellite images time series.IEEE Geoscience and Remote Sensing Letters, 21:1–5, 2024

Wei Peng, Ping Jian, Zhuqing Mao, and Yingying Zhao. Change caption- ing for satellite images time series.IEEE Geoscience and Remote Sensing Letters, 21:1–5, 2024

2024
[6]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, pages 5998–6008, 2017

2017
[7]

An im- age is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An im- age is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021

2021
[8]

Rsic-gmamba: A state space model with genetic operations for remote sensing image cap- tioning.IEEE Transactions on Geoscience and Remote Sensing, 2025

Lingwu Meng, Jing Wang, Yan Huang, and Liang Xiao. Rsic-gmamba: A state space model with genetic operations for remote sensing image cap- tioning.IEEE Transactions on Geoscience and Remote Sensing, 2025

2025
[9]

Mask approximation net: A novel diffusion model approach 16 Y

Dongwei Sun, Jing Yao, Wu Xue, Changsheng Zhou, Pedram Ghamisi, and Xiangyong Cao. Mask approximation net: A novel diffusion model approach 16 Y. Gao et al. for remote sensing change captioning.IEEE transactions on geoscience and remote sensing, 2025

2025
[10]

RS-LLaVA: A large vision-language model for joint captioning and question answering in remote sensing imagery.Remote Sensing, 16(9):1477, 2024

Bin Zhang, Shuting Zhao, Yuqi Liang, Jiaming Ye, Shuai Lu, and Jiawei Ma. RS-LLaVA: A large vision-language model for joint captioning and question answering in remote sensing imagery.Remote Sensing, 16(9):1477, 2024

2024
[11]

Describing land cover changes via multi-temporal remote sensing image cap- tioning using llm, vit, and lora.Remote Sensing, 18(1):166, 2026

Javier Lamar León, Vitor Nogueira, Pedro Salgueiro, and Paulo Quaresma. Describing land cover changes via multi-temporal remote sensing image cap- tioning using llm, vit, and lora.Remote Sensing, 18(1):166, 2026

2026
[12]

Multi-task learning for dense prediction tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3614–3633, 2021

Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc Van Gool. Multi-task learning for dense prediction tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3614–3633, 2021

2021
[14]

Detection assisted change captioning for remote sensing image

Xiliang Li, Bin Sun, and Shutao Li. Detection assisted change captioning for remote sensing image. InIGARSS 2024-2024 IEEE International Geo- science and Remote Sensing Symposium, pages 10454–10458. IEEE, 2024

2024
[15]

Change-agent: Toward interactive comprehensive remote sens- ing change interpretation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

Chenyang Liu, Keyan Chen, Haotian Zhang, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi. Change-agent: Toward interactive comprehensive remote sens- ing change interpretation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

2024
[16]

Scnet: Lightweight spatial-channel attention network for remote sensing change captioning.IEEE Transactions on Geoscience and Remote Sensing, 2026

Dongwei Sun, Yuduo Wang, Jing Yao, Weikang Yu, Xiangyong Cao, and Pedram Ghamisi. Scnet: Lightweight spatial-channel attention network for remote sensing change captioning.IEEE Transactions on Geoscience and Remote Sensing, 2026

2026
[17]

Remote sensing spatiotemporal vision–language models: A comprehensive survey.IEEE Geoscience and Remote Sensing Magazine, 2025

Chenyang Liu, Jiafan Zhang, Keyan Chen, Man Wang, Zhengxia Zou, and Zhenwei Shi. Remote sensing spatiotemporal vision–language models: A comprehensive survey.IEEE Geoscience and Remote Sensing Magazine, 2025

2025
[18]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning (ICML), pages 8748–8763, 2021

2021
[19]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page Pith review arXiv 2018
[20]

SNUNet-CD: A densely connected siamese network for change detection of VHR images.IEEE Geoscience and Remote Sensing Letters, 19:1–5, 2022

Sheng Fang, Kaiyu Li, Jinyuan Shao, and Zhe Li. SNUNet-CD: A densely connected siamese network for change detection of VHR images.IEEE Geoscience and Remote Sensing Letters, 19:1–5, 2022

2022
[21]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. Change Captioning for Urban Construction Monitoring 17

2016
[22]

Fully con- volutional siamese networks for change detection

Rodrigo Caye Daudt, Bertrand Le Saux, and Alexandre Boulch. Fully con- volutional siamese networks for change detection. InProceedings of the IEEE International Conference on Image Processing (ICIP), pages 4063– 4067, 2018

2018
[23]

A spatial-temporal attention-based method and a new dataset for remote sensing image change detection.Remote Sensing, 12(10):1662, 2020

Hao Chen and Zhenwei Shi. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection.Remote Sensing, 12(10):1662, 2020

2020
[24]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), pages 10012–10022, 2021

2021
[25]

Remote sensing change detection with transformers trained from scratch.IEEE Transactions on Geoscience and Remote Sensing, 62:1–15, 2024

Mustansar Noman, Mustansar Fiaz, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, and Fahad Shahbaz Khan. Remote sensing change detection with transformers trained from scratch.IEEE Transactions on Geoscience and Remote Sensing, 62:1–15, 2024

2024
[26]

Remote sensing image change detection with transformers.IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2021

Hao Chen, Zipeng Qi, and Zhenwei Shi. Remote sensing image change detection with transformers.IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2021

2021
[27]

Wele Gedara Chaminda Bandara and Vishal M. Patel. A transformer- based siamese network for change detection. InProceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pages 207–210, 2022

2022
[28]

Intertemporalinteractionandsymmetricdifferencelearningforremotesens- ingimagechangecaptioning.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024

Yunpeng Li, Xiangrong Zhang, Xina Cheng, Puhua Chen, and Licheng Jiao. Intertemporalinteractionandsymmetricdifferencelearningforremotesens- ingimagechangecaptioning.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024

2024
[30]

Changes to captions: An attentive network for remote sensing change captioning.IEEE Transactions on Image Processing, 32:6047–6060, 2023

Shizhen Chang and Pedram Ghamisi. Changes to captions: An attentive network for remote sensing change captioning.IEEE Transactions on Image Processing, 32:6047–6060, 2023

2023
[31]

A decoupling paradigm with prompt learning for remote sensing image change captioning.IEEE Transactions on Geoscience and Remote Sensing, 61:1–18, 2023

Chenyang Liu, Rui Zhao, Jianqi Chen, Zipeng Qi, Zhengxia Zou, and Zhen- wei Shi. A decoupling paradigm with prompt learning for remote sensing image change captioning.IEEE Transactions on Geoscience and Remote Sensing, 61:1–18, 2023

2023
[32]

RSCaMa: Remote sensing image change captioning with state space model.IEEE Geoscience and Remote Sensing Letters, 21:1–5, 2024

Chenyang Liu, Keyan Chen, Bowen Chen, Haotian Zhang, Zhengxia Zou, and Zhenwei Shi. RSCaMa: Remote sensing image change captioning with state space model.IEEE Geoscience and Remote Sensing Letters, 21:1–5, 2024

2024
[33]

Remote sensing image change captioning using multi-attentive network with diffusion model.Remote Sensing, 16(21):4083, 2024

Yunpeng Yang, Tingting Liu, Yonggang Pu, Lianming Liu, Qing Zhao, and Qian Wan. Remote sensing image change captioning using multi-attentive network with diffusion model.Remote Sensing, 16(21):4083, 2024

2024
[34]

Semantic-CC: Boosting remote sensing image change cap- 18 Y

Haoran Liu, Yibo Zhao, Yuan Jin, Keyan Li, Jiaqi Chen, Zhengxia Zou, and Zhenwei Shi. Semantic-CC: Boosting remote sensing image change cap- 18 Y. Gao et al. tioning via foundational knowledge and semantic guidance.arXiv preprint arXiv:2407.14032, 2024

work page arXiv 2024
[35]

Enhancing perception of key changes in remote sensing image change captioning.IEEE Transactions on Image Processing, 2025

Cong Yang, Zuchao Li, Hongzan Jiao, Zhi Gao, and Lefei Zhang. Enhancing perception of key changes in remote sensing image change captioning.IEEE Transactions on Image Processing, 2025

2025
[36]

Visual in- struction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual in- struction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

2023
[37]

BLIP-2: Bootstrap- ping language-image pre-training with frozen image encoders and large lan- guage models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrap- ping language-image pre-training with frozen image encoders and large lan- guage models. InProceedings of the International Conference on Machine Learning (ICML), pages 19730–19742, 2023

2023
[38]

Advancing plain vision transformer towards remote sensing foundation model.IEEE Transactions on Geoscience and Remote Sensing, 61:1–15, 2023

Di Wang, Qiming Zhang, Yanxing Xu, Jing Zhang, Bo Du, Dacheng Tao, and Liangpei Zhang. Advancing plain vision transformer towards remote sensing foundation model.IEEE Transactions on Geoscience and Remote Sensing, 61:1–15, 2023

2023
[39]

RSVQA: Vi- sual question answering for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555–8566, 2020

Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. RSVQA: Vi- sual question answering for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555–8566, 2020

2020
[40]

GeoChat: Grounded large vision-language model for remote sensing

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Khan, Salman Khan, and Fahad Shahbaz Khan. GeoChat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27831–27840, 2024

2024
[41]

anything-to-image

Yunpeng Wang, Wenbo Li, Jian Gong, Michael Kopp, and Devis Tuia. EarthVQA: Towards queryable earth via relational reasoning-based remote sensing visual question answering.arXiv preprint arXiv:2312.12222, 2023

work page arXiv 2023
[42]

ChangeChat: An interactive model for remote sensing change analysis via multimodal instruction tuning

Pei Deng, Wenqian Zhou, and Hanlin Wu. ChangeChat: An interactive model for remote sensing change analysis via multimodal instruction tuning. arXiv preprint arXiv:2409.08582, 2025

work page arXiv 2025
[43]

arXiv preprint arXiv:2409.16261

Mustansar Noman, Noor Ahsan, Muzammal Naseer, Hisham Cholakkal, RaoMuhammadAnwer,SalmanKhan,andFahadShahbazKhan. CDChat: A large multimodal model for remote sensing change description.arXiv preprint arXiv:2409.16261, 2024

work page arXiv 2024
[44]

BTCChat: Advancing remote sensing bi -temporal change captioning with multimodal large language model,

Yujie Li et al. BTCChat: Advancing remote sensing bi-temporal change captioning with multimodal large language model.arXiv preprint arXiv:2509.05895, 2025

work page arXiv 2025
[45]

arXiv preprint arXiv:2410.10047 (2024)

Yuchao Wang, Wele Gedara Chaminda Yu, Michael Kopp, and Devis Tuia. ChangeMinds: Multi-task framework for detecting and describing changes in remote sensing.arXiv preprint arXiv:2410.10047, 2024

work page arXiv 2024
[46]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Represen- tations (ICLR), 2022

2022
[47]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariha- ran, and Serge Belongie. Feature pyramid networks for object detection. Change Captioning for Urban Construction Monitoring 19 InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2117–2125, 2017

2017
[48]

Rbfim: Perceptual quality assessment for compressed point clouds usingradialbasisfunction interpolation.IEEE Transactions on Multimedia, 27:8579–8591, 2025

Zhang Chen, Shuai Wan, Siyu Ren, Fuzheng Yang, Mengting Yu, and Jun- hui Hou. Rbfim: Perceptual quality assessment for compressed point clouds usingradialbasisfunction interpolation.IEEE Transactions on Multimedia, 27:8579–8591, 2025

2025
[49]

Shikun Liu, Edward Johns, and Andrew J. Davison. End-to-end multi- task learning with attention. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1871–1880, 2019

2019
[50]

Lightglue: Local feature matching at light speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed. InProceedings of the IEEE/CVF international conference on computer vision, pages 17627–17638, 2023

2023
[51]

Kunping Yang, Jianchong Wei, Chengbin Chen, Zhensheng Wang, Junhui Lan, Xuanping Li, Duwei Hua, Dingli Xue, and Yi Wu. Restricted super- vised cascade information network for remote sensing change captioning with serial sentences.International Journal of Applied Earth Observation and Geoinformation, 142:104686, 2025

2025
[52]

A multitask network and two large-scale datasets for change detection and captioning in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–17, 2024

Jingye Shi, Mengge Zhang, Yuewu Hou, Ruicong Zhi, and Jiqiang Liu. A multitask network and two large-scale datasets for change detection and captioning in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–17, 2024

2024
[53]

Ali Can Karaca, Enes Ozelbas, Saadettin Berber, Orkhan Karimli, Turabi Yildirim, and Mehmet Fatih Amasyali. Robust change captioning in remote sensing: SECOND-CC dataset and MModalCC framework.IEEE Jour- nal of Selected Topics in Applied Earth Observations and Remote Sensing, 18:21494–21513, 2025

2025
[54]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics

2002
[55]

METEOR: An automatic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InPro- ceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Mea- sures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

2005
[56]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July
[57]

Association for Computational Linguistics
[58]

Lawrence Zitnick, and Devi Parikh

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based image description evaluation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575, 2015

2015
[59]

Ioulossfor2d/3dobjectdetection

Dingfu Zhou, Jin Fang, Xibin Song, Chenye Guan, Junbo Yin, Yuchao Dai, andRuigangYang. Ioulossfor2d/3dobjectdetection. In2019 international conference on 3D vision (3DV), pages 85–94. IEEE, 2019. 20 Y. Gao et al

2019
[60]

Nafiseh Ghasemian Sorboni, Jinfei Wang, and Mohammad Reza Najafi. Fu- sion of google street view, lidar, and orthophoto classifications using ranking classes based on f1 score for building land-use type detection.Remote Sens- ing, 16(11):2011, 2024

2011
[61]

Saras-net: Scale and relation aware siamese network for change detection

Chao-Peng Chen, Jun-Wei Hsieh, Ping-Yang Chen, Yi-Kuan Hsieh, and Bor-Shiun Wang. Saras-net: Scale and relation aware siamese network for change detection. InProceedings of the AAAI Conference on Artificial In- telligence, volume 37, pages 14187–14195, 2023

2023
[62]

Describing and localizing multiple changes with transformers

Yue Qiu, Shintaro Yamamoto, Kazutoshi Nakashima, Ryota Suzuki, Kenji Iwata, Hirokatsu Kataoka, and Yutaka Satoh. Describing and localizing multiple changes with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), pages 1951–1960, 2021

1951
[63]

Re- mote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset.IEEE Transactions on Geoscience and Remote Sensing, 60:1–20, 2022

Chenyang Liu, Rui Zhao, Hao Chen, Zhengxia Zou, and Zhenwei Shi. Re- mote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset.IEEE Transactions on Geoscience and Remote Sensing, 60:1–20, 2022

2022
[64]

Progressive scale-aware network for remote sensing image change caption- ing

Chenyang Liu, Jiajun Yang, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi. Progressive scale-aware network for remote sensing image change caption- ing. InIGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, pages 6668–6671. IEEE, 2023

2023
[65]

Diffusion-RSCC: Diffusion probabilistic model for change captioning in remote sensing im- ages.IEEE Transactions on Geoscience and Remote Sensing, 2025

Xiaofei Yu, Yitong Li, Jie Ma, Chang Li, and Hanlin Wu. Diffusion-RSCC: Diffusion probabilistic model for change captioning in remote sensing im- ages.IEEE Transactions on Geoscience and Remote Sensing, 2025

2025