pith. machine review for the scientific record. sign in

arxiv: 2605.04409 · v1 · submitted 2026-05-06 · 💻 cs.CV

UAV as Urban Construction Change Monitor: A New Benchmark and Change Captioning Model

Pith reviewed 2026-05-08 17:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords change captioningremote sensingUAV imageryurban constructionprototype learningmulti-task learningchange detection
0
0 comments X

The pith

PTNet uses a learnable prototype bank to model structured change semantics for generating natural language descriptions of urban construction changes from UAV image pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Remote sensing change captioning seeks natural language descriptions of how scenes evolve between two images instead of just binary change masks. Existing methods rely on implicit feature differences and cannot easily reconcile the distinct needs of accurate change detection with semantically coherent descriptions. PTNet addresses this by maintaining a bank of learnable prototypes that represent common change patterns, using them to align features across time steps, gating representations so detection and captioning do not interfere, and feeding detection outputs as spatial guidance into the caption decoder. The authors also release UCCD, a new benchmark of 9,000 high-resolution UAV pairs focused on urban construction with 45,000 annotated sentences. Experiments on UCCD and an existing dataset show PTNet produces more accurate and coherent results than prior approaches.

Core claim

PTNet explicitly models structured change semantics through a learnable prototype bank that guides cross-temporal interaction, disentangles task-specific representations via multi-head gating, and injects detection-derived spatial priors into caption generation, enabling coherent semantic correspondence while preserving fine-grained spatial sensitivity.

What carries the argument

A learnable prototype bank that captures structured change semantics, guides cross-temporal feature alignment, and supports task-specific disentanglement in a joint change detection and captioning model.

If this is right

  • Joint detection and captioning yields spatially grounded descriptions that align with actual changed regions.
  • Explicit prototypes allow the model to handle complex, multi-object urban changes more coherently than implicit differencing.
  • The UCCD benchmark provides a standardized testbed for future work on high-resolution construction monitoring.
  • Detection priors injected into captioning improve fine-grained spatial sensitivity without sacrificing semantic quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The prototype approach could transfer to other change-description tasks such as vegetation or infrastructure monitoring if the bank is initialized from domain-specific data.
  • If prototypes prove stable across datasets, the method might support lighter supervision for new regions rather than full retraining.
  • Real-time UAV streams could feed the same prototype bank to produce ongoing natural-language summaries of construction activity.

Load-bearing premise

A learnable prototype bank can reliably capture and generalize structured change semantics across diverse urban construction scenarios without overfitting to the training distribution.

What would settle it

Evaluating PTNet on a new UAV dataset of urban construction changes from cities or construction types absent from UCCD training data, then checking whether caption coherence and accuracy gains disappear compared with baselines.

Figures

Figures reproduced from arXiv: 2605.04409 by Guoqing Wang, Tianyu Li, Yang Yang, Yupeng Gao.

Figure 1
Figure 1. Figure 1: (a) single-task methods that produce either a change mask or a caption, (b) existing joint methods that suffer from feature conflicts and inaccurate descriptions, and (c) the proposed PTNet, which introduces prototype-guided semantic modeling and task-adaptive feature decoupling for accurate and spatially faithful change captioning. Existing RSICC methods follow an encoder–decoder paradigm to model cross-t… view at source ↗
Figure 2
Figure 2. Figure 2: Overall architecture of the proposed PTNet. 3.2 Prototype-Guided Change-Aware Interaction Prototype Initialization As depicted in view at source ↗
Figure 3
Figure 3. Figure 3: (a) Prototype bank construction: training-set difference features are clus￾tered via K-means and spatially recovered via RBF interpolation to form the learnable prototype bank P ∈ R K×N×D. (b) PG-CAI Block: P modulates bidirectional cross￾attention between {F i 1, F i 2}, producing change-aware features {Gi 1, Gi 2}. (c) Change Captioning Decoder: change-aware features are projected into the LLM token spac… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the UCCD dataset construction and statistical analysis. (a) Data annotation pipeline for UCCD dataset construction. (b) Sentence length distribution across Train, Val, and Test splits. (c) Part-of-speech distribution of all captions, with nouns (29.4%) and verbs (25.1%) dominating, reflecting the action-oriented nature of change descriptions. (d) Inter-annotator semantic consistency across four… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on WHU-CDC and UCCD. Red text highlights erro￾neous or hallucinated descriptions view at source ↗
read the original abstract

Remote Sensing Image Change Captioning (RSICC) aims to generate spatially grounded natural language descriptions of scene evolution from bi-temporal imagery, moving beyond binary change masks toward semantic-level understanding. However, existing methods rely on implicit feature differencing without explicitly modeling structured change semantics, and struggle to reconcile the conflicting representation demands of change detection and caption generation. In addition, current benchmarks provide limited coverage of high-resolution urban construction scenarios. To address these challenges, we propose PTNet, a prototype-guided task-adaptive framework for joint change captioning and detection. PTNet explicitly models structured change semantics through a learnable prototype bank that guides cross-temporal interaction, disentangles task-specific representations via multi-head gating, and injects detection-derived spatial priors into caption generation, enabling coherent semantic correspondence while preserving fine-grained spatial sensitivity. Furthermore, we construct UCCD, a large-scale UAV-based benchmark comprising 9,000 high-resolution image pairs and 45,000 annotated sentences for urban construction monitoring. Extensive experiments on UCCD and WHU-CDC demonstrate that PTNet consistently outperforms existing methods. The dataset and source code are publicly available at https://github.com/G124556/ptnet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces PTNet, a prototype-guided task-adaptive network for remote sensing image change captioning (RSICC) that uses a learnable prototype bank to explicitly model structured change semantics, multi-head gating to disentangle change detection and captioning representations, and injection of detection-derived spatial priors into the caption decoder. It also presents UCCD, a new UAV-based benchmark with 9,000 high-resolution bi-temporal image pairs and 45,000 annotated sentences focused on urban construction changes. Experiments claim consistent outperformance over prior methods on both UCCD and the existing WHU-CDC dataset, with public release of data and code.

Significance. If the central claims hold, the work supplies a much-needed high-resolution urban construction benchmark and an architecture that moves RSICC beyond implicit differencing toward explicit semantic modeling. The public dataset and code are clear strengths that support reproducibility and further research in UAV-based monitoring applications.

major comments (2)
  1. [§3.2] §3.2 (Prototype Bank): The learnable prototype bank is presented as the key mechanism for capturing and guiding structured change semantics, yet the manuscript provides no details on prototype count selection, initialization, update rule, or regularization against collapse/overfitting. Because UCCD is newly introduced and the bank is fully learnable, this omission leaves open the possibility that reported gains arise from dataset-specific fitting rather than generalizable semantics.
  2. [§4] §4 (Experiments and Ablations): The ablation studies do not isolate the prototype bank's contribution from the multi-head gating and spatial-prior components. Without a controlled variant that removes or freezes the prototype bank while keeping other modules fixed, it is impossible to attribute the claimed outperformance on UCCD and WHU-CDC specifically to the structured semantic modeling.
minor comments (3)
  1. [§2] The description of the UCCD annotation protocol (number of annotators, quality control, sentence diversity across construction types) is insufficient for a new benchmark paper.
  2. [Figure 3] Figure 3 (architecture diagram) would benefit from explicit labeling of the prototype-bank interaction arrows and the gating module to match the text in §3.
  3. [§4.1] The abstract and §4.1 state that PTNet 'consistently outperforms' existing methods, but no statistical significance tests or variance across multiple runs are reported.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving clarity and experimental rigor, and we will revise the paper to address them fully.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Prototype Bank): The learnable prototype bank is presented as the key mechanism for capturing and guiding structured change semantics, yet the manuscript provides no details on prototype count selection, initialization, update rule, or regularization against collapse/overfitting. Because UCCD is newly introduced and the bank is fully learnable, this omission leaves open the possibility that reported gains arise from dataset-specific fitting rather than generalizable semantics.

    Authors: We agree that the current description of the prototype bank lacks sufficient implementation details for full reproducibility and to rule out dataset-specific effects. In the revised manuscript, we will expand §3.2 (and add corresponding material to the supplement) with explicit descriptions of prototype count selection, initialization strategy, the update rule during training, and any regularization applied to prevent collapse or overfitting. These additions will clarify how the bank models generalizable structured change semantics rather than fitting idiosyncrasies of UCCD. revision: yes

  2. Referee: [§4] §4 (Experiments and Ablations): The ablation studies do not isolate the prototype bank's contribution from the multi-head gating and spatial-prior components. Without a controlled variant that removes or freezes the prototype bank while keeping other modules fixed, it is impossible to attribute the claimed outperformance on UCCD and WHU-CDC specifically to the structured semantic modeling.

    Authors: We acknowledge that the existing ablations do not isolate the prototype bank's specific contribution. In the revised §4, we will introduce a controlled ablation that removes or freezes the prototype bank while holding the multi-head gating and spatial-prior components fixed. Performance differences on both UCCD and WHU-CDC will be reported to directly attribute gains to the structured semantic modeling. revision: yes

Circularity Check

0 steps flagged

No circularity in PTNet derivation or UCCD benchmark claims

full rationale

The paper presents PTNet as an architectural proposal (learnable prototype bank guiding cross-temporal interaction, multi-head gating for disentanglement, and detection-derived spatial priors) whose behavior is defined by standard neural network components rather than by construction equaling any fitted output or prior result. Claims rest on empirical outperformance on the newly introduced UCCD dataset (9k pairs) and the external WHU-CDC benchmark, with no equations, self-citations, or uniqueness theorems shown that reduce the reported gains to tautological inputs. The derivation chain is therefore self-contained and externally falsifiable via the released code and data.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the effectiveness of newly introduced components (prototype bank and gating) whose behavior is learned from data rather than derived from first principles; standard deep-learning assumptions about representation learning are invoked without additional justification.

free parameters (2)
  • prototype bank size
    The number of learnable prototypes is a hyperparameter that must be chosen and trained on the data to represent change semantics.
  • multi-head gating weights
    Gating parameters are learned during end-to-end training to disentangle task-specific representations.
axioms (2)
  • domain assumption Paired bi-temporal images contain sufficient visual information to support both change localization and natural-language description.
    This is the foundational premise of the RSICC task and is invoked throughout the motivation and method description.
  • domain assumption Neural networks trained with standard supervision can learn disentangled and semantically meaningful representations when guided by prototypes.
    Standard assumption underlying the prototype bank and multi-head gating design.
invented entities (1)
  • learnable prototype bank no independent evidence
    purpose: To explicitly represent and guide structured change semantics across time steps.
    New component introduced to move beyond implicit feature differencing.

pith-pipeline@v0.9.0 · 5509 in / 1748 out tokens · 85519 ms · 2026-05-08T17:52:13.596746+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

63 extracted references · 7 canonical work pages

  1. [1]

    Change cap- tioning: A new paradigm for multitemporal remote sensing image analysis

    Genc Hoxha, Saliha Chouaf, Farid Melgani, and Youcef Smara. Change cap- tioning: A new paradigm for multitemporal remote sensing image analysis. IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2022

  2. [2]

    Change3d: Revisiting change detection and captioning from a video modeling perspective

    Duowang Zhu, Xiaohu Huang, Haiyan Huang, Hao Zhou, and Zhenfeng Shao. Change3d: Revisiting change detection and captioning from a video modeling perspective. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24011–24022, 2025

  3. [3]

    Cd4c: Change detection for remote sensing image change captioning.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025

    Xiliang Li, Bin Sun, Zhenhua Wu, Shutao Li, and Hu Guo. Cd4c: Change detection for remote sensing image change captioning.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2025

  4. [4]

    Pixel-level change detection pseudo-label learning for remote sensing change captioning

    Chenyang Liu, Keyan Chen, Zipeng Qi, Zili Liu, Haotian Zhang, Zhengxia Zou, and Zhenwei Shi. Pixel-level change detection pseudo-label learning for remote sensing change captioning. InIGARSS 2024-2024 IEEE Interna- tional Geoscience and Remote Sensing Symposium, pages 8405–8408. IEEE, 2024

  5. [5]

    Change caption- ing for satellite images time series.IEEE Geoscience and Remote Sensing Letters, 21:1–5, 2024

    Wei Peng, Ping Jian, Zhuqing Mao, and Yingying Zhao. Change caption- ing for satellite images time series.IEEE Geoscience and Remote Sensing Letters, 21:1–5, 2024

  6. [6]

    Gomez, Łukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS), volume 30, pages 5998–6008, 2017

  7. [7]

    An im- age is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An im- age is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021

  8. [8]

    Rsic-gmamba: A state space model with genetic operations for remote sensing image cap- tioning.IEEE Transactions on Geoscience and Remote Sensing, 2025

    Lingwu Meng, Jing Wang, Yan Huang, and Liang Xiao. Rsic-gmamba: A state space model with genetic operations for remote sensing image cap- tioning.IEEE Transactions on Geoscience and Remote Sensing, 2025

  9. [9]

    Mask approximation net: A novel diffusion model approach 16 Y

    Dongwei Sun, Jing Yao, Wu Xue, Changsheng Zhou, Pedram Ghamisi, and Xiangyong Cao. Mask approximation net: A novel diffusion model approach 16 Y. Gao et al. for remote sensing change captioning.IEEE transactions on geoscience and remote sensing, 2025

  10. [10]

    RS-LLaVA: A large vision-language model for joint captioning and question answering in remote sensing imagery.Remote Sensing, 16(9):1477, 2024

    Bin Zhang, Shuting Zhao, Yuqi Liang, Jiaming Ye, Shuai Lu, and Jiawei Ma. RS-LLaVA: A large vision-language model for joint captioning and question answering in remote sensing imagery.Remote Sensing, 16(9):1477, 2024

  11. [11]

    Describing land cover changes via multi-temporal remote sensing image cap- tioning using llm, vit, and lora.Remote Sensing, 18(1):166, 2026

    Javier Lamar León, Vitor Nogueira, Pedro Salgueiro, and Paulo Quaresma. Describing land cover changes via multi-temporal remote sensing image cap- tioning using llm, vit, and lora.Remote Sensing, 18(1):166, 2026

  12. [12]

    Multi-task learning for dense prediction tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3614–3633, 2021

    Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc Van Gool. Multi-task learning for dense prediction tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3614–3633, 2021

  13. [14]

    Detection assisted change captioning for remote sensing image

    Xiliang Li, Bin Sun, and Shutao Li. Detection assisted change captioning for remote sensing image. InIGARSS 2024-2024 IEEE International Geo- science and Remote Sensing Symposium, pages 10454–10458. IEEE, 2024

  14. [15]

    Change-agent: Toward interactive comprehensive remote sens- ing change interpretation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

    Chenyang Liu, Keyan Chen, Haotian Zhang, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi. Change-agent: Toward interactive comprehensive remote sens- ing change interpretation and analysis.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

  15. [16]

    Scnet: Lightweight spatial-channel attention network for remote sensing change captioning.IEEE Transactions on Geoscience and Remote Sensing, 2026

    Dongwei Sun, Yuduo Wang, Jing Yao, Weikang Yu, Xiangyong Cao, and Pedram Ghamisi. Scnet: Lightweight spatial-channel attention network for remote sensing change captioning.IEEE Transactions on Geoscience and Remote Sensing, 2026

  16. [17]

    Remote sensing spatiotemporal vision–language models: A comprehensive survey.IEEE Geoscience and Remote Sensing Magazine, 2025

    Chenyang Liu, Jiafan Zhang, Keyan Chen, Man Wang, Zhengxia Zou, and Zhenwei Shi. Remote sensing spatiotemporal vision–language models: A comprehensive survey.IEEE Geoscience and Remote Sensing Magazine, 2025

  17. [18]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning (ICML), pages 8748–8763, 2021

  18. [19]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  19. [20]

    SNUNet-CD: A densely connected siamese network for change detection of VHR images.IEEE Geoscience and Remote Sensing Letters, 19:1–5, 2022

    Sheng Fang, Kaiyu Li, Jinyuan Shao, and Zhe Li. SNUNet-CD: A densely connected siamese network for change detection of VHR images.IEEE Geoscience and Remote Sensing Letters, 19:1–5, 2022

  20. [21]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. Change Captioning for Urban Construction Monitoring 17

  21. [22]

    Fully con- volutional siamese networks for change detection

    Rodrigo Caye Daudt, Bertrand Le Saux, and Alexandre Boulch. Fully con- volutional siamese networks for change detection. InProceedings of the IEEE International Conference on Image Processing (ICIP), pages 4063– 4067, 2018

  22. [23]

    A spatial-temporal attention-based method and a new dataset for remote sensing image change detection.Remote Sensing, 12(10):1662, 2020

    Hao Chen and Zhenwei Shi. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection.Remote Sensing, 12(10):1662, 2020

  23. [24]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF International Con- ference on Computer Vision (ICCV), pages 10012–10022, 2021

  24. [25]

    Remote sensing change detection with transformers trained from scratch.IEEE Transactions on Geoscience and Remote Sensing, 62:1–15, 2024

    Mustansar Noman, Mustansar Fiaz, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, and Fahad Shahbaz Khan. Remote sensing change detection with transformers trained from scratch.IEEE Transactions on Geoscience and Remote Sensing, 62:1–15, 2024

  25. [26]

    Remote sensing image change detection with transformers.IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2021

    Hao Chen, Zipeng Qi, and Zhenwei Shi. Remote sensing image change detection with transformers.IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2021

  26. [27]

    Wele Gedara Chaminda Bandara and Vishal M. Patel. A transformer- based siamese network for change detection. InProceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pages 207–210, 2022

  27. [28]

    Intertemporalinteractionandsymmetricdifferencelearningforremotesens- ingimagechangecaptioning.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024

    Yunpeng Li, Xiangrong Zhang, Xina Cheng, Puhua Chen, and Licheng Jiao. Intertemporalinteractionandsymmetricdifferencelearningforremotesens- ingimagechangecaptioning.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024

  28. [30]

    Changes to captions: An attentive network for remote sensing change captioning.IEEE Transactions on Image Processing, 32:6047–6060, 2023

    Shizhen Chang and Pedram Ghamisi. Changes to captions: An attentive network for remote sensing change captioning.IEEE Transactions on Image Processing, 32:6047–6060, 2023

  29. [31]

    A decoupling paradigm with prompt learning for remote sensing image change captioning.IEEE Transactions on Geoscience and Remote Sensing, 61:1–18, 2023

    Chenyang Liu, Rui Zhao, Jianqi Chen, Zipeng Qi, Zhengxia Zou, and Zhen- wei Shi. A decoupling paradigm with prompt learning for remote sensing image change captioning.IEEE Transactions on Geoscience and Remote Sensing, 61:1–18, 2023

  30. [32]

    RSCaMa: Remote sensing image change captioning with state space model.IEEE Geoscience and Remote Sensing Letters, 21:1–5, 2024

    Chenyang Liu, Keyan Chen, Bowen Chen, Haotian Zhang, Zhengxia Zou, and Zhenwei Shi. RSCaMa: Remote sensing image change captioning with state space model.IEEE Geoscience and Remote Sensing Letters, 21:1–5, 2024

  31. [33]

    Remote sensing image change captioning using multi-attentive network with diffusion model.Remote Sensing, 16(21):4083, 2024

    Yunpeng Yang, Tingting Liu, Yonggang Pu, Lianming Liu, Qing Zhao, and Qian Wan. Remote sensing image change captioning using multi-attentive network with diffusion model.Remote Sensing, 16(21):4083, 2024

  32. [34]

    Semantic-CC: Boosting remote sensing image change cap- 18 Y

    Haoran Liu, Yibo Zhao, Yuan Jin, Keyan Li, Jiaqi Chen, Zhengxia Zou, and Zhenwei Shi. Semantic-CC: Boosting remote sensing image change cap- 18 Y. Gao et al. tioning via foundational knowledge and semantic guidance.arXiv preprint arXiv:2407.14032, 2024

  33. [35]

    Enhancing perception of key changes in remote sensing image change captioning.IEEE Transactions on Image Processing, 2025

    Cong Yang, Zuchao Li, Hongzan Jiao, Zhi Gao, and Lefei Zhang. Enhancing perception of key changes in remote sensing image change captioning.IEEE Transactions on Image Processing, 2025

  34. [36]

    Visual in- struction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual in- struction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  35. [37]

    BLIP-2: Bootstrap- ping language-image pre-training with frozen image encoders and large lan- guage models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrap- ping language-image pre-training with frozen image encoders and large lan- guage models. InProceedings of the International Conference on Machine Learning (ICML), pages 19730–19742, 2023

  36. [38]

    Advancing plain vision transformer towards remote sensing foundation model.IEEE Transactions on Geoscience and Remote Sensing, 61:1–15, 2023

    Di Wang, Qiming Zhang, Yanxing Xu, Jing Zhang, Bo Du, Dacheng Tao, and Liangpei Zhang. Advancing plain vision transformer towards remote sensing foundation model.IEEE Transactions on Geoscience and Remote Sensing, 61:1–15, 2023

  37. [39]

    RSVQA: Vi- sual question answering for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555–8566, 2020

    Sylvain Lobry, Diego Marcos, Jesse Murray, and Devis Tuia. RSVQA: Vi- sual question answering for remote sensing data.IEEE Transactions on Geoscience and Remote Sensing, 58(12):8555–8566, 2020

  38. [40]

    GeoChat: Grounded large vision-language model for remote sensing

    Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Khan, Salman Khan, and Fahad Shahbaz Khan. GeoChat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 27831–27840, 2024

  39. [41]

    anything-to-image

    Yunpeng Wang, Wenbo Li, Jian Gong, Michael Kopp, and Devis Tuia. EarthVQA: Towards queryable earth via relational reasoning-based remote sensing visual question answering.arXiv preprint arXiv:2312.12222, 2023

  40. [42]

    ChangeChat: An interactive model for remote sensing change analysis via multimodal instruction tuning

    Pei Deng, Wenqian Zhou, and Hanlin Wu. ChangeChat: An interactive model for remote sensing change analysis via multimodal instruction tuning. arXiv preprint arXiv:2409.08582, 2025

  41. [43]

    arXiv preprint arXiv:2409.16261

    Mustansar Noman, Noor Ahsan, Muzammal Naseer, Hisham Cholakkal, RaoMuhammadAnwer,SalmanKhan,andFahadShahbazKhan. CDChat: A large multimodal model for remote sensing change description.arXiv preprint arXiv:2409.16261, 2024

  42. [44]

    BTCChat: Advancing remote sensing bi -temporal change captioning with multimodal large language model,

    Yujie Li et al. BTCChat: Advancing remote sensing bi-temporal change captioning with multimodal large language model.arXiv preprint arXiv:2509.05895, 2025

  43. [45]

    arXiv preprint arXiv:2410.10047 (2024)

    Yuchao Wang, Wele Gedara Chaminda Yu, Michael Kopp, and Devis Tuia. ChangeMinds: Multi-task framework for detecting and describing changes in remote sensing.arXiv preprint arXiv:2410.10047, 2024

  44. [46]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Represen- tations (ICLR), 2022

  45. [47]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariha- ran, and Serge Belongie. Feature pyramid networks for object detection. Change Captioning for Urban Construction Monitoring 19 InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2117–2125, 2017

  46. [48]

    Rbfim: Perceptual quality assessment for compressed point clouds usingradialbasisfunction interpolation.IEEE Transactions on Multimedia, 27:8579–8591, 2025

    Zhang Chen, Shuai Wan, Siyu Ren, Fuzheng Yang, Mengting Yu, and Jun- hui Hou. Rbfim: Perceptual quality assessment for compressed point clouds usingradialbasisfunction interpolation.IEEE Transactions on Multimedia, 27:8579–8591, 2025

  47. [49]

    Shikun Liu, Edward Johns, and Andrew J. Davison. End-to-end multi- task learning with attention. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1871–1880, 2019

  48. [50]

    Lightglue: Local feature matching at light speed

    Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed. InProceedings of the IEEE/CVF international conference on computer vision, pages 17627–17638, 2023

  49. [51]

    Kunping Yang, Jianchong Wei, Chengbin Chen, Zhensheng Wang, Junhui Lan, Xuanping Li, Duwei Hua, Dingli Xue, and Yi Wu. Restricted super- vised cascade information network for remote sensing change captioning with serial sentences.International Journal of Applied Earth Observation and Geoinformation, 142:104686, 2025

  50. [52]

    A multitask network and two large-scale datasets for change detection and captioning in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–17, 2024

    Jingye Shi, Mengge Zhang, Yuewu Hou, Ruicong Zhi, and Jiqiang Liu. A multitask network and two large-scale datasets for change detection and captioning in remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–17, 2024

  51. [53]

    Ali Can Karaca, Enes Ozelbas, Saadettin Berber, Orkhan Karimli, Turabi Yildirim, and Mehmet Fatih Amasyali. Robust change captioning in remote sensing: SECOND-CC dataset and MModalCC framework.IEEE Jour- nal of Selected Topics in Applied Earth Observations and Remote Sensing, 18:21494–21513, 2025

  52. [54]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics

  53. [55]

    METEOR: An automatic metric for MT evaluation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InPro- ceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Mea- sures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics

  54. [56]

    ROUGE: A package for automatic evaluation of summaries

    Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July

  55. [57]

    Association for Computational Linguistics

  56. [58]

    Lawrence Zitnick, and Devi Parikh

    Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based image description evaluation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575, 2015

  57. [59]

    Ioulossfor2d/3dobjectdetection

    Dingfu Zhou, Jin Fang, Xibin Song, Chenye Guan, Junbo Yin, Yuchao Dai, andRuigangYang. Ioulossfor2d/3dobjectdetection. In2019 international conference on 3D vision (3DV), pages 85–94. IEEE, 2019. 20 Y. Gao et al

  58. [60]

    Nafiseh Ghasemian Sorboni, Jinfei Wang, and Mohammad Reza Najafi. Fu- sion of google street view, lidar, and orthophoto classifications using ranking classes based on f1 score for building land-use type detection.Remote Sens- ing, 16(11):2011, 2024

  59. [61]

    Saras-net: Scale and relation aware siamese network for change detection

    Chao-Peng Chen, Jun-Wei Hsieh, Ping-Yang Chen, Yi-Kuan Hsieh, and Bor-Shiun Wang. Saras-net: Scale and relation aware siamese network for change detection. InProceedings of the AAAI Conference on Artificial In- telligence, volume 37, pages 14187–14195, 2023

  60. [62]

    Describing and localizing multiple changes with transformers

    Yue Qiu, Shintaro Yamamoto, Kazutoshi Nakashima, Ryota Suzuki, Kenji Iwata, Hirokatsu Kataoka, and Yutaka Satoh. Describing and localizing multiple changes with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision (ICCV), pages 1951–1960, 2021

  61. [63]

    Re- mote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset.IEEE Transactions on Geoscience and Remote Sensing, 60:1–20, 2022

    Chenyang Liu, Rui Zhao, Hao Chen, Zhengxia Zou, and Zhenwei Shi. Re- mote sensing image change captioning with dual-branch transformers: A new method and a large scale dataset.IEEE Transactions on Geoscience and Remote Sensing, 60:1–20, 2022

  62. [64]

    Progressive scale-aware network for remote sensing image change caption- ing

    Chenyang Liu, Jiajun Yang, Zipeng Qi, Zhengxia Zou, and Zhenwei Shi. Progressive scale-aware network for remote sensing image change caption- ing. InIGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium, pages 6668–6671. IEEE, 2023

  63. [65]

    Diffusion-RSCC: Diffusion probabilistic model for change captioning in remote sensing im- ages.IEEE Transactions on Geoscience and Remote Sensing, 2025

    Xiaofei Yu, Yitong Li, Jie Ma, Chang Li, and Hanlin Wu. Diffusion-RSCC: Diffusion probabilistic model for change captioning in remote sensing im- ages.IEEE Transactions on Geoscience and Remote Sensing, 2025