arxiv: 2604.23309 · v1 · submitted 2026-04-25 · 💻 cs.CV · cs.LG

Recognition: unknown

STAND: Semantic Anchoring Constraint with Dual-Granularity Disambiguation for Remote Sensing Image Change Captioning

Yanpei Gong , Beichen Zhang , Hao Wang , Zhaobo Qi , Xinyan Liu , Yuanrong Xu , Ruiyang Gao , Weigang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:38 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords remote sensing image change captioningsemantic anchoringdual-granularity disambiguationtemporal constraintambiguity resolutionchange descriptionimage captioningviewpoint and scale handling

0 comments

The pith

STAND resolves ambiguities in viewpoint, scale, and prior knowledge for remote sensing image change captioning through temporal regularization, dual-granularity disambiguation, and semantic anchoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to improve the generation of text descriptions for differences between pairs of remote sensing images. Prior video-modeling approaches overlook ambiguities arising from changes in viewpoint, variations in object scale, and gaps in background knowledge, without imposing strong constraints on the image encoder. STAND first applies an interpretable temporal constraint to regularize and purify the temporal feature representations. It then uses a dual-granularity disambiguation module that aggregates global context at the macro level to handle viewpoint confusion while applying frequency-refocused attention at the micro level to improve small-object detection. Finally, a semantic concept anchoring module incorporates language categorical priors to resolve knowledge ambiguity during the decoding stage that produces the caption. A reader would care because more accurate automated captions would support reliable tracking of environmental, agricultural, or urban changes from satellite data.

Core claim

The central claim is that progressively applying an interpretable temporal constraint to regularize representations, followed by dual-granularity disambiguation to resolve spatial uncertainties through macro-level global context aggregation and micro-level frequency-refocused attention, and finally semantic concept anchoring with language priors during decoding, produces more precise captions for changes between remote sensing images.

What carries the argument

The STAND architecture, which integrates an interpretable temporal constraint for feature regularization, a dual-granularity disambiguation module, and a semantic concept anchoring module that leverages language priors.

If this is right

Purified temporal features from the constraint serve as a stable base for the disambiguation steps that follow.
Macro-level context aggregation reduces viewpoint-induced confusion while micro-level attention improves detection of small-scale changes.
Language-based semantic anchoring during decoding lowers errors caused by missing prior knowledge.
The overall pipeline yields higher caption accuracy than prior RSICC methods on benchmark datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The modular separation of temporal regularization from spatial disambiguation could be reused in other paired-image analysis tasks such as medical change detection.
Frequency-refocused attention may inspire similar micro-scale enhancements in low-resolution vision models outside remote sensing.
The explicit use of language priors for anchoring suggests the method could be extended to multi-temporal sequences rather than image pairs alone.

Load-bearing premise

The three modules can be combined without introducing new inconsistencies and that they directly address the primary sources of error in existing remote sensing image change captioning models.

What would settle it

An ablation experiment on standard RSICC benchmarks in which removing the dual-granularity disambiguation module produces no measurable drop in caption quality metrics on image pairs that contain large viewpoint shifts or small-object scale changes.

Figures

Figures reproduced from arXiv: 2604.23309 by Beichen Zhang, Hao Wang, Ruiyang Gao, Weigang Zhang, Xinyan Liu, Yanpei Gong, Yuanrong Xu, Zhaobo Qi.

**Figure 1.** Figure 1: Typical examples of ambiguities in the remote sensing images. We attribute these ambiguities to three main factors: viewpoint, scale, and knowledge. ( view at source ↗

**Figure 2.** Figure 2: The architecture of STAND, including an interpretable transition constraint (ITC), dual-granularity target disambiguation (DGTD) module, and semantic concept anchoring (SCA) module. B and fc denote batch size and fully connected layer. 2.2 Video Understanding Video understanding provides an effective paradigm for modeling spatio-temporal dynamics [36]. By treating a sequence of frames as a unified input, v… view at source ↗

**Figure 3.** Figure 3: Visualization of baseline Change3D [56] and STAND on LEVIR-CC. (a), (b), and (c) are change image pairs, while (d) is a no-change pair. when n = 0 (i.e., n+1 = 1). Unlike conventional contrastive learning, RS change features often share similar contextual patterns and small structural variations, making mismatched compositions prone to introducing false negatives. Such artificial separation may disrupt th… view at source ↗

**Figure 4.** Figure 4: Visualization of the heatmap, without (left) and with (right) FRCA view at source ↗

read the original abstract

Remote sensing image change captioning (RSICC) aims to describe the difference between two remote sensing images. While recent methods have explored video modeling, they largely overlook the inherent ambiguities in viewpoint, scale, and prior knowledge, lacking effective constraints on the encoder. In this paper, we present STAND, a Semantic Anchoring Constraint with Dual-Granularity Disambiguation for RSICC, to progressively resolve these ambiguities. Specifically, to establish a reliable feature foundation, we first introduce an interpretable constraint to regularize temporal representations. Operating on these purified features, a dual-granularity disambiguation module resolves spatial uncertainties by coupling macro-level global context aggregation for viewpoint confusion with micro-level frequency-refocused attention for small-object scale enhancement. Ultimately, to translate these visually disambiguated features into precise text, a semantic concept anchoring module leverages language categorical priors to tackle knowledge ambiguity during decoding. Extensive experiments verify the superiority of STAND and its effectiveness in addressing ambiguities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STAND layers temporal regularization, dual-granularity spatial fixes, and semantic anchoring onto video-style models for RSICC, with ablations that show additive gains and no interference.

read the letter

STAND adds three targeted modules to handle viewpoint, scale, and prior-knowledge ambiguities that standard video-modeling approaches for remote sensing image change captioning tend to leave open. The temporal regularization first cleans the feature foundation, the dual-granularity disambiguation then couples macro global aggregation with micro frequency-refocused attention, and semantic concept anchoring guides the decoder with language priors. The paper supplies explicit formulations for each and ablation tables that record steady improvements on the usual RSICC benchmarks without regressions that would signal module clashes.

Referee Report

0 major / 3 minor

Summary. The paper proposes STAND, a framework for remote sensing image change captioning (RSICC) that progressively resolves ambiguities in viewpoint, scale, and prior knowledge. It introduces an interpretable temporal constraint to regularize temporal representations in the encoder, a dual-granularity disambiguation module that couples macro-level global context aggregation with micro-level frequency-refocused attention, and a semantic concept anchoring module that leverages language categorical priors during decoding. The central claim is that these components, when combined, purify features and anchor semantics to yield superior captioning performance, as verified by experiments and ablations on standard RSICC benchmarks.

Significance. If the results hold, this work offers a meaningful contribution by providing an interpretable, modular pipeline for addressing common error sources in RSICC. Explicit formulations (temporal regularization loss, macro-micro fusion, concept-embedding anchoring) and ablation tables demonstrating additive gains without regressions are strengths that support the design's internal consistency and direct targeting of the stated ambiguities.

minor comments (3)

[Abstract] Abstract: While the abstract summarizes the modules and claims superiority, it would help readers to briefly note the specific datasets (e.g., LEVIR-CC or similar) and primary metrics used in the experiments.
[§3.2] §3.2 (Dual-Granularity Disambiguation): The fusion step between macro global aggregation and micro frequency-refocused attention is described at a high level; adding an explicit equation for the combined feature representation would improve clarity and reproducibility.
[Ablation studies] Ablation tables: The reported gains are additive, but including standard deviations across multiple runs or seeds would strengthen confidence that the improvements are not due to random variation.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and detailed summary of our STAND framework for remote sensing image change captioning, as well as for recognizing its contributions in addressing viewpoint, scale, and knowledge ambiguities through interpretable constraints and modular components. The recommendation for minor revision is noted, and we appreciate the acknowledgment of the paper's internal consistency via explicit formulations and ablation studies. As no specific major comments were enumerated in the report, we have no individual points to address point-by-point at this stage. We remain ready to incorporate any minor clarifications or adjustments in the revised version.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces STAND as an engineering architecture with three explicitly formulated modules (temporal regularization loss, dual-granularity spatial disambiguation via macro aggregation + micro attention, and semantic concept anchoring) motivated by listed ambiguities. Each component is defined with independent equations and fusion mechanisms, then validated via additive ablation gains on standard RSICC benchmarks. No step reduces a claimed prediction to a fitted parameter by construction, no self-definitional loop exists in the module definitions, and no load-bearing self-citation or uniqueness theorem is invoked to force the design. The chain is self-contained with external empirical support.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Based on abstract only; the central claim rests on the assumption that the introduced modules are both novel and effective, with no free parameters, axioms, or invented entities quantified.

invented entities (2)

Dual-Granularity Disambiguation module no independent evidence
purpose: Resolves spatial uncertainties via macro global context and micro frequency-refocused attention
Introduced to couple viewpoint and scale handling
Semantic concept anchoring module no independent evidence
purpose: Leverages language categorical priors during decoding
Addresses knowledge ambiguity

pith-pipeline@v0.9.0 · 5490 in / 1204 out tokens · 41320 ms · 2026-05-08T08:38:04.238779+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 6 canonical work pages · 4 internal anchors

[1]

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2025)

Bai, Q., Wang, X.: Cross-temporal remote sensing image change captioning: A manifold mapping and bayesian diffusion approach for land use monitoring. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2025)

2025
[2]

In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization

Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization. pp. 65–72 (2005)

2005
[3]

IEEE Transactions on Geoscience and Remote Sensing (2026)

Cao, X., Dong, W., Qu, J., Li, Y.: Rmnet: Dual-dimensional difference recalibration-guided cnn-vmamba synergistic network for remote sensing image change captioning. IEEE Transactions on Geoscience and Remote Sensing (2026)

2026
[4]

In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6299–6308 (2017)

2017
[5]

IEEE Transactions on Image Processing32, 6047–6060 (2023)

Chang, S., Ghamisi, P.: Changes to captions: An attentive network for remote sensing change captioning. IEEE Transactions on Image Processing32, 6047–6060 (2023)

2023
[6]

In: 2024 IEEE International Symposium on Cir- cuits and Systems (ISCAS)

Chen, C., Wang, Y., Yap, K.H.: Multi-scale attentive fusion network for remote sensing image change captioning. In: 2024 IEEE International Symposium on Cir- cuits and Systems (ISCAS). pp. 1–5. IEEE (2024)

2024
[7]

In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., Barnard, K.: Attentional feature fusion. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 3560–3569 (2021)

2021
[8]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 203–213 (2020)

2020
[9]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

2016
[10]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hosseinzadeh, M., Wang, Y.: Image change captioning by learning from an auxil- iary task. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2725–2734 (2021)

2021
[11]

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2025)

Karaca, A.C., Ozelbas, E., Berber, S., Karimli, O., Yildirim, T., Amasyali, M.F.: Robust change captioning in remote sensing: Second-cc dataset and mmodalcc framework. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2025)

2025
[12]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Kim, H., Kim, J., Lee, H., Park, H., Kim, G.: Agnostic change captioning with cycle consistency. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2095–2104 (2021)

2095
[13]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

work page internal anchor Pith review arXiv 2024
[14]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Li, R., Li, L., Zhang, J., Zhao, Q., Wang, H., Yan, C.: Region-aware difference distilling with attribute-guided contrastive regularization for change captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 4887–4895 (2025)

2025
[15]

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2025) 16 Y

Li, X., Sun, B., Wu, Z., Li, S., Guo, H.: Cd4c: Change detection for remote sens- ing image change captioning. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2025) 16 Y. Gong et al

2025
[16]

IEEE Transactions on Geoscience and Remote Sensing (2024)

Li, Y., Zhang, X., Cheng, X., Chen, P., Jiao, L.: Inter-temporal interaction and symmetric difference learning for remote sensing image change captioning. IEEE Transactions on Geoscience and Remote Sensing (2024)

2024
[17]

In: Text sum- marization branches out

Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text sum- marization branches out. pp. 74–81 (2004)

2004
[18]

In: IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium

Liu, C., Yang, J., Qi, Z., Zou, Z., Shi, Z.: Progressive scale-aware network for re- mote sensing image change captioning. In: IGARSS 2023-2023 IEEE International Geoscience and Remote Sensing Symposium. pp. 6668–6671. IEEE (2023)

2023
[19]

IEEE Geoscience and Remote Sensing Magazine (2025)

Liu, C., Zhang, J., Chen, K., Wang, M., Zou, Z., Shi, Z.: Remote sensing spa- tiotemporal vision–language models: A comprehensive survey. IEEE Geoscience and Remote Sensing Magazine (2025)

2025
[20]

IEEE Transactions on Geoscience and Remote Sensing60, 1–20 (2022)

Liu, C., Zhao, R., Chen, H., Zou, Z., Shi, Z.: Remote sensing image change cap- tioning with dual-branch transformers: A new method and a large scale dataset. IEEE Transactions on Geoscience and Remote Sensing60, 1–20 (2022)

2022
[21]

IEEE Transactions on Geoscience and Remote Sensing61, 1–18 (2023)

Liu, C., Zhao, R., Chen, J., Qi, Z., Zou, Z., Shi, Z.: A decoupling paradigm with prompt learning for remote sensing image change captioning. IEEE Transactions on Geoscience and Remote Sensing61, 1–18 (2023)

2023
[22]

Advances in neural information processing systems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023)

2023
[23]

arXiv preprint arXiv:2410.23946 (2024)

Liu, R., Li, K., Song, J., Sun, D., Cao, X.: Mv-cc: Mask enhanced video model for remote sensing change caption. arXiv preprint arXiv:2410.23946 (2024)

work page arXiv 2024
[24]

In: Proceedings of the IEEE/CVF conference on computer vision and pat- tern recognition

Liu, Z., Ning, J., Cao, Y., Wei, Y., Zhang, Z., Lin, S., Hu, H.: Video swin trans- former. In: Proceedings of the IEEE/CVF conference on computer vision and pat- tern recognition. pp. 3202–3211 (2022)

2022
[25]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Lv,F.,Wang,R.,Jing,L.:Revisitingchangecaptioningfromself-supervisedglobal- part alignment. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 5892–5900 (2025)

2025
[26]

Representation Learning with Contrastive Predictive Coding

Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predic- tive coding. arXiv preprint arXiv:1807.03748 (2018)

work page internal anchor Pith review arXiv 2018
[27]

In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002)

2002
[28]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Park, D.H., Darrell, T., Rohrbach, A.: Robust change captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4624–4633 (2019)

2019
[29]

In: Proceedings of the AAAI conference on artificial intelligence

Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: Film: Visual rea- soning with a general conditioning layer. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)

2018
[30]

IEEE Transactions on Geoscience and Remote Sensing63, 1–16 (2025)

Qin, P., Liu, J., Li, L., Tian, C., Cao, X.: Dgat: Dynamic gaussian attenuate transformer for remote sensing image change captioning. IEEE Transactions on Geoscience and Remote Sensing63, 1–16 (2025)

2025
[31]

IEEE Transactions on Geoscience and Remote Sensing (2024)

Shi, J., Zhang, M., Hou, Y., Zhi, R., Liu, J.: A multi-task network and two large scale datasets for change detection and captioning in remote sensing images. IEEE Transactions on Geoscience and Remote Sensing (2024)

2024
[32]

In: European conference on com- puter vision

Shi, X., Yang, X., Gu, J., Joty, S., Cai, J.: Finding it at another side: A viewpoint- adapted matching encoder for change captioning. In: European conference on com- puter vision. pp. 574–590. Springer (2020)

2020
[33]

Advances in neural information processing systems27(2014) STAND for Remote Sensing Image Change Captioning 17

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recog- nition in videos. Advances in neural information processing systems27(2014) STAND for Remote Sensing Image Change Captioning 17

2014
[34]

IEEE Transactions on Geoscience and Remote Sensing (2026)

Sun, D., Wang, Y., Yao, J., Yu, W., Cao, X., Ghamisi, P.: Scnet: Lightweight spatial-channel attention network for remote sensing change captioning. IEEE Transactions on Geoscience and Remote Sensing (2026)

2026
[35]

IEEE Transactions on Geoscience and Remote Sensing (2025)

Sun, D., Yao, J., Xue, W., Zhou, C., Ghamisi, P., Cao, X.: Mask approximation net: A novel diffusion model approach for remote sensing change captioning. IEEE Transactions on Geoscience and Remote Sensing (2025)

2025
[36]

IEEE Transactions on Circuits and Systems for Video Technology (2025)

Tang, Y., Bi, J., Xu, S., Song, L., Liang, S., Wang, T., Zhang, D., An, J., Lin, J., Zhu, R., et al.: Video understanding with large language models: A survey. IEEE Transactions on Circuits and Systems for Video Technology (2025)

2025
[37]

In: Proceedings of the IEEE inter- national conference on computer vision

Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotem- poral features with 3d convolutional networks. In: Proceedings of the IEEE inter- national conference on computer vision. pp. 4489–4497 (2015)

2015
[38]

In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 6450–6459 (2018)

2018
[39]

IEEE Transactions on Image Processing 32, 2620–2635 (2023)

Tu, Y., Li, L., Su, L., Du, J., Lu, K., Huang, Q.: Adaptive representation disentan- glement network for change captioning. IEEE Transactions on Image Processing 32, 2620–2635 (2023)

2023
[40]

IEEE Transactions on Pattern Analysis and Machine Intelligence46(7), 4926–4943 (2024)

Tu, Y., Li, L., Su, L., Zha, Z.J., Huang, Q.: Smart: Syntax-calibrated multi-aspect relation transformer for change captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence46(7), 4926–4943 (2024)

2024
[41]

Tu, Y., Li, L., Su, L., Zha, Z.J., Yan, C., Huang, Q.: Self-supervised cross-view rep- resentationreconstructionforchangecaptioning.In:ProceedingsoftheIEEE/CVF international conference on computer vision. pp. 2805–2815 (2023)

2023
[42]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Tu, Y., Li, L., Su, L., Zha, Z.J., Yan, C., Huang, Q.: Context-aware difference distilling for multi-change captioning. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 7941–7956 (2024)

2024
[43]

In: Findings of the asso- ciation for computational linguistics: ACL-IJCNLP 2021

Tu, Y., Yao, T., Li, L., Lou, J., Gao, S., Yu, Z., Yan, C.: Semantic relation-aware difference representation learning for change captioning. In: Findings of the asso- ciation for computational linguistics: ACL-IJCNLP 2021. pp. 63–73 (2021)

2021
[44]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4566–4575 (2015)

2015
[45]

Wang, H., Zhang, B., Gong, Y., Fang, S., Qi, Z., Xu, Y., Liu, X., Zhang, W.: Aifind: Artifact-aware interpreting fine-grained alignment for incremental face forgery de- tection (2026),https://arxiv.org/abs/2604.16207

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

arXiv preprint arXiv:2410.10047 (2024)

Wang, Y., Yu, W., Kopp, M., Ghamisi, P.: Changeminds: Multi-task frame- work for detecting and describing changes in remote sensing. arXiv preprint arXiv:2410.10047 (2024)

work page arXiv 2024
[47]

Remote Sensing17(13), 2285 (2025)

Wu, R., Ye, H., Liu, X., Li, Z., Sun, C., Wu, J.: A cross-spatial differential localiza- tion network for remote sensing change captioning. Remote Sensing17(13), 2285 (2025)

2025
[48]

IEEE Transactions on Multimedia (2026)

Xian, T., Zhou, Z., Zhou, W., Zeng, D., Li, B.: Cross-view and multi-step interac- tion for change captioning. IEEE Transactions on Multimedia (2026)

2026
[49]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Xiao, Y., Xu, T., Xin, Y., Li, J.: Fbrt-yolo: Faster and better for real-time aerial image detection. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 8673–8681 (2025)

2025
[50]

IEEE Transactions on Image Process- ing (2025) 18 Y

Yang, C., Li, Z., Jiao, H., Gao, Z., Zhang, L.: Enhancing perception of key changes in remote sensing image change captioning. IEEE Transactions on Image Process- ing (2025) 18 Y. Gong et al

2025
[51]

Remote Sensing 16(21), 4083 (2024)

Yang, Y., Liu, T., Pu, Y., Liu, L., Zhao, Q., Wan, Q.: Remote sensing image change captioning using multi-attentive network with diffusion model. Remote Sensing 16(21), 4083 (2024)

2024
[52]

IEEE Transactions on Multimedia25, 8828–8841 (2023)

Yue, S., Tu, Y., Li, L., Yang, Y., Gao, S., Yu, Z.: I3n: Intra-and inter-representation interaction network for change captioning. IEEE Transactions on Multimedia25, 8828–8841 (2023)

2023
[53]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Zhong, G., Hu, J., Chen, J., Yuan, J., Pan, W.: Decider: Difference-aware con- trastive diffusion model with adversarial perturbations for image change caption- ing. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 10662–10670 (2025)

2025
[54]

IEEE Transactions on Geoscience and Remote Sensing62, 1–14 (2024)

Zhou, Q., Gao, J., Yuan, Y., Wang, Q.: Single-stream extractor network with contrastive pre-training for remote-sensing change captioning. IEEE Transactions on Geoscience and Remote Sensing62, 1–14 (2024)

2024
[55]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhancing vision- language understanding with advanced large language models. arXiv preprint arXiv:2304.10592 (2023)

work page internal anchor Pith review arXiv 2023
[56]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhu, D., Huang, X., Huang, H., Zhou, H., Shao, Z.: Change3d: Revisiting change detection and captioning from a video modeling perspective. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24011–24022 (2025)

2025
[57]

Remote Sensing17(8), 1463 (2025)

Zou, S., Wei, Y., Xie, Y., Luan, X.: Frequency–spatial–temporal domain fusion network for remote sensing image change captioning. Remote Sensing17(8), 1463 (2025)

2025