MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning

Raksha Sharma; Swadhin Das

arxiv: 2502.09282 · v4 · submitted 2025-02-13 · 💻 cs.CV · cs.HC· cs.LG

MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning

Swadhin Das , Raksha Sharma This is my paper

Pith reviewed 2026-05-23 03:19 UTC · model grok-4.3

classification 💻 cs.CV cs.HCcs.LG

keywords remote sensing image captioningmulti-stream encoder-decoderfeature fusionstacked GRUimage captioningsatellite imageryencoder-decoder architecture

0 comments

The pith

Fusing two complementary image encoders plus a stacked GRU decoder with element-wise aggregation improves remote sensing image captioning over single-stream baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MsEdF, a multi-stream encoder-decoder model for turning remote sensing images into descriptive text. It argues that single-stream designs fail to capture enough feature diversity or semantic context in complex satellite scenes. The encoder side integrates outputs from two different image encoders to combine multiscale and structural cues. The decoder side uses stacked GRUs plus element-wise aggregation to refine sequence modeling. Tests on three benchmark datasets show the resulting captions outperform several existing methods.

Core claim

The Multi-stream Encoder-decoder Framework (MsEdF) improves RSIC performance by fusing information from two complementary image encoders to increase feature diversity and by refining semantic modeling on the decoder side with a stacked GRU architecture that applies an element-wise aggregation scheme.

What carries the argument

MsEdF, whose encoder fuses two complementary image encoders and whose decoder applies stacked GRU with element-wise aggregation to the input sequence.

If this is right

Remote sensing images with high intraclass similarity or contextual ambiguity receive more accurate descriptions.
Single-stream encoder-decoder models are less effective for capturing diverse spatial features in satellite imagery.
Context-aware text generation benefits from the stacked GRU and element-wise aggregation on the decoder side.
Performance gains appear across multiple standard RSIC benchmark datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar fusion of complementary encoders could be tested in other visual domains that contain ambiguous or fine-grained scenes.
The approach suggests a general route for increasing feature diversity without adding entirely new network families.
Further gains might come from exploring more than two encoder streams or alternative aggregation methods inside the decoder.

Load-bearing premise

That the specific choice of two complementary encoders and the stacked GRU aggregation scheme produces the observed gains rather than other factors such as training details or dataset properties.

What would settle it

Re-running the three-dataset experiments after replacing the dual-encoder fusion with a single encoder or the stacked GRU with a standard single-layer decoder and finding no consistent improvement.

Figures

Figures reproduced from arXiv: 2502.09282 by Raksha Sharma, Swadhin Das.

**Figure 1.** Figure 1: Architecture of the Proposed Model softmax to generate word probabilities. The overall architecture of our work is shown in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Illustrative Example of Encoder Fusion in the Proposed Model [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of RS Image Captioning by Different Methods [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

read the original abstract

Remote sensing images contain complex spatial patterns and semantic structures, which makes the captioning model difficult to accurately describe. Encoder-decoder architectures have become the widely used approach for RSIC by translating visual content into descriptive text. However, many existing methods rely on a single-stream architecture, which weakens the model to accurately describe the image. Such single-stream architectures typically struggle to extract diverse spatial features or capture complex semantic relationships, limiting their effectiveness in scenes with high intraclass similarity or contextual ambiguity. In this work, we propose a novel Multi-stream Encoder-decoder Framework (MsEdF) which improves the performance of RSIC by optimizing both the spatial representation and language generation of encoder-decoder architecture. The encoder fuses information from two complementary image encoders, thereby promoting feature diversity through the integration of multiscale and structurally distinct cues. To improve the capture of context-aware descriptions, we refine the input sequence's semantic modeling on the decoder side using a stacked GRU architecture with an element-wise aggregation scheme. Experiments on three benchmark RSIC datasets show that MsEdF outperforms several baseline models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MsEdF is a standard two-encoder plus stacked-GRU tweak for remote sensing captioning whose abstract states outperformance without metrics, ablations, or controls.

read the letter

The main thing here is a multi-stream encoder-decoder that fuses outputs from two complementary image encoders and feeds them into a stacked GRU decoder with element-wise aggregation. The abstract positions this as a fix for single-stream models that miss diverse spatial features or context in remote sensing scenes with high intraclass similarity. That motivation is clear enough for the domain. The architecture choices themselves are incremental rather than radical, but they target a real pain point in RSIC where single encoders often fall short on multiscale cues. If the full paper includes matched training runs and shows the fusion step actually moves the needle, the details could be useful to people already working on satellite imagery captioning. The soft spot is the results. The abstract asserts that MsEdF beats several baselines on three benchmark datasets, yet supplies no numbers, no error bars, no ablation variants (single encoder, non-stacked decoder), and no mention of hyperparameter matching or statistical tests. Without those, any reported gains could come from uncontrolled factors rather than the claimed components. The stress-test concern lands directly on the abstract as written. The paper is aimed at the narrow RSIC community. A specialist already following encoder-decoder variants in that subfield might extract the fusion and aggregation scheme for their own experiments. It is coherent on its own terms and shows honest engagement with the stated limitations of prior single-stream work, so it clears the bar for a serious referee to examine the full experiments and tables.

Referee Report

2 major / 1 minor

Summary. The paper proposes a Multi-stream Encoder-decoder Framework (MsEdF) for remote sensing image captioning (RSIC). It fuses information from two complementary image encoders to promote feature diversity and refines the decoder with a stacked GRU architecture using element-wise aggregation to improve context-aware descriptions. The central claim is that experiments on three benchmark RSIC datasets show MsEdF outperforms several baseline models.

Significance. If the outperformance holds under controlled conditions with ablations and statistical validation, the framework could advance RSIC by addressing limitations of single-stream models in handling complex spatial patterns and semantic ambiguity.

major comments (2)

[Abstract] Abstract: The claim that 'Experiments on three benchmark RSIC datasets show that MsEdF outperforms several baseline models' supplies no quantitative metrics, error bars, ablation studies, dataset details, or baseline specifications, which is load-bearing for the central experimental claim.
[Experiments section] Experiments section: No ablation results are provided to isolate the contribution of the two-encoder fusion (versus single-encoder) or the stacked GRU with element-wise aggregation (versus standard decoder), so performance differences cannot be attributed to the proposed components rather than uncontrolled factors such as hyper-parameters or training schedules.

minor comments (1)

[Abstract] Abstract: Standard RSIC evaluation metrics (e.g., BLEU, METEOR, CIDEr) and dataset names are not mentioned, reducing clarity even at the summary level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below and commit to revisions that directly strengthen the experimental claims.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that 'Experiments on three benchmark RSIC datasets show that MsEdF outperforms several baseline models' supplies no quantitative metrics, error bars, ablation studies, dataset details, or baseline specifications, which is load-bearing for the central experimental claim.

Authors: We agree that the abstract should be more informative. In the revision we will replace the generic claim with concrete metrics (e.g., BLEU-4, METEOR and CIDEr deltas on the three datasets), name the baselines, and briefly note the datasets and evaluation protocol. revision: yes
Referee: [Experiments section] Experiments section: No ablation results are provided to isolate the contribution of the two-encoder fusion (versus single-encoder) or the stacked GRU with element-wise aggregation (versus standard decoder), so performance differences cannot be attributed to the proposed components rather than uncontrolled factors such as hyper-parameters or training schedules.

Authors: We accept this criticism. The revised manuscript will include new ablation tables that compare (i) the dual-encoder fusion against each single encoder and (ii) the stacked GRU with element-wise aggregation against a standard single-layer GRU, while keeping all other hyperparameters and training schedules fixed. These results will be reported with the same evaluation metrics used in the main experiments. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical outperformance claim rests on experimental comparison, not derivation

full rationale

The paper proposes an architecture (two complementary encoders fused, stacked GRU decoder with element-wise aggregation) and reports that it outperforms baselines on three RSIC datasets. No equations, fitted parameters renamed as predictions, uniqueness theorems, or self-citation chains appear in the provided text. The central claim is an empirical result from running the model, not a mathematical reduction that equals its inputs by construction. This is a standard experimental ML paper whose validity hinges on controls and ablations (addressed elsewhere) rather than any self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5722 in / 1024 out tokens · 24795 ms · 2026-05-23T03:19:29.952608+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The encoder fuses information from two complementary image encoders... stacked GRU architecture with an element-wise aggregation scheme.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on three benchmark RSIC datasets show that MsEdF outperforms several baseline models.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Text-RSIR: A Text-Guided Framework for Efficient Remote Sensing Image Transmission and Reconstruction
eess.IV 2026-05 unverdicted novelty 4.0

A text-guided framework for remote sensing image transmission uses low-res images and compact text to reduce data volume to 2%, with text-conditioned reconstruction achieving PSNRs of 16.36-27.41 dB on tested datasets.
JSSFF: A Joint Structural-Semantic Fusion Framework for Remote Sensing Image Captioning
cs.CV 2026-04 unverdicted novelty 4.0

JSSFF improves remote sensing image captioning by fusing structural edge details with semantic features in an encoder-decoder model and using fairness-based beam search, outperforming baselines on quantitative and qua...

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

Big Data Mining and Analytics7(2), 247–270 (2024)

Alghamdi, M.A., Abdullah, S., Ragab, M.: Predicting energy consumption using stacked lstm snapshot ensemble. Big Data Mining and Analytics7(2), 247–270 (2024)

work page 2024
[2]

arXiv preprint arXiv:2502.16095 (2025)

Das, S., Gupta, S., Kumar, K., Sharma, R.: Good representation, better explana- tion: Role of convolutional neural networks in transformer-based remote sensing image captioning. arXiv preprint arXiv:2502.16095 (2025)

work page arXiv 2025
[3]

In: 2024 International Joint Conference on Neural Networks (IJCNN)

Das, S., Khandelwal, A., Sharma, R.: Unveiling the power of convolutional neural networks: A comprehensive study on remote sensing image captioning and encoder selection. In: 2024 International Joint Conference on Neural Networks (IJCNN). pp. 1–8. IEEE (2024)

work page 2024
[4]

IEEE Geoscience and Remote Sensing Letters (2024)

Das, S., Sharma, R.: A textgcn-based decoding approach for improving remote sensing image captioning. IEEE Geoscience and Remote Sensing Letters (2024)

work page 2024
[5]

In: Proceedings of the AAAI conference on artificial intelligence

Gu, J., Cai, J., Wang, G., Chen, T.: Stack-captioning: Coarse-to-fine learning for image captioning. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)

work page 2018
[6]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

work page 2016
[7]

IEEE Transactions on Geoscience and Remote Sensing60, 1–14 (2021)

Hoxha, G., Melgani, F.: A novel svm-based decoder for remote sensing image cap- tioning. IEEE Transactions on Geoscience and Remote Sensing60, 1–14 (2021)

work page 2021
[8]

In: 2020 Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS)

Hoxha, G., Melgani, F., Slaghenauffi, J.: A new cnn-rnn framework for remote sensing image captioning. In: 2020 Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS). pp. 1–4. IEEE (2020)

work page 2020
[9]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708 (2017)

work page 2017
[10]

arXiv preprint arXiv:2203.01594 (2022)

Khan, R., Islam, M.S., Kanwal, K., Iqbal, M., Hossain, M.I., Ye, Z.: A deep neu- ral framework for image caption generation using gru-based attention mechanism. arXiv preprint arXiv:2203.01594 (2022)

work page arXiv 2022
[11]

In: Proceedings of the Second WorkshoponStatisticalMachineTranslation.pp.228–231.AssociationforCompu- tational Linguistics, Prague, Czech Republic (Jun 2007),https://aclanthology

Lavie, A., Agarwal, A.: METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the Second WorkshoponStatisticalMachineTranslation.pp.228–231.AssociationforCompu- tational Linguistics, Prague, Czech Republic (Jun 2007),https://aclanthology. org/W07-0734

work page 2007
[12]

Scientific Reports15(1), 8742 (2025)

Li, Y., Tao, C., Liu, M., Zhang, X., Wang, G., Zhang, T., Zhao, D., Wang, D.: Feature refinement and rethinking attention for remote sensing image captioning. Scientific Reports15(1), 8742 (2025)

work page 2025
[13]

In: Text Summarization Branches Out

Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out. pp. 74–81. Association for Computational Linguis- tics, Barcelona, Spain (Jul 2004),https://aclanthology.org/W04-1013

work page 2004
[14]

arXiv preprint arXiv:2503.23453 (2025)

Liu, M., Liu, J., Zhang, X.: Semantic-spatial feature fusion with dynamic graph refinement for remote sensing image captioning. arXiv preprint arXiv:2503.23453 (2025)

work page arXiv 2025
[15]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976–11986 (2022)

work page 2022
[16]

IEEE Transactions on Geoscience and Remote Sensing 56(4), 2183–2195 (2017) Title Suppressed Due to Excessive Length 17

Lu, X., Wang, B., Zheng, X., Li, X.: Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing 56(4), 2183–2195 (2017) Title Suppressed Due to Excessive Length 17

work page 2017
[17]

doi:10.3115/1073083.1073135 , editor =

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meet- ing of the Association for Computational Linguistics. pp. 311–318. Associa- tion for Computational Linguistics, Philadelphia, Pennsylvania, USA (Jul 2002). https://doi.org/10.3115/1073083.1073135,https://a...

work page doi:10.3115/1073083.1073135 2002
[18]

In: 2016 International conference on computer, information and telecommunication systems (Cits)

Qu, B., Li, X., Tao, D., Lu, X.: Deep semantic understanding of high resolution remote sensing image. In: 2016 International conference on computer, information and telecommunication systems (Cits). pp. 1–5. IEEE (2016)

work page 2016
[19]

In: 2024 7th International Conference on Pattern Recognition and Artificial Intelligence (PRAI)

Sattar, A., Assam, M., Alahmadi, T.J., Bhatti, U.A., Tang, H., Aamir, M.: Re- mote sensing based advance image captioning improved feature attention. In: 2024 7th International Conference on Pattern Recognition and Artificial Intelligence (PRAI). pp. 97–105. IEEE (2024)

work page 2024
[20]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4566–4575 (2015)

work page 2015
[21]

IEEE Geoscience and Remote Sensing Letters16(8), 1274–1278 (2019)

Wang,B.,Lu,X.,Zheng,X.,Li,X.:Semanticdescriptionsofhigh-resolutionremote sensing images. IEEE Geoscience and Remote Sensing Letters16(8), 1274–1278 (2019)

work page 2019
[22]

IEEE Geoscience and Remote Sensing Letters (2024)

Wang, J., Wang, B., Xi, J., Bai, X., Ersoy, O.K., Cong, M., Gao, S., Zhao, Z.: Remote sensing image captioning with sequential attention and flexible word cor- relation. IEEE Geoscience and Remote Sensing Letters (2024)

work page 2024
[23]

IEEE Transactions on Geoscience and Remote Sensing (2024)

Wu, Y., Li, L., Jiao, L., Liu, F., Liu, X., Yang, S.: Trtr-cmr: Cross-modal reason- ing dual transformer for remote sensing image captioning. IEEE Transactions on Geoscience and Remote Sensing (2024)

work page 2024
[24]

Fusing Multi-Stream Deep Networks for Video Classification

Wu, Z., Jiang, Y., Wang, X., Ye, H., Xue, X., Wang, J.: Fusing multi-stream deep networks for video classification. arxiv 2015. arXiv preprint arXiv:1509.06086 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[25]

IEEE Transactions on Geoscience and Remote Sensing55(7), 3965–3981 (2017)

Xia, G.S., Hu, J., Hu, F., Shi, B., Bai, X., Zhong, Y., Zhang, L., Lu, X.: Aid: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing55(7), 3965–3981 (2017)

work page 2017
[26]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Xie,S.,Girshick,R.,Dollár,P.,Tu,Z.,He,K.:Aggregatedresidualtransformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1492–1500 (2017)

work page 2017
[27]

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Xu, K.: Show, attend and tell: Neural image caption generation with visual atten- tion. arXiv preprint arXiv:1502.03044 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[28]

In: Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems

Yang, Y., Newsam, S.: Bag-of-visual-words and spatial extensions for land-use classification. In: Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems. pp. 270–279 (2010)

work page 2010
[29]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4651–4659 (2016)

work page 2016
[30]

IEEE Transactions on Geoscience and Remote Sensing53(4), 2175–2184 (2014)

Zhang, F., Du, B., Zhang, L.: Saliency-guided unsupervised feature learning for scene classification. IEEE Transactions on Geoscience and Remote Sensing53(4), 2175–2184 (2014)

work page 2014

[1] [1]

Big Data Mining and Analytics7(2), 247–270 (2024)

Alghamdi, M.A., Abdullah, S., Ragab, M.: Predicting energy consumption using stacked lstm snapshot ensemble. Big Data Mining and Analytics7(2), 247–270 (2024)

work page 2024

[2] [2]

arXiv preprint arXiv:2502.16095 (2025)

Das, S., Gupta, S., Kumar, K., Sharma, R.: Good representation, better explana- tion: Role of convolutional neural networks in transformer-based remote sensing image captioning. arXiv preprint arXiv:2502.16095 (2025)

work page arXiv 2025

[3] [3]

In: 2024 International Joint Conference on Neural Networks (IJCNN)

Das, S., Khandelwal, A., Sharma, R.: Unveiling the power of convolutional neural networks: A comprehensive study on remote sensing image captioning and encoder selection. In: 2024 International Joint Conference on Neural Networks (IJCNN). pp. 1–8. IEEE (2024)

work page 2024

[4] [4]

IEEE Geoscience and Remote Sensing Letters (2024)

Das, S., Sharma, R.: A textgcn-based decoding approach for improving remote sensing image captioning. IEEE Geoscience and Remote Sensing Letters (2024)

work page 2024

[5] [5]

In: Proceedings of the AAAI conference on artificial intelligence

Gu, J., Cai, J., Wang, G., Chen, T.: Stack-captioning: Coarse-to-fine learning for image captioning. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)

work page 2018

[6] [6]

He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

work page 2016

[7] [7]

IEEE Transactions on Geoscience and Remote Sensing60, 1–14 (2021)

Hoxha, G., Melgani, F.: A novel svm-based decoder for remote sensing image cap- tioning. IEEE Transactions on Geoscience and Remote Sensing60, 1–14 (2021)

work page 2021

[8] [8]

In: 2020 Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS)

Hoxha, G., Melgani, F., Slaghenauffi, J.: A new cnn-rnn framework for remote sensing image captioning. In: 2020 Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS). pp. 1–4. IEEE (2020)

work page 2020

[9] [9]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4700–4708 (2017)

work page 2017

[10] [10]

arXiv preprint arXiv:2203.01594 (2022)

Khan, R., Islam, M.S., Kanwal, K., Iqbal, M., Hossain, M.I., Ye, Z.: A deep neu- ral framework for image caption generation using gru-based attention mechanism. arXiv preprint arXiv:2203.01594 (2022)

work page arXiv 2022

[11] [11]

In: Proceedings of the Second WorkshoponStatisticalMachineTranslation.pp.228–231.AssociationforCompu- tational Linguistics, Prague, Czech Republic (Jun 2007),https://aclanthology

Lavie, A., Agarwal, A.: METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. In: Proceedings of the Second WorkshoponStatisticalMachineTranslation.pp.228–231.AssociationforCompu- tational Linguistics, Prague, Czech Republic (Jun 2007),https://aclanthology. org/W07-0734

work page 2007

[12] [12]

Scientific Reports15(1), 8742 (2025)

Li, Y., Tao, C., Liu, M., Zhang, X., Wang, G., Zhang, T., Zhao, D., Wang, D.: Feature refinement and rethinking attention for remote sensing image captioning. Scientific Reports15(1), 8742 (2025)

work page 2025

[13] [13]

In: Text Summarization Branches Out

Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Text Summarization Branches Out. pp. 74–81. Association for Computational Linguis- tics, Barcelona, Spain (Jul 2004),https://aclanthology.org/W04-1013

work page 2004

[14] [14]

arXiv preprint arXiv:2503.23453 (2025)

Liu, M., Liu, J., Zhang, X.: Semantic-spatial feature fusion with dynamic graph refinement for remote sensing image captioning. arXiv preprint arXiv:2503.23453 (2025)

work page arXiv 2025

[15] [15]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976–11986 (2022)

work page 2022

[16] [16]

IEEE Transactions on Geoscience and Remote Sensing 56(4), 2183–2195 (2017) Title Suppressed Due to Excessive Length 17

Lu, X., Wang, B., Zheng, X., Li, X.: Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing 56(4), 2183–2195 (2017) Title Suppressed Due to Excessive Length 17

work page 2017

[17] [17]

doi:10.3115/1073083.1073135 , editor =

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meet- ing of the Association for Computational Linguistics. pp. 311–318. Associa- tion for Computational Linguistics, Philadelphia, Pennsylvania, USA (Jul 2002). https://doi.org/10.3115/1073083.1073135,https://a...

work page doi:10.3115/1073083.1073135 2002

[18] [18]

In: 2016 International conference on computer, information and telecommunication systems (Cits)

Qu, B., Li, X., Tao, D., Lu, X.: Deep semantic understanding of high resolution remote sensing image. In: 2016 International conference on computer, information and telecommunication systems (Cits). pp. 1–5. IEEE (2016)

work page 2016

[19] [19]

In: 2024 7th International Conference on Pattern Recognition and Artificial Intelligence (PRAI)

Sattar, A., Assam, M., Alahmadi, T.J., Bhatti, U.A., Tang, H., Aamir, M.: Re- mote sensing based advance image captioning improved feature attention. In: 2024 7th International Conference on Pattern Recognition and Artificial Intelligence (PRAI). pp. 97–105. IEEE (2024)

work page 2024

[20] [20]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4566–4575 (2015)

work page 2015

[21] [21]

IEEE Geoscience and Remote Sensing Letters16(8), 1274–1278 (2019)

Wang,B.,Lu,X.,Zheng,X.,Li,X.:Semanticdescriptionsofhigh-resolutionremote sensing images. IEEE Geoscience and Remote Sensing Letters16(8), 1274–1278 (2019)

work page 2019

[22] [22]

IEEE Geoscience and Remote Sensing Letters (2024)

Wang, J., Wang, B., Xi, J., Bai, X., Ersoy, O.K., Cong, M., Gao, S., Zhao, Z.: Remote sensing image captioning with sequential attention and flexible word cor- relation. IEEE Geoscience and Remote Sensing Letters (2024)

work page 2024

[23] [23]

IEEE Transactions on Geoscience and Remote Sensing (2024)

Wu, Y., Li, L., Jiao, L., Liu, F., Liu, X., Yang, S.: Trtr-cmr: Cross-modal reason- ing dual transformer for remote sensing image captioning. IEEE Transactions on Geoscience and Remote Sensing (2024)

work page 2024

[24] [24]

Fusing Multi-Stream Deep Networks for Video Classification

Wu, Z., Jiang, Y., Wang, X., Ye, H., Xue, X., Wang, J.: Fusing multi-stream deep networks for video classification. arxiv 2015. arXiv preprint arXiv:1509.06086 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[25] [25]

IEEE Transactions on Geoscience and Remote Sensing55(7), 3965–3981 (2017)

Xia, G.S., Hu, J., Hu, F., Shi, B., Bai, X., Zhong, Y., Zhang, L., Lu, X.: Aid: A benchmark data set for performance evaluation of aerial scene classification. IEEE Transactions on Geoscience and Remote Sensing55(7), 3965–3981 (2017)

work page 2017

[26] [26]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Xie,S.,Girshick,R.,Dollár,P.,Tu,Z.,He,K.:Aggregatedresidualtransformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1492–1500 (2017)

work page 2017

[27] [27]

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Xu, K.: Show, attend and tell: Neural image caption generation with visual atten- tion. arXiv preprint arXiv:1502.03044 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[28] [28]

In: Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems

Yang, Y., Newsam, S.: Bag-of-visual-words and spatial extensions for land-use classification. In: Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems. pp. 270–279 (2010)

work page 2010

[29] [29]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4651–4659 (2016)

work page 2016

[30] [30]

IEEE Transactions on Geoscience and Remote Sensing53(4), 2175–2184 (2014)

Zhang, F., Du, B., Zhang, L.: Saliency-guided unsupervised feature learning for scene classification. IEEE Transactions on Geoscience and Remote Sensing53(4), 2175–2184 (2014)

work page 2014