MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning

· 2025 · cs.CV · arXiv 2502.09282

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Remote sensing images contain complex spatial patterns and semantic structures, which makes the captioning model difficult to accurately describe. Encoder-decoder architectures have become the widely used approach for RSIC by translating visual content into descriptive text. However, many existing methods rely on a single-stream architecture, which weakens the model to accurately describe the image. Such single-stream architectures typically struggle to extract diverse spatial features or capture complex semantic relationships, limiting their effectiveness in scenes with high intraclass similarity or contextual ambiguity. In this work, we propose a novel Multi-stream Encoder-decoder Framework (MsEdF) which improves the performance of RSIC by optimizing both the spatial representation and language generation of encoder-decoder architecture. The encoder fuses information from two complementary image encoders, thereby promoting feature diversity through the integration of multiscale and structurally distinct cues. To improve the capture of context-aware descriptions, we refine the input sequence's semantic modeling on the decoder side using a stacked GRU architecture with an element-wise aggregation scheme. Experiments on three benchmark RSIC datasets show that MsEdF outperforms several baseline models.

representative citing papers

Text-RSIR: A Text-Guided Framework for Efficient Remote Sensing Image Transmission and Reconstruction

eess.IV · 2026-05-15 · unverdicted · novelty 4.0

A text-guided framework for remote sensing image transmission uses low-res images and compact text to reduce data volume to 2%, with text-conditioned reconstruction achieving PSNRs of 16.36-27.41 dB on tested datasets.

JSSFF: A Joint Structural-Semantic Fusion Framework for Remote Sensing Image Captioning

cs.CV · 2026-04-27 · unverdicted · novelty 4.0

JSSFF improves remote sensing image captioning by fusing structural edge details with semantic features in an encoder-decoder model and using fairness-based beam search, outperforming baselines on quantitative and qualitative measures.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Text-RSIR: A Text-Guided Framework for Efficient Remote Sensing Image Transmission and Reconstruction eess.IV · 2026-05-15 · unverdicted · none · ref 57 · internal anchor
A text-guided framework for remote sensing image transmission uses low-res images and compact text to reduce data volume to 2%, with text-conditioned reconstruction achieving PSNRs of 16.36-27.41 dB on tested datasets.

MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning

fields

years

verdicts

representative citing papers

citing papers explorer