pith. sign in

arxiv: 2505.12217 · v3 · pith:DPR5SBT5new · submitted 2025-05-18 · 💻 cs.CV

HyperCap: Hyperspectral Land Cover Captioning Dataset for Vision Language Models

Pith reviewed 2026-05-22 15:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords Hyperspectral imagingLand cover captioningVision-language modelsRemote sensing datasetPixel-wise annotationsClassification performance
0
0 comments X

The pith

HyperCap pairs hyperspectral images with pixel-wise text captions to improve land cover understanding in remote sensing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HyperCap as the first large-scale dataset that adds pixel-wise textual descriptions to hyperspectral images. This combination lets vision-language models move beyond raw spectral data to capture semantic meaning in land cover. The dataset draws from four existing benchmarks and uses a mix of automated and manual annotation. Evaluations with modern encoders and fusion methods show better classification results. A sympathetic reader would see this as a step toward richer, language-guided analysis of earth observation data.

Core claim

HyperCap is constructed from four benchmark datasets and annotated through a hybrid automated and manual process to produce accurate pixel-wise textual descriptions; when paired with state-of-the-art encoders and diverse fusion techniques, it yields significant improvements in hyperspectral land cover classification performance.

What carries the argument

The HyperCap dataset itself, which supplies pixel-wise textual annotations alongside hyperspectral spectral data to support semantic tasks in vision-language models.

If this is right

  • Vision-language models gain the ability to perform classification and feature extraction directly on hyperspectral data using textual guidance.
  • Remote sensing pipelines can incorporate language-based queries for land cover analysis instead of purely spectral methods.
  • Future work can build larger models or benchmarks on top of this combined spectral-textual resource.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same annotation approach could be tested on multispectral or SAR imagery to check whether textual captions transfer across sensor types.
  • If the captions prove stable, the dataset might support zero-shot or few-shot land cover mapping in new geographic regions.
  • Integration with temporal hyperspectral sequences could enable change detection described in natural language.

Load-bearing premise

The hybrid automated and manual annotation process produces accurate and consistent pixel-wise textual descriptions that meaningfully capture land cover semantics.

What would settle it

Independent expert re-annotation of a random sample of pixels that finds frequent mismatches between the provided captions and actual land cover types would show the annotations do not reliably capture semantics.

Figures

Figures reproduced from arXiv: 2505.12217 by Antonio Plaza, Aryan Das, Koushik Biswas, Pravendra Singh, Salvador Garcia, Swalpa Kumar Roy, Tanishq Rachamalla, Vinay Kumar Verma.

Figure 1
Figure 1. Figure 1: Qualitative Analysis of Class Distribution for the Botswana, Houston13, Indian Pines and KSC datasets. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative Analysis of Part-of-Speech Distribution in Captions for the Botswana, Houston13, Indian Pines [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Quantitative Visualization of captions per class across Botswana, Houston13, Indian Pines and KSC datasets. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The plot for the t-SNE visualizations over the Botswana, Houston13, Indian Pines and KSC datasets. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of four sample datasets used in the study. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Captions Before and After Manual Refinement. [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of classification maps for the 3D-RCNet-BERT model on the Botswana dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of classification maps for the 3D-RCNet-T5 model on the Botswana dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of classification maps for the 3D-ConvSST-BERT model on the Botswana dataset, showing [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of classification maps for the 3D-ConvSST-T5 model on the Botswana dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of classification maps for the DBCTNet-BERT model on the Botswana dataset, showing [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of classification maps for the DBCTNet-T5 model on the Botswana dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Comparison of classification maps for the FAHM-BERT model on the Botswana dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Comparison of classification maps for the FAHM-T5 model on the Botswana dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Comparison of classification maps for the 3D-RCNet-Bert model on the Houston13 dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Comparison of classification maps for the 3D-RCNet-T5 model on the Houston13 dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p013_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparison of classification maps for the 3D-ConvSST-Bert model on the Houston13 dataset, showing [PITH_FULL_IMAGE:figures/full_fig_p014_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Comparison of classification maps for the 3D-ConvSST-T5 model on the Houston13 dataset, showing [PITH_FULL_IMAGE:figures/full_fig_p014_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Comparison of classification maps for the DBCTNet-Bert model on the Houston13 dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p014_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Comparison of classification maps for the DBCTNet-T5 model on the Houston13 dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p014_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Comparison of classification maps for the FAHM-Bert model on the Houston13 dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p014_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Comparison of classification maps for the FAHM-T5 model on the Houston13 dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p015_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Comparison of classification maps for the 3D-RCNet-Bert model on the Indian Pines dataset, showing [PITH_FULL_IMAGE:figures/full_fig_p015_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Comparison of classification maps for the 3D-RCNet-T5 model on the Indian Pines dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p015_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Comparison of classification maps for the 3D-ConvSST-Bert model on the Indian Pines dataset, showing [PITH_FULL_IMAGE:figures/full_fig_p015_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Comparison of classification maps for the 3D-ConvSST-T5 model on the Indian Pines dataset, showing [PITH_FULL_IMAGE:figures/full_fig_p016_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Comparison of classification maps for the DBCTNet-Bert model on the Indian Pines dataset, showing [PITH_FULL_IMAGE:figures/full_fig_p016_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Comparison of classification maps for the DBCTNet-T5 model on the Indian Pines dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p016_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Comparison of classification maps for the FAHM-Bert model on the Indian Pines dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p016_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Comparison of classification maps for the FAHM-T5 model on the Indian Pines dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p017_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Comparison of classification maps for the 3D-RCNet-Bert model on the KSC dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p017_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Comparison of classification maps for the 3D-RCNet-T5 model on the KSC dataset, showing different fusion [PITH_FULL_IMAGE:figures/full_fig_p017_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Comparison of classification maps for the 3D-ConvSST-Bert model on the KSC dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p017_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Comparison of classification maps for the 3D-ConvSST-T5 model on the KSC dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p018_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Comparison of classification maps for the DBCTNet-Bert model on the KSC dataset, showing different [PITH_FULL_IMAGE:figures/full_fig_p018_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Comparison of classification maps for the DBCTNet-T5 model on the KSC dataset, showing different fusion [PITH_FULL_IMAGE:figures/full_fig_p018_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Comparison of classification maps for the FAHM-Bert model on the KSC dataset, showing different fusion [PITH_FULL_IMAGE:figures/full_fig_p018_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Comparison of classification maps for the FAHM-T5 model on the KSC dataset, showing different fusion [PITH_FULL_IMAGE:figures/full_fig_p019_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Comparison of classification maps for the KSC dataset, showing different maps: Cross Attention (CA), [PITH_FULL_IMAGE:figures/full_fig_p019_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Comparison of classification maps for the Indian Pines dataset, showing different maps: 3D-RCNet, [PITH_FULL_IMAGE:figures/full_fig_p019_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: Comparison of classification maps for the Botswana dataset, showing different maps: 3D-RCNet, 3D [PITH_FULL_IMAGE:figures/full_fig_p019_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Comparison of classification maps for the Houston13 dataset, showing different maps: 3D-RCNet, 3D [PITH_FULL_IMAGE:figures/full_fig_p019_42.png] view at source ↗
read the original abstract

We introduce HyperCap, the first large-scale hyperspectral captioning dataset designed to enhance model performance and effectiveness in remote sensing applications. Unlike traditional hyperspectral imaging (HSI) benchmarks, HyperCap integrates spectral data with pixel-wise textual annotations, enabling deeper semantic understanding. This dataset enhances model performance in tasks like classification and feature extraction, providing a valuable resource for advanced remote sensing applications. HyperCap is constructed from four benchmark datasets and annotated through a hybrid approach combining automated and manual methods to ensure accuracy and consistency. Empirical evaluations using state-of-the-art encoders and diverse fusion techniques demonstrate significant improvements in classification performance. These results underscore the potential of vision-language learning in HSI and position HyperCap as a foundational dataset for future research in the field. The code and dataset are available at https://github.com/arya-domain/HyperCap.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces HyperCap, the first large-scale hyperspectral captioning dataset for vision-language models in remote sensing. Constructed from four existing HSI benchmark datasets, it provides pixel-wise textual annotations generated via a hybrid automated-plus-manual process. The central claim is that empirical evaluations with state-of-the-art encoders and diverse fusion techniques yield significant improvements in land-cover classification performance, positioning the dataset as foundational for multimodal HSI research. The code and dataset are released publicly.

Significance. If the pixel-wise captions prove semantically faithful and the classification gains are reproducible with proper controls, HyperCap could meaningfully advance vision-language approaches in hyperspectral remote sensing by supplying paired spectral-textual data at scale. The public release of code and data is a clear strength that supports reproducibility and community follow-up work.

major comments (2)
  1. [Dataset Construction] Dataset Construction section: the hybrid automated and manual annotation process is described but supplies no quantitative validation metrics such as inter-annotator agreement, caption-error rates, or expert-verification statistics. This is load-bearing for the central claim because any observed classification improvements could be driven by label noise or semantic drift rather than the intended multimodal signal.
  2. [Empirical Evaluations] Empirical Evaluations / Abstract: the claim of 'significant improvements in classification performance' using state-of-the-art encoders and diverse fusion techniques is asserted without reporting specific quantitative metrics, baseline comparisons, error bars, data-split details, or statistical tests. This leaves the headline empirical result unsupported from the provided text.
minor comments (2)
  1. [Abstract] Abstract: the four source HSI datasets are referenced generically; explicitly naming them (e.g., Indian Pines, Salinas) would improve clarity and allow readers to assess coverage immediately.
  2. [Notation and figures] Notation and figures: ensure consistent capitalization of 'HSI' and 'VLM' after first use, and verify that any result tables include standard deviation or confidence intervals for the reported metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to improve clarity and support for our claims.

read point-by-point responses
  1. Referee: [Dataset Construction] Dataset Construction section: the hybrid automated and manual annotation process is described but supplies no quantitative validation metrics such as inter-annotator agreement, caption-error rates, or expert-verification statistics. This is load-bearing for the central claim because any observed classification improvements could be driven by label noise or semantic drift rather than the intended multimodal signal.

    Authors: We agree that quantitative validation metrics for the annotation process would strengthen the manuscript. The current text describes the hybrid automated-manual workflow but does not report inter-annotator agreement, error rates, or expert verification statistics. In the revised version we will add a new paragraph in the Dataset Construction section that includes these metrics, computed on a representative sample of captions, to demonstrate semantic fidelity and rule out substantial label noise. revision: yes

  2. Referee: [Empirical Evaluations] Empirical Evaluations / Abstract: the claim of 'significant improvements in classification performance' using state-of-the-art encoders and diverse fusion techniques is asserted without reporting specific quantitative metrics, baseline comparisons, error bars, data-split details, or statistical tests. This leaves the headline empirical result unsupported from the provided text.

    Authors: The Empirical Evaluations section of the full manuscript contains tables with accuracy numbers, baseline comparisons, and data-split descriptions. However, we acknowledge that the abstract and high-level claims do not explicitly include error bars or statistical tests. We will revise the abstract to report key quantitative improvements with standard deviations and will add a brief statement on statistical testing in the main text to make the empirical results fully transparent. revision: partial

Circularity Check

0 steps flagged

No circularity detected in derivation or claims

full rationale

This is a dataset-construction paper that assembles HyperCap from four existing HSI benchmarks via a hybrid automated-plus-manual annotation pipeline and then runs standard empirical evaluations with off-the-shelf encoders and fusion methods. No equations, fitted parameters, predictions, or uniqueness theorems appear; the central claim is simply that the new captions improve downstream classification when used with existing vision-language techniques. All reported results are therefore independent of any self-referential reduction and rest on external benchmarks and models.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset introduction paper with no mathematical model, free parameters, axioms, or invented entities. The contribution rests on the curation process and the claim that the resulting annotations are accurate.

pith-pipeline@v0.9.0 · 5694 in / 1039 out tokens · 38016 ms · 2026-05-22T15:11:45.103713+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 2 internal anchors

  1. [1]

    Deep learning techniques for hyperspectral image analysis in agriculture: A review,

    M. F. Guerri, C. Distante, P. Spagnolo, F. Bougourzi, and A. Taleb-Ahmed, “Deep learning techniques for hyperspectral image analysis in agriculture: A review,”ISPRS Open Journal of Photogrammetry and Remote Sensing, vol. 12, p. 100062, 2024

  2. [2]

    Deep learning for hyperspectral image classification: An overview,

    S. Li, W. Song, L. Fang, Y. Chen, P. Ghamisi, and J. A. Benediktsson, “Deep learning for hyperspectral image classification: An overview,”IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 9, pp. 6690–6709, 2019

  3. [3]

    Integration of remote sensing and crowdsourced data for fine-grained urban flood detection,

    Z. Liu, J. Li, L. Wang, and A. Plaza, “Integration of remote sensing and crowdsourced data for fine-grained urban flood detection,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 13523–13532, 2024

  4. [4]

    A survey: Deep learning for hyperspectral image classification with few labeled samples,

    S. Jia, S. Jiang, Z. Lin, N. Li, M. Xu, and S. Yu, “A survey: Deep learning for hyperspectral image classification with few labeled samples,”Neurocomputing, vol. 448, pp. 179–204, 2021

  5. [5]

    Hyperspectral image super resolution with real unaligned rgb guidance,

    Z. Lai, Y. Fu, and J. Zhang, “Hyperspectral image super resolution with real unaligned rgb guidance,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 2, pp. 2999–3011, 2025

  6. [6]

    A new cnn training approach with application to hyperspectral image classification,

    S. Kutluk, K. Kayabol, and A. Akan, “A new cnn training approach with application to hyperspectral image classification,” Digital Signal Processing, vol. 113, p. 103016, 2021

  7. [7]

    Attention-based adaptive spectral–spatial kernel resnet for hyperspectral image classification,

    S. K. Roy, S. Manna, T. Song, and L. Bruzzone, “Attention-based adaptive spectral–spatial kernel resnet for hyperspectral image classification,”IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 9, pp. 7831–7843, 2020

  8. [8]

    Hyperspectral image transformer classification networks,

    X. Yang, W. Cao, Y. Lu, and Y. Zhou, “Hyperspectral image transformer classification networks,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022

  9. [9]

    Multimodal fusion transformer for remote sensing image classification,

    S. K. Roy, A. Deria, D. Hong, B. Rasti, A. Plaza, and J. Chanussot, “Multimodal fusion transformer for remote sensing image classification,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–20, 2023

  10. [10]

    Spectral–spatial morphological attention transformer for hyperspectral image classification,

    S. K. Roy, A. Deria, C. Shah, J. M. Haut, Q. Du, and A. Plaza, “Spectral–spatial morphological attention transformer for hyperspectral image classification,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2023

  11. [11]

    Recent advances on spectral–spatial hyperspectral image classification: An overview and new guidelines,

    L. He, J. Li, C. Liu, and S. Li, “Recent advances on spectral–spatial hyperspectral image classification: An overview and new guidelines,”IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 3, pp. 1579–1597, 2018

  12. [12]

    Ai-driven hsi: Multimodality, fusion, challenges, and the deep learning revolution,

    D. S. Bhatti, Y. Choi, R. S. Wahidur, M. Bakhtawar, S. Kim, S. Lee, Y. Lee, and H.-N. Lee, “Ai-driven hsi: Multimodality, fusion, challenges, and the deep learning revolution,”arXiv preprint arXiv:2502.06894, 2025

  13. [13]

    Hyperspectral image classification with contrastive self-supervised learning under limited labeled samples,

    L. Zhao, W. Luo, Q. Liao, S. Chen, and J. Wu, “Hyperspectral image classification with contrastive self-supervised learning under limited labeled samples,”IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022

  14. [14]

    A semisupervised siamese network for hyperspectral image classification,

    S. Jia, S. Jiang, Z. Lin, M. Xu, W. Sun, Q. Huang, J. Zhu, and X. Jia, “A semisupervised siamese network for hyperspectral image classification,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–17, 2022

  15. [15]

    Towards explainable ai for hyperspectral image classification in edge computing environments,

    G. De Lucia, M. Lapegna, and D. Romano, “Towards explainable ai for hyperspectral image classification in edge computing environments,”Computers and Electrical Engineering, vol. 103, p. 108381, 2022

  16. [16]

    Progressive semantic-visual mutual adaption for generalized zero-shot learning,

    M. Liu, F. Li, C. Zhang, Y. Wei, H. Bai, and Y. Zhao, “Progressive semantic-visual mutual adaption for generalized zero-shot learning,”inProceedingsoftheIEEE/CVFConferenceonComputerVisionandPatternRecognition(CVPR) ,pp.15337–15346, June 2023

  17. [17]

    Geochat: Grounded large vision-language model for remote sensing,

    K. Kuckreja, M. S. Danish, M. Naseer, A. Das, S. Khan, and F. S. Khan, “Geochat: Grounded large vision-language model for remote sensing,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 27831–27840, June 2024

  18. [18]

    Recurrent attention and semantic gate for remote sensing image captioning,

    Y. Li, X. Zhang, J. Gu, C. Li, X. Wang, X. Tang, and L. Jiao, “Recurrent attention and semantic gate for remote sensing image captioning,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–16, 2022

  19. [19]

    Attention multisource fusion-based deep few-shot learning for hyperspectral image classification,

    X. Liang, Y. Zhang, and J. Zhang, “Attention multisource fusion-based deep few-shot learning for hyperspectral image classification,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 8773–8788, 2021

  20. [20]

    High-resolutionremotesensingimagecaptioningbasedonstructuredattention,

    R.Zhao,Z.Shi,andZ.Zou,“High-resolutionremotesensingimagecaptioningbasedonstructuredattention,” IEEETransactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2022

  21. [21]

    Multisource and multitemporal data fusion in remote sensing: A comprehensive review of the state of the art,

    P. Ghamisi, B. Rasti, N. Yokoya, Q. Wang, B. Hofle, L. Bruzzone, F. Bovolo, M. Chi, K. Anders, R. Gloaguen, P. M. Atkinson, and J. A. Benediktsson, “Multisource and multitemporal data fusion in remote sensing: A comprehensive review of the state of the art,”IEEE Geoscience and Remote Sensing Magazine, vol. 7, no. 1, pp. 6–39, 2019

  22. [22]

    220 band aviris hyperspectral image data set: June 12, 1992 indian pine test site 3,

    M. Baumgardner, L. Biehl, and D. Landgrebe, “220 band aviris hyperspectral image data set: June 12, 1992 indian pine test site 3,”https://purr.purdue.edu/publications/1947/1, 2015

  23. [23]

    Contrastivemultiviewsubspaceclusteringofhyperspectral images based on graph convolutional networks,

    R.Guan,Z.Li,W.Tu,J.Wang,Y.Liu,X.Li,C.Tang,andR.Feng,“Contrastivemultiviewsubspaceclusteringofhyperspectral images based on graph convolutional networks,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

  24. [24]

    Enhanced multiscale feature fusion network for hsi classification,

    J. Yang, C. Wu, B. Du, and L. Zhang, “Enhanced multiscale feature fusion network for hsi classification,”IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 12, pp. 10328–10347, 2021. 20 Running Title for Header

  25. [25]

    Adaptive hybrid attention network for hyperspectral image classification,

    S. Pande and B. Banerjee, “Adaptive hybrid attention network for hyperspectral image classification,”Pattern Recognition Letters, vol. 144, pp. 6–12, 2021

  26. [26]

    Anovelspatial–spectralpyramidnetworkforhyperspectralimageclassification,

    J.Zhou,S.Zeng,G.Gao,Y.Chen,andY.Tang,“Anovelspatial–spectralpyramidnetworkforhyperspectralimageclassification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–14, 2023

  27. [27]

    Hyperspectral image classification using 3d 2d cnn,

    A. Diakite, G. Jiangsheng, and F. Xiaping, “Hyperspectral image classification using 3d 2d cnn,”IET Image Processing, vol. 15, no. 5, pp. 1083–1092, 2021

  28. [28]

    Semantic segmentation and spatial relationship modeling in hyperspectral imagery using deep learning and graph-based representations,

    R. Yenni and P. Arun, “Semantic segmentation and spatial relationship modeling in hyperspectral imagery using deep learning and graph-based representations,” in2024 14th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), pp. 1–4, IEEE, 2024

  29. [29]

    Multispectral data mining: A focus on remote sensing satellite images,

    S. L. Lim, J. Sreevalsan-Nair, and B. Daya Sagar, “Multispectral data mining: A focus on remote sensing satellite images,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 14, no. 2, p. e1522, 2024

  30. [30]

    Theharmonizedlandsat and sentinel-2 surface reflectance data set,

    M.Claverie, J.Ju, J.G.Masek, J.L.Dungan, E.F.Vermote, J.-C.Roger, S.V.Skakun, andC.Justice, “Theharmonizedlandsat and sentinel-2 surface reflectance data set,”Remote sensing of environment, vol. 219, pp. 145–161, 2018

  31. [31]

    Deeplearningdetectsinvasiveplantspeciesacrosscomplexlandscapes using worldview-2 and planetscope satellite imagery,

    T.A.Lake, R.D.BriscoeRunquist, andD.A.Moeller, “Deeplearningdetectsinvasiveplantspeciesacrosscomplexlandscapes using worldview-2 and planetscope satellite imagery,”Remote Sensing in Ecology and Conservation, vol. 8, no. 6, pp. 875–889, 2022

  32. [32]

    Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery,

    X. Sun, P. Wang, Z. Yan, F. Xu, R. Wang, W. Diao, J. Chen, J. Li, Y. Feng, T. Xu,et al., “Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 184, pp. 116–130, 2022

  33. [33]

    Remote sensing image scene classification based on mutual learning with complementary multi-features,

    A. Chen and M. Xu, “Remote sensing image scene classification based on mutual learning with complementary multi-features,” IEEE Access, vol. 13, pp. 33436–33454, 2025

  34. [34]

    Scene attention mechanism for remote sensing image caption generation,

    S. Wu, X. Zhang, X. Wang, C. Li, and L. Jiao, “Scene attention mechanism for remote sensing image caption generation,” in 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–7, 2020

  35. [35]

    Direction-oriented visual–semantic embedding model for remote sensing image–text retrieval,

    Q. Ma, J. Pan, and C. Bai, “Direction-oriented visual–semantic embedding model for remote sensing image–text retrieval,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024

  36. [36]

    Nwpu-captions dataset and mlca-net for remote sensing image captioning,

    Q. Cheng, H. Huang, Y. Xu, Y. Zhou, H. Li, and Z. Wang, “Nwpu-captions dataset and mlca-net for remote sensing image captioning,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–19, 2022

  37. [37]

    Object detection in aerial images using dota dataset: A survey,

    Z. Chen, H. Wang, X. Wu, J. Wang, X. Lin, C. Wang, K. Gao, M. Chapman, and D. Li, “Object detection in aerial images using dota dataset: A survey,”International Journal of Applied Earth Observation and Geoinformation, vol. 134, p. 104208, 2024

  38. [38]

    Modality fusion vision transformer for hyperspectral and lidar data collaborative classification,

    B. Yang, X. Wang, Y. Xing, C. Cheng, W. Jiang, and Q. Feng, “Modality fusion vision transformer for hyperspectral and lidar data collaborative classification,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 17, pp. 17052–17065, 2024

  39. [39]

    Aerorit: A new scene for hyperspectral image analysis,

    A. Rangnekar, N. Mokashi, E. J. Ientilucci, C. Kanan, and M. J. Hoffman, “Aerorit: A new scene for hyperspectral image analysis,”IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 11, pp. 8116–8124, 2020

  40. [40]

    Language-awaredomaingeneralizationnetworkforcross-scenehyperspectral image classification,

    Y.Zhang,M.Zhang,W.Li,S.Wang,andR.Tao,“Language-awaredomaingeneralizationnetworkforcross-scenehyperspectral image classification,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–12, 2023

  41. [41]

    Spectral–spatial transformer network for hyperspectral image classification: A factorized architecture search framework,

    Z. Zhong, Y. Li, L. Ma, J. Li, and W.-S. Zheng, “Spectral–spatial transformer network for hyperspectral image classification: A factorized architecture search framework,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022

  42. [42]

    Deep learning for classification of hyperspectral data: A comparative review,

    N. Audebert, B. Le Saux, and S. Lefèvre, “Deep learning for classification of hyperspectral data: A comparative review,”IEEE Geoscience and Remote Sensing Magazine, vol. 7, pp. 159–173, June 2019

  43. [43]

    Topological structure and semantic information transfer network for cross-scene hyperspectral image classification,

    Y. Zhang, W. Li, M. Zhang, Y. Qu, R. Tao, and H. Qi, “Topological structure and semantic information transfer network for cross-scene hyperspectral image classification,”IEEE Transactions on Neural Networks and Learning Systems, pp. 1–14, 2021

  44. [44]

    Hyperspectral patterns with deep learning for classification for indian pines,

    V. Vishwanath, K. Sreekanth, J. Prakash, A. Rajendran, and G. Gopakumar, “Hyperspectral patterns with deep learning for classification for indian pines,” in2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), pp. 1–7, 2024

  45. [45]

    Gpt-4 technical report,

    OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Brundag...

  46. [46]

    Mistral large,

    M. AI, “Mistral large,”Mistral AI News, 2024. Accessed: 2025-03-08

  47. [47]

    Le chat,

    M. AI, “Le chat,” 2024. Large Language Model

  48. [48]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

    J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inICML, 2022

  49. [49]

    GIT: A Generative Image-to-text Transformer for Vision and Language

    J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, and L. Wang, “Git: A generative image-to-text transformer for vision and language,”arXiv preprint arXiv:2205.14100, 2022

  50. [50]

    mplug: Effective and effi- cient vision-language learning by cross-modal skip- connections

    C. Li, H. Xu, J. Tian, W. Wang, M. Yan, B. Bi, J. Ye, H. Chen, G. Xu, Z. Cao,et al., “mplug: Effective and efficient vision-language learning by cross-modal skip-connections,”arXiv preprint arXiv:2205.12005, 2022

  51. [51]

    Vinvl: Revisiting visual representations in vision- language models,

    P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao, “Vinvl: Revisiting visual representations in vision- language models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5579–5588, 2021

  52. [52]

    Visualbert: A simple and performant baseline for vision and language,

    L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visualbert: A simple and performant baseline for vision and language,” 2019

  53. [53]

    Bert has more to offer: Bert layers combination yields better sentence embeddings,

    M. Hosseini, M. Munia, and L. Khan, “Bert has more to offer: Bert layers combination yields better sentence embeddings,” in Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 15419–15431, 2023

  54. [54]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation server,”arXiv preprint arXiv:1504.00325, 2015

  55. [55]

    Dbctnet: Double branch convolution-transformer network for hyperspectral image classification,

    R. Xu, X.-M. Dong, W. Li, J. Peng, W. Sun, and Y. Xu, “Dbctnet: Double branch convolution-transformer network for hyperspectral image classification,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024

  56. [56]

    Fahm: Frequency-awarehierarchicalmambaforhyperspectralimage classification,

    P.Zhuang,X.Zhang,H.Wang,T.Zhang,L.Liu,andJ.Li,“Fahm: Frequency-awarehierarchicalmambaforhyperspectralimage classification,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 18, pp. 6299–6313, 2025

  57. [57]

    3d-convolution guided spectral-spatial transformer for hyperspectral image classification,

    S. Varahagiri, A. Sinha, S. R. Dubey, and S. K. Singh, “3d-convolution guided spectral-spatial transformer for hyperspectral image classification,” in2024 IEEE Conference on Artificial Intelligence (CAI), pp. 8–14, IEEE, 2024

  58. [58]

    3d-rcnet: Learning from transformer to build a 3d relational convnet for hyperspectral image classification,

    H. Jing, L. Wan, X. Xue, H. Zhang, and Y. Li, “3d-rcnet: Learning from transformer to build a 3d relational convnet for hyperspectral image classification,”arXiv preprint arXiv:2408.13728, 2024

  59. [59]

    Exploring the limits of transfer learning with a unified text-to-text transformer,

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020. 22