LESSViT: Robust Hyperspectral Representation Learning under Spectral Configuration Shift

Han Zhao; Haozhe Si; Minh Do; Yuqing Wang; Yuxuan Wan

arxiv: 2605.18541 · v1 · pith:LXDKSNRYnew · submitted 2026-05-18 · 💻 cs.CV

LESSViT: Robust Hyperspectral Representation Learning under Spectral Configuration Shift

Haozhe Si , Yuxuan Wan , Yuqing Wang , Minh Do , Han Zhao This is my paper

Pith reviewed 2026-05-20 11:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords hyperspectral imagingvision transformersspectral generalizationlow-rank attentioncross-sensor robustnessmasked autoencodersremote sensing

0 comments

The pith

Low-rank factorization in attention lets hyperspectral models generalize across different sensors without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to solve the problem that vision models trained on hyperspectral images from one sensor often fail when the wavelength coverage or number of channels changes with a new sensor. It does this by replacing full spatial-spectral attention with a low-rank factorization that separates the spatial and spectral parts while still capturing their joint effects. Flexible patch embeddings that do not assume a fixed channel count and wavelength-aware positional encodings further remove the need to retrain or redesign the model for each sensor. A pretraining scheme using a masked autoencoder with separate spatial and spectral masking strategies supports efficient learning on this flexible input. If these steps succeed, hyperspectral representation learning becomes scalable across the many different sensors used in practice rather than remaining tied to single-sensor datasets.

Core claim

LESSViT shows that joint spatial-spectral interactions can be modeled explicitly and efficiently through a structured low-rank factorization, reducing the complexity of full spatial-spectral attention from quadratic in both spatial tokens and channels to linear in the product of spatial tokens, channels, and a small rank parameter, and that this factorization together with channel-agnostic embeddings and wavelength-aware encodings produces models that remain competitive on the original spectral configuration while improving robustness when the spectral configuration shifts.

What carries the argument

LESS Attention, a structured low-rank factorization that models joint spatial-spectral interactions through separable spatial and spectral components.

If this is right

Models no longer require fixed assumptions about the exact number of input channels.
Pretraining can proceed with hierarchical channel sampling and decoupled spatial-spectral masking without sensor-specific adjustments.
Computational cost remains practical for high-dimensional hyperspectral volumes because the attention complexity scales linearly with rank.
Explicit separation of spatial and spectral modeling becomes feasible at scale rather than relying on implicit mixing inside standard transformers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same factorization pattern could be tested on other data types whose dimensionality varies across acquisitions, such as multispectral time series or variable-resolution imagery.
Widespread adoption would lower the barrier to deploying a single pretrained backbone across fleets of satellites that use different spectral bands.
The choice of rank in the factorization could be studied as a tunable trade-off between expressiveness and speed on new sensor configurations.

Load-bearing premise

The low-rank factorization still captures the spatial-spectral relationships that matter even when the number and spacing of spectral channels changes arbitrarily.

What would settle it

If a LESSViT model trained on one sensor's data is evaluated on a second sensor whose band set is completely disjoint and its accuracy falls below that of a standard ViT that was fine-tuned on the second sensor's data, the claim of robust cross-spectral generalization would be challenged.

Figures

Figures reproduced from arXiv: 2605.18541 by Han Zhao, Haozhe Si, Minh Do, Yuqing Wang, Yuxuan Wan.

**Figure 1.** Figure 1: Overview of LESSViT for cross-spectral generalization. ⃝1 : Cross-spectral generalization setting: train on a fixed spectral configuration and evaluate across sensors with varying wavelength coverage and channel configurations. ⃝2 : HyperMAE pretraining with decoupled spatial–spectral masking and hierarchical channel sampling for scalable and robust learning. ⃝3 : LESS Attention with SSRoPE for efficient… view at source ↗

**Figure 2.** Figure 2: Overview of LESSViT. (a) The tied patch embedding converts a hyperspectral image (C ˆ H ˆ W) into a grid of spatial–spectral tokens with spatial, spectral, and global [CLS] tokens. (b) The LESS block factorizes spatial and spectral attention via structured decomposition, enabling efficient modeling of joint spatial–spectral interactions. 2.2 Low-rank Efficient Spatial-Spectral Attention To model interactio… view at source ↗

**Figure 3.** Figure 3: Wavelength distributions of the channel configurations. C120VNIR+ and C120SWIR+ have identical channel counts but complementary spectral distributions (spectral shift). C82FullzVNIR+ is disjoint from C120VNIR+ (unseen wavelengths), and C202Full includes all channels (channel expansion). Disjoint configuration (C82FullzVNIR+): 82 channels consisting of the complement of C120VNIR+, i.e., 20 VNIR and 62 SWIR … view at source ↗

**Figure 4.** Figure 4: Qualitative results under cross-spectral generalization. PRGB denotes pseudo-RGB visualization of hyperspectral inputs, and GT denotes ground-truth segmentation masks. For clearer comparison with GT, background pixels are masked out in the predicted segmentation maps. We show results under different spectral configurations, including in-distribution, spectral shift, unseen wavelengths, and channel expansio… view at source ↗

**Figure 5.** Figure 5: Inference latency vs. channel count. Latency is normalized to the C“10 setting. ChannelViT becomes out-of-memory (OOM) at C “ 200 on our hardware (144 GB GPU). training cost of ChannelViT at scale, we focus on normalized inference wall-clock latency as a practical measure of efficiency. We run inference on 2000 samples while progressively increasing the number of input channels. For each configuration, w… view at source ↗

**Figure 6.** Figure 6: Additional qualitative results for segmentation tasks. We visualize the segmentation maps generated by SpectralViT, LESSViT, HyperSigma and DOFA on different tasks and channel configurations. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Modeling hyperspectral imagery (HSI) across different sensors presents a fundamental challenge due to variations in wavelength coverage, band sampling, and channel dimensionality. As a result, models trained under a fixed spectral configuration often fail to generalize to other sensors. Existing Vision Transformer (ViT) approaches either rely on implicit spectral modeling with fixed channel assumptions or adopt explicit spatial-spectral attention with prohibitive computational cost, leading to a fundamental trade-off between efficiency and expressiveness. In this work, we introduce Low-rank Efficient Spatial-Spectral ViT (LESSViT), a sensor-flexible architecture for cross-spectral generalization. LESSViT is built on LESS Attention, a structured low-rank factorization that models joint spatial-spectral interactions through separable spatial and spectral components, reducing the complexity of full spatial-spectral attention from $O(N^2 C^2)$ to $O(rNC)$, where $N$ is the number of spatial tokens, $C$ is the number of spectral channels, and $r$ is the rank of the low-rank approximation. We further incorporate channel-agnostic patch embedding and wavelength-aware positional encoding to support flexible spectral inputs. To enable efficient and robust pretraining, we introduce a hyperspectral masked autoencoder (HyperMAE) with decoupled spatial-spectral masking and hierarchical channel sampling. We evaluate LESSViT under a cross-spectral generalization setting that simulates cross-sensor variability. Experiments on the SpectralEarth benchmark demonstrate that LESSViT improves robustness under spectral shifts while remaining competitive in-distribution, and explicit and efficient spatial-spectral modeling is essential for scalable and generalizable hyperspectral representation learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LESSViT proposes a low-rank separable attention design to make hyperspectral ViTs more robust to sensor band changes, but the strength of the gains is hard to judge without the numbers.

read the letter

The main takeaway is that this paper gives a concrete architecture for hyperspectral transformers that can handle different wavelength samplings across sensors. LESSViT uses a low-rank factorization in attention to model joint spatial-spectral effects cheaply, plus channel-agnostic embeddings and wavelength-aware positional encodings to avoid fixed-channel assumptions. The HyperMAE pretraining with decoupled masking is the other practical piece they add for learning from variable inputs.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces LESSViT, a Vision Transformer architecture for hyperspectral imagery designed to handle spectral configuration shifts across sensors. It proposes LESS Attention, a structured low-rank factorization that models joint spatial-spectral interactions via separable components, reducing complexity from O(N² C²) to O(r N C). The design incorporates channel-agnostic patch embedding and wavelength-aware positional encoding for flexible inputs, along with a hyperspectral masked autoencoder (HyperMAE) using decoupled spatial-spectral masking and hierarchical channel sampling for pretraining. Evaluation occurs in a cross-spectral generalization setting on the SpectralEarth benchmark, with the claim that LESSViT improves robustness under spectral shifts while remaining competitive in-distribution.

Significance. If the experimental claims are substantiated, the work would offer a practical advance in hyperspectral representation learning by addressing the efficiency-expressiveness trade-off for cross-sensor generalization. The low-rank factorization and HyperMAE pretraining strategy provide concrete mechanisms for scalable modeling without sensor-specific retraining, which could influence architectures in remote sensing applications. The emphasis on explicit spatial-spectral modeling is a clear contribution relative to implicit ViT baselines.

major comments (2)

[Abstract] Abstract: The central claim of improved robustness under spectral shifts is asserted without any quantitative results, error bars, baseline comparisons, or details on simulation of spectral shifts (e.g., band selection or wavelength perturbations), which prevents verification that the data support the claim of sensor-flexible generalization.
[Methods (LESS Attention)] LESS Attention description (methods): The separable low-rank factorization is presented as retaining sufficient expressiveness for joint spatial-spectral interactions across arbitrary spectral configurations, but no analysis, ablation, or construction details demonstrate that the rank-r decomposition captures non-separable cross-terms dependent on precise wavelength sampling; this directly bears on whether the O(r N C) reduction preserves the sensor-flexible property without hidden adjustments.

minor comments (2)

[Notation] The notation for N (spatial tokens) and C (spectral channels) in the complexity statements should be defined explicitly on first use to improve readability.
[Experiments] Figure captions in the experimental section would benefit from additional detail on the exact parameters used to simulate cross-sensor spectral variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important areas for strengthening the presentation of our claims and the justification of the LESS Attention mechanism. We address each major comment below and have made revisions to the manuscript to improve clarity and substantiation.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of improved robustness under spectral shifts is asserted without any quantitative results, error bars, baseline comparisons, or details on simulation of spectral shifts (e.g., band selection or wavelength perturbations), which prevents verification that the data support the claim of sensor-flexible generalization.

Authors: We agree that the abstract would benefit from more concrete quantitative support to allow readers to immediately assess the strength of the claims. In the revised version, we have updated the abstract to include key performance metrics from the SpectralEarth cross-spectral generalization experiments, such as average accuracy improvements and standard deviations across multiple runs. We also reference the specific simulation protocol for spectral shifts (random band selection and wavelength perturbations) and note the in-distribution competitiveness relative to baselines. These details are now cross-referenced to the experimental section for full tables and error bars. revision: yes
Referee: [Methods (LESS Attention)] LESS Attention description (methods): The separable low-rank factorization is presented as retaining sufficient expressiveness for joint spatial-spectral interactions across arbitrary spectral configurations, but no analysis, ablation, or construction details demonstrate that the rank-r decomposition captures non-separable cross-terms dependent on precise wavelength sampling; this directly bears on whether the O(r N C) reduction preserves the sensor-flexible property without hidden adjustments.

Authors: This comment correctly identifies a gap in the theoretical and empirical justification. While the overall architecture and experimental results support the sensor-flexible property, the original manuscript did not provide a dedicated analysis of how the rank-r separable factorization approximates non-separable cross-terms that depend on exact wavelength sampling. We have revised the methods section to include an expanded mathematical construction of the low-rank factorization, showing the separable spatial and spectral components and their approximation to joint interactions. We have also added an ablation study varying the rank r under different spectral configurations, along with discussion of how the wavelength-aware positional encoding contributes to preserving flexibility without sensor-specific adjustments. revision: yes

Circularity Check

0 steps flagged

No circularity: new architecture with independent design choices

full rationale

The paper introduces LESSViT as a novel sensor-flexible ViT variant built on explicitly defined components: LESS Attention (structured low-rank factorization reducing O(N²C²) to O(rNC)), channel-agnostic patch embedding, wavelength-aware positional encoding, and HyperMAE with decoupled spatial-spectral masking. These are presented as design decisions rather than quantities derived from or fitted to the target generalization results. No equations reduce the claimed robustness under spectral shifts to a self-referential fit, prior self-citation chain, or renamed empirical pattern. The central claims rest on the proposed architecture's construction and empirical evaluation on SpectralEarth, which are independent of the outputs they are meant to explain.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that a low-rank separable factorization can adequately approximate full spatial-spectral attention for arbitrary channel counts and wavelength samplings. The paper introduces new architectural modules whose effectiveness is asserted via benchmark results whose details are not visible in the abstract.

free parameters (1)

rank r
The rank parameter in the low-rank approximation of spatial-spectral attention; its value controls the efficiency-expressiveness trade-off and must be selected for each model.

axioms (1)

domain assumption Low-rank factorization of joint spatial-spectral attention preserves necessary interactions for hyperspectral representation learning
Invoked when replacing full O(N²C²) attention with O(rNC) separable components.

invented entities (2)

LESS Attention no independent evidence
purpose: Structured low-rank factorization for efficient joint spatial-spectral modeling
New attention module introduced by the paper; no independent evidence provided in abstract.
HyperMAE no independent evidence
purpose: Hyperspectral masked autoencoder with decoupled spatial-spectral masking and hierarchical channel sampling
New pretraining method introduced by the paper; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5831 in / 1416 out tokens · 34152 ms · 2026-05-20T11:12:19.862914+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LESS Attention decomposes interactions into spatial-only and spectral-only components, whose coupling is captured through a low-rank composition... A := ∑_{i=1}^r A_C^i ⊗ A_S^i
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

wavelength-aware positional encoding... SSRoPE... 1D RoPE over the spectral dimension using wavelengths λ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 3 internal anchors

[1]

Fast Gaussian process estimation for large- scale in situ inference using convolutional neural net- works

Yujia Bao, Srinivasan Sivanandan, and Theofanis Karaletsos. Channel vision transformers: An image is worth c x 16 x 16 words.arXiv preprint arXiv:2309.16108,

work page arXiv
[2]

doi: 10.3390/s18020441

ISSN 1424-8220. doi: 10.3390/s18020441. URLhttps://www.mdpi.com/1424-8220/18/2/441. Claire Boryan, Zhengwei Yang, Rick Mueller, and Mike Craig. Monitoring us agriculture: the us department of agriculture, national agricultural statistics service, cropland data layer program. Geocarto International, 26(5):341–358,

work page doi:10.3390/s18020441
[3]

Spectralearth: Training hyperspectral foundation models at scale.arXiv preprint arXiv:2408.08447,

Nassim Ait Ali Braham, Conrad M Albrecht, Julien Mairal, Jocelyn Chanussot, Yi Wang, and Xiao Xiang Zhu. Spectralearth: Training hyperspectral foundation models at scale.arXiv preprint arXiv:2408.08447,

work page arXiv
[4]

Longlora: Efficient fine-tuning of long-context large language models.arXiv preprint arXiv:2309.12307,

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fine-tuning of long-context large language models.arXiv preprint arXiv:2309.12307,

work page arXiv
[5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[6]

Corine land cover (clc) 2018, version 2020_20u1

European Environment Agency. Corine land cover (clc) 2018, version 2020_20u1. https://land. copernicus.eu/pan-european/corine-land-cover ,

work page 2018
[7]

What do vision transformers learn? a visual exploration

Amin Ghiasi, Hamid Kazemi, Eitan Borgnia, Steven Reich, Manli Shu, Micah Goldblum, An- drew Gordon Wilson, and Tom Goldstein. What do vision transformers learn? a visual exploration. arXiv preprint arXiv:2212.06727,

work page arXiv
[8]

URLhttps://www.mdpi.com/2072-4292/7/7/8830

ISSN 2072-4292. doi: 10.3390/rs70708830. URLhttps://www.mdpi.com/2072-4292/7/7/8830. E Keith Hege, Dan O’Connell, William Johnson, Shridhar Basty, and Eustace L Dereniak. Hyper- spectral imaging for astronomy and space surveillance. InImaging Spectrometry IX, volume 5159, pages 380–391. SPIE,

work page doi:10.3390/rs70708830 2072
[9]

Mora: High-rank updating for parameter- efficient fine-tuning.arXiv preprint arXiv:2405.12130, 2024

Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, et al. Mora: High-rank updating for parameter-efficient fine-tuning.arXiv preprint arXiv:2405.12130,

work page arXiv
[10]

Relora: High- rank training through low-rank updates.arXiv preprint arXiv:2307.05695, 2023

Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, and Anna Rumshisky. Relora: High-rank training through low-rank updates.arXiv preprint arXiv:2307.05695,

work page arXiv
[11]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Pearlman, P

doi: 10.1109/TGRS.2003.815018. Weijieying Ren, Xinlong Li, Lei Wang, Tianxiang Zhao, and Wei Qin. Analyzing and reducing catastrophic forgetting in parameter efficient tuning.arXiv preprint arXiv:2402.18865,

work page doi:10.1109/tgrs.2003.815018 2003
[13]

Eurocrops: A pan-european dataset for time series crop type classification.arXiv preprint arXiv:2106.08151,

Maja Schneider, Amelie Broszeit, and Marco Körner. Eurocrops: A pan-european dataset for time series crop type classification.arXiv preprint arXiv:2106.08151,

work page arXiv
[14]

RoFormer: Enhanced Transformer with Rotary Position Embedding

URL https://arxiv.org/abs/ 2104.09864. 11 Di Wang, Meiqi Hu, Yao Jin, Yuchun Miao, Jiaqi Yang, Yichu Xu, Xiaolei Qin, Jiaqi Ma, Lingyu Sun, Chenxing Li, et al. Hypersigma: Hyperspectral intelligence comprehension foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Xiong, Y

Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J Stewart, Joëlle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity-inspired multimodal foundation model for earth observation.arXiv preprint arXiv:2403.15356,

work page arXiv
[16]

Sinklora: Enhanced efficiency and chat capabilities for long-context large language models.arXiv preprint arXiv:2406.05678,

Hengyu Zhang. Sinklora: Enhanced efficiency and chat capabilities for long-context large language models.arXiv preprint arXiv:2406.05678,

work page arXiv
[17]

Tensor product attention is all you need.arXiv preprint arXiv:2501.06425, 2025

Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, and Andrew Chi-Chih Yao. Tensor product attention is all you need.arXiv preprint arXiv:2501.06425,

work page arXiv
[18]

Existing approaches can be broadly categorized into two groups based on how they handle the spectral dimension

12 A Additional Related Works Hyperspectral Modeling.Most hyperspectral (HSI) and multi-spectral (MSI) models are developed for geospatial applications, where the core challenge lies in modeling joint spatial-spectral interactions. Existing approaches can be broadly categorized into two groups based on how they handle the spectral dimension. The first cla...

work page 2024
[19]

We normalize data following Braham et al

optimizer with a weight decay of 5e-2. We normalize data following Braham et al. [2024]. For data augmentation, we apply the random horizontal flip with a probability of 50%. We do not resize the image to preserve the physical property of spatial resolution within the geospatial data. D Evaluation Details We evaluate our pretrained LESS ViT models and oth...

work page 2024

[1] [1]

Fast Gaussian process estimation for large- scale in situ inference using convolutional neural net- works

Yujia Bao, Srinivasan Sivanandan, and Theofanis Karaletsos. Channel vision transformers: An image is worth c x 16 x 16 words.arXiv preprint arXiv:2309.16108,

work page arXiv

[2] [2]

doi: 10.3390/s18020441

ISSN 1424-8220. doi: 10.3390/s18020441. URLhttps://www.mdpi.com/1424-8220/18/2/441. Claire Boryan, Zhengwei Yang, Rick Mueller, and Mike Craig. Monitoring us agriculture: the us department of agriculture, national agricultural statistics service, cropland data layer program. Geocarto International, 26(5):341–358,

work page doi:10.3390/s18020441

[3] [3]

Spectralearth: Training hyperspectral foundation models at scale.arXiv preprint arXiv:2408.08447,

Nassim Ait Ali Braham, Conrad M Albrecht, Julien Mairal, Jocelyn Chanussot, Yi Wang, and Xiao Xiang Zhu. Spectralearth: Training hyperspectral foundation models at scale.arXiv preprint arXiv:2408.08447,

work page arXiv

[4] [4]

Longlora: Efficient fine-tuning of long-context large language models.arXiv preprint arXiv:2309.12307,

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fine-tuning of long-context large language models.arXiv preprint arXiv:2309.12307,

work page arXiv

[5] [5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[6] [6]

Corine land cover (clc) 2018, version 2020_20u1

European Environment Agency. Corine land cover (clc) 2018, version 2020_20u1. https://land. copernicus.eu/pan-european/corine-land-cover ,

work page 2018

[7] [7]

What do vision transformers learn? a visual exploration

Amin Ghiasi, Hamid Kazemi, Eitan Borgnia, Steven Reich, Manli Shu, Micah Goldblum, An- drew Gordon Wilson, and Tom Goldstein. What do vision transformers learn? a visual exploration. arXiv preprint arXiv:2212.06727,

work page arXiv

[8] [8]

URLhttps://www.mdpi.com/2072-4292/7/7/8830

ISSN 2072-4292. doi: 10.3390/rs70708830. URLhttps://www.mdpi.com/2072-4292/7/7/8830. E Keith Hege, Dan O’Connell, William Johnson, Shridhar Basty, and Eustace L Dereniak. Hyper- spectral imaging for astronomy and space surveillance. InImaging Spectrometry IX, volume 5159, pages 380–391. SPIE,

work page doi:10.3390/rs70708830 2072

[9] [9]

Mora: High-rank updating for parameter- efficient fine-tuning.arXiv preprint arXiv:2405.12130, 2024

Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, et al. Mora: High-rank updating for parameter-efficient fine-tuning.arXiv preprint arXiv:2405.12130,

work page arXiv

[10] [10]

Relora: High- rank training through low-rank updates.arXiv preprint arXiv:2307.05695, 2023

Vladislav Lialin, Namrata Shivagunde, Sherin Muckatira, and Anna Rumshisky. Relora: High-rank training through low-rank updates.arXiv preprint arXiv:2307.05695,

work page arXiv

[11] [11]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Pearlman, P

doi: 10.1109/TGRS.2003.815018. Weijieying Ren, Xinlong Li, Lei Wang, Tianxiang Zhao, and Wei Qin. Analyzing and reducing catastrophic forgetting in parameter efficient tuning.arXiv preprint arXiv:2402.18865,

work page doi:10.1109/tgrs.2003.815018 2003

[13] [13]

Eurocrops: A pan-european dataset for time series crop type classification.arXiv preprint arXiv:2106.08151,

Maja Schneider, Amelie Broszeit, and Marco Körner. Eurocrops: A pan-european dataset for time series crop type classification.arXiv preprint arXiv:2106.08151,

work page arXiv

[14] [14]

RoFormer: Enhanced Transformer with Rotary Position Embedding

URL https://arxiv.org/abs/ 2104.09864. 11 Di Wang, Meiqi Hu, Yao Jin, Yuchun Miao, Jiaqi Yang, Yichu Xu, Xiaolei Qin, Jiaqi Ma, Lingyu Sun, Chenxing Li, et al. Hypersigma: Hyperspectral intelligence comprehension foundation model. IEEE Transactions on Pattern Analysis and Machine Intelligence,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Xiong, Y

Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J Stewart, Joëlle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity-inspired multimodal foundation model for earth observation.arXiv preprint arXiv:2403.15356,

work page arXiv

[16] [16]

Sinklora: Enhanced efficiency and chat capabilities for long-context large language models.arXiv preprint arXiv:2406.05678,

Hengyu Zhang. Sinklora: Enhanced efficiency and chat capabilities for long-context large language models.arXiv preprint arXiv:2406.05678,

work page arXiv

[17] [17]

Tensor product attention is all you need.arXiv preprint arXiv:2501.06425, 2025

Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, and Andrew Chi-Chih Yao. Tensor product attention is all you need.arXiv preprint arXiv:2501.06425,

work page arXiv

[18] [18]

Existing approaches can be broadly categorized into two groups based on how they handle the spectral dimension

12 A Additional Related Works Hyperspectral Modeling.Most hyperspectral (HSI) and multi-spectral (MSI) models are developed for geospatial applications, where the core challenge lies in modeling joint spatial-spectral interactions. Existing approaches can be broadly categorized into two groups based on how they handle the spectral dimension. The first cla...

work page 2024

[19] [19]

We normalize data following Braham et al

optimizer with a weight decay of 5e-2. We normalize data following Braham et al. [2024]. For data augmentation, we apply the random horizontal flip with a probability of 50%. We do not resize the image to preserve the physical property of spatial resolution within the geospatial data. D Evaluation Details We evaluate our pretrained LESS ViT models and oth...

work page 2024