FOCUS: Fused Observation of Channels for Unveiling Spectra

Anika Tabassum; Aristeidis Tsaris; John Lagergren; Larry M. York; Tianyang Wang; Xiao Wang; Xi Xiao

arxiv: 2507.14787 · v2 · submitted 2025-07-20 · 💻 cs.CV · cs.AI

FOCUS: Fused Observation of Channels for Unveiling Spectra

Xi Xiao , Aristeidis Tsaris , Anika Tabassum , John Lagergren , Larry M. York , Tianyang Wang , Xiao Wang This is my paper

Pith reviewed 2026-05-19 03:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords hyperspectral imagingvision transformerssaliency mapsmodel interpretabilityattention mechanismsspectral promptsfrozen models

0 comments

The pith

FOCUS enables frozen vision transformers to generate reliable spatial-spectral saliency maps for hyperspectral images in one forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FOCUS as a way to interpret vision transformers applied to hyperspectral images that contain hundreds of wavelength bands. It adds class-specific spectral prompts to steer attention toward meaningful wavelength groups and introduces a learnable sink token trained with an attraction loss to capture noisy attention instead. These additions let the model produce 3D saliency maps and spectral importance curves in a single forward pass. A sympathetic reader would care because the approach keeps the original model frozen, avoids gradient calculations, and aims to make high-dimensional image decisions more transparent for fields such as agriculture and environmental monitoring.

Core claim

FOCUS is the first framework that enables reliable and efficient spatial-spectral interpretability for frozen ViTs. It introduces class-specific spectral prompts that guide attention toward semantically meaningful wavelength groups, and a learnable [SINK] token trained with an attraction loss to absorb noisy or redundant attention. Together these designs make it possible to generate stable and interpretable 3D saliency maps and spectral importance curves in a single forward pass, without any gradient backpropagation or backbone modification.

What carries the argument

Class-specific spectral prompts paired with a learnable [SINK] token under an attraction loss, which together direct and stabilize attention flow in a frozen vision transformer processing hyperspectral data.

Load-bearing premise

That class-specific spectral prompts and the attraction loss on the sink token will reliably guide attention to semantically meaningful wavelength groups in a single forward pass without any backbone modification or gradient computation.

What would settle it

A hyperspectral dataset where the saliency maps and spectral curves produced by FOCUS show no measurable overlap with expert-annotated important wavelengths for the same classes.

Figures

Figures reproduced from arXiv: 2507.14787 by Anika Tabassum, Aristeidis Tsaris, John Lagergren, Larry M. York, Tianyang Wang, Xiao Wang, Xi Xiao.

**Figure 2.** Figure 2: Overview of FOCUS. A hyperspectral image is embedded and processed by a frozen ViT. FOCUS injects class-specific spectral prompts, each corresponding to meaningful spectral groups (e.g., VIS, red-edge, NIR), and introduces a dedicated [SINK] token to absorb redundant attention. Saliency is computed via prompt-to-patch attention, yielding a 3D attribution cube T ∈ R H×W×C for spatial–spectral interpretation… view at source ↗

**Figure 3.** Figure 3: Band-wise interpretability across spectral groups. Compared to baseline methods (Grad-CAM and Prompt-CAM), FOCUS produces significantly clearer and biologically grounded saliency patterns, especially in the red-edge and shortwave infrared (SWIR) regions (columns 4–5). The red-edge band (700–750 nm) is known to reflect changes in chlorophyll concentration and leaf pigment activity, critical for identifying … view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Inference efficiency comparison. FOCUS achieves over 95% runtime and 85% memory reduction compared to vanilla ViT and D-CHAG [36], while providing complete spatial–spectral interpretability. with the “attention debiasing” literature, which finds that targeted attention steering—whether by loss design [5] or prompt gating [2]—improves model resilience to common corruptions. Uniquely, FOCUS achieves these ga… view at source ↗

**Figure 7.** Figure 7: Layer-wise sink attention mass. Comparison of attention mass routed to the [SINK] token across transformer layers. Our full method (+Sink+Lsink) most effectively drains nondiscriminative attention, reducing accumulation over depth. 5. Conclusion We introduced FOCUS, a lightweight and gradient-free interpretability framework tailored for hyperspectral vision transformers. By combining class-specific spe… view at source ↗

read the original abstract

Hyperspectral imaging (HSI) captures hundreds of narrow, contiguous wavelength bands, making it a powerful tool in biology, agriculture, and environmental monitoring. However, interpreting Vision Transformers (ViTs) in this setting remains largely unexplored due to two key challenges: (1) existing saliency methods struggle to capture meaningful spectral cues, often collapsing attention onto the class token, and (2) full-spectrum ViTs are computationally prohibitive for interpretability, given the high-dimensional nature of HSI data. We present FOCUS, the first framework that enables reliable and efficient spatial-spectral interpretability for frozen ViTs. FOCUS introduces two core components: class-specific spectral prompts that guide attention toward semantically meaningful wavelength groups, and a learnable [SINK] token trained with an attraction loss to absorb noisy or redundant attention. Together, these designs make it possible to generate stable and interpretable 3D saliency maps and spectral importance curves in a single forward pass, without any gradient backpropagation or backbone modification. FOCUS improves band-level IoU by 15 percent, reduces attention collapse by over 40 percent, and produces saliency results that align closely with expert annotations. With less than 1 percent parameter overhead, our method makes high-resolution ViT interpretability practical for real-world hyperspectral applications, bridging a long-standing gap between black-box modeling and trustworthy HSI decision-making.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces FOCUS, a framework for spatial-spectral interpretability of frozen Vision Transformers on hyperspectral imaging data. It adds class-specific spectral prompts and a learnable [SINK] token with an attraction loss to produce 3D saliency maps and spectral importance curves in a single forward pass, without gradients or backbone modification. The work claims a 15% gain in band-level IoU, over 40% reduction in attention collapse, close alignment with expert annotations, and less than 1% parameter overhead.

Significance. If the faithfulness of the saliency maps to the unmodified frozen ViT holds and the quantitative gains are robustly supported, the approach could meaningfully advance practical interpretability for high-dimensional HSI applications in biology, agriculture, and environmental monitoring. The single-pass efficiency and minimal overhead would be a clear practical strength for real-world deployment.

major comments (1)

[Abstract] Abstract: The central claim that FOCUS delivers 'reliable' interpretability of the frozen ViT is load-bearing for the contribution, yet the addition of class-specific spectral prompts and the [SINK] token modifies the input token sequence. This alteration can shift attention weights among the original patch and class tokens in a single forward pass, even without backbone changes or gradients. It is therefore unclear whether the reported 15% IoU improvement and 40% collapse reduction reflect intrinsic spectral cues of the original model or are driven by the auxiliary loss and prompts; a targeted faithfulness check (e.g., comparison of attention before/after insertion on the same frozen backbone) is required.

minor comments (1)

[Abstract] Abstract: Quantitative claims are stated without reference to experimental setup, baselines, number of runs, statistical significance testing, or dataset details, which hinders immediate assessment of reproducibility and effect-size reliability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment regarding the faithfulness of the interpretability claims below and are prepared to strengthen the paper with additional analysis.

read point-by-point responses

Referee: The central claim that FOCUS delivers 'reliable' interpretability of the frozen ViT is load-bearing for the contribution, yet the addition of class-specific spectral prompts and the [SINK] token modifies the input token sequence. This alteration can shift attention weights among the original patch and class tokens in a single forward pass, even without backbone changes or gradients. It is therefore unclear whether the reported 15% IoU improvement and 40% collapse reduction reflect intrinsic spectral cues of the original model or are driven by the auxiliary loss and prompts; a targeted faithfulness check (e.g., comparison of attention before/after insertion on the same frozen backbone) is required.

Authors: We agree that the insertion of class-specific spectral prompts and the learnable [SINK] token modifies the input token sequence, which can influence attention weight distributions even with a frozen backbone. This is a substantive point that merits explicit verification to confirm that the observed improvements in band-level IoU and attention collapse reduction primarily capture intrinsic spectral cues rather than being artifacts of the auxiliary components. In the revised manuscript we will add a targeted faithfulness experiment that directly compares attention maps (on the original patch and class tokens) produced by the identical frozen ViT with and without the prompts and [SINK] token. This analysis will quantify any shifts and provide quantitative evidence supporting the reliability of the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity: novel components introduced without reduction to inputs or self-citation chains

full rationale

The paper introduces FOCUS via two explicitly new design elements—class-specific spectral prompts and a learnable [SINK] token with attraction loss—presented as additions that enable single-forward-pass saliency on frozen ViTs. No equations, derivations, or fitted parameters are shown that reduce by construction to prior outputs or self-citations. The reported gains (15% IoU, 40% collapse reduction) are framed as empirical outcomes rather than forced by definition or imported uniqueness theorems. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to identify or audit specific free parameters, axioms, or invented entities; full manuscript required for complete ledger.

pith-pipeline@v0.9.0 · 5794 in / 1085 out tokens · 28738 ms · 2026-05-19T03:48:20.813827+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FOCUS introduces two core components: class-specific spectral prompts that guide attention toward semantically meaningful wavelength groups, and a learnable [SINK] token trained with an attraction loss to absorb noisy or redundant attention.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lsink = −λ ∑ℓ 1/|Haux| ∑h∈Haux meani Aℓ,h[i,ksink]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
cs.LG 2026-04 unverdicted novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Quantifying attention flow in transformers

Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 4190–4197, 2020. 2, 5

work page 2020
[2]

Learning to mask and permute visual tokens for vision transformer pre- training

Lorenzo Baraldi, Roberto Amoroso, Marcella Cornia, An- drea Pilzer, and Rita Cucchiara. Learning to mask and permute visual tokens for vision transformer pre- training. Computer Vision and Image Understanding, 2025. arXiv:2306.07346. 8

work page arXiv 2025
[3]

Grad-cam++: General- ized gradient-based visual explanations for deep convolu- tional networks

Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-cam++: General- ized gradient-based visual explanations for deep convolu- tional networks. In 2018 IEEE winter conference on appli- cations of computer vision (WACV) , pages 839–847. IEEE,

work page 2018
[4]

Transformer inter- pretability beyond attention visualization

Hila Chefer, Shir Gur, and Lior Wolf. Transformer inter- pretability beyond attention visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 782–791, 2021. 2, 5

work page 2021
[5]

Optimiz- ing relevance maps of vision transformers improves robust- ness

Hila Chefer, Idan Schwartz, and Lior Wolf. Optimiz- ing relevance maps of vision transformers improves robust- ness. In Advances in Neural Information Processing Systems (NeurIPS), 2022. 7, 8

work page 2022
[6]

Barnett, Jonathan Su, and Cynthia Rudin

Chaofan Chen, Oscar Li, Chaofan Tao, Alina J. Barnett, Jonathan Su, and Cynthia Rudin. This looks like that: Deep learning for interpretable image recognition. In Advances in Neural Information Processing Systems (NeurIPS), pages 8821–8832, 2019. 5

work page 2019
[7]

Sst: Spatial and semantic transformers for multi-label image recognition

Zhao-Min Chen, Quan Cui, Borui Zhao, Renjie Song, Xi- aoqin Zhang, and Osamu Yoshie. Sst: Spatial and semantic transformers for multi-label image recognition. IEEE Trans- actions on Image Processing, 31:2570–2583, 2022. 2

work page 2022
[8]

Prompt-cam: A simpler interpretable transformer for fine-grained analysis

Arpita Chowdhury, Dipanjyoti Paul, Zheda Mai, Jianyang Gu, Ziheng Zhang, Kazi Sajeed Mehrab, Elizabeth G Campolongo, Daniel Rubenstein, Charles V Stewart, Anuj Karpatne, et al. Prompt-cam: A simpler interpretable transformer for fine-grained analysis. arXiv preprint arXiv:2501.09333, 2025. 2, 3, 5, 6, 7

work page arXiv 2025
[9]

Vision Transformers Need Registers

Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Pi- otr Bojanowski. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Vi- sion transformers need registers

Timoth ´ee Darcet, Amir Dezfouli, Maximilian Igl, Andr ´e Barreto, John Quan, Vedanuj Misra, Jared Kaplan, George Tucker, Tom Schaul, Xavier Puig, and Emilio Parisotto. Vi- sion transformers need registers. In Proceedings of the In- ternational Conference on Learning Representations (ICLR),

work page
[12]

Ramaswamy

Saurabh Desai and Harish G. Ramaswamy. Ablation- cam: Visual explanations for deep convolutional network via gradient-free localization. In Proceedings of the IEEE Win- ter Conference on Applications of Computer Vision (WACV), pages 972–980, 2020. 5

work page 2020
[13]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[14]

Sanet: Structure-aware net- work for visual tracking

Heng Fan and Haibin Ling. Sanet: Structure-aware net- work for visual tracking. In Proceedings of the IEEE confer- ence on computer vision and pattern recognition workshops, pages 42–49, 2017. 3, 5, 6, 7

work page 2017
[15]

Koohpayegani, Farnoush R

Mohsen Fayyaz, Soroush A. Koohpayegani, Farnoush R. Ja- fari, Sunando Sengupta, Hamid R. Vaezi Joze, Eric Sommer- lade, Hamed Pirsiavash, and Juergen Gall. Adaptive token sampling for efficient vision transformers. arXiv preprint arXiv:2111.15667, 2021. 3

work page arXiv 2021
[16]

Edit: Enhanc- ing vision transformers by mitigating attention sink through an encoder–decoder architecture

Wenfeng Feng and Guoying Sun. Edit: Enhanc- ing vision transformers by mitigating attention sink through an encoder–decoder architecture. arXiv preprint arXiv:2504.06738, 2025. 3

work page arXiv 2025
[17]

Axiom-based grad-cam: Towards accurate visualization and explanation of cnns

Ruigang Fu, Qingyong Hu, Xiaohu Dong, Yulan Guo, Yinghui Gao, and Biao Li. Axiom-based grad-cam: To- wards accurate visualization and explanation of cnns. arXiv preprint arXiv:2008.02312, 2020. 2, 5

work page arXiv 2008
[18]

Sptrack: Spectral similarity prompt learning for hyperspectral object tracking

Gaowei Guo, Zhaoxu Li, Wei An, Yingqian Wang, Xu He, Yihang Luo, Qiang Ling, Miao Li, and Zaiping Lin. Sptrack: Spectral similarity prompt learning for hyperspectral object tracking. Remote Sensing, 16(16):2975, 2024. 3

work page 2024
[19]

Hyperprompt: Prompt-based task-conditioning of transformers

Yun He, Steven Zheng, Yi Tay, Jai Gupta, Yu Du, Vamsi Aribandi, Zhe Zhao, YaGuang Li, Zhao Chen, Donald Met- zler, et al. Hyperprompt: Prompt-based task-conditioning of transformers. In International conference on machine learn- ing, pages 8678–8690. PMLR, 2022. 3, 5, 6, 7

work page 2022
[20]

Spectralformer: Re- thinking hyperspectral image classification with transform- ers

Danfeng Hong, Zhu Han, Jing Yao, Lianru Gao, Bing Zhang, Antonio Plaza, and Jocelyn Chanussot. Spectralformer: Re- thinking hyperspectral image classification with transform- ers. IEEE Transactions on Geoscience and Remote Sensing, 60:1–15, 2021. 2

work page 2021
[21]

Spectralformer: Re- thinking hyperspectral image classification with transform- ers

Danfeng Hong, Zhu Han, Jing Yao, Lianru Gao, Bing Zhang, Antonio Plaza, and Jocelyn Chanussot. Spectralformer: Re- thinking hyperspectral image classification with transform- ers. IEEE Transactions on Geoscience and Remote Sensing, 60:5518615, 2022. 3

work page 2022
[22]

Specformer: Guarding vision trans- former robustness via maximum singular value penalization

Xixu Hu, Runkai Zheng, Jindong Wang, Cheuk Hang Leung, Qi Wu, and Xing Xie. Specformer: Guarding vision trans- former robustness via maximum singular value penalization. arXiv preprint arXiv:2402.03317, 2024. 3 9

work page arXiv 2024
[23]

Layercam: Exploring hierarchical class activation maps for localization

Peng-Tao Jiang, Chang-Bin Zhang, Qibin Hou, Ming-Ming Cheng, and Yunchao Wei. Layercam: Exploring hierarchical class activation maps for localization. IEEE Transactions on Image Processing, 30:5875–5888, 2021. 2, 5

work page 2021
[24]

HyperLeaf2024 – a hyper- spectral imaging dataset for classification and regression of wheat leaves

William Michael Laprade, Pawel Pieta, Svetlana Kutuzova, Jesper Cairo Westergaard, Mads Nielsen, Svend Christensen, and Anders Bjorholm Dahl. HyperLeaf2024 – a hyper- spectral imaging dataset for classification and regression of wheat leaves. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition Work- shops (FGVC 11), 2024. 5

work page 2024
[25]

Diffcam: Data-driven saliency maps by capturing feature differences

Xingjian Li, Qiming Zhao, Neelesh Bisht, Mostofa Rafid Uddin, Jin Yu Kim, Bryan Zhang, and Min Xu. Diffcam: Data-driven saliency maps by capturing feature differences. In Proceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 10327–10337, 2025. 2, 5, 6, 7

work page 2025
[26]

Spectral–spatial classification of hyperspectral imagery with 3d convolutional neural network

Ying Li, Haokui Zhang, and Qiang Shen. Spectral–spatial classification of hyperspectral imagery with 3d convolutional neural network. Remote Sensing, 9(1):67, 2017. 2

work page 2017
[27]

Hsi-cnn: A novel convolution neural network for hyperspectral image

Yanan Luo, Jie Zou, Chengfei Yao, Xiaosong Zhao, Tao Li, and Gang Bai. Hsi-cnn: A novel convolution neural network for hyperspectral image. In 2018 International Conference on Audio, Language and Image Processing (ICALIP), pages 464–469. IEEE, 2018. 2

work page 2018
[28]

Eigen-cam: Class activation map using principal compo- nents

Mohammed Bany Muhammad and Mohammed Yeasin. Eigen-cam: Class activation map using principal compo- nents. In 2020 international joint conference on neural net- works (IJCNN), pages 1–7. IEEE, 2020. 2, 5

work page 2020
[29]

Smooth grad- cam++: An enhanced inference level visualization technique for deep convolutional neural network models.arXiv preprint arXiv:1908.01224, 2019

Daniel Omeiza, Skyler Speakman, Celia Cintas, and Kom- minist Weldermariam. Smooth grad-cam++: An en- hanced inference level visualization technique for deep convolutional neural network models. In Proceedings of the Intelligent Systems Conference (IntelliSys) , 2019. arXiv:1908.01224. 2

work page arXiv 2019
[30]

Carlyn, Samuel Stevens, Kaiya L

Dipanjyoti Paul, Arpita Chowdhury, Xinqi Xiong, Feng-Ju Chang, David E. Carlyn, Samuel Stevens, Kaiya L. Provost, Anuj Karpatne, Bryan Carstens, Daniel Rubenstein, Charles Stewart, Tanya Berger-Wolf, Yu Su, and Wei-Lun Chao. A simple interpretable transformer for fine-grained image clas- sification and analysis. arXiv preprint arXiv:2311.04157 ,

work page arXiv
[31]

Rise: Random- ized input sampling for explanation of black-box models

Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Random- ized input sampling for explanation of black-box models. In British Machine Vision Conference (BMVC), 2018. 5

work page 2018
[32]

Hyper-vit: A novel light-weighted visual transformer-based supervised classifi- cation framework for hyperspectral remote sensing applica- tions

Bishwas Praveen and Vineetha Menon. Hyper-vit: A novel light-weighted visual transformer-based supervised classifi- cation framework for hyperspectral remote sensing applica- tions. In 2022 12th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHIS- PERS), pages 1–5. IEEE, 2022. 2

work page 2022
[33]

Grad-cam: Visual explanations from deep networks via gradient-based localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE in- ternational conference on computer vision , pages 618–626,

work page
[34]

Ssa: Semantic structure aware inference for weakly pixel-wise dense predictions without cost

Yanpeng Sun and Zechao Li. Ssa: Semantic structure aware inference for weakly pixel-wise dense predictions without cost. arXiv preprint arXiv:2111.03392, 2021. 3, 5

work page arXiv 2021
[36]

Going deeper with im- age transformers

Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv´e J´egou. Going deeper with im- age transformers. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 32–42,

work page
[37]

Distributed cross-channel hierar- chical aggregation for foundation models

Aristeidis Tsaris, Isaac Lyngaas, John Lagregren, Mohamed Wahib, Larry York, Prasanna Balaprakash, Dan Lu, Feiyi Wang, and Xiao Wang. Distributed cross-channel hierar- chical aggregation for foundation models. arXiv preprint arXiv:2506.21411, 2025. 8

work page arXiv 2025
[38]

Score-cam: Score-weighted visual explanations for convolutional neural networks

Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-cam: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages 24–25, 2020. 2, 5

work page 2020
[39]

Hongjie Wang, Bhishma Dedhia, and Niraj K. Jha. Zero- tprune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2024. 3

work page 2024
[40]

High perfor- mance model based image reconstruction

Xiao Wang, Amit Sabne, Sherman Kisner, Anand Raghu- nathan, Charles Bouman, and Samuel Midkiff. High perfor- mance model based image reconstruction. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Prac- tice of Parallel Programming, New York, NY , USA, 2016. Association for Computing Machinery. 3

work page 2016
[41]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Chunyuan Li, Lei Hou, Hao Ma, Ming Sun, Nan Du Wang, Qizhe Wu, Yue Zhang, Lichen Shou, Lei Zhou, Hongyu Zhang, Fei Wu, Haifeng Wang, and Jin- gren Zhou. Efficient streaming language models with atten- tion sinks. arXiv preprint arXiv:2309.17453, 2023. 2, 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Visual instance-aware prompt tuning, 2025

Xi Xiao, Yunbei Zhang, Xingjian Li, Tianyang Wang, Xiao Wang, Yuxiang Wei, Jihun Hamm, and Min Xu. Visual instance-aware prompt tuning, 2025. 2

work page 2025
[43]

Visual varia- tional autoencoder prompt tuning, 2025

Xi Xiao, Yunbei Zhang, Yanshuh Li, Xingjian Li, Tianyang Wang, Jihun Hamm, Xiao Wang, and Min Xu. Visual varia- tional autoencoder prompt tuning, 2025. 2

work page 2025
[44]

Hsvit: Horizontally scalable vision transformer

Chenhao Xu, Chang-Tsun Li, Chee Peng Lim, and Douglas Creighton. Hsvit: Horizontally scalable vision transformer. arXiv preprint arXiv:2404.05196, 2024. 2

work page arXiv 2024
[45]

Protopformer: Concentrating on prototypical parts in vision transform- ers for interpretable image recognition

Mengqi Xue, Qihan Huang, Haofei Zhang, Lechao Cheng, Jie Song, Minghui Wu, and Mingli Song. Protopformer: Concentrating on prototypical parts in vision transform- ers for interpretable image recognition. arXiv preprint arXiv:2208.10431, 2022. 5

work page arXiv 2022
[46]

Semi-active convolutional neural networks for hyperspectral image clas- 10 sification

Jing Yao, Xiangyong Cao, Danfeng Hong, Xin Wu, Deyu Meng, Jocelyn Chanussot, and Zongben Xu. Semi-active convolutional neural networks for hyperspectral image clas- 10 sification. IEEE Transactions on Geoscience and Remote Sensing, 60:1–15, 2022. 3

work page 2022
[47]

Jianping Zhang, Yizhan Huang, Weibin Wu, and Michael R. Lyu. Transferable adversarial attacks on vision transform- ers with token gradient regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2, 5

work page 2023
[48]

Vinatzer, and Song Li

Xuemei Zhang, Boris A. Vinatzer, and Song Li. Hyperspec- tral imaging analysis for early detection of tomato bacterial leaf spot disease. Scientific Reports, 14:27666, 2024. 5

work page 2024
[49]

Deepvit: Towards deeper vision transformer

Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xi- aochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021. 7 11

work page arXiv 2021

[1] [1]

Quantifying attention flow in transformers

Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), pages 4190–4197, 2020. 2, 5

work page 2020

[2] [2]

Learning to mask and permute visual tokens for vision transformer pre- training

Lorenzo Baraldi, Roberto Amoroso, Marcella Cornia, An- drea Pilzer, and Rita Cucchiara. Learning to mask and permute visual tokens for vision transformer pre- training. Computer Vision and Image Understanding, 2025. arXiv:2306.07346. 8

work page arXiv 2025

[3] [3]

Grad-cam++: General- ized gradient-based visual explanations for deep convolu- tional networks

Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasubramanian. Grad-cam++: General- ized gradient-based visual explanations for deep convolu- tional networks. In 2018 IEEE winter conference on appli- cations of computer vision (WACV) , pages 839–847. IEEE,

work page 2018

[4] [4]

Transformer inter- pretability beyond attention visualization

Hila Chefer, Shir Gur, and Lior Wolf. Transformer inter- pretability beyond attention visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 782–791, 2021. 2, 5

work page 2021

[5] [5]

Optimiz- ing relevance maps of vision transformers improves robust- ness

Hila Chefer, Idan Schwartz, and Lior Wolf. Optimiz- ing relevance maps of vision transformers improves robust- ness. In Advances in Neural Information Processing Systems (NeurIPS), 2022. 7, 8

work page 2022

[6] [6]

Barnett, Jonathan Su, and Cynthia Rudin

Chaofan Chen, Oscar Li, Chaofan Tao, Alina J. Barnett, Jonathan Su, and Cynthia Rudin. This looks like that: Deep learning for interpretable image recognition. In Advances in Neural Information Processing Systems (NeurIPS), pages 8821–8832, 2019. 5

work page 2019

[7] [7]

Sst: Spatial and semantic transformers for multi-label image recognition

Zhao-Min Chen, Quan Cui, Borui Zhao, Renjie Song, Xi- aoqin Zhang, and Osamu Yoshie. Sst: Spatial and semantic transformers for multi-label image recognition. IEEE Trans- actions on Image Processing, 31:2570–2583, 2022. 2

work page 2022

[8] [8]

Prompt-cam: A simpler interpretable transformer for fine-grained analysis

Arpita Chowdhury, Dipanjyoti Paul, Zheda Mai, Jianyang Gu, Ziheng Zhang, Kazi Sajeed Mehrab, Elizabeth G Campolongo, Daniel Rubenstein, Charles V Stewart, Anuj Karpatne, et al. Prompt-cam: A simpler interpretable transformer for fine-grained analysis. arXiv preprint arXiv:2501.09333, 2025. 2, 3, 5, 6, 7

work page arXiv 2025

[9] [9]

Vision Transformers Need Registers

Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Pi- otr Bojanowski. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Vi- sion transformers need registers

Timoth ´ee Darcet, Amir Dezfouli, Maximilian Igl, Andr ´e Barreto, John Quan, Vedanuj Misra, Jared Kaplan, George Tucker, Tom Schaul, Xavier Puig, and Emilio Parisotto. Vi- sion transformers need registers. In Proceedings of the In- ternational Conference on Learning Representations (ICLR),

work page

[11] [12]

Ramaswamy

Saurabh Desai and Harish G. Ramaswamy. Ablation- cam: Visual explanations for deep convolutional network via gradient-free localization. In Proceedings of the IEEE Win- ter Conference on Applications of Computer Vision (WACV), pages 972–980, 2020. 5

work page 2020

[12] [13]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 3

work page internal anchor Pith review Pith/arXiv arXiv 2010

[13] [14]

Sanet: Structure-aware net- work for visual tracking

Heng Fan and Haibin Ling. Sanet: Structure-aware net- work for visual tracking. In Proceedings of the IEEE confer- ence on computer vision and pattern recognition workshops, pages 42–49, 2017. 3, 5, 6, 7

work page 2017

[14] [15]

Koohpayegani, Farnoush R

Mohsen Fayyaz, Soroush A. Koohpayegani, Farnoush R. Ja- fari, Sunando Sengupta, Hamid R. Vaezi Joze, Eric Sommer- lade, Hamed Pirsiavash, and Juergen Gall. Adaptive token sampling for efficient vision transformers. arXiv preprint arXiv:2111.15667, 2021. 3

work page arXiv 2021

[15] [16]

Edit: Enhanc- ing vision transformers by mitigating attention sink through an encoder–decoder architecture

Wenfeng Feng and Guoying Sun. Edit: Enhanc- ing vision transformers by mitigating attention sink through an encoder–decoder architecture. arXiv preprint arXiv:2504.06738, 2025. 3

work page arXiv 2025

[16] [17]

Axiom-based grad-cam: Towards accurate visualization and explanation of cnns

Ruigang Fu, Qingyong Hu, Xiaohu Dong, Yulan Guo, Yinghui Gao, and Biao Li. Axiom-based grad-cam: To- wards accurate visualization and explanation of cnns. arXiv preprint arXiv:2008.02312, 2020. 2, 5

work page arXiv 2008

[17] [18]

Sptrack: Spectral similarity prompt learning for hyperspectral object tracking

Gaowei Guo, Zhaoxu Li, Wei An, Yingqian Wang, Xu He, Yihang Luo, Qiang Ling, Miao Li, and Zaiping Lin. Sptrack: Spectral similarity prompt learning for hyperspectral object tracking. Remote Sensing, 16(16):2975, 2024. 3

work page 2024

[18] [19]

Hyperprompt: Prompt-based task-conditioning of transformers

Yun He, Steven Zheng, Yi Tay, Jai Gupta, Yu Du, Vamsi Aribandi, Zhe Zhao, YaGuang Li, Zhao Chen, Donald Met- zler, et al. Hyperprompt: Prompt-based task-conditioning of transformers. In International conference on machine learn- ing, pages 8678–8690. PMLR, 2022. 3, 5, 6, 7

work page 2022

[19] [20]

Spectralformer: Re- thinking hyperspectral image classification with transform- ers

Danfeng Hong, Zhu Han, Jing Yao, Lianru Gao, Bing Zhang, Antonio Plaza, and Jocelyn Chanussot. Spectralformer: Re- thinking hyperspectral image classification with transform- ers. IEEE Transactions on Geoscience and Remote Sensing, 60:1–15, 2021. 2

work page 2021

[20] [21]

Spectralformer: Re- thinking hyperspectral image classification with transform- ers

Danfeng Hong, Zhu Han, Jing Yao, Lianru Gao, Bing Zhang, Antonio Plaza, and Jocelyn Chanussot. Spectralformer: Re- thinking hyperspectral image classification with transform- ers. IEEE Transactions on Geoscience and Remote Sensing, 60:5518615, 2022. 3

work page 2022

[21] [22]

Specformer: Guarding vision trans- former robustness via maximum singular value penalization

Xixu Hu, Runkai Zheng, Jindong Wang, Cheuk Hang Leung, Qi Wu, and Xing Xie. Specformer: Guarding vision trans- former robustness via maximum singular value penalization. arXiv preprint arXiv:2402.03317, 2024. 3 9

work page arXiv 2024

[22] [23]

Layercam: Exploring hierarchical class activation maps for localization

Peng-Tao Jiang, Chang-Bin Zhang, Qibin Hou, Ming-Ming Cheng, and Yunchao Wei. Layercam: Exploring hierarchical class activation maps for localization. IEEE Transactions on Image Processing, 30:5875–5888, 2021. 2, 5

work page 2021

[23] [24]

HyperLeaf2024 – a hyper- spectral imaging dataset for classification and regression of wheat leaves

William Michael Laprade, Pawel Pieta, Svetlana Kutuzova, Jesper Cairo Westergaard, Mads Nielsen, Svend Christensen, and Anders Bjorholm Dahl. HyperLeaf2024 – a hyper- spectral imaging dataset for classification and regression of wheat leaves. In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition Work- shops (FGVC 11), 2024. 5

work page 2024

[24] [25]

Diffcam: Data-driven saliency maps by capturing feature differences

Xingjian Li, Qiming Zhao, Neelesh Bisht, Mostofa Rafid Uddin, Jin Yu Kim, Bryan Zhang, and Min Xu. Diffcam: Data-driven saliency maps by capturing feature differences. In Proceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 10327–10337, 2025. 2, 5, 6, 7

work page 2025

[25] [26]

Spectral–spatial classification of hyperspectral imagery with 3d convolutional neural network

Ying Li, Haokui Zhang, and Qiang Shen. Spectral–spatial classification of hyperspectral imagery with 3d convolutional neural network. Remote Sensing, 9(1):67, 2017. 2

work page 2017

[26] [27]

Hsi-cnn: A novel convolution neural network for hyperspectral image

Yanan Luo, Jie Zou, Chengfei Yao, Xiaosong Zhao, Tao Li, and Gang Bai. Hsi-cnn: A novel convolution neural network for hyperspectral image. In 2018 International Conference on Audio, Language and Image Processing (ICALIP), pages 464–469. IEEE, 2018. 2

work page 2018

[27] [28]

Eigen-cam: Class activation map using principal compo- nents

Mohammed Bany Muhammad and Mohammed Yeasin. Eigen-cam: Class activation map using principal compo- nents. In 2020 international joint conference on neural net- works (IJCNN), pages 1–7. IEEE, 2020. 2, 5

work page 2020

[28] [29]

Smooth grad- cam++: An enhanced inference level visualization technique for deep convolutional neural network models.arXiv preprint arXiv:1908.01224, 2019

Daniel Omeiza, Skyler Speakman, Celia Cintas, and Kom- minist Weldermariam. Smooth grad-cam++: An en- hanced inference level visualization technique for deep convolutional neural network models. In Proceedings of the Intelligent Systems Conference (IntelliSys) , 2019. arXiv:1908.01224. 2

work page arXiv 2019

[29] [30]

Carlyn, Samuel Stevens, Kaiya L

Dipanjyoti Paul, Arpita Chowdhury, Xinqi Xiong, Feng-Ju Chang, David E. Carlyn, Samuel Stevens, Kaiya L. Provost, Anuj Karpatne, Bryan Carstens, Daniel Rubenstein, Charles Stewart, Tanya Berger-Wolf, Yu Su, and Wei-Lun Chao. A simple interpretable transformer for fine-grained image clas- sification and analysis. arXiv preprint arXiv:2311.04157 ,

work page arXiv

[30] [31]

Rise: Random- ized input sampling for explanation of black-box models

Vitali Petsiuk, Abir Das, and Kate Saenko. Rise: Random- ized input sampling for explanation of black-box models. In British Machine Vision Conference (BMVC), 2018. 5

work page 2018

[31] [32]

Hyper-vit: A novel light-weighted visual transformer-based supervised classifi- cation framework for hyperspectral remote sensing applica- tions

Bishwas Praveen and Vineetha Menon. Hyper-vit: A novel light-weighted visual transformer-based supervised classifi- cation framework for hyperspectral remote sensing applica- tions. In 2022 12th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHIS- PERS), pages 1–5. IEEE, 2022. 2

work page 2022

[32] [33]

Grad-cam: Visual explanations from deep networks via gradient-based localization

Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE in- ternational conference on computer vision , pages 618–626,

work page

[33] [34]

Ssa: Semantic structure aware inference for weakly pixel-wise dense predictions without cost

Yanpeng Sun and Zechao Li. Ssa: Semantic structure aware inference for weakly pixel-wise dense predictions without cost. arXiv preprint arXiv:2111.03392, 2021. 3, 5

work page arXiv 2021

[34] [36]

Going deeper with im- age transformers

Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv´e J´egou. Going deeper with im- age transformers. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 32–42,

work page

[35] [37]

Distributed cross-channel hierar- chical aggregation for foundation models

Aristeidis Tsaris, Isaac Lyngaas, John Lagregren, Mohamed Wahib, Larry York, Prasanna Balaprakash, Dan Lu, Feiyi Wang, and Xiao Wang. Distributed cross-channel hierar- chical aggregation for foundation models. arXiv preprint arXiv:2506.21411, 2025. 8

work page arXiv 2025

[36] [38]

Score-cam: Score-weighted visual explanations for convolutional neural networks

Haofan Wang, Zifan Wang, Mengnan Du, Fan Yang, Zijian Zhang, Sirui Ding, Piotr Mardziel, and Xia Hu. Score-cam: Score-weighted visual explanations for convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops , pages 24–25, 2020. 2, 5

work page 2020

[37] [39]

Hongjie Wang, Bhishma Dedhia, and Niraj K. Jha. Zero- tprune: Zero-shot token pruning through leveraging of the attention graph in pre-trained transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), 2024. 3

work page 2024

[38] [40]

High perfor- mance model based image reconstruction

Xiao Wang, Amit Sabne, Sherman Kisner, Anand Raghu- nathan, Charles Bouman, and Samuel Midkiff. High perfor- mance model based image reconstruction. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Prac- tice of Parallel Programming, New York, NY , USA, 2016. Association for Computing Machinery. 3

work page 2016

[39] [41]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Chunyuan Li, Lei Hou, Hao Ma, Ming Sun, Nan Du Wang, Qizhe Wu, Yue Zhang, Lichen Shou, Lei Zhou, Hongyu Zhang, Fei Wu, Haifeng Wang, and Jin- gren Zhou. Efficient streaming language models with atten- tion sinks. arXiv preprint arXiv:2309.17453, 2023. 2, 3, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [42]

Visual instance-aware prompt tuning, 2025

Xi Xiao, Yunbei Zhang, Xingjian Li, Tianyang Wang, Xiao Wang, Yuxiang Wei, Jihun Hamm, and Min Xu. Visual instance-aware prompt tuning, 2025. 2

work page 2025

[41] [43]

Visual varia- tional autoencoder prompt tuning, 2025

Xi Xiao, Yunbei Zhang, Yanshuh Li, Xingjian Li, Tianyang Wang, Jihun Hamm, Xiao Wang, and Min Xu. Visual varia- tional autoencoder prompt tuning, 2025. 2

work page 2025

[42] [44]

Hsvit: Horizontally scalable vision transformer

Chenhao Xu, Chang-Tsun Li, Chee Peng Lim, and Douglas Creighton. Hsvit: Horizontally scalable vision transformer. arXiv preprint arXiv:2404.05196, 2024. 2

work page arXiv 2024

[43] [45]

Protopformer: Concentrating on prototypical parts in vision transform- ers for interpretable image recognition

Mengqi Xue, Qihan Huang, Haofei Zhang, Lechao Cheng, Jie Song, Minghui Wu, and Mingli Song. Protopformer: Concentrating on prototypical parts in vision transform- ers for interpretable image recognition. arXiv preprint arXiv:2208.10431, 2022. 5

work page arXiv 2022

[44] [46]

Semi-active convolutional neural networks for hyperspectral image clas- 10 sification

Jing Yao, Xiangyong Cao, Danfeng Hong, Xin Wu, Deyu Meng, Jocelyn Chanussot, and Zongben Xu. Semi-active convolutional neural networks for hyperspectral image clas- 10 sification. IEEE Transactions on Geoscience and Remote Sensing, 60:1–15, 2022. 3

work page 2022

[45] [47]

Jianping Zhang, Yizhan Huang, Weibin Wu, and Michael R. Lyu. Transferable adversarial attacks on vision transform- ers with token gradient regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 2, 5

work page 2023

[46] [48]

Vinatzer, and Song Li

Xuemei Zhang, Boris A. Vinatzer, and Song Li. Hyperspec- tral imaging analysis for early detection of tomato bacterial leaf spot disease. Scientific Reports, 14:27666, 2024. 5

work page 2024

[47] [49]

Deepvit: Towards deeper vision transformer

Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xi- aochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021. 7 11

work page arXiv 2021