MixerCA: An Efficient and Accurate Model for High-Performance Hyperspectral Image Classification

Ali Jamali; Mohammed Q. Alkhatib

arxiv: 2604.26138 · v1 · submitted 2026-04-28 · 💻 cs.CV

MixerCA: An Efficient and Accurate Model for High-Performance Hyperspectral Image Classification

Mohammed Q. Alkhatib , Ali Jamali This is my paper

Pith reviewed 2026-05-07 16:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords hyperspectral image classificationdeep learningdepthwise convolutioncoordinate attentionlightweight modelremote sensingimage classification

0 comments

The pith

MixerCA integrates depthwise convolutions, token and channel mixing, and coordinate attention to outperform CNNs and transformers on hyperspectral image classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces MixerCA as a lightweight model for hyperspectral image classification. It combines depthwise convolutions for local spatial features with token and channel mixing and coordinate attention to handle interactions across dimensions. The design keeps resolution constant through the network and works directly on image patches. Experiments on four benchmark datasets show higher accuracy than 2D-CNN, 3D-CNN, Tri-CNN, HybridSN, ViT, and Swin Transformer. Readers interested in remote sensing would note the efficiency gains for processing detailed spectral data.

Core claim

MixerCA integrates depth-wise convolutions, token and channel mixing, and coordinate attention into a unified structure to decouple spatial and channel interactions, maintain consistent resolution throughout the network, and directly process HSI patches. Extensive experiments on four hyperspectral benchmark datasets reveal MixerCA's clear advantages over several competing algorithms, including 2D-CNN, 3D-CNN, Tri-CNN, HybridSN, ViT, and Swin Transformer.

What carries the argument

The MixerCA architecture that unifies depthwise convolutions with token and channel mixing plus coordinate attention to process hyperspectral patches while decoupling spatial and spectral dimensions.

Load-bearing premise

That the accuracy gains come mainly from the architectural combination of depthwise convolution, mixing, and coordinate attention rather than from training procedures or dataset-specific choices.

What would settle it

Train MixerCA and all baseline models with identical hyperparameters, data augmentation, and optimization settings on the same four datasets and check if the accuracy advantage remains.

Figures

Figures reproduced from arXiv: 2604.26138 by Ali Jamali, Mohammed Q. Alkhatib.

**Figure 1.** Figure 1: The overall architecture of the developed MixerCA deep learning model view at source ↗

**Figure 2.** Figure 2: Parameter Radar (Spider) plots of Overall Accuracy on four datasets. (a) Pavia University. (b) Salinas. (c) Gulfport of Mississippi. (d) Xuzhou. M.Q. Alkhatib, A. Jamali: Preprint submitted to Elsevier Page 18 of 17 view at source ↗

**Figure 3.** Figure 3: Classification results for the Pavia University dataset. (a) RGB image and (b) reference ground truth map are shown for visual context. Subfigures (c)–(l) present the classification outputs of various models, including traditional machine learning methods (SVM, MLP), CNN-based models (2D-CNN, 3D-CNN, Tri-CNN, PMI-CNN, HybridSN), transformer-based models (ViT, Swin Transformer), and the proposed MixerCA. Th… view at source ↗

**Figure 4.** Figure 4: Classification results for the Salinas dataset. (a) RGB image and (b) reference ground truth map are shown for visual reference. Subfigures (c)–(l) display the classification maps generated by different models, including classical machine learning approaches (SVM, MLP), deep CNN-based models (2D-CNN, 3D-CNN, Tri-CNN, PMI-CNN, HybridSN), transformer-based methods (ViT, Swin Transformer), and the proposed Mi… view at source ↗

**Figure 5.** Figure 5: Classification results for the Gulfport of Mississippi dataset. (a) RGB image and (b) reference ground truth map are provided for visual context. Subfigures (c)–(l) show the classification outputs produced by various models, including traditional classifiers (SVM, MLP), CNN-based approaches (2D-CNN, 3D-CNN, Tri-CNN, PMI-CNN, HybridSN), transformer-based models (ViT, Swin Transformer), and the proposed Mixe… view at source ↗

**Figure 6.** Figure 6: Classification results for the Xuzhou dataset. (a) RGB image and (b) reference ground truth map are included for visual reference. Subfigures (c)–(l) illustrate the classification outputs of various models, including conventional classifiers (SVM, MLP), CNN-based methods (2D-CNN, 3D-CNN, Tri-CNN, PMI-CNN, HybridSN), transformer-based models (ViT, Swin Transformer), and the proposed MixerCA. The visual resu… view at source ↗

**Figure 7.** Figure 7: Overall Accuracy with respect to training percentages: (a) Pavia University, (b) Salinas, (c) Gulfport of Mississippi, (d) Xuzhou. M.Q. Alkhatib, A. Jamali: Preprint submitted to Elsevier Page 23 of 17 view at source ↗

read the original abstract

Over the past decade, hyperspectral image (HSI) classification has drawn considerable interest due to HSIs' ability to effectively distinguish terrestrial objects by capturing detailed, continuous spectral information. The strong performance of recent deep learning techniques in tasks like image classification and semantic segmentation has led to their growing use in HSI classification, due to their ability to capture complex spatial and spectral features more effectively than traditional methods. This paper presents MixerCA, a novel lightweight model for HSI classification that leverages depthwise convolution and a self-attention mechanism. MixerCA integrates depth-wise convolutions, token and channel mixing, and coordinate attention into a unified structure to decouple spatial and channel interactions, maintain consistent resolution throughout the network, and directly process HSI patches. Extensive experiments on four hyperspectral benchmark datasets reveal MixerCA's clear advantages over several competing algorithms, including 2D-CNN, 3D-CNN, Tri-CNN, HybridSN, ViT, and Swin Transformer. The source code is publicly available at https://github.com/mqalkhatib/MixerCA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MixerCA is a straightforward assembly of depthwise conv, mixing, and coordinate attention for HSI classification with public code, but the performance edge over baselines needs tighter controls to confirm it comes from the architecture.

read the letter

The main point is that MixerCA takes depthwise convolutions, token and channel mixing, and coordinate attention and puts them into one block aimed at hyperspectral image classification. The authors say this gives better accuracy and efficiency than 2D-CNN, 3D-CNN, HybridSN, ViT, and Swin Transformer on four common benchmarks, and they released the code on GitHub. That release alone makes it easier for others to check or build on the work. The design keeps resolution steady across layers and works directly on patches, which fits the needs of HSI data where both spatial and spectral details matter without heavy compute. For remote-sensing applications like crop monitoring or environmental mapping, a lightweight model that decouples those interactions could be worth trying if the numbers hold. The paper does a decent job framing the problem and showing how the pieces fit together without overclaiming a new paradigm. The soft spots sit mostly in the experiments. The abstract states clear advantages but gives no actual accuracy numbers, no ablation tables breaking out each component, and no details on whether the baselines used the same optimizer, patch size, augmentation, or stopping rules. That leaves room for the gains to come from tuning differences rather than the unified block itself, which matches the stress-test concern. If the full manuscript has those controls and statistical checks, the claims strengthen; without them the attribution stays shaky. This paper targets people already working on HSI classification who want a practical new option rather than a broad computer-vision audience. A reader who needs code to experiment with on standard datasets will get some value from it. It deserves a serious referee because the task is applied and important, the components are established, and the code is open, even though revisions will be needed to tighten the evaluation. I would send it to review with requests for ablations and matched training protocols.

Referee Report

2 major / 2 minor

Summary. The paper introduces MixerCA, a lightweight model for hyperspectral image (HSI) classification that integrates depthwise convolutions, token and channel mixing, and coordinate attention into a unified structure. The design aims to decouple spatial and channel interactions while maintaining consistent resolution and directly processing HSI patches. Extensive experiments on four benchmark datasets are claimed to show clear advantages in performance and efficiency over baselines including 2D-CNN, 3D-CNN, Tri-CNN, HybridSN, ViT, and Swin Transformer, with source code released publicly.

Significance. If the empirical advantages are shown to arise from the architectural decoupling rather than uncontrolled training differences, MixerCA could provide a useful efficient option for HSI classification in remote sensing. The public code release supports reproducibility and is a clear strength. However, the current presentation leaves the attribution of gains uncertain, limiting immediate impact.

major comments (2)

[Experiments] Experiments section: The central claim attributes performance gains to the unified MixerCA structure (depthwise convolution + token/channel mixing + coordinate attention), yet no details are given on whether baselines were reproduced under identical conditions (optimizer, learning-rate schedule, patch size, data augmentation, early stopping, or random seeds). This is load-bearing for the attribution because margins could arise from hyperparameter tuning or implementation differences rather than the proposed decoupling of interactions.
[§4.2] §4.2 or Ablation subsection: No ablation studies isolate the individual contributions of depthwise convolutions, mixing modules, and coordinate attention. Without these, it is impossible to verify that the full unified construction is required for the claimed accuracy-efficiency trade-off on the four benchmarks.

minor comments (2)

[Abstract] Abstract: The claim of 'clear advantages' is stated without any quantitative metrics (e.g., overall accuracy, kappa, or FLOPs) or dataset names; adding one or two key numbers would improve transparency.
[Notation] Notation and figures: Ensure consistent use of symbols for spectral bands and spatial dimensions across equations and diagrams; check that all baseline references include full citations with years and venues.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help improve the clarity and rigor of our work. We address each major comment below and will revise the manuscript accordingly to strengthen the experimental details and analyses.

read point-by-point responses

Referee: [Experiments] Experiments section: The central claim attributes performance gains to the unified MixerCA structure (depthwise convolution + token/channel mixing + coordinate attention), yet no details are given on whether baselines were reproduced under identical conditions (optimizer, learning-rate schedule, patch size, data augmentation, early stopping, or random seeds). This is load-bearing for the attribution because margins could arise from hyperparameter tuning or implementation differences rather than the proposed decoupling of interactions.

Authors: We agree that explicit details on training protocols are necessary to support attribution of gains to the architecture. All baselines were re-implemented and trained using the same patch sizes, data splits, optimizer (Adam with identical learning-rate schedule), batch sizes, and early-stopping criteria as described in their original papers, with the same random seeds for reproducibility. The public code release already encodes these settings. To eliminate any ambiguity, we will add a new subsection (e.g., §4.1.1) that tabulates all hyperparameters, augmentation strategies, and seed values used for MixerCA and every baseline across the four datasets. revision: yes
Referee: [§4.2] §4.2 or Ablation subsection: No ablation studies isolate the individual contributions of depthwise convolutions, mixing modules, and coordinate attention. Without these, it is impossible to verify that the full unified construction is required for the claimed accuracy-efficiency trade-off on the four benchmarks.

Authors: We recognize that component-wise ablations would provide stronger evidence for the necessity of the unified design. While the current manuscript emphasizes end-to-end performance, we will add an ablation subsection (replacing or expanding §4.2) that reports results on all four benchmarks when each module is individually removed or replaced (depthwise convolution with standard convolution, mixing modules with MLP-only, coordinate attention with standard channel attention). These experiments will quantify the accuracy-efficiency impact of each element and their interactions. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical model proposal with benchmark comparisons

full rationale

The paper proposes MixerCA, a lightweight architecture integrating depthwise convolutions, token/channel mixing, and coordinate attention for hyperspectral image classification. Its claims rest entirely on empirical performance comparisons against published baselines (2D-CNN, 3D-CNN, HybridSN, ViT, Swin Transformer) across four standard datasets. No mathematical derivations, self-referential equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The architecture is presented as a design choice validated by experiments rather than derived from prior results by the same authors or reduced to inputs by construction. This is a standard empirical model paper whose central assertions remain independently testable via reproduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The model is assembled from standard deep-learning primitives; no new physical entities or ad-hoc axioms are introduced beyond the usual assumption that convolutional and attention layers can learn useful spatial-spectral features from labeled patches.

axioms (1)

domain assumption Convolutional and attention layers can extract discriminative spatial-spectral features from small HSI patches when trained with standard supervised losses.
Implicit in the decision to apply a CNN-style architecture directly to HSI data.

pith-pipeline@v0.9.0 · 5491 in / 1327 out tokens · 44634 ms · 2026-05-07T16:45:30.897111+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 1 canonical work pages

[1]

M. B. Stuart, A. J. McGonigle, J. R. Willmott, Hyperspectral imaging in environmental monitoring: A review of recent developments and technological advances in compact field deployable systems, Sensors 19 (2019)

2019
[2]

M. B. Stuart, M. Davies, M. J. Hobbs, T. D. Pering, A. J. McGonigle, J. R. Willmott, High-resolution hyperspectral imaging using low-cost components: Application within environmental monitoring scenarios, Sensors 22 (2022)

2022
[3]

J. G. A. Barbedo, A review on the combination of deep learning techniques with proximal hyperspectral images in agriculture, Computers and Electronics in Agriculture 210 (2023) 107920. B.Lu,P.D.Dao,J.Liu,Y.He,J.Shang, Recentadvancesofhyperspectralimagingtechnologyandapplicationsinagriculture, RemoteSensing12 (2020)

2023
[4]

Hajaj, A

S. Hajaj, A. El Harti, A. B. Pour, A. Jellouli, Z. Adiri, M. Hashim, A review on hyperspectral imagery application for lithological mapping and mineralprospecting:Machinelearningtechniquesandfutureprospects, RemoteSensingApplications:SocietyandEnvironment(2024)101218. N. Okada, B. Bino Sinaice, J. Kim, H. Nozaki, K. Takizawa, N. Owada, Y. Ohtomo, Y. Kawam...

2024
[5]

IGARSS’05., volume 1, IEEE, 2005, pp. 4–pp. S. Wang, A. Dou, X. Yuan, X. Zhang, The airborne hyperspectral image classification based on the random forest algorithm, in: 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), IEEE, 2016, pp. 2280–2283. K. Makantasis, K. Karantzalos, A. Doulamis, N. Doulamis, Deep supervised learning for ...

2005
[6]

R.Vaddi,P.Manoharan, Hyperspectralimageclassificationusingcnnwithspectralandspatialfeaturesintegration, InfraredPhysics&Technology 107 (2020) 103296. S. K. Roy, G. Krishna, S. R. Dubey, B. B. Chaudhuri, Hybridsn: Exploring 3-d–2-d cnn feature hierarchy for hyperspectral image classification, IEEE Geoscience and Remote Sensing Letters 17 (2019) 277–281. L....

2020
[7]

X.Yang,X.Zhang,Y.Ye,R.Y.Lau,S.Lu,X.Li,X.Huang, Synergistic2d/3dconvolutionalneuralnetworkforhyperspectralimageclassification, Remote Sensing 12 (2020)

H.Zhong,L.Li,J.Ren,W.Wu,R.Wang,Hyperspectralimageclassificationviaparallelmulti-inputmechanism-basedconvolutionalneuralnetwork, Multimedia Tools and Applications (2022) 1–26. X.Yang,X.Zhang,Y.Ye,R.Y.Lau,S.Lu,X.Li,X.Huang, Synergistic2d/3dconvolutionalneuralnetworkforhyperspectralimageclassification, Remote Sensing 12 (2020)

2022
[8]

H. Gao, Y. Yang, C. Li, L. Gao, B. Zhang, Multiscale residual network with mixed depthwise convolution for hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing 59 (2020) 3396–3408. Z.Ye,C.Li,Q.Liu,L.Bai,J.E.Fowler, Computationallylightweighthyperspectralimageclassificationusingamultiscaledepthwiseconvolutional network wit...

2020
[9]

B.Cui,X.-M.Dong,Q.Zhan,J.Peng,W.Sun, Litedepthwisenet:Alightweightnetworkforhyperspectralimageclassification, IEEETransactions on Geoscience and Remote Sensing 60 (2021) 1–15. X. T. Nguyen, G. S. Tran, Hyperspectral image classification using an encoder-decoder model with depthwise separable convolution, squeeze and excitation blocks, Earth Science Inform...

2021
[10]

Z. Xue, X. Yu, B. Liu, X. Tan, X. Wei, Hresnetam: Hierarchical residual network with attention mechanism for hyperspectral image classification, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14 (2021) 3566–3580. C. Shi, D. Liao, T. Zhang, L. Wang, Hyperspectral image classification based on 3d coordination attention mech...

2021
[11]

J.Wang,J.Sun,E.Zhang,T.Zhang,K.Yu,J.Peng, Hyperspectralimageclassificationviadeepnetworkwithattentionmechanismandmultigroup strategy, Expert Systems with Applications 224 (2023) 119904. M.Q. Alkhatib, A. Jamali:Preprint submitted to ElsevierPage 16 of 17 MixerCA: An Efficient Model for HSI Classification W. Liao, F. Wang, H. Zhao, Hyperspectral image clas...

2023
[12]

URL:https://arxiv.org/abs/1704.04861.arXiv:1704.04861. S. Woo, J. Park, J.-Y. Lee, I. S. Kweon, Cbam: Convolutional block attention module, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19. J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2018) 2...

work page doi:10.1109/lgrs.2019.2918719 2018
[13]

D. Hong, J. Hu, J. Yao, J. Chanussot, X. X. Zhu, Multimodal remote sensing benchmark datasets for land cover classification with a shared and specific feature learning model, ISPRS Journal of Photogrammetry and Remote Sensing 178 (2021) 68–80. Y. Zhong, X. Hu, C. Luo, X. Wang, J. Zhao, L. Zhang, Whu-hi: Uav-borne hyperspectral with high spatial resolution...

2021

[1] [1]

M. B. Stuart, A. J. McGonigle, J. R. Willmott, Hyperspectral imaging in environmental monitoring: A review of recent developments and technological advances in compact field deployable systems, Sensors 19 (2019)

2019

[2] [2]

M. B. Stuart, M. Davies, M. J. Hobbs, T. D. Pering, A. J. McGonigle, J. R. Willmott, High-resolution hyperspectral imaging using low-cost components: Application within environmental monitoring scenarios, Sensors 22 (2022)

2022

[3] [3]

J. G. A. Barbedo, A review on the combination of deep learning techniques with proximal hyperspectral images in agriculture, Computers and Electronics in Agriculture 210 (2023) 107920. B.Lu,P.D.Dao,J.Liu,Y.He,J.Shang, Recentadvancesofhyperspectralimagingtechnologyandapplicationsinagriculture, RemoteSensing12 (2020)

2023

[4] [4]

Hajaj, A

S. Hajaj, A. El Harti, A. B. Pour, A. Jellouli, Z. Adiri, M. Hashim, A review on hyperspectral imagery application for lithological mapping and mineralprospecting:Machinelearningtechniquesandfutureprospects, RemoteSensingApplications:SocietyandEnvironment(2024)101218. N. Okada, B. Bino Sinaice, J. Kim, H. Nozaki, K. Takizawa, N. Owada, Y. Ohtomo, Y. Kawam...

2024

[5] [5]

IGARSS’05., volume 1, IEEE, 2005, pp. 4–pp. S. Wang, A. Dou, X. Yuan, X. Zhang, The airborne hyperspectral image classification based on the random forest algorithm, in: 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), IEEE, 2016, pp. 2280–2283. K. Makantasis, K. Karantzalos, A. Doulamis, N. Doulamis, Deep supervised learning for ...

2005

[6] [6]

R.Vaddi,P.Manoharan, Hyperspectralimageclassificationusingcnnwithspectralandspatialfeaturesintegration, InfraredPhysics&Technology 107 (2020) 103296. S. K. Roy, G. Krishna, S. R. Dubey, B. B. Chaudhuri, Hybridsn: Exploring 3-d–2-d cnn feature hierarchy for hyperspectral image classification, IEEE Geoscience and Remote Sensing Letters 17 (2019) 277–281. L....

2020

[7] [7]

X.Yang,X.Zhang,Y.Ye,R.Y.Lau,S.Lu,X.Li,X.Huang, Synergistic2d/3dconvolutionalneuralnetworkforhyperspectralimageclassification, Remote Sensing 12 (2020)

H.Zhong,L.Li,J.Ren,W.Wu,R.Wang,Hyperspectralimageclassificationviaparallelmulti-inputmechanism-basedconvolutionalneuralnetwork, Multimedia Tools and Applications (2022) 1–26. X.Yang,X.Zhang,Y.Ye,R.Y.Lau,S.Lu,X.Li,X.Huang, Synergistic2d/3dconvolutionalneuralnetworkforhyperspectralimageclassification, Remote Sensing 12 (2020)

2022

[8] [8]

H. Gao, Y. Yang, C. Li, L. Gao, B. Zhang, Multiscale residual network with mixed depthwise convolution for hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing 59 (2020) 3396–3408. Z.Ye,C.Li,Q.Liu,L.Bai,J.E.Fowler, Computationallylightweighthyperspectralimageclassificationusingamultiscaledepthwiseconvolutional network wit...

2020

[9] [9]

B.Cui,X.-M.Dong,Q.Zhan,J.Peng,W.Sun, Litedepthwisenet:Alightweightnetworkforhyperspectralimageclassification, IEEETransactions on Geoscience and Remote Sensing 60 (2021) 1–15. X. T. Nguyen, G. S. Tran, Hyperspectral image classification using an encoder-decoder model with depthwise separable convolution, squeeze and excitation blocks, Earth Science Inform...

2021

[10] [10]

Z. Xue, X. Yu, B. Liu, X. Tan, X. Wei, Hresnetam: Hierarchical residual network with attention mechanism for hyperspectral image classification, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14 (2021) 3566–3580. C. Shi, D. Liao, T. Zhang, L. Wang, Hyperspectral image classification based on 3d coordination attention mech...

2021

[11] [11]

J.Wang,J.Sun,E.Zhang,T.Zhang,K.Yu,J.Peng, Hyperspectralimageclassificationviadeepnetworkwithattentionmechanismandmultigroup strategy, Expert Systems with Applications 224 (2023) 119904. M.Q. Alkhatib, A. Jamali:Preprint submitted to ElsevierPage 16 of 17 MixerCA: An Efficient Model for HSI Classification W. Liao, F. Wang, H. Zhao, Hyperspectral image clas...

2023

[12] [12]

URL:https://arxiv.org/abs/1704.04861.arXiv:1704.04861. S. Woo, J. Park, J.-Y. Lee, I. S. Kweon, Cbam: Convolutional block attention module, in: Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19. J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2018) 2...

work page doi:10.1109/lgrs.2019.2918719 2018

[13] [13]

D. Hong, J. Hu, J. Yao, J. Chanussot, X. X. Zhu, Multimodal remote sensing benchmark datasets for land cover classification with a shared and specific feature learning model, ISPRS Journal of Photogrammetry and Remote Sensing 178 (2021) 68–80. Y. Zhong, X. Hu, C. Luo, X. Wang, J. Zhao, L. Zhang, Whu-hi: Uav-borne hyperspectral with high spatial resolution...

2021