Discrete Cosine Transform Based Decorrelated Attention for Vision Transformers
Pith reviewed 2026-05-24 00:31 UTC · model grok-4.3
The pith
DCT initialization of attention projections raises classification accuracy in Vision Transformers while frequency truncation cuts computation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Initializing self-attention projection matrices with Discrete Cosine Transform coefficients produces higher classification accuracy on CIFAR-10 and ImageNet-1K, and truncating high-frequency DCT components of input patches before projection reduces the dimensionality of query, key, and value matrices, yielding substantial computational savings on Swin Transformer models with comparable performance.
What carries the argument
DCT-based decorrelated attention, which replaces random initialization of QKV projections with DCT coefficients and removes high-frequency DCT components from input patches to lower projection dimension.
If this is right
- DCT initialization raises accuracy on both CIFAR-10 and ImageNet-1K benchmarks compared with standard random initialization.
- DCT truncation lowers the dimension of QKV projections and thereby reduces overall computational overhead in Swin Transformer models.
- Performance after compression remains comparable to the original Swin Transformer on the tested image-classification tasks.
- The two DCT methods can be used together or separately without architectural changes to the transformer blocks.
Where Pith is reading between the lines
- The same DCT initialization and truncation steps could be tested on other transformer families such as DeiT or ConvNeXt to check whether the gains transfer.
- If high-frequency truncation works because natural images are low-frequency dominant, the method may need adjustment for tasks where fine detail matters, such as medical imaging or satellite analysis.
- Replacing learned linear projections entirely with fixed DCT bases might further reduce parameter count, though that extension is not tested here.
Load-bearing premise
High-frequency DCT coefficients of input patches typically correspond to noise that can be truncated without reducing classification accuracy.
What would settle it
Retraining the compressed model on a dataset where high-frequency image details determine correct labels and measuring a clear accuracy drop relative to the uncompressed baseline would falsify the compression claim.
Figures
read the original abstract
Self-attention is central to the success of Transformer architectures; however, learning the query, key, and value projections from random initialization remains challenging and computationally expensive. In this paper, we propose two complementary methods that leverage the Discrete Cosine Transform (DCT) to enhance the efficiency and performance of Vision Transformers. First, we address the initialization problem by introducing a simple yet effective DCT-based initialization strategy for self-attention, where projection weights are initialized using DCT coefficients. This structure-preserving approach consistently improves classification accuracy on the CIFAR-10 and ImageNet-1K benchmarks. Second, we propose a DCT-based attention compression technique that exploits the decorrelation properties of the frequency domain. By observing that high-frequency DCT coefficients typically correspond to noise, we truncate high-frequency components of the input patches, thereby reducing the dimensionality of the query, key, and value projections without sacrificing accuracy. Experiments on Swin Transformer models demonstrate that the proposed compression method achieves a substantial reduction in computational overhead while maintaining comparable performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes two DCT-based techniques for Vision Transformers: a structured initialization of the query/key/value projection weights using DCT coefficients, claimed to improve classification accuracy on CIFAR-10 and ImageNet-1K, and an attention compression method that truncates high-frequency DCT coefficients of input patches before projection (justified as removing noise) to reduce QKV dimensionality and computational cost while preserving performance on Swin Transformer models.
Significance. If the empirical claims are substantiated, the initialization approach would supply a non-random, frequency-domain starting point for attention weights that could aid convergence without additional hyperparameters. The compression technique, if the high-frequency truncation reliably discards only non-discriminative content, would constitute a lightweight, architecture-agnostic way to lower attention FLOPs.
major comments (1)
- [Abstract] Abstract: the compression claim that truncating high-frequency DCT coefficients 'typically correspond to noise' and can be removed 'without sacrificing accuracy' is load-bearing for the second contribution, yet the provided text supplies no controlled verification (e.g., reconstruction error on retained vs. discarded frequencies, class-conditional frequency saliency, or comparison against random truncation of equal dimensionality) on the ImageNet-1K or CIFAR-10 data.
minor comments (1)
- [Abstract] Abstract: quantitative results (accuracy deltas, FLOPs reduction factors, baselines, number of runs, error bars) are referenced but not reported, preventing assessment of effect size or statistical reliability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the single major comment below and agree that additional analysis will strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the compression claim that truncating high-frequency DCT coefficients 'typically correspond to noise' and can be removed 'without sacrificing accuracy' is load-bearing for the second contribution, yet the provided text supplies no controlled verification (e.g., reconstruction error on retained vs. discarded frequencies, class-conditional frequency saliency, or comparison against random truncation of equal dimensionality) on the ImageNet-1K or CIFAR-10 data.
Authors: We appreciate the referee's observation. The manuscript reports that the proposed truncation maintains comparable accuracy on Swin Transformer models for the cited benchmarks, but we agree that the abstract's claim would be better supported by the specific controlled verifications mentioned (reconstruction error, class-conditional analysis, or random-truncation baselines). We will add these experiments and corresponding discussion in the revised manuscript. revision: yes
Circularity Check
No circularity: empirical proposals validated on external benchmarks
full rationale
The paper presents two empirical methods (DCT-based weight initialization and high-frequency truncation for attention compression) justified by the observation that high-frequency DCT coefficients typically correspond to noise. These are tested directly on CIFAR-10 and ImageNet-1K with Swin Transformers, showing accuracy gains and compute reduction. No equations, derivations, or predictions are defined in terms of fitted parameters from the same data; no self-citations form a load-bearing chain; the central claims rest on external benchmark results rather than reducing to inputs by construction. This is the standard case of a self-contained empirical contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption High-frequency DCT coefficients of image patches correspond to noise and can be truncated without sacrificing accuracy
Reference graph
Works this paper leans on
-
[1]
Discrete cosine transform.IEEE trans- actions on Computers, 100(1):90–93,
[Ahmedet al., 1974 ] Nasir Ahmed, T Natarajan, and Kamisetty R Rao. Discrete cosine transform.IEEE trans- actions on Computers, 100(1):90–93,
work page 1974
-
[2]
[Akansu and Torun, 2012] Ali N Akansu and Mustafa U Torun. Toeplitz approximation to empirical correlation matrix of asset returns: A signal processing perspective. IEEE Journal of Selected Topics in Signal Processing, 6(4):319–326,
work page 2012
-
[3]
Hydra at- tention: Efficient attention with many heads
[Bolyaet al., 2022 ] Daniel Bolya, Cheng-Yang Fu, Xiao- liang Dai, Peizhao Zhang, and Judy Hoffman. Hydra at- tention: Efficient attention with many heads. InEuropean Conference on Computer Vision, pages 35–49. Springer,
work page 2022
-
[4]
Reminder of the first paper on transfer learning in neural networks, 1976.Infor- matica, 44(3),
[Bozinovski, 2020] Stevo Bozinovski. Reminder of the first paper on transfer learning in neural networks, 1976.Infor- matica, 44(3),
work page 2020
-
[5]
[Buchholz and Jug, 2022] Tim-Oliver Buchholz and Florian Jug. Fourier image transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1846–1854,
work page 2022
-
[6]
Cascade r-cnn: Delving into high quality object detection
[Cai and Vasconcelos, 2018] Zhaowei Cai and Nuno Vas- concelos. Cascade r-cnn: Delving into high quality object detection. InProceedings of the IEEE conference on com- puter vision and pattern recognition, pages 6154–6162,
work page 2018
-
[7]
[Chenet al., 1977 ] Wen-Hsiung Chen, CH Smith, and Sam Fralick. A fast computational algorithm for the discrete cosine transform.IEEE Transactions on communications, 25(9):1004–1009,
work page 1977
-
[8]
Fast fourier convolution.Advances in Neural Information Pro- cessing Systems, 33:4479–4488,
[Chiet al., 2020 ] Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolution.Advances in Neural Information Pro- cessing Systems, 33:4479–4488,
work page 2020
-
[9]
AutoAugment: Learning Augmentation Policies from Data
[Cubuket al., 2018 ] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data.arXiv preprint arXiv:1805.09501,
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[10]
Karhunen-loeve transform.The transform and data compression hand- book, 1(1-34):29,
[Dony and others, 2001] R Dony et al. Karhunen-loeve transform.The transform and data compression hand- book, 1(1-34):29,
work page 2001
-
[11]
An image is worth 16x16 words: Transformers for image recognition at scale
[Dosovitskiyet al., 2020 ] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Min- derer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Repre- sentations,
work page 2020
-
[12]
Mpeg-4 natural video coding–an overview.Sig- nal Processing: Image Communication, 15(4-5):365–385,
[Ebrahimi and Horne, 2000] Touradj Ebrahimi and Caspar Horne. Mpeg-4 natural video coding–an overview.Sig- nal Processing: Image Communication, 15(4-5):365–385,
work page 2000
-
[13]
Understanding the difficulty of training deep feedfor- ward neural networks
[Glorot and Bengio, 2010] Xavier Glorot and Yoshua Ben- gio. Understanding the difficulty of training deep feedfor- ward neural networks. InProceedings of the thirteenth in- ternational conference on artificial intelligence and statis- tics, pages 249–256. JMLR Workshop and Conference Proceedings,
work page 2010
-
[14]
gswin: Gated mlp vision model with hi- erarchical structure of shifted window
[Go and Tachibana, 2023] Mocho Go and Hideyuki Tachibana. gswin: Gated mlp vision model with hi- erarchical structure of shifted window. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,
work page 2023
-
[15]
Delving deep into rectifiers: Surpass- ing human-level performance on imagenet classification
[Heet al., 2015 ] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpass- ing human-level performance on imagenet classification. InProceedings of the IEEE international conference on computer vision, pages 1026–1034,
work page 2015
-
[16]
Deep residual learning for image recog- nition
[Heet al., 2016 ] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778,
work page 2016
-
[17]
Colorformer: Image colorization via color memory assisted hybrid-attention transformer
[Jiet al., 2022 ] Xiaozhong Ji, Boyuan Jiang, Donghao Luo, Guangpin Tao, Wenqing Chu, Zhifeng Xie, Chengjie Wang, and Ying Tai. Colorformer: Image colorization via color memory assisted hybrid-attention transformer. In European Conference on Computer Vision, pages 20–36. Springer,
work page 2022
-
[18]
Discrete cosin transformer: Im- age modeling from frequency domain
[Liet al., 2023 ] Xinyu Li, Yanyi Zhang, Jianbo Yuan, Han- lin Lu, and Yibo Zhu. Discrete cosin transformer: Im- age modeling from frequency domain. InProceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, pages 5468–5478,
work page 2023
-
[19]
Microsoft coco: Com- mon objects in context
[Linet al., 2014 ] Tsung-Yi Lin, Michael Maire, Serge Be- longie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Com- mon objects in context. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, Septem- ber 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer,
work page 2014
-
[20]
Swin transformer: Hierarchical vision transformer using shifted windows
[Liuet al., 2021 ] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 10012– 10022,
work page 2021
-
[21]
Deep learn- ing via hessian-free optimization
[Martens and others, 2010] James Martens et al. Deep learn- ing via hessian-free optimization. InICML, volume 27, pages 735–742,
work page 2010
-
[22]
Training neural network with zero weight ini- tialization
[Masood and Chandra, 2012] Sarfaraz Masood and Pravin Chandra. Training neural network with zero weight ini- tialization. InProceedings of the CUBE International In- formation Technology Conference, pages 235–239,
work page 2012
-
[23]
Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer
[Mehta and Rastegari, 2021] Sachin Mehta and Mohammad Rastegari. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. InInternational Con- ference on Learning Representations,
work page 2021
-
[24]
A hybrid quantum-classical approach based on the hadamard transform for the convolutional layer
[Panet al., 2023 ] Hongyi Pan, Xin Zhu, Salih Furkan Atici, and Ahmet Cetin. A hybrid quantum-classical approach based on the hadamard transform for the convolutional layer. InInternational Conference on Machine Learning, pages 26891–26903. PMLR,
work page 2023
-
[25]
[Paszkeet al., 2019 ] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high- performance deep learning library.Advances in neural in- formation processing systems, 32,
work page 2019
-
[26]
Spectformer: Frequency and atten- tion is what you need in a vision transformer
[Patroet al., 2025 ] Badri N Patro, Vinay P Namboodiri, and Vijay S Agneeswaran. Spectformer: Frequency and atten- tion is what you need in a vision transformer. In2025 IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 9543–9554. IEEE,
work page 2025
-
[27]
[Rad¨unzet al., 2022 ] Anabeth P Rad ¨unz, F ´abio M Bayer, and Renato J Cintra. Low-complexity rounded klt approx- imation for image compression.Journal of Real-Time Im- age Processing, pages 1–11,
work page 2022
-
[28]
Mobilenetv2: Inverted residuals and linear bottlenecks
[Sandleret al., 2018 ] Mark Sandler, Andrew Howard, Men- glong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520,
work page 2018
-
[29]
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
[Saxeet al., 2013 ] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dy- namics of learning in deep linear neural networks.arXiv preprint arXiv:1312.6120,
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[30]
[Scribanoet al., 2023 ] Carmelo Scribano, Giorgia Fran- chini, Marco Prato, and Marko Bertogna. Dct-former: Ef- ficient self-attention with discrete cosine transform.Jour- nal of Scientific Computing, 94(3):67,
work page 2023
-
[31]
Mi- crovit: a vision transformer with low complexity self atten- tion for edge device
[Setyawanet al., 2025 ] Novendra Setyawan, Chi-Chia Sun, Mao-Hsiu Hsu, Wen-Kai Kuo, and Jun-Wei Hsieh. Mi- crovit: a vision transformer with low complexity self atten- tion for edge device. In2025 IEEE International Sympo- sium on Circuits and Systems (ISCAS), pages 1–5. IEEE,
work page 2025
-
[32]
Efficient attention: Attention with linear complexities
[Shenet al., 2021 ] Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with linear complexities. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531–3539,
work page 2021
-
[33]
[Testa and Magli, 2016] Matteo Testa and Enrico Magli. Compressive estimation and imaging based on autoregres- sive models.IEEE Transactions on Image Processing, 25(11):5077–5087,
work page 2016
-
[34]
Mimetic initialization of self-attention layers
[Trockman and Kolter, 2023] Asher Trockman and J Zico Kolter. Mimetic initialization of self-attention layers. arXiv preprint arXiv:2305.09828,
-
[35]
Attention is all you need.Advances in neural information processing systems, 30,
[Vaswaniet al., 2017 ] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30,
work page 2017
-
[36]
The jpeg still pic- ture compression standard.Communications of the ACM, 34(4):30–44,
[Wallace, 1991] Gregory K Wallace. The jpeg still pic- ture compression standard.Communications of the ACM, 34(4):30–44,
work page 1991
-
[37]
A survey of transfer learning.Journal of Big data, 3(1):1–40,
[Weisset al., 2016 ] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning.Journal of Big data, 3(1):1–40,
work page 2016
-
[38]
Initializing models with larger ones
[Xuet al., 2023 ] Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, and Zhuang Liu. Initializing models with larger ones. InThe Twelfth International Conference on Learning Represen- tations,
work page 2023
-
[39]
Metaformer is actually what you need for vision
[Yuet al., 2022 ] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10819– 10829,
work page 2022
-
[40]
Cutmix: Regularization strategy to train strong classifiers with localizable features
[Yunet al., 2019 ] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. InProceedings of the IEEE/CVF international conference on computer vision, pages 6023– 6032,
work page 2019
-
[41]
mixup: Beyond empirical risk minimization
[Zhanget al., 2018 ] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. InInternational Conference on Learning Representations,
work page 2018
-
[42]
Improving deep transformer with depth-scaled ini- tialization and merged attention
[Zhanget al., 2019 ] Biao Zhang, Ivan Titov, and Rico Sen- nrich. Improving deep transformer with depth-scaled ini- tialization and merged attention. InProceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP-IJCNLP), pages 898–909,
work page 2019
-
[43]
[Zhaoet al., 2022 ] Jiawei Zhao, Florian Tobias Schaefer, and Anima Anandkumar. Zero initialization: Initializing neural networks with only zeros and ones.Transactions on Machine Learning Research,
work page 2022
-
[44]
Structured initialization for attention in vision transformers.arXiv preprint arXiv:2404.01139,
[Zhenget al., 2024 ] Jianqiao Zheng, Xueqian Li, and Simon Lucey. Structured initialization for attention in vision transformers.arXiv preprint arXiv:2404.01139,
-
[45]
Random erasing data aug- mentation
[Zhonget al., 2020 ] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data aug- mentation. InProceedings of the AAAI conference on arti- ficial intelligence, volume 34, pages 13001–13008, 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.