Discrete Cosine Transform Based Decorrelated Attention for Vision Transformers

Ahmet Enis Cetin; Emadeldeen Hamdan; Hongyi Pan; Ulas Bagci; Xin Zhu

arxiv: 2405.13901 · v6 · pith:3M7MSWIZnew · submitted 2024-05-22 · 💻 cs.CV · cs.LG· eess.SP

Discrete Cosine Transform Based Decorrelated Attention for Vision Transformers

Hongyi Pan , Emadeldeen Hamdan , Xin Zhu , Ahmet Enis Cetin , Ulas Bagci This is my paper

Pith reviewed 2026-05-24 00:31 UTC · model grok-4.3

classification 💻 cs.CV cs.LGeess.SP

keywords Vision TransformerDiscrete Cosine TransformSelf-attentionModel initializationAttention compressionImage classificationSwin Transformer

0 comments

The pith

DCT initialization of attention projections raises classification accuracy in Vision Transformers while frequency truncation cuts computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes initializing the query, key, and value projection weights in self-attention layers with coefficients drawn from the Discrete Cosine Transform rather than random values. This change improves accuracy on CIFAR-10 and ImageNet-1K. It also shows that high-frequency DCT coefficients of input patches can be dropped to shrink the dimension of those projections, lowering the cost of attention in Swin Transformer models while keeping accuracy nearly the same. A reader would care because random initialization of attention weights is known to be slow and unstable, and attention itself dominates the compute budget in large vision models.

Core claim

Initializing self-attention projection matrices with Discrete Cosine Transform coefficients produces higher classification accuracy on CIFAR-10 and ImageNet-1K, and truncating high-frequency DCT components of input patches before projection reduces the dimensionality of query, key, and value matrices, yielding substantial computational savings on Swin Transformer models with comparable performance.

What carries the argument

DCT-based decorrelated attention, which replaces random initialization of QKV projections with DCT coefficients and removes high-frequency DCT components from input patches to lower projection dimension.

If this is right

DCT initialization raises accuracy on both CIFAR-10 and ImageNet-1K benchmarks compared with standard random initialization.
DCT truncation lowers the dimension of QKV projections and thereby reduces overall computational overhead in Swin Transformer models.
Performance after compression remains comparable to the original Swin Transformer on the tested image-classification tasks.
The two DCT methods can be used together or separately without architectural changes to the transformer blocks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same DCT initialization and truncation steps could be tested on other transformer families such as DeiT or ConvNeXt to check whether the gains transfer.
If high-frequency truncation works because natural images are low-frequency dominant, the method may need adjustment for tasks where fine detail matters, such as medical imaging or satellite analysis.
Replacing learned linear projections entirely with fixed DCT bases might further reduce parameter count, though that extension is not tested here.

Load-bearing premise

High-frequency DCT coefficients of input patches typically correspond to noise that can be truncated without reducing classification accuracy.

What would settle it

Retraining the compressed model on a dataset where high-frequency image details determine correct labels and measuring a clear accuracy drop relative to the uncompressed baseline would falsify the compression claim.

Figures

Figures reproduced from arXiv: 2405.13901 by Ahmet Enis Cetin, Emadeldeen Hamdan, Hongyi Pan, Ulas Bagci, Xin Zhu.

**Figure 2.** Figure 2: Basis vectors of an 8 × 8 DCT matrix where Di refers to the i-th column. The DCT basis vectors provide a good approximation to the eigenvectors of the Teoplitz matrix with ρ = 0.9 (KLT). ear layer LO. Computational overhead from the Softmax function is omitted because the Softmax function is based on exponential terms. Despite this, the Softmax function in the DCT-compressed attention requires less comput… view at source ↗

**Figure 3.** Figure 3: Frequency response of the DCT basis vectors. Each fre [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation study on ImageNet-1K for (a) initializing multiple attention weights with the DCT matrix and (b) removing DCT from [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Self-attention is central to the success of Transformer architectures; however, learning the query, key, and value projections from random initialization remains challenging and computationally expensive. In this paper, we propose two complementary methods that leverage the Discrete Cosine Transform (DCT) to enhance the efficiency and performance of Vision Transformers. First, we address the initialization problem by introducing a simple yet effective DCT-based initialization strategy for self-attention, where projection weights are initialized using DCT coefficients. This structure-preserving approach consistently improves classification accuracy on the CIFAR-10 and ImageNet-1K benchmarks. Second, we propose a DCT-based attention compression technique that exploits the decorrelation properties of the frequency domain. By observing that high-frequency DCT coefficients typically correspond to noise, we truncate high-frequency components of the input patches, thereby reducing the dimensionality of the query, key, and value projections without sacrificing accuracy. Experiments on Swin Transformer models demonstrate that the proposed compression method achieves a substantial reduction in computational overhead while maintaining comparable performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DCT init for attention looks like a simple practical tweak worth testing, but the compression claim hinges on an unexamined assumption that high-frequency patch coefficients are just noise.

read the letter

The main things here are a DCT-based initialization for the QKV projection weights in self-attention, which the authors report lifts accuracy on CIFAR-10 and ImageNet-1K, plus a compression step that drops high-frequency DCT coefficients from input patches before projection to shrink the matrices and cut compute on Swin Transformers while holding performance roughly steady. Both moves are presented as direct uses of the decorrelation property of the DCT. The initialization part is the cleaner of the two because it is a drop-in change that does not alter the forward pass. The compression part is more aggressive and rests on the premise that high frequencies can be treated as noise without losing discriminative signal. The paper does a reasonable job of keeping the proposals simple and linking them to classical signal processing rather than inventing new machinery. If the numbers hold in the full experiments, the initialization strategy could be an easy thing for others to try when training ViTs from scratch. The compression idea is less routine and applies the frequency view inside the attention block in a way that is not a standard extension of prior pruning or low-rank work. The soft spots are concentrated on the compression side. The abstract states that high-frequency coefficients typically correspond to noise but supplies no supporting checks such as controlled ablations, reconstruction quality, or comparison against random truncation at the same dimensionality. Without those, it is possible the observed parity is tied to the specific datasets or to Swin’s hierarchical structure rather than a general property of DCT truncation. The abstract also gives no implementation details, error bars, or statistical tests, so the size of the claimed gains and savings cannot be judged from what is shown. This leaves the soundness low until the full paper is examined. The work is aimed at people already tuning Vision Transformers for efficiency or training stability. A reader who cares about frequency-domain methods or initialization schemes could get value from trying the initialization piece even if the compression needs more validation. It deserves a serious referee because the application of DCT inside attention is new enough and the efficiency angle is relevant, though the authors would need to add direct evidence for the noise assumption and fuller experimental reporting.

Referee Report

1 major / 1 minor

Summary. The paper proposes two DCT-based techniques for Vision Transformers: a structured initialization of the query/key/value projection weights using DCT coefficients, claimed to improve classification accuracy on CIFAR-10 and ImageNet-1K, and an attention compression method that truncates high-frequency DCT coefficients of input patches before projection (justified as removing noise) to reduce QKV dimensionality and computational cost while preserving performance on Swin Transformer models.

Significance. If the empirical claims are substantiated, the initialization approach would supply a non-random, frequency-domain starting point for attention weights that could aid convergence without additional hyperparameters. The compression technique, if the high-frequency truncation reliably discards only non-discriminative content, would constitute a lightweight, architecture-agnostic way to lower attention FLOPs.

major comments (1)

[Abstract] Abstract: the compression claim that truncating high-frequency DCT coefficients 'typically correspond to noise' and can be removed 'without sacrificing accuracy' is load-bearing for the second contribution, yet the provided text supplies no controlled verification (e.g., reconstruction error on retained vs. discarded frequencies, class-conditional frequency saliency, or comparison against random truncation of equal dimensionality) on the ImageNet-1K or CIFAR-10 data.

minor comments (1)

[Abstract] Abstract: quantitative results (accuracy deltas, FLOPs reduction factors, baselines, number of runs, error bars) are referenced but not reported, preventing assessment of effect size or statistical reliability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below and agree that additional analysis will strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the compression claim that truncating high-frequency DCT coefficients 'typically correspond to noise' and can be removed 'without sacrificing accuracy' is load-bearing for the second contribution, yet the provided text supplies no controlled verification (e.g., reconstruction error on retained vs. discarded frequencies, class-conditional frequency saliency, or comparison against random truncation of equal dimensionality) on the ImageNet-1K or CIFAR-10 data.

Authors: We appreciate the referee's observation. The manuscript reports that the proposed truncation maintains comparable accuracy on Swin Transformer models for the cited benchmarks, but we agree that the abstract's claim would be better supported by the specific controlled verifications mentioned (reconstruction error, class-conditional analysis, or random-truncation baselines). We will add these experiments and corresponding discussion in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical proposals validated on external benchmarks

full rationale

The paper presents two empirical methods (DCT-based weight initialization and high-frequency truncation for attention compression) justified by the observation that high-frequency DCT coefficients typically correspond to noise. These are tested directly on CIFAR-10 and ImageNet-1K with Swin Transformers, showing accuracy gains and compute reduction. No equations, derivations, or predictions are defined in terms of fitted parameters from the same data; no self-citations form a load-bearing chain; the central claims rest on external benchmark results rather than reducing to inputs by construction. This is the standard case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claims rest primarily on a domain assumption about the noise content of high-frequency DCT coefficients and on the utility of DCT for providing structured initialization; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption High-frequency DCT coefficients of image patches correspond to noise and can be truncated without sacrificing accuracy
Explicitly invoked as the justification for the compression method in the abstract.

pith-pipeline@v0.9.0 · 5718 in / 1199 out tokens · 31875 ms · 2026-05-24T00:31:41.514055+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 2 internal anchors

[1]

Discrete cosine transform.IEEE trans- actions on Computers, 100(1):90–93,

[Ahmedet al., 1974 ] Nasir Ahmed, T Natarajan, and Kamisetty R Rao. Discrete cosine transform.IEEE trans- actions on Computers, 100(1):90–93,

work page 1974
[2]

Toeplitz approximation to empirical correlation matrix of asset returns: A signal processing perspective

[Akansu and Torun, 2012] Ali N Akansu and Mustafa U Torun. Toeplitz approximation to empirical correlation matrix of asset returns: A signal processing perspective. IEEE Journal of Selected Topics in Signal Processing, 6(4):319–326,

work page 2012
[3]

Hydra at- tention: Efficient attention with many heads

[Bolyaet al., 2022 ] Daniel Bolya, Cheng-Yang Fu, Xiao- liang Dai, Peizhao Zhang, and Judy Hoffman. Hydra at- tention: Efficient attention with many heads. InEuropean Conference on Computer Vision, pages 35–49. Springer,

work page 2022
[4]

Reminder of the first paper on transfer learning in neural networks, 1976.Infor- matica, 44(3),

[Bozinovski, 2020] Stevo Bozinovski. Reminder of the first paper on transfer learning in neural networks, 1976.Infor- matica, 44(3),

work page 2020
[5]

Fourier image transformer

[Buchholz and Jug, 2022] Tim-Oliver Buchholz and Florian Jug. Fourier image transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1846–1854,

work page 2022
[6]

Cascade r-cnn: Delving into high quality object detection

[Cai and Vasconcelos, 2018] Zhaowei Cai and Nuno Vas- concelos. Cascade r-cnn: Delving into high quality object detection. InProceedings of the IEEE conference on com- puter vision and pattern recognition, pages 6154–6162,

work page 2018
[7]

A fast computational algorithm for the discrete cosine transform.IEEE Transactions on communications, 25(9):1004–1009,

[Chenet al., 1977 ] Wen-Hsiung Chen, CH Smith, and Sam Fralick. A fast computational algorithm for the discrete cosine transform.IEEE Transactions on communications, 25(9):1004–1009,

work page 1977
[8]

Fast fourier convolution.Advances in Neural Information Pro- cessing Systems, 33:4479–4488,

[Chiet al., 2020 ] Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolution.Advances in Neural Information Pro- cessing Systems, 33:4479–4488,

work page 2020
[9]

AutoAugment: Learning Augmentation Policies from Data

[Cubuket al., 2018 ] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data.arXiv preprint arXiv:1805.09501,

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

Karhunen-loeve transform.The transform and data compression hand- book, 1(1-34):29,

[Dony and others, 2001] R Dony et al. Karhunen-loeve transform.The transform and data compression hand- book, 1(1-34):29,

work page 2001
[11]

An image is worth 16x16 words: Transformers for image recognition at scale

[Dosovitskiyet al., 2020 ] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Min- derer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Repre- sentations,

work page 2020
[12]

Mpeg-4 natural video coding–an overview.Sig- nal Processing: Image Communication, 15(4-5):365–385,

[Ebrahimi and Horne, 2000] Touradj Ebrahimi and Caspar Horne. Mpeg-4 natural video coding–an overview.Sig- nal Processing: Image Communication, 15(4-5):365–385,

work page 2000
[13]

Understanding the difficulty of training deep feedfor- ward neural networks

[Glorot and Bengio, 2010] Xavier Glorot and Yoshua Ben- gio. Understanding the difficulty of training deep feedfor- ward neural networks. InProceedings of the thirteenth in- ternational conference on artificial intelligence and statis- tics, pages 249–256. JMLR Workshop and Conference Proceedings,

work page 2010
[14]

gswin: Gated mlp vision model with hi- erarchical structure of shifted window

[Go and Tachibana, 2023] Mocho Go and Hideyuki Tachibana. gswin: Gated mlp vision model with hi- erarchical structure of shifted window. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,

work page 2023
[15]

Delving deep into rectifiers: Surpass- ing human-level performance on imagenet classification

[Heet al., 2015 ] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpass- ing human-level performance on imagenet classification. InProceedings of the IEEE international conference on computer vision, pages 1026–1034,

work page 2015
[16]

Deep residual learning for image recog- nition

[Heet al., 2016 ] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778,

work page 2016
[17]

Colorformer: Image colorization via color memory assisted hybrid-attention transformer

[Jiet al., 2022 ] Xiaozhong Ji, Boyuan Jiang, Donghao Luo, Guangpin Tao, Wenqing Chu, Zhifeng Xie, Chengjie Wang, and Ying Tai. Colorformer: Image colorization via color memory assisted hybrid-attention transformer. In European Conference on Computer Vision, pages 20–36. Springer,

work page 2022
[18]

Discrete cosin transformer: Im- age modeling from frequency domain

[Liet al., 2023 ] Xinyu Li, Yanyi Zhang, Jianbo Yuan, Han- lin Lu, and Yibo Zhu. Discrete cosin transformer: Im- age modeling from frequency domain. InProceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, pages 5468–5478,

work page 2023
[19]

Microsoft coco: Com- mon objects in context

[Linet al., 2014 ] Tsung-Yi Lin, Michael Maire, Serge Be- longie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Com- mon objects in context. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, Septem- ber 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer,

work page 2014
[20]

Swin transformer: Hierarchical vision transformer using shifted windows

[Liuet al., 2021 ] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 10012– 10022,

work page 2021
[21]

Deep learn- ing via hessian-free optimization

[Martens and others, 2010] James Martens et al. Deep learn- ing via hessian-free optimization. InICML, volume 27, pages 735–742,

work page 2010
[22]

Training neural network with zero weight ini- tialization

[Masood and Chandra, 2012] Sarfaraz Masood and Pravin Chandra. Training neural network with zero weight ini- tialization. InProceedings of the CUBE International In- formation Technology Conference, pages 235–239,

work page 2012
[23]

Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer

[Mehta and Rastegari, 2021] Sachin Mehta and Mohammad Rastegari. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. InInternational Con- ference on Learning Representations,

work page 2021
[24]

A hybrid quantum-classical approach based on the hadamard transform for the convolutional layer

[Panet al., 2023 ] Hongyi Pan, Xin Zhu, Salih Furkan Atici, and Ahmet Cetin. A hybrid quantum-classical approach based on the hadamard transform for the convolutional layer. InInternational Conference on Machine Learning, pages 26891–26903. PMLR,

work page 2023
[25]

Pytorch: An imperative style, high- performance deep learning library.Advances in neural in- formation processing systems, 32,

[Paszkeet al., 2019 ] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high- performance deep learning library.Advances in neural in- formation processing systems, 32,

work page 2019
[26]

Spectformer: Frequency and atten- tion is what you need in a vision transformer

[Patroet al., 2025 ] Badri N Patro, Vinay P Namboodiri, and Vijay S Agneeswaran. Spectformer: Frequency and atten- tion is what you need in a vision transformer. In2025 IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 9543–9554. IEEE,

work page 2025
[27]

Low-complexity rounded klt approx- imation for image compression.Journal of Real-Time Im- age Processing, pages 1–11,

[Rad¨unzet al., 2022 ] Anabeth P Rad ¨unz, F ´abio M Bayer, and Renato J Cintra. Low-complexity rounded klt approx- imation for image compression.Journal of Real-Time Im- age Processing, pages 1–11,

work page 2022
[28]

Mobilenetv2: Inverted residuals and linear bottlenecks

[Sandleret al., 2018 ] Mark Sandler, Andrew Howard, Men- glong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520,

work page 2018
[29]

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

[Saxeet al., 2013 ] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dy- namics of learning in deep linear neural networks.arXiv preprint arXiv:1312.6120,

work page internal anchor Pith review Pith/arXiv arXiv 2013
[30]

Dct-former: Ef- ficient self-attention with discrete cosine transform.Jour- nal of Scientific Computing, 94(3):67,

[Scribanoet al., 2023 ] Carmelo Scribano, Giorgia Fran- chini, Marco Prato, and Marko Bertogna. Dct-former: Ef- ficient self-attention with discrete cosine transform.Jour- nal of Scientific Computing, 94(3):67,

work page 2023
[31]

Mi- crovit: a vision transformer with low complexity self atten- tion for edge device

[Setyawanet al., 2025 ] Novendra Setyawan, Chi-Chia Sun, Mao-Hsiu Hsu, Wen-Kai Kuo, and Jun-Wei Hsieh. Mi- crovit: a vision transformer with low complexity self atten- tion for edge device. In2025 IEEE International Sympo- sium on Circuits and Systems (ISCAS), pages 1–5. IEEE,

work page 2025
[32]

Efficient attention: Attention with linear complexities

[Shenet al., 2021 ] Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with linear complexities. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531–3539,

work page 2021
[33]

Compressive estimation and imaging based on autoregres- sive models.IEEE Transactions on Image Processing, 25(11):5077–5087,

[Testa and Magli, 2016] Matteo Testa and Enrico Magli. Compressive estimation and imaging based on autoregres- sive models.IEEE Transactions on Image Processing, 25(11):5077–5087,

work page 2016
[34]

Mimetic initialization of self-attention layers

[Trockman and Kolter, 2023] Asher Trockman and J Zico Kolter. Mimetic initialization of self-attention layers. arXiv preprint arXiv:2305.09828,

work page arXiv 2023
[35]

Attention is all you need.Advances in neural information processing systems, 30,

[Vaswaniet al., 2017 ] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30,

work page 2017
[36]

The jpeg still pic- ture compression standard.Communications of the ACM, 34(4):30–44,

[Wallace, 1991] Gregory K Wallace. The jpeg still pic- ture compression standard.Communications of the ACM, 34(4):30–44,

work page 1991
[37]

A survey of transfer learning.Journal of Big data, 3(1):1–40,

[Weisset al., 2016 ] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning.Journal of Big data, 3(1):1–40,

work page 2016
[38]

Initializing models with larger ones

[Xuet al., 2023 ] Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, and Zhuang Liu. Initializing models with larger ones. InThe Twelfth International Conference on Learning Represen- tations,

work page 2023
[39]

Metaformer is actually what you need for vision

[Yuet al., 2022 ] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10819– 10829,

work page 2022
[40]

Cutmix: Regularization strategy to train strong classifiers with localizable features

[Yunet al., 2019 ] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. InProceedings of the IEEE/CVF international conference on computer vision, pages 6023– 6032,

work page 2019
[41]

mixup: Beyond empirical risk minimization

[Zhanget al., 2018 ] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. InInternational Conference on Learning Representations,

work page 2018
[42]

Improving deep transformer with depth-scaled ini- tialization and merged attention

[Zhanget al., 2019 ] Biao Zhang, Ivan Titov, and Rico Sen- nrich. Improving deep transformer with depth-scaled ini- tialization and merged attention. InProceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP-IJCNLP), pages 898–909,

work page 2019
[43]

Zero initialization: Initializing neural networks with only zeros and ones.Transactions on Machine Learning Research,

[Zhaoet al., 2022 ] Jiawei Zhao, Florian Tobias Schaefer, and Anima Anandkumar. Zero initialization: Initializing neural networks with only zeros and ones.Transactions on Machine Learning Research,

work page 2022
[44]

Structured initialization for attention in vision transformers.arXiv preprint arXiv:2404.01139,

[Zhenget al., 2024 ] Jianqiao Zheng, Xueqian Li, and Simon Lucey. Structured initialization for attention in vision transformers.arXiv preprint arXiv:2404.01139,

work page arXiv 2024
[45]

Random erasing data aug- mentation

[Zhonget al., 2020 ] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data aug- mentation. InProceedings of the AAAI conference on arti- ficial intelligence, volume 34, pages 13001–13008, 2020

work page 2020

[1] [1]

Discrete cosine transform.IEEE trans- actions on Computers, 100(1):90–93,

[Ahmedet al., 1974 ] Nasir Ahmed, T Natarajan, and Kamisetty R Rao. Discrete cosine transform.IEEE trans- actions on Computers, 100(1):90–93,

work page 1974

[2] [2]

Toeplitz approximation to empirical correlation matrix of asset returns: A signal processing perspective

[Akansu and Torun, 2012] Ali N Akansu and Mustafa U Torun. Toeplitz approximation to empirical correlation matrix of asset returns: A signal processing perspective. IEEE Journal of Selected Topics in Signal Processing, 6(4):319–326,

work page 2012

[3] [3]

Hydra at- tention: Efficient attention with many heads

[Bolyaet al., 2022 ] Daniel Bolya, Cheng-Yang Fu, Xiao- liang Dai, Peizhao Zhang, and Judy Hoffman. Hydra at- tention: Efficient attention with many heads. InEuropean Conference on Computer Vision, pages 35–49. Springer,

work page 2022

[4] [4]

Reminder of the first paper on transfer learning in neural networks, 1976.Infor- matica, 44(3),

[Bozinovski, 2020] Stevo Bozinovski. Reminder of the first paper on transfer learning in neural networks, 1976.Infor- matica, 44(3),

work page 2020

[5] [5]

Fourier image transformer

[Buchholz and Jug, 2022] Tim-Oliver Buchholz and Florian Jug. Fourier image transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1846–1854,

work page 2022

[6] [6]

Cascade r-cnn: Delving into high quality object detection

[Cai and Vasconcelos, 2018] Zhaowei Cai and Nuno Vas- concelos. Cascade r-cnn: Delving into high quality object detection. InProceedings of the IEEE conference on com- puter vision and pattern recognition, pages 6154–6162,

work page 2018

[7] [7]

A fast computational algorithm for the discrete cosine transform.IEEE Transactions on communications, 25(9):1004–1009,

[Chenet al., 1977 ] Wen-Hsiung Chen, CH Smith, and Sam Fralick. A fast computational algorithm for the discrete cosine transform.IEEE Transactions on communications, 25(9):1004–1009,

work page 1977

[8] [8]

Fast fourier convolution.Advances in Neural Information Pro- cessing Systems, 33:4479–4488,

[Chiet al., 2020 ] Lu Chi, Borui Jiang, and Yadong Mu. Fast fourier convolution.Advances in Neural Information Pro- cessing Systems, 33:4479–4488,

work page 2020

[9] [9]

AutoAugment: Learning Augmentation Policies from Data

[Cubuket al., 2018 ] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation policies from data.arXiv preprint arXiv:1805.09501,

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

Karhunen-loeve transform.The transform and data compression hand- book, 1(1-34):29,

[Dony and others, 2001] R Dony et al. Karhunen-loeve transform.The transform and data compression hand- book, 1(1-34):29,

work page 2001

[11] [11]

An image is worth 16x16 words: Transformers for image recognition at scale

[Dosovitskiyet al., 2020 ] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Min- derer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Repre- sentations,

work page 2020

[12] [12]

Mpeg-4 natural video coding–an overview.Sig- nal Processing: Image Communication, 15(4-5):365–385,

[Ebrahimi and Horne, 2000] Touradj Ebrahimi and Caspar Horne. Mpeg-4 natural video coding–an overview.Sig- nal Processing: Image Communication, 15(4-5):365–385,

work page 2000

[13] [13]

Understanding the difficulty of training deep feedfor- ward neural networks

[Glorot and Bengio, 2010] Xavier Glorot and Yoshua Ben- gio. Understanding the difficulty of training deep feedfor- ward neural networks. InProceedings of the thirteenth in- ternational conference on artificial intelligence and statis- tics, pages 249–256. JMLR Workshop and Conference Proceedings,

work page 2010

[14] [14]

gswin: Gated mlp vision model with hi- erarchical structure of shifted window

[Go and Tachibana, 2023] Mocho Go and Hideyuki Tachibana. gswin: Gated mlp vision model with hi- erarchical structure of shifted window. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,

work page 2023

[15] [15]

Delving deep into rectifiers: Surpass- ing human-level performance on imagenet classification

[Heet al., 2015 ] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpass- ing human-level performance on imagenet classification. InProceedings of the IEEE international conference on computer vision, pages 1026–1034,

work page 2015

[16] [16]

Deep residual learning for image recog- nition

[Heet al., 2016 ] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778,

work page 2016

[17] [17]

Colorformer: Image colorization via color memory assisted hybrid-attention transformer

[Jiet al., 2022 ] Xiaozhong Ji, Boyuan Jiang, Donghao Luo, Guangpin Tao, Wenqing Chu, Zhifeng Xie, Chengjie Wang, and Ying Tai. Colorformer: Image colorization via color memory assisted hybrid-attention transformer. In European Conference on Computer Vision, pages 20–36. Springer,

work page 2022

[18] [18]

Discrete cosin transformer: Im- age modeling from frequency domain

[Liet al., 2023 ] Xinyu Li, Yanyi Zhang, Jianbo Yuan, Han- lin Lu, and Yibo Zhu. Discrete cosin transformer: Im- age modeling from frequency domain. InProceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision, pages 5468–5478,

work page 2023

[19] [19]

Microsoft coco: Com- mon objects in context

[Linet al., 2014 ] Tsung-Yi Lin, Michael Maire, Serge Be- longie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Com- mon objects in context. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, Septem- ber 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer,

work page 2014

[20] [20]

Swin transformer: Hierarchical vision transformer using shifted windows

[Liuet al., 2021 ] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF in- ternational conference on computer vision, pages 10012– 10022,

work page 2021

[21] [21]

Deep learn- ing via hessian-free optimization

[Martens and others, 2010] James Martens et al. Deep learn- ing via hessian-free optimization. InICML, volume 27, pages 735–742,

work page 2010

[22] [22]

Training neural network with zero weight ini- tialization

[Masood and Chandra, 2012] Sarfaraz Masood and Pravin Chandra. Training neural network with zero weight ini- tialization. InProceedings of the CUBE International In- formation Technology Conference, pages 235–239,

work page 2012

[23] [23]

Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer

[Mehta and Rastegari, 2021] Sachin Mehta and Mohammad Rastegari. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. InInternational Con- ference on Learning Representations,

work page 2021

[24] [24]

A hybrid quantum-classical approach based on the hadamard transform for the convolutional layer

[Panet al., 2023 ] Hongyi Pan, Xin Zhu, Salih Furkan Atici, and Ahmet Cetin. A hybrid quantum-classical approach based on the hadamard transform for the convolutional layer. InInternational Conference on Machine Learning, pages 26891–26903. PMLR,

work page 2023

[25] [25]

Pytorch: An imperative style, high- performance deep learning library.Advances in neural in- formation processing systems, 32,

[Paszkeet al., 2019 ] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high- performance deep learning library.Advances in neural in- formation processing systems, 32,

work page 2019

[26] [26]

Spectformer: Frequency and atten- tion is what you need in a vision transformer

[Patroet al., 2025 ] Badri N Patro, Vinay P Namboodiri, and Vijay S Agneeswaran. Spectformer: Frequency and atten- tion is what you need in a vision transformer. In2025 IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 9543–9554. IEEE,

work page 2025

[27] [27]

Low-complexity rounded klt approx- imation for image compression.Journal of Real-Time Im- age Processing, pages 1–11,

[Rad¨unzet al., 2022 ] Anabeth P Rad ¨unz, F ´abio M Bayer, and Renato J Cintra. Low-complexity rounded klt approx- imation for image compression.Journal of Real-Time Im- age Processing, pages 1–11,

work page 2022

[28] [28]

Mobilenetv2: Inverted residuals and linear bottlenecks

[Sandleret al., 2018 ] Mark Sandler, Andrew Howard, Men- glong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520,

work page 2018

[29] [29]

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

[Saxeet al., 2013 ] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dy- namics of learning in deep linear neural networks.arXiv preprint arXiv:1312.6120,

work page internal anchor Pith review Pith/arXiv arXiv 2013

[30] [30]

Dct-former: Ef- ficient self-attention with discrete cosine transform.Jour- nal of Scientific Computing, 94(3):67,

[Scribanoet al., 2023 ] Carmelo Scribano, Giorgia Fran- chini, Marco Prato, and Marko Bertogna. Dct-former: Ef- ficient self-attention with discrete cosine transform.Jour- nal of Scientific Computing, 94(3):67,

work page 2023

[31] [31]

Mi- crovit: a vision transformer with low complexity self atten- tion for edge device

[Setyawanet al., 2025 ] Novendra Setyawan, Chi-Chia Sun, Mao-Hsiu Hsu, Wen-Kai Kuo, and Jun-Wei Hsieh. Mi- crovit: a vision transformer with low complexity self atten- tion for edge device. In2025 IEEE International Sympo- sium on Circuits and Systems (ISCAS), pages 1–5. IEEE,

work page 2025

[32] [32]

Efficient attention: Attention with linear complexities

[Shenet al., 2021 ] Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with linear complexities. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531–3539,

work page 2021

[33] [33]

Compressive estimation and imaging based on autoregres- sive models.IEEE Transactions on Image Processing, 25(11):5077–5087,

[Testa and Magli, 2016] Matteo Testa and Enrico Magli. Compressive estimation and imaging based on autoregres- sive models.IEEE Transactions on Image Processing, 25(11):5077–5087,

work page 2016

[34] [34]

Mimetic initialization of self-attention layers

[Trockman and Kolter, 2023] Asher Trockman and J Zico Kolter. Mimetic initialization of self-attention layers. arXiv preprint arXiv:2305.09828,

work page arXiv 2023

[35] [35]

Attention is all you need.Advances in neural information processing systems, 30,

[Vaswaniet al., 2017 ] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30,

work page 2017

[36] [36]

The jpeg still pic- ture compression standard.Communications of the ACM, 34(4):30–44,

[Wallace, 1991] Gregory K Wallace. The jpeg still pic- ture compression standard.Communications of the ACM, 34(4):30–44,

work page 1991

[37] [37]

A survey of transfer learning.Journal of Big data, 3(1):1–40,

[Weisset al., 2016 ] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning.Journal of Big data, 3(1):1–40,

work page 2016

[38] [38]

Initializing models with larger ones

[Xuet al., 2023 ] Zhiqiu Xu, Yanjie Chen, Kirill Vishniakov, Yida Yin, Zhiqiang Shen, Trevor Darrell, Lingjie Liu, and Zhuang Liu. Initializing models with larger ones. InThe Twelfth International Conference on Learning Represen- tations,

work page 2023

[39] [39]

Metaformer is actually what you need for vision

[Yuet al., 2022 ] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10819– 10829,

work page 2022

[40] [40]

Cutmix: Regularization strategy to train strong classifiers with localizable features

[Yunet al., 2019 ] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. InProceedings of the IEEE/CVF international conference on computer vision, pages 6023– 6032,

work page 2019

[41] [41]

mixup: Beyond empirical risk minimization

[Zhanget al., 2018 ] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. InInternational Conference on Learning Representations,

work page 2018

[42] [42]

Improving deep transformer with depth-scaled ini- tialization and merged attention

[Zhanget al., 2019 ] Biao Zhang, Ivan Titov, and Rico Sen- nrich. Improving deep transformer with depth-scaled ini- tialization and merged attention. InProceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Confer- ence on Natural Language Processing (EMNLP-IJCNLP), pages 898–909,

work page 2019

[43] [43]

Zero initialization: Initializing neural networks with only zeros and ones.Transactions on Machine Learning Research,

[Zhaoet al., 2022 ] Jiawei Zhao, Florian Tobias Schaefer, and Anima Anandkumar. Zero initialization: Initializing neural networks with only zeros and ones.Transactions on Machine Learning Research,

work page 2022

[44] [44]

Structured initialization for attention in vision transformers.arXiv preprint arXiv:2404.01139,

[Zhenget al., 2024 ] Jianqiao Zheng, Xueqian Li, and Simon Lucey. Structured initialization for attention in vision transformers.arXiv preprint arXiv:2404.01139,

work page arXiv 2024

[45] [45]

Random erasing data aug- mentation

[Zhonget al., 2020 ] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data aug- mentation. InProceedings of the AAAI conference on arti- ficial intelligence, volume 34, pages 13001–13008, 2020

work page 2020