pith. sign in

arxiv: 2510.14393 · v1 · submitted 2025-10-16 · 💻 cs.AR · cs.LG

Low Power Vision Transformer Accelerator with Hardware-Aware Pruning and Optimized Dataflow

Pith reviewed 2026-05-18 06:41 UTC · model grok-4.3

classification 💻 cs.AR cs.LG
keywords vision transformerhardware acceleratordynamic pruningdataflow optimizationlow poweralgorithm hardware co-designfeed forward networkenergy efficiency
0
0 comments X

The pith

Dynamic token pruning and FFN2 pruning cut vision transformer operations 61.5 percent and weights 59.3 percent while holding accuracy loss below 2 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that vision transformers with short token lengths have the feed-forward network as the main computational cost rather than self-attention. It shows this cost can be lowered through hardware-friendly dynamic token pruning, replacement of GELU with ReLU, and dynamic pruning of the second FFN layer. These steps are paired with a row-wise dataflow that removes the need for data transposition and supports the resulting sparsity with little added hardware. If the approach holds, accelerators for vision tasks can reach high throughput and energy efficiency on modest silicon area without complex extra logic.

Core claim

The paper claims that algorithm-hardware co-design using hardware-friendly dynamic token pruning, GELU-to-ReLU substitution, and dynamic FFN2 pruning reduces total operations by 61.5 percent and FFN2 weights by 59.3 percent with less than 2 percent accuracy loss. The accompanying hardware uses row-wise dataflow with output-oriented access to eliminate transposition and handles the dynamic operations with minimal area overhead. Implemented in TSMC 28 nm CMOS, the design uses 496.4 K gates and 232 KB SRAM to deliver 1024 GOPS at 1 GHz, 2.31 TOPS/W energy efficiency, and 858.61 GOPS/mm2 area efficiency.

What carries the argument

Row-wise dataflow with output-oriented access that supports dynamic pruning operations without data transposition.

Load-bearing premise

The dynamic token pruning and FFN2 pruning keep accuracy loss under 2 percent on the target vision tasks without extra hardware logic or retraining that would erase the reported efficiency gains.

What would settle it

Running the pruned model on a standard vision dataset such as ImageNet and checking whether top-1 accuracy drops more than 2 percent compared with the unpruned baseline.

Figures

Figures reproduced from arXiv: 2510.14393 by Ching-Lin Hsiung, Tian-Sheuan Chang.

Figure 1
Figure 1. Figure 1: Distribution analysis of the DeiT-small model [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of the token pruning process [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of FFN post-activation matrix values [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Histogram of FFN post-activation accumulation along dimensions [PITH_FULL_IMAGE:figures/full_fig_p003_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: The overview of hardware architecture [PITH_FULL_IMAGE:figures/full_fig_p004_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Detailed diagram of a PE Group(8x8) [PITH_FULL_IMAGE:figures/full_fig_p004_9.png] view at source ↗
Figure 12
Figure 12. Figure 12: Data access flow of FFN2 2) FFN: In the FFN computation, the FFN1 output need to be stored for the following FFN2. To avoid a larger interme￾diate buffer, we propose an interleaved FFN computing order. In this order, once FFN1 computations are completed, the results are passed through the ReLU activation function before entering the FFN2 pruning module, which determines whether the computed elements shoul… view at source ↗
Figure 10
Figure 10. Figure 10: Data access flow of fully connected operations. [PITH_FULL_IMAGE:figures/full_fig_p005_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Detailed mapping of the fully connected operations for a PE group [PITH_FULL_IMAGE:figures/full_fig_p005_11.png] view at source ↗
Figure 14
Figure 14. Figure 14: Detailed mapping of Q × KT of (a) PE Group 0 and (b) PE Group 7 at the same cycle. the weights to be mapped across all PE Groups in a single cycle. This efficient mapping, as shown in [PITH_FULL_IMAGE:figures/full_fig_p006_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: Block diagram of the FFN2 pruning module [PITH_FULL_IMAGE:figures/full_fig_p006_16.png] view at source ↗
Figure 15
Figure 15. Figure 15: Block diagram of token pruning module [PITH_FULL_IMAGE:figures/full_fig_p006_15.png] view at source ↗
Figure 19
Figure 19. Figure 19: illustrates a 56.4% reduction in total data fetch requirements when applying token pruning with ρ = 0.5. This significantly reduces computational overhead while improving hardware efficiency. In addition, [PITH_FULL_IMAGE:figures/full_fig_p007_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: External memory access for FFN2 weight [PITH_FULL_IMAGE:figures/full_fig_p007_20.png] view at source ↗
Figure 18
Figure 18. Figure 18: Computation comparison with FACT [6] [PITH_FULL_IMAGE:figures/full_fig_p007_18.png] view at source ↗
Figure 21
Figure 21. Figure 21: External memory access with dynamic Pruning [PITH_FULL_IMAGE:figures/full_fig_p007_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: FFN2 weight skip ratio across layers. the accuracy loss remains under 2%. The proposed pruning methods introduce minimal area overhead while achieving a reduction of 59. 3% in external memory access for input tokens. The accelerator demonstrates notable improvements in speed and efficiency, achieving 2.31 TOPS/W and 858 GOPS/mm². With a gate count of 496.5K and a 232KB SRAM buffer, the chip occupies 1.19m… view at source ↗
read the original abstract

Current transformer accelerators primarily focus on optimizing self-attention due to its quadratic complexity. However, this focus is less relevant for vision transformers with short token lengths, where the Feed-Forward Network (FFN) tends to be the dominant computational bottleneck. This paper presents a low power Vision Transformer accelerator, optimized through algorithm-hardware co-design. The model complexity is reduced using hardware-friendly dynamic token pruning without introducing complex mechanisms. Sparsity is further improved by replacing GELU with ReLU activations and employing dynamic FFN2 pruning, achieving a 61.5\% reduction in operations and a 59.3\% reduction in FFN2 weights, with an accuracy loss of less than 2\%. The hardware adopts a row-wise dataflow with output-oriented data access to eliminate data transposition, and supports dynamic operations with minimal area overhead. Implemented in TSMC's 28nm CMOS technology, our design occupies 496.4K gates and includes a 232KB SRAM buffer, achieving a peak throughput of 1024 GOPS at 1GHz, with an energy efficiency of 2.31 TOPS/W and an area efficiency of 858.61 GOPS/mm2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes an algorithm-hardware co-design for a low-power Vision Transformer accelerator focused on short-token ViTs, where FFN is the dominant bottleneck. It applies hardware-friendly dynamic token pruning, replaces GELU with ReLU, and uses dynamic FFN2 pruning to reduce operations by 61.5% and FFN2 weights by 59.3% with <2% accuracy loss. The hardware employs row-wise dataflow to eliminate transposition and supports dynamic sparsity with minimal overhead. Implemented in TSMC 28nm, it reports 496.4K gates, 232KB SRAM, 1024 GOPS at 1 GHz, 2.31 TOPS/W energy efficiency, and 858.61 GOPS/mm² area efficiency.

Significance. If the accuracy claims hold on standard datasets without hidden retraining costs or excessive control logic overhead, the work offers a practical contribution to efficient ViT inference on edge hardware by targeting the FFN rather than attention. The concrete hardware metrics and co-design optimizations provide measurable efficiency gains that could inform future accelerator designs.

major comments (2)
  1. [Abstract] Abstract: The headline claims of 61.5% operation reduction, 59.3% FFN2 weight reduction, and less than 2% accuracy loss are presented without any reference to the datasets (e.g., ImageNet or CIFAR), baseline ViT model and its original top-1 accuracy, pruned accuracy numbers, or pruning-rate schedule. This absence makes the central co-design claim unverifiable and load-bearing for the efficiency results.
  2. [Hardware Architecture] Hardware implementation description: The assertion that dynamic operations incur 'minimal area overhead' lacks a quantitative breakdown (e.g., percentage of total gates or SRAM attributed to pruning control logic versus baseline accelerator), which is necessary to confirm that the reported 496.4K gates and efficiencies are not offset by the added dynamic support.
minor comments (2)
  1. [Abstract] The abstract and introduction could explicitly state the ViT variant and token length (e.g., 16 or 32 tokens) used for the short-token experiments to contextualize the FFN dominance claim.
  2. [Evaluation] A comparison table against prior ViT accelerators (e.g., reporting energy efficiency and area efficiency) would strengthen the hardware results section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify opportunities to strengthen the verifiability of our claims and the transparency of our hardware overhead analysis. We address each point below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claims of 61.5% operation reduction, 59.3% FFN2 weight reduction, and less than 2% accuracy loss are presented without any reference to the datasets (e.g., ImageNet or CIFAR), baseline ViT model and its original top-1 accuracy, pruned accuracy numbers, or pruning-rate schedule. This absence makes the central co-design claim unverifiable and load-bearing for the efficiency results.

    Authors: We agree that the abstract would be more self-contained with explicit references to the evaluation dataset, baseline model, and accuracy figures. While these details appear in the experimental results section, we will revise the abstract to concisely include the dataset (ImageNet), the baseline short-token ViT model, the original and pruned top-1 accuracies, and a note on the dynamic pruning schedule to improve immediate verifiability of the co-design claims. revision: yes

  2. Referee: [Hardware Architecture] Hardware implementation description: The assertion that dynamic operations incur 'minimal area overhead' lacks a quantitative breakdown (e.g., percentage of total gates or SRAM attributed to pruning control logic versus baseline accelerator), which is necessary to confirm that the reported 496.4K gates and efficiencies are not offset by the added dynamic support.

    Authors: We acknowledge that an explicit quantitative breakdown would strengthen the 'minimal area overhead' statement. Our synthesis data show that the additional control logic for dynamic token pruning, ReLU replacement, and FFN2 pruning contributes only a modest fraction of the total gate count. In the revised manuscript we will add a table or paragraph providing the gate-count and SRAM breakdown, separating the dynamic support overhead from the baseline accelerator components. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical hardware results are self-contained

full rationale

The paper describes an algorithm-hardware co-design for a Vision Transformer accelerator, reporting concrete implementation metrics such as gate count, SRAM size, throughput, energy efficiency, and area efficiency in TSMC 28nm, along with measured reductions from dynamic token pruning, GELU-to-ReLU replacement, and dynamic FFN2 pruning. No equations, first-principles derivations, fitted parameters, or predictions are presented that reduce to their own inputs by construction. The central claims rest on reported post-implementation outcomes rather than any self-definitional logic, self-citation chains, or renamed known results, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design relies on standard CMOS process assumptions and the premise that pruning thresholds can be chosen to keep accuracy loss under 2% without introducing new unverified mechanisms.

axioms (1)
  • domain assumption Standard 28nm CMOS process characteristics and SRAM behavior hold as reported by the foundry.
    Invoked when stating gate count, SRAM size, and power numbers.

pith-pipeline@v0.9.0 · 5744 in / 1237 out tokens · 31699 ms · 2026-05-18T06:41:06.673615+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    Training data-efficient image transformers & distillation through attention,

    H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou, “Training data-efficient image transformers & distillation through attention,” inInternational Conference on Machine Learning, vol. 139, July 2021, pp. 10 347–10 357

  2. [2]

    A 3: Accelerating attention mechanisms in neural networks with approximation,

    T. J. Ham, S. J. Jung, S. Kim, Y . H. Oh, Y . Park, Y . Song, J.-H. Park, S. Lee, K. Park, J. W. Lee, and D.-K. Jeong, “A 3: Accelerating attention mechanisms in neural networks with approximation,” inIEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 328–341

  3. [3]

    ELSA: hardware-software co-design for efficient, lightweight self- attention mechanism in neural networks,

    T. J. Ham, Y . Lee, S. H. Seo, S. Kim, H. Choi, S. J. Jung, and J. W. Lee, “ELSA: hardware-software co-design for efficient, lightweight self- attention mechanism in neural networks,” inProceedings of the 48th Annual International Symposium on Computer Architecture, 2021, p. 692–705

  4. [4]

    A 28nm 27.5TOPS/W approximate- computing-based transformer processor with asymptotic sparsity spec- ulating and out-of-order computing,

    Y . Wang, Y . Qin, D. Deng, J. Wei, Y . Zhou, Y . Fan, T. Chen, H. Sun, L. Liu, S. Wei, and S. Yin, “A 28nm 27.5TOPS/W approximate- computing-based transformer processor with asymptotic sparsity spec- ulating and out-of-order computing,” inIEEE International Solid-State Circuits Conference (ISSCC), vol. 65, 2022, pp. 1–3

  5. [5]

    SpAtten: Efficient sparse attention architecture with cascade token and head pruning,

    H. Wang, Z. Zhang, and S. Han, “SpAtten: Efficient sparse attention architecture with cascade token and head pruning,” inIEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021, pp. 97–110

  6. [6]

    FACT: FFN-attention co-optimized transformer architecture with eager correlation prediction,

    Y . Qin, Y . Wang, D. Deng, Z. Zhao, X. Yang, L. Liu, S. Wei, Y . Hu, and S. Yin, “FACT: FFN-attention co-optimized transformer architecture with eager correlation prediction,” inProceedings of the 50th Annual International Symposium on Computer Architecture, ser. ISCA ’23. Association for Computing Machinery, 2023

  7. [7]

    Bsvit: A bit-serial vision transformer accelerator exploiting dynamic patch and weight bit-group quantization,

    G. Wang, S. Cai, W. Li, D. Lyu, and G. He, “Bsvit: A bit-serial vision transformer accelerator exploiting dynamic patch and weight bit-group quantization,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 71, no. 9, pp. 4064–4077, 2024

  8. [8]

    Evo-ViT: Slow-fast token evolution for dynamic vision transformer,

    Y . Xu, Z. Zhang, M. Zhang, K. Sheng, K. Li, W. Dong, L. Zhang, C. Xu, and X. Sun, “Evo-ViT: Slow-fast token evolution for dynamic vision transformer,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, pp. 2964–2972, Jun. 2022

  9. [9]

    Not all patches are what you need: Expediting vision transformers via token reorganiza- tions,

    Y . Liang, C. Ge, Z. Tong, Y . Song, J. Wang, and P. Xie, “Not all patches are what you need: Expediting vision transformers via token reorganiza- tions,” inInternational Conference on Learning Representations, 2022

  10. [10]

    A-ViT: adaptive tokens for efficient vision transformer,

    H. Yin, A. Vahdat, J. M. Alvarez, A. Mallya, J. Kautz, and P. Molchanov, “A-ViT: adaptive tokens for efficient vision transformer,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10 799–10 808. 10

  11. [11]

    Adaptive token sampling for efficient vision transformers,

    M. Fayyaz, S. Abbasi Kouhpayegani, F. Rezaei Jafari, E. Sommerlade, H. R. Vaezi Joze, H. Pirsiavash, and J. Gall, “Adaptive token sampling for efficient vision transformers,”European Conference on Computer Vision (ECCV), 2022

  12. [12]

    Dynam- icViT: Efficient vision transformers with dynamic token sparsification,

    Y . Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, “Dynam- icViT: Efficient vision transformers with dynamic token sparsification,” Advances in Neural Information Processing Systems (NeurIPS), 2021

  13. [13]

    Pruning self-attentions into convolutional layers in single path,

    H. He, J. Cai, J. Liu, Z. Pan, J. Zhang, D. Tao, and B. Zhuang, “Pruning self-attentions into convolutional layers in single path,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, pp. 3910–3922, 2024

  14. [14]

    HeatViT: hardware-efficient adaptive token pruning for vision transformers,

    P. Dong, M. Sun, A. Lu, Y . Xie, K. Liu, Z. Kong, X. Meng, Z. Li, X. Lin, Z. Fang, and Y . Wang, “HeatViT: hardware-efficient adaptive token pruning for vision transformers,” inIEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023, pp. 442–455

  15. [15]

    An image is worth 16x16 words: Trans- formers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Trans- formers for image recognition at scale,”The International Conference on Learning Representations, 2021

  16. [16]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10 012–10 022

  17. [17]

    Tokens-to-token vit: Training vision transformers from scratch on imagenet,

    L. Yuan, Y . Chen, T. Wang, W. Yu, Y . Shi, F.-F. Tay, J. Feng, and S. Yan, “Tokens-to-token vit: Training vision transformers from scratch on imagenet,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 558–567

  18. [18]

    Go- ing deeper with image transformers,

    H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. J ´egou, “Go- ing deeper with image transformers,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 32–42

  19. [19]

    ViTA: A vision transformer inference accelerator for edge applications,

    S. Nag, G. Datta, S. Kundu, N. Chandrachoodan, and P. A. Beerel, “ViTA: A vision transformer inference accelerator for edge applications,” inIEEE International Symposium on Circuits and Systems (ISCAS), 2023, pp. 1–5

  20. [20]

    A comparison-free hardware sorting engine,

    S. Ghosh, S. Dasgupta, and S. Saha Ray, “A comparison-free hardware sorting engine,” inIEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2019, pp. 586–591

  21. [21]

    K-degree parallel comparison-free hardware sorter for complete sorting,

    S. Saha Ray and S. Ghosh, “K-degree parallel comparison-free hardware sorter for complete sorting,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 5, pp. 1438– 1449, 2023

  22. [22]

    ImageNet: a large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: a large-scale hierarchical image database,” inIEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255. Ching-Lin Hsiungreceived the M.S. degree in elec- tronics engineering from the National Yang Ming Chiao Tung University, Hsinchu, Taiwan, in 2024. He is curren...