pith. sign in

arxiv: 2606.31938 · v1 · pith:ZVWISGPBnew · submitted 2026-06-30 · 💻 cs.AR · cs.CV· cs.DC· cs.LG

FlexViT: A Flexible FPGA-based Accelerator for Edge Vision Transformers

Pith reviewed 2026-07-01 02:11 UTC · model grok-4.3

classification 💻 cs.AR cs.CVcs.DCcs.LG
keywords FPGA acceleratorVision Transformersedge inferenceINT8 GEMMreconfigurable dataflowhardware-software co-designim2col transformation
0
0 comments X

The pith

FlexViT maps fully connected and convolutional layers in Vision Transformers to one reconfigurable INT8 GEMM engine on FPGA for edge devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FlexViT as a reconfigurable accelerator built on the SECDA-TFLite framework to run hybrid Vision Transformer models on resource-constrained edge FPGAs. It uses a hardware-software co-design that converts both layer types to a unified GEMM engine via runtime im2col, switches data reuse modes on the fly, and tiles computations to finish all accumulations without writing partial sums off-chip. The design targets the PYNQ-Z2 board and reports measured speedups of 2.74x on accelerated layers and 1.40x end-to-end versus CPU execution alone. A sympathetic reader would care because current edge ViT deployments struggle with the mix of layer shapes and the resulting memory and compute waste; a single flexible engine could reduce that friction.

Core claim

FlexViT is a reconfigurable FPGA accelerator that maps both fully connected and convolutional layers onto a unified high-throughput INT8 GEMM engine using runtime im2col transformation, a dual-mode dataflow that switches between input and weight reuse by reconfiguring the compute array at runtime, and a depth-first tiling strategy that completes accumulation in a single pass to eliminate off-chip partial-sum transfers.

What carries the argument

Dual-mode dataflow that dynamically switches between input and weight reuse by reconfiguring the compute array at runtime, paired with the unified INT8 GEMM engine and depth-first tiling.

If this is right

  • Accelerator-executed layers achieve up to 2.74x speedup over CPU execution.
  • End-to-end inference achieves up to 1.40x speedup compared to CPU-only execution.
  • Memory bandwidth demand drops because accumulation finishes in one pass with no off-chip partial sums.
  • Diverse layer shapes are supported through runtime reconfiguration of the same compute array.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The single unified engine could reduce the need for separate hardware modules when new ViT variants appear with different layer mixes.
  • Depth-first tiling may become more valuable on FPGAs with smaller on-chip buffers than the PYNQ-Z2.
  • The approach could be tested on other hybrid networks that combine dense and convolutional stages beyond Vision Transformers.

Load-bearing premise

The runtime im2col transformation and dual-mode reconfiguration incur sufficiently low overhead that the reported speedups remain net positive across the evaluated ViT models on the target PYNQ-Z2 platform.

What would settle it

Measure end-to-end wall-clock time for the same ViT models on the PYNQ-Z2 platform with the accelerator enabled versus disabled; if the accelerator version is not at least 1.4 times faster after including all reconfiguration and im2col costs, the net-benefit claim does not hold.

Figures

Figures reproduced from arXiv: 2606.31938 by Hubert Dymarkowski, Jos\'e Cano, Jude Haris, Rappy Saha, Xingjian Fu.

Figure 1
Figure 1. Figure 1: Vision Transformer Architecture. B. Hybrid Vision Transformers The Hybrid family of ViT models merges a different set of computational patterns. These models, such as MobileViT [16] and EfficientViT [17] were designed specifically to maximize efficiency for on-device inference. They achieve this by com￾bining two types of operations: i) convolutions to extract local, low-level features; and ii) lightweight… view at source ↗
Figure 2
Figure 2. Figure 2: FlexViT architecture and computational flow. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FlexViT resource utilization on PYNQ-Z2 board. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Deploying Vision Transformer (ViT) models on edge platforms remains challenging due to their high computational demands and the architectural heterogeneity of modern hybrid ViT models, which incorporate both fully connected and convolutional layers. This heterogeneity leads to significant variation in tensor shapes, requiring flexible and efficient FPGA-based acceleration. In this paper, we present FlexViT, a reconfigurable FPGA accelerator for efficient ViT inference on resource-constrained edge devices. Built on the SECDA-TFLite framework, FlexViT employs a hardware-software co-design approach that maps both fully connected and convolutional layers onto a unified high-throughput INT8 GEMM engine using a runtime im2col transformation. To efficiently support diverse layer configurations, we propose a dual-mode dataflow that dynamically switches between input and weight reuse by reconfiguring the compute array at runtime. We further introduce a depth-first tiling strategy that completes accumulation in a single pass, eliminating off-chip partial-sum transfers and reducing memory bandwidth requirements. We implement FlexViT on a PYNQ-Z2 FPGA and evaluate it across a representative set of ViT models. FlexViT achieves up to 2.74x speedup on accelerator-executed layers, translating into up to 1.40x end-to-end speedup compared to CPU-only execution. The code is available at: https://github.com/gicLAB/FlexViT

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents FlexViT, a reconfigurable FPGA accelerator for Vision Transformer inference on edge devices built on the SECDA-TFLite framework. It maps both fully connected and convolutional layers to a unified high-throughput INT8 GEMM engine via runtime im2col transformation, introduces a dual-mode dataflow that switches between input and weight reuse through runtime reconfiguration of the compute array, and employs depth-first tiling to complete accumulations in a single pass without off-chip partial-sum transfers. The design is implemented on a PYNQ-Z2 platform and evaluated on representative ViT models, claiming up to 2.74× speedup on accelerator-executed layers and up to 1.40× end-to-end speedup versus CPU-only execution, with code released at https://github.com/gicLAB/FlexViT.

Significance. If the measured speedups hold under standard evaluation practices, the work provides a concrete demonstration of hardware-software co-design for handling architectural heterogeneity in hybrid ViT models on resource-constrained FPGAs. The open-source release and use of physical timing measurements on real hardware strengthen reproducibility and allow direct assessment of the unified GEMM plus depth-first tiling approach for edge deployment.

major comments (2)
  1. [Abstract] Abstract: The central speedup claims (2.74× on accelerator layers, 1.40× end-to-end) are stated without specifying the exact ViT models evaluated, the CPU baseline configuration (processor, compiler flags, or library), the fraction of layers offloaded to the accelerator, or any statistical measures (multiple runs, error bars). These details are load-bearing for assessing whether the reported net speedups are robust and generalizable.
  2. [Evaluation (inferred from abstract claims)] The manuscript does not appear to include explicit measurements or analysis of the overheads from the runtime im2col transformation and dual-mode reconfiguration; without these, it is difficult to isolate how much of the reported speedup is attributable to the unified GEMM engine versus the overheads being low enough to remain net positive across models.
minor comments (2)
  1. [Abstract] The abstract and introduction would benefit from a brief table summarizing the key ViT models, their layer counts, and tensor-shape diversity to contextualize the heterogeneity challenge addressed by the dual-mode design.
  2. [Introduction/Approach] Notation for dataflow modes (input reuse vs. weight reuse) should be defined consistently when first introduced to aid readers unfamiliar with the SECDA-TFLite framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive recommendation. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central speedup claims (2.74× on accelerator layers, 1.40× end-to-end) are stated without specifying the exact ViT models evaluated, the CPU baseline configuration (processor, compiler flags, or library), the fraction of layers offloaded to the accelerator, or any statistical measures (multiple runs, error bars). These details are load-bearing for assessing whether the reported net speedups are robust and generalizable.

    Authors: We agree that the abstract would benefit from greater specificity. In the revision we will explicitly name the evaluated ViT models, state the CPU baseline (PYNQ-Z2 ARM Cortex-A9 with TFLite), note the fraction of layers offloaded, and clarify that reported timings are single-run hardware measurements on the target platform. If space allows we will also add a parenthetical reference to the evaluation section for statistical details. revision: yes

  2. Referee: [Evaluation (inferred from abstract claims)] The manuscript does not appear to include explicit measurements or analysis of the overheads from the runtime im2col transformation and dual-mode reconfiguration; without these, it is difficult to isolate how much of the reported speedup is attributable to the unified GEMM engine versus the overheads being low enough to remain net positive across models.

    Authors: The end-to-end speedups already incorporate the measured overhead of im2col and reconfiguration because all timings are taken on the complete SECDA-TFLite execution path. Nevertheless, we acknowledge that an explicit breakdown would strengthen the analysis. In the revised manuscript we will add a short table or paragraph in the evaluation section quantifying the reconfiguration and im2col latency relative to GEMM execution time for the evaluated models. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports measured speedups from physical FPGA implementation and timing on PYNQ-Z2 hardware using a unified GEMM engine, runtime im2col, dual-mode reconfiguration, and depth-first tiling. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the derivation of the core performance claims; results are externally falsifiable via the linked code and hardware runs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Engineering implementation paper with no mathematical derivations or fitted constants; relies on standard FPGA design assumptions and the SECDA-TFLite framework.

pith-pipeline@v0.9.1-grok · 5793 in / 1042 out tokens · 40365 ms · 2026-07-01T02:11:33.338451+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration,

    P. Gibson, J. Cano, E. Crowley, A. Storkey, and M. O’boyle, “DLAS: A Conceptual Model for Across-Stack Deep Learning Acceleration,”ACM Transactions on Architecture and Code Optimization (TACO), 2025

  2. [2]

    ViTA: A vision transformer inference accelerator for edge applica- tions,

    S. Nag, G. Datta, S. Kundu, N. Chandrachoodan, and P. A. Beerel, “ViTA: A vision transformer inference accelerator for edge applica- tions,” in2023 IEEE International Symposium on Circuits and Systems (ISCAS), 2023, pp. 1–5

  3. [3]

    An fpga-based reconfigurable accelerator for convolution-transformer hybrid efficientvit,

    H. Shao, H. Shi, W. Mao, and Z. Wang, “An fpga-based reconfigurable accelerator for convolution-transformer hybrid efficientvit,” in2024 IEEE International Symposium on Circuits and Systems (ISCAS), 2024, pp. 1–5

  4. [4]

    An energy- efficient fpga-based vision transformer accelerator via software-hardware co-design,

    J. Cao, J. Guo, W. Xiong, H. Luo, J. Wang, and J. Lai, “An energy- efficient fpga-based vision transformer accelerator via software-hardware co-design,” in2025 IEEE 33rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2025, pp. 272–272

  5. [5]

    M2-vit: Accelerating hybrid vision transformers with two-level mixed quantization,

    Y . Liang, H. Shi, and Z. Wang, “M2-vit: Accelerating hybrid vision transformers with two-level mixed quantization,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 33, no. 5, pp. 1492– 1496, 2025

  6. [6]

    SECDA- TFLite: A toolkit for efficient development of FPGA-based DNN accelerators for edge inference,

    J. Haris, P. Gibson, J. Cano, N. Bohm Agostini, and D. Kaeli, “SECDA- TFLite: A toolkit for efficient development of FPGA-based DNN accelerators for edge inference,”Journal of Parallel and Distributed Computing, vol. 173, pp. 140–151, 2023

  7. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” Jun. 2021, arXiv:2010.11929 [cs]. [Online]. Available: http://arxiv.org/abs/2010.11929

  8. [8]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, ser. NIPS’17. Curran Associates Inc., 2017, p. 6000–6010

  9. [9]

    Training data-efficient image transformers & distillation through attention,

    H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jegou, “Training data-efficient image transformers & distillation through attention,” inProceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 18–24 Jul 2021, pp. 10 347–10 357

  10. [10]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9992–10 002

  11. [11]

    A Survey on Vision Transformer,

    K. Han, Y . Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y . Tang, A. Xiao, C. Xu, Y . Xu, Z. Yang, Y . Zhang, and D. Tao, “A Survey on Vision Transformer,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 87–110, Jan. 2023

  12. [12]

    CSWin Transformer: A General Vision Transformer Backbone With Cross-Shaped Windows,

    X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo, “CSWin Transformer: A General Vision Transformer Backbone With Cross-Shaped Windows,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2022, pp. 12 124– 12 134

  13. [13]

    ViViT: A Video Vision Transformer,

    A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lu ˇci´c, and C. Schmid, “ViViT: A Video Vision Transformer,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6836–6846

  14. [14]

    Devit: Decomposing vision transformers for collaborative inference in edge devices,

    G. Xu, Z. Hao, Y . Luo, H. Hu, J. An, and S. Mao, “Devit: Decomposing vision transformers for collaborative inference in edge devices,”IEEE Transactions on Mobile Computing, vol. 23, no. 5, p. 5917–5932, May 2024

  15. [15]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255

  16. [16]

    Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer,

    S. Mehta and M. Rastegari, “Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer,” inInternational Conference on Learning Representations, 2022

  17. [17]

    Efficientvit: Memory efficient vision transformer with cascaded group attention,

    X. Liu, H. Peng, N. Zheng, Y . Yang, H. Hu, and Y . Yuan, “Efficientvit: Memory efficient vision transformer with cascaded group attention,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 15 094–15 104

  18. [18]

    Edge computing: Vision and challenges,

    W. Shi, J. Cao, Q. Zhang, Y . Li, and L. Xu, “Edge computing: Vision and challenges,”IEEE Internet of Things Journal, vol. 3, no. 5, pp. 637–646, 2016

  19. [19]

    SECDA: Efficient Hardware/Software Co-Design of FPGA-based DNN Acceler- ators for Edge Inference,

    J. Haris, P. Gibson, J. Cano, N. B. Agostini, and D. Kaeli, “SECDA: Efficient Hardware/Software Co-Design of FPGA-based DNN Acceler- ators for Edge Inference,” inSBAC-PAD, 2021, pp. 1–8

  20. [20]

    TUL Corporation,PYNQ-Z2 User Manual v1.0, TUL Corporation, May 2018

  21. [21]

    How to train your vit? data, augmentation, and regularization in vision transformers,

    A. P. Steiner, A. Kolesnikov, X. Zhai, R. Wightman, J. Uszkoreit, and L. Beyer, “How to train your vit? data, augmentation, and regularization in vision transformers,”Transactions on Machine Learning Research, 2022

  22. [22]

    Validity of the single processor approach to achieving large scale computing capabilities,

    G. M. Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” inProceedings of the April 18-20, 1967, spring joint computer conference, 1967, pp. 483–485

  23. [23]

    (2025) USB Power Meter Digital Display - V oltage Current Amps Capacity Time Temperature Meter

    Makerfocus. (2025) USB Power Meter Digital Display - V oltage Current Amps Capacity Time Temperature Meter

  24. [24]

    ViA: A novel vision-transformer accelerator based on FPGA,

    T. Wang, L. Gong, C. Wang, Y . Yang, Y . Gao, X. Zhou, and H. Chen, “ViA: A novel vision-transformer accelerator based on FPGA,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 11, pp. 4088–4099, 2022

  25. [25]

    Auto-vit-acc: An fpga-aware automatic acceleration framework for vision transformer with mixed- scheme quantization,

    Z. Li, M. Sun, A. Lu, H. Ma, G. Yuan, Y . Xie, H. Tang, Y . Li, M. Leeser, Z. Wang, X. Lin, and Z. Fang, “Auto-vit-acc: An fpga-aware automatic acceleration framework for vision transformer with mixed- scheme quantization,” in2022 32nd International Conference on Field- Programmable Logic and Applications (FPL), 2022, pp. 109–116

  26. [26]

    Vaqf: Fully automatic software-hardware co-design frame- work for low-bit vision transformer,

    M. Sun, H. Ma, G. Kang, Y . Jiang, T. Chen, X. Ma, Z. Wang, and Y . Wang, “V AQF: Fully automatic software-hardware co-design framework for low-bit vision transformer,” 2022. [Online]. Available: https://arxiv.org/abs/2201.06618

  27. [27]

    Heatvit: Hardware-efficient adaptive token pruning for vision transformers,

    P. Dong, M. Sun, A. Lu, Y . Xie, K. Liu, Z. Kong, X. Meng, Z. Li, X. Lin, Z. Fang, and Y . Wang, “Heatvit: Hardware-efficient adaptive token pruning for vision transformers,” in2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023, pp. 442–455

  28. [28]

    Accelerating vit inference on fpga through static and dynamic prun- ing,

    D. Parikh, S. Li, B. Zhang, R. Kannan, C. Busart, and V . Prasanna, “Accelerating vit inference on fpga through static and dynamic prun- ing,” in2024 IEEE 32nd Annual International Symposium on Field- Programmable Custom Computing Machines (FCCM), 2024, pp. 78–89

  29. [29]

    Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design,

    H. You, Z. Sun, H. Shi, Z. Yu, Y . Zhao, Y . Zhang, C. Li, B. Li, and Y . Lin, “Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design,” in2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023, pp. 273–286

  30. [30]

    Token merging: Your vit but faster,

    D. Bolya, C.-Y . Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman, “Token merging: Your vit but faster,” inThe Eleventh International Conference on Learning Representations, 2023

  31. [31]

    Adaptiv: Sign-similarity based image- adaptive token merging for vision transformer acceleration,

    S. Yoo, H. Kim, and J.-Y . Kim, “Adaptiv: Sign-similarity based image- adaptive token merging for vision transformer acceleration,” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MI- CRO), 2024, pp. 64–77

  32. [32]

    DRViT: A dynamic redundancy-aware vision transformer accelerator via algorithm and architecture co-design on FPGA,

    X. Sun, Y . Zhang, Q. Wang, X. Zou, Y . Liu, Z. Zeng, and H. Zhuang, “DRViT: A dynamic redundancy-aware vision transformer accelerator via algorithm and architecture co-design on FPGA,”Journal of Parallel and Distributed Computing, vol. 199, p. 105042, 2025

  33. [33]

    Lightweight vision transformers for low energy edge inference,

    S. Nag, L. Liberty, A. Sivakumar, N. J. Yadwadkar, and L. K. John, “Lightweight vision transformers for low energy edge inference,” in Machine Learning for Computer Architecture and Systems 2024, 2024

  34. [34]

    An FPGA-based transformer accelerator using output block stationary dataflow for object recognition applications,

    Z. Zhao, R. Cao, K.-F. Un, W.-H. Yu, P.-I. Mak, and R. P. Martins, “An FPGA-based transformer accelerator using output block stationary dataflow for object recognition applications,”IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 70, no. 1, pp. 281–285, 2023

  35. [35]

    A 109-gops/w fpga-based vision transformer accelerator with weight-loop dataflow featuring data reusing and resource saving,

    Y . Zhang, L. Feng, H. Shan, and Z. Zhu, “A 109-gops/w fpga-based vision transformer accelerator with weight-loop dataflow featuring data reusing and resource saving,”IEEE Transactions on Circuits and Sys- tems for Video Technology, vol. 34, no. 12, pp. 13 596–13 610, 2024

  36. [36]

    Me- vit: A single-load memory-efficient fpga accelerator for vision transformers,

    K. Marino, P. Zhang, and V . K. Prasanna, “Me- vit: A single-load memory-efficient fpga accelerator for vision transformers,” in2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC), 2023, pp. 213–223

  37. [37]

    SW AT: An efficient swin transformer accelerator based on FPGA,

    Q. Dong, X. Xie, and Z. Wang, “SW AT: An efficient swin transformer accelerator based on FPGA,” inProceedings of the 29th Asia and South Pacific Design Automation Conference (ASPDAC), 2024, pp. 515–520