ViM-Q: Scalable Algorithm-Hardware Co-Design for Vision Mamba Model Inference on FPGA

Patrick S. Y. Hung; Ray C. C. Cheung; Shengzhe Lyu; Weitao Xu; Yuhan She

arxiv: 2605.01935 · v1 · submitted 2026-05-03 · 💻 cs.AR · cs.CV· cs.LG

ViM-Q: Scalable Algorithm-Hardware Co-Design for Vision Mamba Model Inference on FPGA

Shengzhe Lyu , Yuhan She , Patrick S. Y. Hung , Ray C. C. Cheung , Weitao Xu This is my paper

Pith reviewed 2026-05-08 19:34 UTC · model grok-4.3

classification 💻 cs.AR cs.CVcs.LG

keywords quantizationmodelsco-designfpgainferencelinearvim-qwhile

0 comments

The pith

ViM-Q delivers 4.96x speedup and 59.8x energy efficiency for Vision Mamba inference on FPGA versus a quantized GPU baseline using dynamic activation quantization, per-block APoT weights, and a pipelined SSM engine.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision Mamba models process images using state space models that scale linearly with input size, unlike the quadratic cost of transformers. This efficiency is attractive, but running them on FPGAs for edge devices faces two practical problems: activation values fluctuate sharply and break simple low-bit quantization, and the associative scan operation used in state space models does not map well to the streaming, pipelined dataflow of FPGA hardware. ViM-Q tackles the first problem with a hardware-aware scheme that applies dynamic per-token quantization to activations, adds per-channel smoothing to reduce outlier impact, and uses a 4-bit per-block Additive Power-of-Two format for weights. For the second problem, it builds a runtime-configurable FPGA accelerator containing a linear engine that replaces multiplications with shift-add operations via lookup tables and a fine-grained pipelined SSM engine that parallelizes the state dimension while preserving the required sequential recurrence. The design is parameterizable at runtime so the same hardware can handle different ViM model sizes and input resolutions. When implemented on an AMD ZCU102 FPGA, the system reports an average 4.96 times speedup and 59.8 times better energy efficiency than a quantized NVIDIA RTX 3090 GPU for low-batch inference on the smallest ViM model. The work therefore combines model-level number-format changes with architecture-level streaming optimizations to make these models practical on resource-constrained hardware.

Core claim

Implemented on an AMD ZCU102 FPGA, ViM-Q achieves an average 4.96x speedup and 59.8x energy efficiency gain over a quantized NVIDIA RTX 3090 GPU baseline for low-batch inference on ViM-tiny.

Load-bearing premise

The quantization scheme preserves sufficient model accuracy for the target use cases and the hardware design generalizes across the ViM family without major accuracy or performance loss; the abstract provides no accuracy metrics, ablation studies, or error bars to support this.

Figures

Figures reproduced from arXiv: 2605.01935 by Patrick S. Y. Hung, Ray C. C. Cheung, Shengzhe Lyu, Weitao Xu, Yuhan She.

**Figure 1.** Figure 1: Architecture of the ViM encoder and the detailed dataflow view at source ↗

**Figure 2.** Figure 2: Input images and the corresponding input activation distri view at source ↗

**Figure 4.** Figure 4: Unified linear engine design with LUT pre-computation. Key optimizations include: view at source ↗

**Figure 5.** Figure 5: Latency breakdown of quantized versus FP16 linear layers view at source ↗

**Figure 6.** Figure 6: Data access misalignment between the token-major input view at source ↗

**Figure 7.** Figure 7: Fine-grained pipelined SSM architecture. view at source ↗

**Figure 8.** Figure 8: Design space exploration of weight bit-widths view at source ↗

**Figure 10.** Figure 10: Physical floorplan of the implementation on ZCU102 view at source ↗

**Figure 11.** Figure 11: Normalized throughput and energy efficiency across view at source ↗

read the original abstract

Vision Mamba (ViM) models offer a compelling efficiency advantage over Transformers by leveraging the linear complexity of State Space Models (SSMs), yet efficiently deploying them on FPGAs remains challenging. Linear layers struggle with dynamic activation outliers that render static quantization ineffective, while uniform quantization fails to capture the weight distribution at low bit-widths. Furthermore, while associative scan accelerates SSMs on GPUs, its memory access patterns are misaligned with the streaming dataflow required by FPGAs. To address these challenges, we present ViM-Q, a scalable algorithm-hardware co-design for end-to-end ViM inference on the edge. We introduce a hardware-aware quantization scheme combining dynamic per-token activation quantization and per-channel smoothing to mitigate outliers, alongside a custom 4-bit per-block Additive Power-of-Two (APoT) weight quantization. The models are deployed on a runtime-parameterizable FPGA accelerator featuring a linear engine employing a Lookup-Table (LUT) unit to replace multiplications with shift-add operations, and a fine-grained pipelined SSM engine that parallelizes the state dimension while preserving sequential recurrence. Crucially, the hardware supports runtime configuration, adapting to diverse dimensions and input resolutions across the ViM family. Implemented on an AMD ZCU102 FPGA, ViM-Q achieves an average 4.96x speedup and 59.8x energy efficiency gain over a quantized NVIDIA RTX 3090 GPU baseline for low-batch inference on ViM-tiny. This co-design shows a viable path for deploying ViM models on resource-constrained edge devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied systems paper; it introduces no free parameters in a derivation, no mathematical axioms, and no new postulated entities such as particles or forces.

pith-pipeline@v0.9.0 · 5612 in / 1430 out tokens · 69266 ms · 2026-05-08T19:34:20.092001+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages

[1]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017

work page 2017
[2]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inProceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[3]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10 012–10 022

work page 2021
[4]

Training data-efficient image transformers & distillation through attention,

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J ´egou, “Training data-efficient image transformers & distillation through attention,” inProceedings of the International Conference on Machine Learning (ICML), 2021, pp. 10 347–10 357

work page 2021
[5]

Masked au- toencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked au- toencoders are scalable vision learners,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16 000–16 009

work page 2022
[6]

Dynam- icvit: Efficient vision transformers with dynamic token sparsification,

Y . Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, “Dynam- icvit: Efficient vision transformers with dynamic token sparsification,” Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 13 937–13 949, 2021

work page 2021
[7]

BEit: BERT pre-training of image transformers,

H. Bao, L. Dong, S. Piao, and F. Wei, “BEit: BERT pre-training of image transformers,” inProceedings of the International Conference on Learning Representations (ICLR), 2022

work page 2022
[8]

Taming transformers for high- resolution image synthesis,

P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- resolution image synthesis,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12 873–12 883

work page 2021
[9]

Transformers in vision: A survey,

S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,”ACM Computing Surveys (CSUR), vol. 54, no. 10s, pp. 1–41, 2022

work page 2022
[10]

A survey on vision transformer,

K. Han, Y . Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y . Tang, A. Xiao, C. Xu, Y . Xuet al., “A survey on vision transformer,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 45, no. 1, pp. 87–110, 2022

work page 2022
[11]

Mamba: Linear-time sequence modeling with selective state spaces,

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inConference on Language Modeling (COLM), 2024

work page 2024
[12]

Transformers are ssms: generalized models and effi- cient algorithms through structured state space duality,

T. Dao and A. Gu, “Transformers are ssms: generalized models and effi- cient algorithms through structured state space duality,” inProceedings of the International Conference on Machine Learning (ICML), 2024, pp. 10 041–10 071

work page 2024
[13]

Vision mamba: Efficient visual representation learning with bidirectional state space model,

L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” inProceedings of the International Conference on Ma- chine Learning (ICML), 2024

work page 2024
[14]

Vmamba: Visual state space model,

Y . Liu, Y . Tian, Y . Zhao, H. Yu, L. Xie, Y . Wang, Q. Ye, J. Jiao, and Y . Liu, “Vmamba: Visual state space model,”Advances in Neural In- formation Processing Systems (NeurIPS), vol. 37, pp. 103 031–103 063, 2024

work page 2024
[15]

Mambaout: Do we really need mamba for vision?

W. Yu and X. Wang, “Mambaout: Do we really need mamba for vision?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 4484–4496

work page 2025
[16]

Mambavision: A hybrid mamba- transformer vision backbone,

A. Hatamizadeh and J. Kautz, “Mambavision: A hybrid mamba- transformer vision backbone,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 25 261–25 270

work page 2025
[17]

Mobilemamba: Lightweight multi-receptive visual mamba network,

H. He, J. Zhang, Y . Cai, H. Chen, X. Hu, Z. Gan, Y . Wang, C. Wang, Y . Wu, and L. Xie, “Mobilemamba: Lightweight multi-receptive visual mamba network,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 4497– 4507

work page 2025
[18]

Localmamba: Visual state space model with windowed selective scan,

T. Huang, X. Pei, S. You, F. Wang, C. Qian, and C. Xu, “Localmamba: Visual state space model with windowed selective scan,” inProceedings of the European Conference on Computer Vision (ECCV), 2024, pp. 12– 22

work page 2024
[19]

Mamba in vision: A comprehensive survey of techniques and applications,

M. M. Rahman, A. A. Tutul, A. Nath, L. Laishram, S. K. Jung, and T. Hammond, “Mamba in vision: A comprehensive survey of techniques and applications,”arXiv preprint arXiv:2410.03105 (arXiv), 2024

work page arXiv 2024
[20]

Vision mamba: A comprehensive survey and taxonomy,

X. Liu, C. Zhang, F. Huang, S. Xia, G. Wang, and L. Zhang, “Vision mamba: A comprehensive survey and taxonomy,”IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2025

work page 2025
[21]

arXiv preprint arXiv:2502.07161 (2025)

F. Ibrahim, G. Liu, and G. Wang, “A survey on mamba architecture for vision applications,”arXiv preprint arXiv:2502.07161 (arXiv), 2025

work page arXiv 2025
[22]

Ptq4vm: Post-training quantization for visual mamba,

Y . Cho, C. Lee, S. Kim, and E. Park, “Ptq4vm: Post-training quantization for visual mamba,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 1176–1185

work page 2025
[23]

Mamba-ptq: Outlier channels in recurrent large language models.arXiv preprint arXiv:2407.12397, 2024

A. Pierro and S. Abreu, “Mamba-ptq: Outlier channels in recurrent large language models,”arXiv preprint arXiv:2407.12397 (arXiv), 2024

work page arXiv 2024
[24]

Mambaquant: Quantizing the mamba family with variance aligned rotation methods,

Z. Xu, Y . Yue, X. Hu, D. Yang, Z. Yuan, Z. Jiang, Z. Chen, Jiangy- ongYu, XUCHEN, and S. Zhou, “Mambaquant: Quantizing the mamba family with variance aligned rotation methods,” inProceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025
[25]

Quamba: A post-training quantization recipe for selective state space models,

H.-Y . Chiang, C.-C. Chang, N. Frumkin, K.-C. Wu, and D. Marculescu, “Quamba: A post-training quantization recipe for selective state space models,” inProceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025
[26]

Quamba2: A robust and scalable post-training quantization framework for selective state space models,

H.-Y . Chiang, C.-C. Chang, N. Frumkin, K.-C. Wu, M. S. Abdelfattah, and D. Marculescu, “Quamba2: A robust and scalable post-training quantization framework for selective state space models,” inProceedings of the International Conference on Machine Learning (ICML), 2025

work page 2025
[27]

Post-training quan- tization for vision mamba with k-scaled quantization and reparame- terization,

B.-Y . Shi, Y .-C. Lo, A.-Y . Wu, and Y .-M. Tsai, “Post-training quan- tization for vision mamba with k-scaled quantization and reparame- terization,” inProceedings of the International Workshop on Machine Learning for Signal Processing (MLSP), 2025, pp. 1–6

work page 2025
[28]

Vim-vq: Efficient post-training vector quantization for visual mamba,

J. Deng, S. Li, Z. Wang, K. Xu, H. Gu, and K. Huang, “Vim-vq: Efficient post-training vector quantization for visual mamba,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 24 518–24 527

work page 2025
[29]

Ouro- mamba: A data-free quantization framework for vision mamba,

A. Ramachandran, M. Lee, H. Xu, S. Kundu, and T. Krishna, “Ouro- mamba: A data-free quantization framework for vision mamba,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 21 177–21 186

work page 2025
[30]

Lut tensor core: A software-hardware co-design for lut-based low-bit llm inference,

Z. Mo, L. Wang, J. Wei, Z. Zeng, S. Cao, L. Ma, N. Jing, T. Cao, J. Xue, F. Yanget al., “Lut tensor core: A software-hardware co-design for lut-based low-bit llm inference,” inProceedings of the International Symposium on Computer Architecture (ISCA), 2025, pp. 514–528

work page 2025
[31]

Vs-quant: Per-vector scaled quantization for accurate low-precision neural network inference,

S. Dai, R. Venkatesan, M. Ren, B. Zimmer, W. Dally, and B. Khailany, “Vs-quant: Per-vector scaled quantization for accurate low-precision neural network inference,”Machine Learning and Systems (MLSys), vol. 3, pp. 873–884, 2021

work page 2021
[32]

Ladder: Enabling efficient low-precision deep learning computing through hardware-aware tensor transformation,

L. Wang, L. Ma, S. Cao, Q. Zhang, J. Xue, Y . Shi, N. Zheng, Z. Miao, F. Yang, T. Caoet al., “Ladder: Enabling efficient low-precision deep learning computing through hardware-aware tensor transformation,” in Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024, pp. 307–323

work page 2024
[33]

T- mac: Cpu renaissance via table lookup for low-bit llm deployment on edge,

J. Wei, S. Cao, T. Cao, L. Ma, L. Wang, Y . Zhang, and M. Yang, “T- mac: Cpu renaissance via table lookup for low-bit llm deployment on edge,” inProceedings of the European Conference on Computer Systems (EuroSys), 2025, pp. 278–292

work page 2025
[34]

Awq: Activation-aware weight quantiza- tion for on-device llm compression and acceleration,

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quantiza- tion for on-device llm compression and acceleration,”Machine Learning and Systems (MLSys), vol. 6, pp. 87–100, 2024

work page 2024
[35]

Marca: Mamba accelerator with reconfigurable architecture,

J. Li, S. Huang, J. Xu, J. Liu, L. Ding, N. Xu, and G. Dai, “Marca: Mamba accelerator with reconfigurable architecture,” inProceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2024, pp. 1–9

work page 2024
[36]

Lightmamba: Efficient mamba acceleration on fpga with quantization and hardware co-design,

R. Wei, S. Xu, L. Zhong, Z. Yang, Q. Guo, Y . Wang, R. Wang, and M. Li, “Lightmamba: Efficient mamba acceleration on fpga with quantization and hardware co-design,” inProceedings of the Conference on Design, Automation and Test in Europe (DATE), 2025, pp. 1–7

work page 2025
[37]

An efficient fpga-based hardware accelerator of fully quantized mamba-2,

K. Zhou, H. Jiao, W. Huang, and Y . Huang, “An efficient fpga-based hardware accelerator of fully quantized mamba-2,” inProceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2025, pp. 217–226

work page 2025
[38]

Mamba-x: An end-to-end vision mamba accelerator for edge computing devices,

D. Yoon, G. Lee, J. Chang, Y . Lee, D. Lee, and M. Rhu, “Mamba-x: An end-to-end vision mamba accelerator for edge computing devices,” in Proceedings of the IEEE/ACM International Conference on Computer- Aided Design (ICCAD), 2025, pp. 1–9

work page 2025
[39]

Specmamba: Accelerating mamba inference on fpga with speculative decoding,

L. Zhong, S. Xu, H. Wen, T. Xie, Q. Guo, Y . Wang, and M. Li, “Specmamba: Accelerating mamba inference on fpga with speculative decoding,” inProceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2025, pp. 1–9. 9

work page 2025
[40]

Fastmamba: A high-speed and efficient mamba accelerator on fpga with accurate quantization,

A. Wang, H. Shao, S. Ma, and Z. Wang, “Fastmamba: A high-speed and efficient mamba accelerator on fpga with accurate quantization,” in Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI), vol. 1, 2025, pp. 1–6

work page 2025
[41]

Smoothquant: Accurate and efficient post-training quantization for large language models,

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” inProceedings of the International Conference on Machine Learning (ICML), 2023, pp. 38 087–38 099

work page 2023
[42]

SVDQuant: Absorbing outliers by low-rank com- ponent for 4-bit diffusion models,

M. Li, Y . Lin, Z. Zhang, T. Cai, J. Guo, X. Li, E. Xie, C. Meng, J.-Y . Zhu, and S. Han, “SVDQuant: Absorbing outliers by low-rank com- ponent for 4-bit diffusion models,” inProceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025
[43]

Mix and match: A novel fpga-centric deep neural network quantization framework,

S.-E. Chang, Y . Li, M. Sun, R. Shi, H. K.-H. So, X. Qian, Y . Wang, and X. Lin, “Mix and match: A novel fpga-centric deep neural network quantization framework,” inProceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021, pp. 208–220

work page 2021
[44]

P 2-vit: Power-of-two post- training quantization and acceleration for fully quantized vision trans- former,

H. Shi, X. Cheng, W. Mao, and Z. Wang, “P 2-vit: Power-of-two post- training quantization and acceleration for fully quantized vision trans- former,”IEEE Transactions on Very Large Scale Integration Systems (TVLSI), 2024

work page 2024
[45]

Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks,

Y . Li, X. Dong, and W. Wang, “Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks,” inProceedings of the International Conference on Learning Representations (ICLR), 2020

work page 2020
[46]

Rapq: Rescuing accuracy for power-of-two low-bit post-training quantization,

H. Yao, P. Li, J. Cao, X. Liu, C. Xie, and B. Wang, “Rapq: Rescuing accuracy for power-of-two low-bit post-training quantization,” inPro- ceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2022, pp. 1573–1579

work page 2022
[47]

Power-of-two quantization for low bitwidth and hardware compliant neural networks,

D. Przewlocka-Rus, S. S. Sarwar, H. E. Sumbul, Y . Li, and B. De Salvo, “Power-of-two quantization for low bitwidth and hardware compliant neural networks,”arXiv preprint arXiv:2203.05025 (arXiv), 2022

work page arXiv 2022
[48]

Zeroquant: Efficient and affordable post-training quantization for large- scale transformers,

Z. Yao, R. Yazdani Aminabadi, M. Zhang, X. Wu, C. Li, and Y . He, “Zeroquant: Efficient and affordable post-training quantization for large- scale transformers,”Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 27 168–27 183, 2022

work page 2022
[49]

Llm.int8(): 8- bit matrix multiplication for transformers at scale,

T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “Llm.int8(): 8- bit matrix multiplication for transformers at scale,”Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 30 318–30 332, 2022

work page 2022
[50]

Edge-moe: Memory-efficient multi-task vision transformer architecture with task- level sparsity via mixture-of-experts,

R. Sarkar, H. Liang, Z. Fan, Z. Wang, and C. Hao, “Edge-moe: Memory-efficient multi-task vision transformer architecture with task- level sparsity via mixture-of-experts,” inProceedings of the IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2023, pp. 01–09

work page 2023
[51]

Finn: A framework for fast, scalable binarized neural network inference,

Y . Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, “Finn: A framework for fast, scalable binarized neural network inference,” inProceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ISFPGA), 2017, pp. 65–74

work page 2017
[52]

Famous: Flexible accelerator for the attention mechanism of transformer on ultrascale+ fpgas,

E. Kabir, M. A. Kabir, A. R. Downey, J. D. Bakos, D. Andrews, and M. Huang, “Famous: Flexible accelerator for the attention mechanism of transformer on ultrascale+ fpgas,” inProceedings of the International Conference on Field Programmable Technology (ICFPT), 2024, pp. 1–2

work page 2024
[53]

Lut-dla: Lookup table as efficient extreme low-bit deep learning accelerator,

G. Li, S. Ye, C. Chen, Y . Wang, F. Yang, T. Cao, C. Liu, M. M. S. Aly, and M. Yang, “Lut-dla: Lookup table as efficient extreme low-bit deep learning accelerator,” inProceedings of the IEEE International Sym- posium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 671–684

work page 2025
[54]

Figlut: An energy-efficient accelerator design for fp-int gemm using look-up tables,

G. Park, H. Kwon, J. Kim, J. Bae, B. Park, D. Lee, and Y . Lee, “Figlut: An energy-efficient accelerator design for fp-int gemm using look-up tables,” inProceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1098– 1111

work page 2025
[55]

Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models,

G. Park, B. Park, M. Kim, S. Lee, J. Kim, B. Kwon, S. J. Kwon, B. Kim, Y . Lee, and D. Lee, “Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models,” inProceedings of the International Conference on Learning Represen- tations (ICLR), 2022

work page 2022
[56]

Prefix sums and their applications,

G. E. Blelloch, “Prefix sums and their applications,” 1990. 10

work page 1990

[1] [1]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017

work page 2017

[2] [2]

An image is worth 16x16 words: Transformers for image recognition at scale,

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inProceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021

[3] [3]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 10 012–10 022

work page 2021

[4] [4]

Training data-efficient image transformers & distillation through attention,

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. J ´egou, “Training data-efficient image transformers & distillation through attention,” inProceedings of the International Conference on Machine Learning (ICML), 2021, pp. 10 347–10 357

work page 2021

[5] [5]

Masked au- toencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked au- toencoders are scalable vision learners,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16 000–16 009

work page 2022

[6] [6]

Dynam- icvit: Efficient vision transformers with dynamic token sparsification,

Y . Rao, W. Zhao, B. Liu, J. Lu, J. Zhou, and C.-J. Hsieh, “Dynam- icvit: Efficient vision transformers with dynamic token sparsification,” Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 13 937–13 949, 2021

work page 2021

[7] [7]

BEit: BERT pre-training of image transformers,

H. Bao, L. Dong, S. Piao, and F. Wei, “BEit: BERT pre-training of image transformers,” inProceedings of the International Conference on Learning Representations (ICLR), 2022

work page 2022

[8] [8]

Taming transformers for high- resolution image synthesis,

P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high- resolution image synthesis,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 12 873–12 883

work page 2021

[9] [9]

Transformers in vision: A survey,

S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,”ACM Computing Surveys (CSUR), vol. 54, no. 10s, pp. 1–41, 2022

work page 2022

[10] [10]

A survey on vision transformer,

K. Han, Y . Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y . Tang, A. Xiao, C. Xu, Y . Xuet al., “A survey on vision transformer,”IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 45, no. 1, pp. 87–110, 2022

work page 2022

[11] [11]

Mamba: Linear-time sequence modeling with selective state spaces,

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inConference on Language Modeling (COLM), 2024

work page 2024

[12] [12]

Transformers are ssms: generalized models and effi- cient algorithms through structured state space duality,

T. Dao and A. Gu, “Transformers are ssms: generalized models and effi- cient algorithms through structured state space duality,” inProceedings of the International Conference on Machine Learning (ICML), 2024, pp. 10 041–10 071

work page 2024

[13] [13]

Vision mamba: Efficient visual representation learning with bidirectional state space model,

L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” inProceedings of the International Conference on Ma- chine Learning (ICML), 2024

work page 2024

[14] [14]

Vmamba: Visual state space model,

Y . Liu, Y . Tian, Y . Zhao, H. Yu, L. Xie, Y . Wang, Q. Ye, J. Jiao, and Y . Liu, “Vmamba: Visual state space model,”Advances in Neural In- formation Processing Systems (NeurIPS), vol. 37, pp. 103 031–103 063, 2024

work page 2024

[15] [15]

Mambaout: Do we really need mamba for vision?

W. Yu and X. Wang, “Mambaout: Do we really need mamba for vision?” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 4484–4496

work page 2025

[16] [16]

Mambavision: A hybrid mamba- transformer vision backbone,

A. Hatamizadeh and J. Kautz, “Mambavision: A hybrid mamba- transformer vision backbone,” inProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 25 261–25 270

work page 2025

[17] [17]

Mobilemamba: Lightweight multi-receptive visual mamba network,

H. He, J. Zhang, Y . Cai, H. Chen, X. Hu, Z. Gan, Y . Wang, C. Wang, Y . Wu, and L. Xie, “Mobilemamba: Lightweight multi-receptive visual mamba network,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 4497– 4507

work page 2025

[18] [18]

Localmamba: Visual state space model with windowed selective scan,

T. Huang, X. Pei, S. You, F. Wang, C. Qian, and C. Xu, “Localmamba: Visual state space model with windowed selective scan,” inProceedings of the European Conference on Computer Vision (ECCV), 2024, pp. 12– 22

work page 2024

[19] [19]

Mamba in vision: A comprehensive survey of techniques and applications,

M. M. Rahman, A. A. Tutul, A. Nath, L. Laishram, S. K. Jung, and T. Hammond, “Mamba in vision: A comprehensive survey of techniques and applications,”arXiv preprint arXiv:2410.03105 (arXiv), 2024

work page arXiv 2024

[20] [20]

Vision mamba: A comprehensive survey and taxonomy,

X. Liu, C. Zhang, F. Huang, S. Xia, G. Wang, and L. Zhang, “Vision mamba: A comprehensive survey and taxonomy,”IEEE Transactions on Neural Networks and Learning Systems (TNNLS), 2025

work page 2025

[21] [21]

arXiv preprint arXiv:2502.07161 (2025)

F. Ibrahim, G. Liu, and G. Wang, “A survey on mamba architecture for vision applications,”arXiv preprint arXiv:2502.07161 (arXiv), 2025

work page arXiv 2025

[22] [22]

Ptq4vm: Post-training quantization for visual mamba,

Y . Cho, C. Lee, S. Kim, and E. Park, “Ptq4vm: Post-training quantization for visual mamba,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 1176–1185

work page 2025

[23] [23]

Mamba-ptq: Outlier channels in recurrent large language models.arXiv preprint arXiv:2407.12397, 2024

A. Pierro and S. Abreu, “Mamba-ptq: Outlier channels in recurrent large language models,”arXiv preprint arXiv:2407.12397 (arXiv), 2024

work page arXiv 2024

[24] [24]

Mambaquant: Quantizing the mamba family with variance aligned rotation methods,

Z. Xu, Y . Yue, X. Hu, D. Yang, Z. Yuan, Z. Jiang, Z. Chen, Jiangy- ongYu, XUCHEN, and S. Zhou, “Mambaquant: Quantizing the mamba family with variance aligned rotation methods,” inProceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025

[25] [25]

Quamba: A post-training quantization recipe for selective state space models,

H.-Y . Chiang, C.-C. Chang, N. Frumkin, K.-C. Wu, and D. Marculescu, “Quamba: A post-training quantization recipe for selective state space models,” inProceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025

[26] [26]

Quamba2: A robust and scalable post-training quantization framework for selective state space models,

H.-Y . Chiang, C.-C. Chang, N. Frumkin, K.-C. Wu, M. S. Abdelfattah, and D. Marculescu, “Quamba2: A robust and scalable post-training quantization framework for selective state space models,” inProceedings of the International Conference on Machine Learning (ICML), 2025

work page 2025

[27] [27]

Post-training quan- tization for vision mamba with k-scaled quantization and reparame- terization,

B.-Y . Shi, Y .-C. Lo, A.-Y . Wu, and Y .-M. Tsai, “Post-training quan- tization for vision mamba with k-scaled quantization and reparame- terization,” inProceedings of the International Workshop on Machine Learning for Signal Processing (MLSP), 2025, pp. 1–6

work page 2025

[28] [28]

Vim-vq: Efficient post-training vector quantization for visual mamba,

J. Deng, S. Li, Z. Wang, K. Xu, H. Gu, and K. Huang, “Vim-vq: Efficient post-training vector quantization for visual mamba,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 24 518–24 527

work page 2025

[29] [29]

Ouro- mamba: A data-free quantization framework for vision mamba,

A. Ramachandran, M. Lee, H. Xu, S. Kundu, and T. Krishna, “Ouro- mamba: A data-free quantization framework for vision mamba,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 21 177–21 186

work page 2025

[30] [30]

Lut tensor core: A software-hardware co-design for lut-based low-bit llm inference,

Z. Mo, L. Wang, J. Wei, Z. Zeng, S. Cao, L. Ma, N. Jing, T. Cao, J. Xue, F. Yanget al., “Lut tensor core: A software-hardware co-design for lut-based low-bit llm inference,” inProceedings of the International Symposium on Computer Architecture (ISCA), 2025, pp. 514–528

work page 2025

[31] [31]

Vs-quant: Per-vector scaled quantization for accurate low-precision neural network inference,

S. Dai, R. Venkatesan, M. Ren, B. Zimmer, W. Dally, and B. Khailany, “Vs-quant: Per-vector scaled quantization for accurate low-precision neural network inference,”Machine Learning and Systems (MLSys), vol. 3, pp. 873–884, 2021

work page 2021

[32] [32]

Ladder: Enabling efficient low-precision deep learning computing through hardware-aware tensor transformation,

L. Wang, L. Ma, S. Cao, Q. Zhang, J. Xue, Y . Shi, N. Zheng, Z. Miao, F. Yang, T. Caoet al., “Ladder: Enabling efficient low-precision deep learning computing through hardware-aware tensor transformation,” in Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024, pp. 307–323

work page 2024

[33] [33]

T- mac: Cpu renaissance via table lookup for low-bit llm deployment on edge,

J. Wei, S. Cao, T. Cao, L. Ma, L. Wang, Y . Zhang, and M. Yang, “T- mac: Cpu renaissance via table lookup for low-bit llm deployment on edge,” inProceedings of the European Conference on Computer Systems (EuroSys), 2025, pp. 278–292

work page 2025

[34] [34]

Awq: Activation-aware weight quantiza- tion for on-device llm compression and acceleration,

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han, “Awq: Activation-aware weight quantiza- tion for on-device llm compression and acceleration,”Machine Learning and Systems (MLSys), vol. 6, pp. 87–100, 2024

work page 2024

[35] [35]

Marca: Mamba accelerator with reconfigurable architecture,

J. Li, S. Huang, J. Xu, J. Liu, L. Ding, N. Xu, and G. Dai, “Marca: Mamba accelerator with reconfigurable architecture,” inProceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2024, pp. 1–9

work page 2024

[36] [36]

Lightmamba: Efficient mamba acceleration on fpga with quantization and hardware co-design,

R. Wei, S. Xu, L. Zhong, Z. Yang, Q. Guo, Y . Wang, R. Wang, and M. Li, “Lightmamba: Efficient mamba acceleration on fpga with quantization and hardware co-design,” inProceedings of the Conference on Design, Automation and Test in Europe (DATE), 2025, pp. 1–7

work page 2025

[37] [37]

An efficient fpga-based hardware accelerator of fully quantized mamba-2,

K. Zhou, H. Jiao, W. Huang, and Y . Huang, “An efficient fpga-based hardware accelerator of fully quantized mamba-2,” inProceedings of the IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2025, pp. 217–226

work page 2025

[38] [38]

Mamba-x: An end-to-end vision mamba accelerator for edge computing devices,

D. Yoon, G. Lee, J. Chang, Y . Lee, D. Lee, and M. Rhu, “Mamba-x: An end-to-end vision mamba accelerator for edge computing devices,” in Proceedings of the IEEE/ACM International Conference on Computer- Aided Design (ICCAD), 2025, pp. 1–9

work page 2025

[39] [39]

Specmamba: Accelerating mamba inference on fpga with speculative decoding,

L. Zhong, S. Xu, H. Wen, T. Xie, Q. Guo, Y . Wang, and M. Li, “Specmamba: Accelerating mamba inference on fpga with speculative decoding,” inProceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2025, pp. 1–9. 9

work page 2025

[40] [40]

Fastmamba: A high-speed and efficient mamba accelerator on fpga with accurate quantization,

A. Wang, H. Shao, S. Ma, and Z. Wang, “Fastmamba: A high-speed and efficient mamba accelerator on fpga with accurate quantization,” in Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI), vol. 1, 2025, pp. 1–6

work page 2025

[41] [41]

Smoothquant: Accurate and efficient post-training quantization for large language models,

G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” inProceedings of the International Conference on Machine Learning (ICML), 2023, pp. 38 087–38 099

work page 2023

[42] [42]

SVDQuant: Absorbing outliers by low-rank com- ponent for 4-bit diffusion models,

M. Li, Y . Lin, Z. Zhang, T. Cai, J. Guo, X. Li, E. Xie, C. Meng, J.-Y . Zhu, and S. Han, “SVDQuant: Absorbing outliers by low-rank com- ponent for 4-bit diffusion models,” inProceedings of the International Conference on Learning Representations (ICLR), 2025

work page 2025

[43] [43]

Mix and match: A novel fpga-centric deep neural network quantization framework,

S.-E. Chang, Y . Li, M. Sun, R. Shi, H. K.-H. So, X. Qian, Y . Wang, and X. Lin, “Mix and match: A novel fpga-centric deep neural network quantization framework,” inProceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021, pp. 208–220

work page 2021

[44] [44]

P 2-vit: Power-of-two post- training quantization and acceleration for fully quantized vision trans- former,

H. Shi, X. Cheng, W. Mao, and Z. Wang, “P 2-vit: Power-of-two post- training quantization and acceleration for fully quantized vision trans- former,”IEEE Transactions on Very Large Scale Integration Systems (TVLSI), 2024

work page 2024

[45] [45]

Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks,

Y . Li, X. Dong, and W. Wang, “Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks,” inProceedings of the International Conference on Learning Representations (ICLR), 2020

work page 2020

[46] [46]

Rapq: Rescuing accuracy for power-of-two low-bit post-training quantization,

H. Yao, P. Li, J. Cao, X. Liu, C. Xie, and B. Wang, “Rapq: Rescuing accuracy for power-of-two low-bit post-training quantization,” inPro- ceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2022, pp. 1573–1579

work page 2022

[47] [47]

Power-of-two quantization for low bitwidth and hardware compliant neural networks,

D. Przewlocka-Rus, S. S. Sarwar, H. E. Sumbul, Y . Li, and B. De Salvo, “Power-of-two quantization for low bitwidth and hardware compliant neural networks,”arXiv preprint arXiv:2203.05025 (arXiv), 2022

work page arXiv 2022

[48] [48]

Zeroquant: Efficient and affordable post-training quantization for large- scale transformers,

Z. Yao, R. Yazdani Aminabadi, M. Zhang, X. Wu, C. Li, and Y . He, “Zeroquant: Efficient and affordable post-training quantization for large- scale transformers,”Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 27 168–27 183, 2022

work page 2022

[49] [49]

Llm.int8(): 8- bit matrix multiplication for transformers at scale,

T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “Llm.int8(): 8- bit matrix multiplication for transformers at scale,”Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 30 318–30 332, 2022

work page 2022

[50] [50]

Edge-moe: Memory-efficient multi-task vision transformer architecture with task- level sparsity via mixture-of-experts,

R. Sarkar, H. Liang, Z. Fan, Z. Wang, and C. Hao, “Edge-moe: Memory-efficient multi-task vision transformer architecture with task- level sparsity via mixture-of-experts,” inProceedings of the IEEE/ACM International Conference on Computer Aided Design (ICCAD), 2023, pp. 01–09

work page 2023

[51] [51]

Finn: A framework for fast, scalable binarized neural network inference,

Y . Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers, “Finn: A framework for fast, scalable binarized neural network inference,” inProceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ISFPGA), 2017, pp. 65–74

work page 2017

[52] [52]

Famous: Flexible accelerator for the attention mechanism of transformer on ultrascale+ fpgas,

E. Kabir, M. A. Kabir, A. R. Downey, J. D. Bakos, D. Andrews, and M. Huang, “Famous: Flexible accelerator for the attention mechanism of transformer on ultrascale+ fpgas,” inProceedings of the International Conference on Field Programmable Technology (ICFPT), 2024, pp. 1–2

work page 2024

[53] [53]

Lut-dla: Lookup table as efficient extreme low-bit deep learning accelerator,

G. Li, S. Ye, C. Chen, Y . Wang, F. Yang, T. Cao, C. Liu, M. M. S. Aly, and M. Yang, “Lut-dla: Lookup table as efficient extreme low-bit deep learning accelerator,” inProceedings of the IEEE International Sym- posium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 671–684

work page 2025

[54] [54]

Figlut: An energy-efficient accelerator design for fp-int gemm using look-up tables,

G. Park, H. Kwon, J. Kim, J. Bae, B. Park, D. Lee, and Y . Lee, “Figlut: An energy-efficient accelerator design for fp-int gemm using look-up tables,” inProceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1098– 1111

work page 2025

[55] [55]

Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models,

G. Park, B. Park, M. Kim, S. Lee, J. Kim, B. Kwon, S. J. Kwon, B. Kim, Y . Lee, and D. Lee, “Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models,” inProceedings of the International Conference on Learning Represen- tations (ICLR), 2022

work page 2022

[56] [56]

Prefix sums and their applications,

G. E. Blelloch, “Prefix sums and their applications,” 1990. 10

work page 1990