Integer-only Quantized Transformers for Embedded FPGA-based Time-series Forecasting in AIoT

Chao Qian; Gregor Schiele; Tianheng Ling

arxiv: 2407.11041 · v6 · submitted 2024-07-06 · 💻 cs.LG · cs.AI

Integer-only Quantized Transformers for Embedded FPGA-based Time-series Forecasting in AIoT

Tianheng Ling , Chao Qian , Gregor Schiele This is my paper

Pith reviewed 2026-05-23 22:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords quantized transformersfpga accelerationtime-series forecastingaiotinteger quantizationembedded systemsquantization-aware trainingon-device inference

0 comments

The pith

4-bit integer-only quantized Transformers on embedded FPGAs match 8-bit accuracy while running up to 132 times faster and using 48 times less energy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to build a hardware accelerator that runs Transformer models for time-series forecasting directly on small FPGAs used in AIoT devices. It applies integer-only quantization at 4 and 6 bits together with quantization-aware training so the models keep test loss within 0.63 percent of 8-bit versions reported elsewhere. A reader would care because this removes the need to send sensor data to distant servers and allows real-time prediction inside power-limited hardware. Full synthesis on a Spartan-7 FPGA supplies measured numbers for resources, latency, power, and energy.

Core claim

The authors design and implement an integer-only quantized Transformer accelerator on the Xilinx Spartan-7 XC7S15 FPGA. Using 4-bit and 6-bit quantization with quantization-aware training produces models whose test loss rises by at most 0.63 percent relative to 8-bit baselines from prior work, while the 4-bit version reaches 132.33 times higher throughput and 48.19 times lower energy use. The work also records that lowering bit width does not automatically cut latency or energy and therefore requires exhaustive search over design combinations.

What carries the argument

Integer-only quantization combined with quantization-aware training and a custom FPGA accelerator architecture that maps Transformer layers to on-chip resources.

If this is right

4-bit models increase test loss by only 0.63 percent yet deliver up to 132.33 times faster operation than 8-bit baselines.
Energy use drops by up to 48.19 times when the same 4-bit design is compared with prior 8-bit implementations.
Bit-width reduction alone does not guarantee lower latency or energy, so systematic exploration of all optimization choices is required.
A complete on-FPGA implementation demonstrates that Transformer inference is feasible inside embedded IoT devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same integer quantization approach could be tested on other sequence models such as state-space or recurrent networks for the same hardware target.
Local inference on the device supports use cases where network connectivity is absent or data privacy rules forbid cloud transfer.
The released source code allows direct replication on additional FPGA boards or adaptation to different forecasting horizons.

Load-bearing premise

The selected time-series datasets, quantization-aware training settings, and Spartan-7 synthesis parameters produce accuracy, speed, and energy numbers that would appear under other datasets or hardware configurations.

What would settle it

Running the 4-bit model on a new time-series dataset and recording a test-loss increase larger than 1 percent, or measuring energy consumption on the same Spartan-7 board that exceeds the claimed 48-fold reduction, would refute the performance claims.

Figures

Figures reproduced from arXiv: 2407.11041 by Chao Qian, Gregor Schiele, Tianheng Ling.

**Figure 2.** Figure 2: PeMS Dataset: RMSE Variation 4 6 8 Bitwidth −20 10 40 70 100 130 160 190 220 250 280 Change in RMSE (%) dmodel & n 8 & 6 8 & 12 8 & 18 8 & 24 16 & 6 16 & 12 16 & 18 16 & 24 32 & 6 32 & 12 32 & 18 32 & 24 64 & 6 64 & 12 64 & 18 64 & 24 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: AirU Dataset: RMSE Variation The increase (at least 93.0%) in RMSE at 4-bit quantization is more pronounced on the PeMS dataset than on the AirU dataset, likely due to its univariate nature. In contrast, the multivariate nature of the AirU dataset provides enhanced [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

This paper presents the design of a hardware accelerator for Transformers, optimized for on-device time-series forecasting in AIoT systems. It integrates integer-only quantization and Quantization-Aware Training with optimized hardware designs to realize 6-bit and 4-bit quantized Transformer models, which achieved precision comparable to 8-bit quantized models from related research. Utilizing a complete implementation on an embedded FPGA (Xilinx Spartan-7 XC7S15), we examine the feasibility of deploying Transformer models on embedded IoT devices. This includes a thorough analysis of achievable precision, resource utilization, timing, power, and energy consumption for on-device inference. Our results indicate that while sufficient performance can be attained, the optimization process is not trivial. For instance, reducing the quantization bitwidth does not consistently result in decreased latency or energy consumption, underscoring the necessity of systematically exploring various optimization combinations. Compared to an 8-bit quantized Transformer model in related studies, our 4-bit quantized Transformer model increases test loss by only 0.63%, operates up to 132.33x faster, and consumes 48.19x less energy. Relevant source code is provided in the accompanying GitHub repository: https://github.com/tianheng-ling/TinyTransformer4TS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a practical implementation of low-bit Transformers on cheap FPGAs with open code, though the standout performance numbers depend on comparisons to external 8-bit results.

read the letter

The main thing to know is that the authors built and measured 4-bit and 6-bit integer-only Transformer models on a Spartan-7 FPGA for time-series tasks, with the code released on GitHub. That part is concrete and useful for edge deployment work. They integrate quantization-aware training with hardware designs optimized for the XC7S15 chip. They handle the hardware side well by reporting resource utilization, timing, power, and energy consumption. They also examine how different quantization bitwidths and optimization combinations affect these metrics. One clear observation is that reducing the bitwidth does not consistently lead to lower latency or energy use, which highlights the need for careful tuning rather than assuming lower precision always helps. The soft spot is the comparison to 8-bit models. The abstract states that their 4-bit model increases test loss by only 0.63% while being up to 132.33x faster and consuming 48.19x less energy than an 8-bit quantized Transformer from related studies. However, those baselines come from other papers, and without details on whether the models, datasets, or evaluation setups match, it's difficult to assess how fair the comparison is. The stress test note points out that the external baselines may differ in architecture or platform, which could invalidate the relative gains. If the full paper includes a direct comparison on the same hardware, that would strengthen the claims. This paper is for people working on low-power IoT forecasting hardware using embedded FPGAs. Readers interested in FPGA synthesis results, quantization effects on real devices, and practical deployment metrics will get value from the measurements and the open code. It deserves a serious referee because it ships working code and detailed hardware numbers, even though the baseline comparisons could be tightened with more side-by-side data.

Referee Report

2 major / 1 minor

Summary. The manuscript describes the design and FPGA implementation (Xilinx Spartan-7 XC7S15) of integer-only 4-bit and 6-bit quantized Transformer models for on-device time-series forecasting in AIoT. Using quantization-aware training, the authors report that their 4-bit model increases test loss by only 0.63% relative to 8-bit quantized Transformers from prior work while delivering up to 132.33× speedup and 48.19× lower energy; source code is released on GitHub.

Significance. If the comparative claims can be substantiated with matched baselines, the result would show that low-bitwidth integer Transformers are deployable on low-cost embedded FPGAs with acceptable accuracy loss, providing concrete evidence for edge-AI time-series applications. The open-source repository is a clear strength for reproducibility.

major comments (2)

[Abstract / Results] Abstract and results section: the headline performance ratios (0.63 % test-loss delta, 132.33× latency, 48.19× energy) are obtained by comparing the authors’ 4-bit/6-bit Spartan-7 implementation against 8-bit Transformer numbers reported in unrelated studies. No table or subsection lists the reference papers, confirms identical forecasting tasks, sequence lengths, model depths/widths, or evaluation splits, or provides a re-implementation on XC7S15; without such verification the ratios are not interpretable.
[Experimental setup / Evaluation] Experimental setup: the manuscript supplies no information on the time-series datasets (size, number of series, train/test split), the exact baseline 8-bit implementations, or the measurement methodology (tools, averaging, error bars) used for latency and energy figures on the FPGA. These omissions prevent assessment of whether the reported deltas are statistically meaningful or representative.

minor comments (1)

[Abstract] The abstract states that “reducing the quantization bitwidth does not consistently result in decreased latency or energy consumption” but does not quantify this observation with a table of all bit-width / optimization combinations explored.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We agree that additional clarity on baselines and experimental details is needed to strengthen interpretability. Below we respond point-by-point and indicate planned revisions.

read point-by-point responses

Referee: [Abstract / Results] Abstract and results section: the headline performance ratios (0.63 % test-loss delta, 132.33× latency, 48.19× energy) are obtained by comparing the authors’ 4-bit/6-bit Spartan-7 implementation against 8-bit Transformer numbers reported in unrelated studies. No table or subsection lists the reference papers, confirms identical forecasting tasks, sequence lengths, model depths/widths, or evaluation splits, or provides a re-implementation on XC7S15; without such verification the ratios are not interpretable.

Authors: We agree the comparisons are to published 8-bit results from related studies rather than matched re-implementations on the XC7S15. In revision we will add a dedicated subsection and table that (1) explicitly lists the reference papers and their reported 8-bit metrics, (2) notes the forecasting tasks, sequence lengths, and model configurations used in those works, and (3) states that the ratios are indicative rather than strictly matched. Because several prior works do not release code or target the same FPGA, a full re-implementation on XC7S15 is not feasible within the scope of this study; we will clarify this limitation explicitly. revision: partial
Referee: [Experimental setup / Evaluation] Experimental setup: the manuscript supplies no information on the time-series datasets (size, number of series, train/test split), the exact baseline 8-bit implementations, or the measurement methodology (tools, averaging, error bars) used for latency and energy figures on the FPGA. These omissions prevent assessment of whether the reported deltas are statistically meaningful or representative.

Authors: We will expand Section 4 (Experimental Setup) and the results section to include: (a) full dataset descriptions (number of series, lengths, train/test splits), (b) explicit references to the 8-bit baseline implementations and papers, and (c) measurement methodology details (Vivado power analysis settings, timing reports, number of inference runs averaged, and any error bars). These additions will allow readers to assess statistical meaningfulness. revision: yes

Circularity Check

0 steps flagged

No circularity; results are direct empirical measurements from implementation.

full rationale

The paper presents an FPGA implementation of integer-only quantized Transformers for time-series forecasting. All reported metrics (test loss, latency, energy) are obtained from synthesis, timing analysis, and power measurement on the XC7S15 device after quantization-aware training. No equations, fitted parameters, or derivations are defined in terms of the target outputs. Comparisons to 8-bit baselines are external citations whose validity is a question of experimental matching rather than internal self-definition or self-citation load-bearing. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied engineering implementation paper; the abstract introduces no mathematical free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5755 in / 1120 out tokens · 25909 ms · 2026-05-23T22:52:56.183810+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

[1]

Advancements in accelerating deep neural network inference on AIoT devices: a survey,

L. Cheng, Y . Gu, Q. Liu, L. Yang, C. Liu, and Y . Wang, “Advancements in accelerating deep neural network inference on AIoT devices: a survey,” IEEE Transactions on Sustainable Computing , 2024

work page 2024
[2]

Enabling resource-efficient AIoT system with cross-level optimization: a survey,

S. Liu, B. Guo, C. Fang, Z. Wang, S. Luo, Z. Zhou, and Z. Yu, “Enabling resource-efficient AIoT system with cross-level optimization: a survey,” IEEE Communications Surveys & Tutorials , 2023

work page 2023
[3]

A survey of Transformers,

T. Lin, Y . Wang, X. Liu, and X. Qiu, “A survey of Transformers,” AI open, vol. 3, pp. 111–132, 2022

work page 2022
[4]

Transformers in vision: a survey,

S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: a survey,” ACM computing surveys (CSUR) , vol. 54, no. 10s, pp. 1–41, 2022

work page 2022
[5]

Transformers in time series: a survey,

Q. Wen, T. Zhou, C. Zhang, W. Chen, Z. Ma, J. Yan, and L. Sun, “Transformers in time series: a survey,” in Proceedings of the Thirty- Second International Joint Conference on Artificial Intelligence , 2023

work page 2023
[6]

Efficient Transformers: a survey,

Y . Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient Transformers: a survey,” ACM Computing Surveys , vol. 55, no. 6, pp. 1–28, 2022

work page 2022
[7]

A study of quantisation- aware training on time series Transformer models for resource- constrained FPGAs,

T. Ling, C. Qian, L. Einhaus, and G. Schiele, “A study of quantisation- aware training on time series Transformer models for resource- constrained FPGAs,” arXiv preprint arXiv:2310.02654 , 2023

work page arXiv 2023
[8]

Tiny time-series Transform- ers: realtime multi-target sensor inference at the edge,

T. Becnel, K. Kelly, and P.-E. Gaillardon, “Tiny time-series Transform- ers: realtime multi-target sensor inference at the edge,” in International Conference on Omni-layer Intelligent Systems . IEEE, 2022

work page 2022
[9]

An efficient FPGA-based accelerator for Swin Transformer,

Z. Liu, P. Yin, and Z. Ren, “An efficient FPGA-based accelerator for Swin Transformer,” arXiv preprint arXiv:2308.13922 , 2023

work page arXiv 2023
[10]

Flowprecision: advancing FPGA-based real-time fluid flow estimation with linear quantization,

T. Ling, J. Hoever, C. Qian, and G. Schiele, “Flowprecision: advancing FPGA-based real-time fluid flow estimation with linear quantization,” in 2024 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Work- shops). IEEE Computer Society, 2024, pp. 733–738

work page 2024
[11]

Quantizing deep convolutional networks for efficient inference: A whitepaper

R. Krishnamoorthi, “Quantizing deep convolutional networks for effi- cient inference: a whitepaper,” arXiv preprint arXiv:1806.08342 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Softermax: hardware/software co-design of an efficient Softmax for Transformers,

J. R. Stevens, R. Venkatesan, S. Dai, B. Khailany, and A. Raghunathan, “Softermax: hardware/software co-design of an efficient Softmax for Transformers,” in 2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE, 2021, pp. 469–474

work page 2021
[13]

I- BERT: integer-only BERT quantization,

S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I- BERT: integer-only BERT quantization,” in International conference on machine learning. PMLR, 2021, pp. 5506–5518

work page 2021
[14]

I-ViT: integer-only quantization for efficient vision Transformer inference,

Z. Li and Q. Gu, “I-ViT: integer-only quantization for efficient vision Transformer inference,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 17 065–17 075

work page 2023
[15]

Efficient Softmax approximation for deep neural networks with attention mechanism,

I. Vasyltsov and W. Chang, “Efficient Softmax approximation for deep neural networks with attention mechanism,” arXiv preprint arXiv:2111.10770, 2021

work page arXiv 2021
[16]

Energy efficient LSTM accelerators for embedded FPGAs through parameterised architecture design,

C. Qian, T. Ling, and G. Schiele, “Energy efficient LSTM accelerators for embedded FPGAs through parameterised architecture design,” in International Conference on Architecture of Computing Systems , 2023

work page 2023
[17]

A radix-2 non-restoring 32-b/32- b ring divider with asynchronous control scheme,

J.-S. Chiang, E. Lai, J.-Y . Liao et al., “A radix-2 non-restoring 32-b/32- b ring divider with asynchronous control scheme,” Journal of Applied Science and Engineering , vol. 2, no. 1, pp. 37–43, 1999

work page 1999
[18]

Enhancing energy-efficiency by solving the throughput bottleneck of LSTM cells for embedded FPGAs,

C. Qian, T. Ling, and G. Schiele, “Enhancing energy-efficiency by solving the throughput bottleneck of LSTM cells for embedded FPGAs,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases . Springer, 2022, pp. 594–605

work page 2022
[19]

Hardware accelerator for Transformer based end-to-end automatic speech recog- nition system,

S. D. Yamini, G. S. Mirishkar, A. K. Vuppala, and S. Purini, “Hardware accelerator for Transformer based end-to-end automatic speech recog- nition system,” in International Parallel and Distributed Processing Symposium Workshops. IEEE, 2023, pp. 93–100

work page 2023
[20]

Optimizing Transformer neural network for real-time outlier detection,

I. Sobakinskikh, “Optimizing Transformer neural network for real-time outlier detection,” 2023

work page 2023

[1] [1]

Advancements in accelerating deep neural network inference on AIoT devices: a survey,

L. Cheng, Y . Gu, Q. Liu, L. Yang, C. Liu, and Y . Wang, “Advancements in accelerating deep neural network inference on AIoT devices: a survey,” IEEE Transactions on Sustainable Computing , 2024

work page 2024

[2] [2]

Enabling resource-efficient AIoT system with cross-level optimization: a survey,

S. Liu, B. Guo, C. Fang, Z. Wang, S. Luo, Z. Zhou, and Z. Yu, “Enabling resource-efficient AIoT system with cross-level optimization: a survey,” IEEE Communications Surveys & Tutorials , 2023

work page 2023

[3] [3]

A survey of Transformers,

T. Lin, Y . Wang, X. Liu, and X. Qiu, “A survey of Transformers,” AI open, vol. 3, pp. 111–132, 2022

work page 2022

[4] [4]

Transformers in vision: a survey,

S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: a survey,” ACM computing surveys (CSUR) , vol. 54, no. 10s, pp. 1–41, 2022

work page 2022

[5] [5]

Transformers in time series: a survey,

Q. Wen, T. Zhou, C. Zhang, W. Chen, Z. Ma, J. Yan, and L. Sun, “Transformers in time series: a survey,” in Proceedings of the Thirty- Second International Joint Conference on Artificial Intelligence , 2023

work page 2023

[6] [6]

Efficient Transformers: a survey,

Y . Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient Transformers: a survey,” ACM Computing Surveys , vol. 55, no. 6, pp. 1–28, 2022

work page 2022

[7] [7]

A study of quantisation- aware training on time series Transformer models for resource- constrained FPGAs,

T. Ling, C. Qian, L. Einhaus, and G. Schiele, “A study of quantisation- aware training on time series Transformer models for resource- constrained FPGAs,” arXiv preprint arXiv:2310.02654 , 2023

work page arXiv 2023

[8] [8]

Tiny time-series Transform- ers: realtime multi-target sensor inference at the edge,

T. Becnel, K. Kelly, and P.-E. Gaillardon, “Tiny time-series Transform- ers: realtime multi-target sensor inference at the edge,” in International Conference on Omni-layer Intelligent Systems . IEEE, 2022

work page 2022

[9] [9]

An efficient FPGA-based accelerator for Swin Transformer,

Z. Liu, P. Yin, and Z. Ren, “An efficient FPGA-based accelerator for Swin Transformer,” arXiv preprint arXiv:2308.13922 , 2023

work page arXiv 2023

[10] [10]

Flowprecision: advancing FPGA-based real-time fluid flow estimation with linear quantization,

T. Ling, J. Hoever, C. Qian, and G. Schiele, “Flowprecision: advancing FPGA-based real-time fluid flow estimation with linear quantization,” in 2024 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Work- shops). IEEE Computer Society, 2024, pp. 733–738

work page 2024

[11] [11]

Quantizing deep convolutional networks for efficient inference: A whitepaper

R. Krishnamoorthi, “Quantizing deep convolutional networks for effi- cient inference: a whitepaper,” arXiv preprint arXiv:1806.08342 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [12]

Softermax: hardware/software co-design of an efficient Softmax for Transformers,

J. R. Stevens, R. Venkatesan, S. Dai, B. Khailany, and A. Raghunathan, “Softermax: hardware/software co-design of an efficient Softmax for Transformers,” in 2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE, 2021, pp. 469–474

work page 2021

[13] [13]

I- BERT: integer-only BERT quantization,

S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I- BERT: integer-only BERT quantization,” in International conference on machine learning. PMLR, 2021, pp. 5506–5518

work page 2021

[14] [14]

I-ViT: integer-only quantization for efficient vision Transformer inference,

Z. Li and Q. Gu, “I-ViT: integer-only quantization for efficient vision Transformer inference,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2023, pp. 17 065–17 075

work page 2023

[15] [15]

Efficient Softmax approximation for deep neural networks with attention mechanism,

I. Vasyltsov and W. Chang, “Efficient Softmax approximation for deep neural networks with attention mechanism,” arXiv preprint arXiv:2111.10770, 2021

work page arXiv 2021

[16] [16]

Energy efficient LSTM accelerators for embedded FPGAs through parameterised architecture design,

C. Qian, T. Ling, and G. Schiele, “Energy efficient LSTM accelerators for embedded FPGAs through parameterised architecture design,” in International Conference on Architecture of Computing Systems , 2023

work page 2023

[17] [17]

A radix-2 non-restoring 32-b/32- b ring divider with asynchronous control scheme,

J.-S. Chiang, E. Lai, J.-Y . Liao et al., “A radix-2 non-restoring 32-b/32- b ring divider with asynchronous control scheme,” Journal of Applied Science and Engineering , vol. 2, no. 1, pp. 37–43, 1999

work page 1999

[18] [18]

Enhancing energy-efficiency by solving the throughput bottleneck of LSTM cells for embedded FPGAs,

C. Qian, T. Ling, and G. Schiele, “Enhancing energy-efficiency by solving the throughput bottleneck of LSTM cells for embedded FPGAs,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases . Springer, 2022, pp. 594–605

work page 2022

[19] [19]

Hardware accelerator for Transformer based end-to-end automatic speech recog- nition system,

S. D. Yamini, G. S. Mirishkar, A. K. Vuppala, and S. Purini, “Hardware accelerator for Transformer based end-to-end automatic speech recog- nition system,” in International Parallel and Distributed Processing Symposium Workshops. IEEE, 2023, pp. 93–100

work page 2023

[20] [20]

Optimizing Transformer neural network for real-time outlier detection,

I. Sobakinskikh, “Optimizing Transformer neural network for real-time outlier detection,” 2023

work page 2023