pith. sign in

arxiv: 2510.03516 · v3 · submitted 2025-10-03 · 📡 eess.SP

COMET: Co-Optimization of a CNN Model using Efficient-Hardware OBC Techniques

Pith reviewed 2026-05-18 09:52 UTC · model grok-4.3

classification 📡 eess.SP
keywords convolutional neural networksFPGA accelerationoffset binary codinghardware optimizationlookup table techniquesedge deploymentmatrix multiplication core
0
0 comments X

The pith

COMET applies offset-binary coding separately to CNN inputs and weights to build four lookup-table methods that cut FPGA resource use while preserving nearly full accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents COMET as a co-optimization framework that formulates CNN inference with offset-binary coding representations for inputs in one scheme and weights in another. This separation exploits bit-width asymmetry and lets the authors modify the shift-accumulate step by folding in an offset term with a pre-scaled bias. Symmetries between the two schemes are then used to derive four distinct lookup-table realizations for the core computation, which feed into a general matrix-multiplication engine built around the im2col transform. Tests on LeNet-5 and All-CNN-C show the resulting hardware design matches or improves on existing FPGA accelerators while keeping accuracy loss negligible. A sympathetic reader would care because the work directly targets the resource bottleneck that prevents powerful vision models from running on low-power edge hardware.

Core claim

COMET formulates CNN inference using OBC representations applied separately to inputs (Scheme A) and weights (Scheme B), enabling exploitation of bit-width asymmetry. The shift-accumulate operation is modified by incorporating an offset term with the pre-scaled bias. Leveraging symmetries in the two schemes, four look-up table techniques are introduced and combined into an OBC-GEMM core that accelerates CNN workloads on FPGA hardware, delivering improved efficiency and resource utilization compared with prior designs while incurring only negligible accuracy loss on the evaluated networks.

What carries the argument

The four LUT techniques (parallel, shared, split, and hybrid) derived from symmetries between OBC Schemes A and B, which replace the standard shift-accumulate operation and power the OBC-GEMM core for im2col-based CNN acceleration.

If this is right

  • CNN inference runs with lower FPGA resource counts than state-of-the-art accelerators while accuracy stays nearly identical.
  • The same OBC-GEMM core scales to different network architectures without redesign of the underlying arithmetic.
  • Modern workloads become feasible on resource-constrained FPGAs through the im2col-based general matrix multiplication path.
  • Co-optimization of model representation and hardware mapping yields measurable gains in both speed and area.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same input-weight asymmetry and symmetry exploitation could be tested on other hardware fabrics such as ASICs or coarse-grained reconfigurable arrays.
  • Combining the OBC LUT methods with existing quantization or pruning pipelines might produce still larger resource savings.
  • The approach suggests a general template for trading arithmetic precision for table-based computation in any matrix-heavy workload.

Load-bearing premise

Symmetries between the offset-binary coding schemes for inputs and weights can be turned into four lookup-table methods that keep convolutional neural network accuracy essentially unchanged while cutting FPGA hardware resources.

What would settle it

A side-by-side FPGA implementation of a COMET-optimized LeNet-5 or All-CNN-C model that reports either noticeably higher LUT or DSP consumption than claimed or an accuracy drop larger than the reported negligible loss would refute the efficiency result.

Figures

Figures reproduced from arXiv: 2510.03516 by Boyang Chen, George Goussetis, Jo\~ao F. C. Mota, Jongeun Lee, Mathini Sellathurai, Mohd Tasleem Khan, Yuan Ding.

Figure 1
Figure 1. Figure 1: Illustration of the im2col transformation of a 2D convolution into a GEMM operation, assuming no padding in the input feature map and stride=1. with training factors. For example, replacing tanh activations with ReLU has been shown to improve performance in practice [20], and substituting large fully connected layers with global average pooling (GAP) can yield a more efficient design [21]. Distributed arit… view at source ↗
Figure 2
Figure 2. Figure 2: OBC-based IPC for K = 4 with (a) traditional LUT and SA-unit [25], (b) hardware LUT (w/o pipelining) and CSFA-based SA unit [26]. a 2D convolution of X with a filter tensor Θ ∈ R N×C×K×L is a tensor Y ∈ R N×(H+K)×(W+L) , whose entries are given by Yn,h,w = X C c=1 X K k=1 X L l=1 Xc,h+k,w+lΘn,c,k,l + βn, (1) where n = 1, . . . , N, h = 1, . . . , H + K, w = 1, . . . , W + L, and βn ∈ R represents a bias te… view at source ↗
Figure 3
Figure 3. Figure 3: System-level diagram of the proposed CNN accelerator. Note [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Top-level schematic of the proposed OBC-GEMM core. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: LUT architectures for K = 4 based on the proposed techniques: (a) Parallel, (b) Shared, (c) Split, and (d) Hybrid. 2) Shared LUT: As the name suggests, the shared LUT approach is derived from the parallel LUT by sharing the generated LUT contents to reduce the generated partial sums, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: LUT usage and power consumption versus clock frequency for different proposed techniques with [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: FF usage versus clock frequency for different techniques for [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Accuracy of original and modified LeNet-5 under QAT with two sets [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: LUTs/FFs usage and power of OBC-GEMM for different [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
read the original abstract

Convolutional Neural Networks (CNNs) achieve remarkable accuracy in vision tasks, yet their computational complexity challenges low-power edge deployment. In this work, we present COMET, a framework of CNN models that employ efficient hardware offset-binary coding (OBC) techniques to enable co-optimization of performance and resource utilization. The approach formulates CNN inference using OBC representations applied separately to inputs (Scheme A) and weights (Scheme B), enabling exploitation of bit-width asymmetry. The shift-accumulate operation is modified by incorporating offset-term with the pre-scaled bias. Leveraging symmetries in Schemes A and B, we introduce four look-up table (LUT) techniques -- parallel, shared, split, and hybrid -- and evaluate their efficiency. Building on this foundation, we develop a general matrix multiplication core using the im2col transformation for efficient CNN acceleration. We consider LeNet-5 and All-CNN-C to demonstrate that the OBC-GEMM core efficiently supports modern workloads. Evaluation shows that COMET enables efficient FPGA deployment compared to state-of-the-art designs, with negligible accuracy loss, demonstrating its efficiency and scalability across diverse network architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents COMET, a framework for co-optimizing CNN inference on FPGAs via offset-binary coding (OBC). It applies OBC separately to inputs (Scheme A) and weights (Scheme B) to exploit bit-width asymmetry, modifies the shift-accumulate operation by adding an offset term and pre-scaled bias, derives four LUT techniques (parallel, shared, split, hybrid) from symmetries in the OBC representations, and implements a general matrix-multiplication core using the standard im2col transformation. Evaluation on LeNet-5 and All-CNN-C is claimed to show efficient FPGA resource utilization compared with state-of-the-art designs while incurring negligible accuracy loss.

Significance. If the accuracy-preservation claim holds with rigorous error bounds, the work would supply a concrete, hardware-grounded method for reducing FPGA resource consumption in CNN accelerators by leveraging existing im2col and LUT primitives together with OBC symmetries. The explicit construction of four distinct LUT organizations and the integration into a reusable GEMM core constitute reusable engineering contributions that could be adopted by other FPGA CNN flows.

major comments (2)
  1. [Abstract and OBC Schemes section] The central claim that the four LUT techniques together with the modified shift-accumulate produce outputs that are mathematically equivalent (or bounded-error) to standard convolution is load-bearing for the entire efficiency argument, yet no derivation of the quantization error introduced by the offset term and pre-scaled bias is supplied. Without an explicit bound that is independent of bit-width and layer depth, the assertion of “negligible accuracy loss” on LeNet-5 and All-CNN-C cannot be verified from the given description.
  2. [Evaluation section] The experimental evaluation is described only at the level of the abstract; no tables, error bars, baseline comparisons (e.g., against plain im2col GEMM or prior OBC accelerators), or exclusion criteria for the reported accuracy figures are visible. This absence prevents confirmation that the claimed FPGA resource savings are achieved without accuracy degradation.
minor comments (2)
  1. [Hardware Implementation] Define the offset term and pre-scaled bias explicitly in the equations for the modified shift-accumulate operation; the current description leaves their scaling factors and bit-width handling ambiguous.
  2. [Results] Add a short table summarizing LUT, DSP, and BRAM counts for each of the four LUT organizations on the target FPGA device.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. We address each major comment below and describe the revisions that will be incorporated to strengthen the mathematical rigor and experimental presentation of the manuscript.

read point-by-point responses
  1. Referee: [Abstract and OBC Schemes section] The central claim that the four LUT techniques together with the modified shift-accumulate produce outputs that are mathematically equivalent (or bounded-error) to standard convolution is load-bearing for the entire efficiency argument, yet no derivation of the quantization error introduced by the offset term and pre-scaled bias is supplied. Without an explicit bound that is independent of bit-width and layer depth, the assertion of “negligible accuracy loss” on LeNet-5 and All-CNN-C cannot be verified from the given description.

    Authors: We acknowledge that an explicit derivation of the quantization error arising from the offset term and pre-scaled bias is necessary to substantiate the equivalence claim. In the revised manuscript we will insert a new subsection that derives the error introduced by the modified shift-accumulate operation under both Scheme A and Scheme B. The derivation will establish that the output remains mathematically equivalent to standard convolution when the offset is correctly compensated, and will supply an error bound that is independent of bit-width and network depth for the fixed-point representations employed. This addition will directly support the “negligible accuracy loss” statement with verifiable bounds. revision: yes

  2. Referee: [Evaluation section] The experimental evaluation is described only at the level of the abstract; no tables, error bars, baseline comparisons (e.g., against plain im2col GEMM or prior OBC accelerators), or exclusion criteria for the reported accuracy figures are visible. This absence prevents confirmation that the claimed FPGA resource savings are achieved without accuracy degradation.

    Authors: We agree that the current evaluation section does not provide sufficient detail for independent verification. In the revision we will expand the experimental results with (i) comprehensive tables reporting LUT, DSP, BRAM, and power utilization for the four LUT organizations on both LeNet-5 and All-CNN-C, (ii) direct comparisons against a plain im2col GEMM baseline and representative prior OBC accelerators, (iii) accuracy figures accompanied by error bars obtained from multiple training/inference runs, and (iv) explicit statements of any data-exclusion criteria. These additions will allow readers to confirm that the reported resource savings are obtained without accuracy degradation. revision: yes

Circularity Check

0 steps flagged

No circularity: COMET proposes OBC schemes and LUT techniques with empirical validation

full rationale

The paper defines OBC representations for inputs (Scheme A) and weights (Scheme B), modifies shift-accumulate with offset-term and pre-scaled bias, then introduces four LUT techniques (parallel, shared, split, hybrid) by leveraging symmetries. These are presented as design choices implemented via im2col-based GEMM core and evaluated empirically on LeNet-5 and All-CNN-C for resource savings and accuracy. No equations reduce the reported efficiency or negligible accuracy loss to quantities defined by the same fitted parameters or self-referential inputs; claims rest on hardware implementation results rather than by-construction equivalence. The derivation is self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard hardware assumptions plus the novel OBC formulation; no new physical entities are postulated and only modest free parameters (bit-width selections) appear to be chosen for the target networks.

free parameters (1)
  • input and weight bit-widths
    Selected to exploit asymmetry between Schemes A and B; specific values are not stated in the abstract but must be chosen to achieve the reported resource-accuracy trade-off.
axioms (1)
  • domain assumption OBC representations applied separately to inputs and weights preserve functional equivalence of CNN inference when the shift-accumulate is modified by an offset term and pre-scaled bias
    Invoked when the paper formulates CNN inference using OBC and states that the modified operation supports the LUT techniques.

pith-pipeline@v0.9.0 · 5764 in / 1506 out tokens · 39949 ms · 2026-05-18T09:52:05.385666+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Leveraging symmetries in Schemes A and B, we introduce four look-up table (LUT) techniques—parallel, shared, split, and hybrid... The shift–accumulate operation is modified by incorporating the offset term with the pre-scaled bias.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 3 internal anchors

  1. [1]

    A review of convolutional neural networks in computer vision,

    X. Zhao, L. Wang, Y . Zhang, X. Han, M. Deveci, and M. Parmar, “A review of convolutional neural networks in computer vision,”Artificial Intelligence Review, vol. 57, no. 4, p. 99, 2024

  2. [2]

    A comprehensive review of convolutional neural networks for defect detection in industrial applications,

    R. Khanam, M. Hussain, R. Hill, and P. Allen, “A comprehensive review of convolutional neural networks for defect detection in industrial applications,”IEEE Access, 2024

  3. [3]

    Agamotto: A performance optimiza- tion framework for CNN accelerator with row stationary dataflow,

    D. Kim, S. Jeong, and J.-Y . Kim, “Agamotto: A performance optimiza- tion framework for CNN accelerator with row stationary dataflow,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 70, no. 6, pp. 2487–2496, 2023

  4. [4]

    Mobile-x: dedicated FPGA implementation of the mobilenet accelerator optimizing depthwise sep- arable convolution,

    H. Hong, D. Choi, N. Kim, and H. Kim, “Mobile-x: dedicated FPGA implementation of the mobilenet accelerator optimizing depthwise sep- arable convolution,”IEEE Transactions on Circuits and Systems II: Express Briefs, 2024

  5. [5]

    A high-throughput FPGA accelerator for lightweight CNNs with balanced dataflow,

    Z. Zhao, Y . Chen, P. Feng, J. Li, G. Chen, R. Shen, and H. Lu, “A high-throughput FPGA accelerator for lightweight CNNs with balanced dataflow,”IEEE Transactions on Circuits and Systems I: Regular Papers, 2025

  6. [6]

    A survey on convolutional neural network accelerators: Gpu, fpga and asic,

    Y . Hu, Y . Liu, and Z. Liu, “A survey on convolutional neural network accelerators: Gpu, fpga and asic,” in2022 14th International Conference on Computer Research and Development (ICCRD). IEEE, 2022, pp. 100–107

  7. [7]

    An energy-efficient GeMM-based convolution accelerator with on-the-fly im2col,

    J. Fornt, P. Fontova-Must ´e, M. Caro, J. Abella, F. Moll, J. Altet, and C. Studer, “An energy-efficient GeMM-based convolution accelerator with on-the-fly im2col,”IEEE Transactions on Very Large Scale Inte- gration (VLSI) Systems, vol. 31, no. 11, pp. 1874–1878, 2023

  8. [8]

    Accelerating sparse DNNs based on tiled GEMM,

    C. Guo, F. Xue, J. Leng, Y . Qiu, Y . Guan, W. Cui, Q. Chen, and M. Guo, “Accelerating sparse DNNs based on tiled GEMM,”IEEE Transactions on Computers, vol. 73, no. 5, pp. 1275–1289, 2024

  9. [9]

    Winograd,Arithmetic Complexity of Computations, ser

    S. Winograd,Arithmetic Complexity of Computations, ser. CBMS–NSF Regional Conference Series in Applied Mathematics. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics (SIAM), 1980, vol. 33

  10. [10]

    Edge-side fine-grained sparse CNN accelerator with efficient dynamic pruning scheme,

    B. Wu, T. Yu, K. Chen, and W. Liu, “Edge-side fine-grained sparse CNN accelerator with efficient dynamic pruning scheme,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 71, no. 3, pp. 1285– 1298, 2024

  11. [11]

    Dynamic dataflow scheduling and computation mapping techniques for efficient depthwise separable convolution acceleration,

    B. Li, H. Wang, X. Zhang, J. Ren, L. Liu, H. Sun, and N. Zheng, “Dynamic dataflow scheduling and computation mapping techniques for efficient depthwise separable convolution acceleration,”IEEE Transac- tions on Circuits and Systems I: Regular Papers, vol. 68, no. 8, pp. 3279–3292, 2021

  12. [12]

    Quantizing deep convolutional networks for efficient inference: A whitepaper

    R. Krishnamoorthi, “Quantizing deep convolutional networks for effi- cient inference: A whitepaper,”arXiv preprint arXiv:1806.08342, 2018

  13. [13]

    A high-throughput full-dataflow mo- bilenetv2 accelerator on edge FPGA,

    W. Jiang, H. Yu, and Y . Ha, “A high-throughput full-dataflow mo- bilenetv2 accelerator on edge FPGA,”IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems, vol. 42, no. 5, pp. 1532–1545, 2022

  14. [14]

    A high throughput mobilenetv2 FPGA implementation based on a flexible architecture for depthwise separable convolution,

    J. Knapheide, B. Stabernack, and M. Kuhnke, “A high throughput mobilenetv2 FPGA implementation based on a flexible architecture for depthwise separable convolution,” in2020 30th International Conference on Field-Programmable Logic and Applications (FPL). IEEE, 2020, pp. 277–283

  15. [15]

    A post- quantum encryption mechanism based on convolutional neural network accelerator,

    Y . Huang, G. Fan, J. Mai, W. Jiang, J. Hu, and E. Yao, “A post- quantum encryption mechanism based on convolutional neural network accelerator,”IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 71, no. 8, pp. 3945–3949, 2024

  16. [16]

    A sparse cnn ac- celerator for eliminating redundant computations in intra- and inter- convolutional/pooling layers,

    C. Yang, Y . Meng, K. Huo, J. Xi, and K. Mei, “A sparse cnn ac- celerator for eliminating redundant computations in intra- and inter- convolutional/pooling layers,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 30, no. 12, pp. 1902–1915, 2022

  17. [17]

    Accelerator implementation of LeNet-5 convolution neural network based on FPGA with HLS,

    D. Rongshi and T. Yongming, “Accelerator implementation of LeNet-5 convolution neural network based on FPGA with HLS,” in2019 3rd international conference on circuits, system and simulation (ICCSS). IEEE, 2019, pp. 64–67

  18. [18]

    Classification of garments from fashion MNIST dataset using CNN lenet-5 architecture,

    M. Kayed, A. Anter, and H. Mohamed, “Classification of garments from fashion MNIST dataset using CNN lenet-5 architecture,” in2020 international conference on innovative trends in communication and computer engineering (ITCE). IEEE, 2020, pp. 238–243. 13

  19. [19]

    High- performance low-memory lowering: GEMM-based algorithms for DNN convolution,

    A. Anderson, A. Vasudevan, C. Keane, and D. Gregg, “High- performance low-memory lowering: GEMM-based algorithms for DNN convolution,” in2020 ieee 32nd international symposium on computer architecture and high performance computing (sbac-pad). IEEE, 2020, pp. 99–106

  20. [20]

    Improving classification neural networks by using absolute activation function (MNIST/LeNET-5 example),

    O. I. Berngardt, “Improving classification neural networks by using absolute activation function (MNIST/LeNET-5 example),”arXiv preprint arXiv:2304.11758, 2023

  21. [21]

    Network In Network

    M. Lin, Q. Chen, and S. Yan, “Network in network,”arXiv preprint arXiv:1312.4400, 2013

  22. [22]

    Applications of distributed arithmetic to digital signal processing: A tutorial review,

    S. A. White, “Applications of distributed arithmetic to digital signal processing: A tutorial review,”IEEE Assp Magazine, vol. 6, no. 3, pp. 4–19, 2002

  23. [23]

    Two distributed arithmetic based high throughput architectures of non-pipelined LMS adaptive filters,

    M. T. Khan, M. A. Alhartomi, S. Alzahrani, R. A. Shaik, and R. Alsu- lami, “Two distributed arithmetic based high throughput architectures of non-pipelined LMS adaptive filters,”IEEE Access, vol. 10, pp. 76 693– 76 706, 2022

  24. [24]

    High-performance VLSI architecture of DLMS adaptive filter for fast-convergence and low-MSE,

    M. T. Khan and R. A. Shaik, “High-performance VLSI architecture of DLMS adaptive filter for fast-convergence and low-MSE,”IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 69, no. 4, pp. 2106–2110, 2022

  25. [25]

    Optimal complexity architectures for pipelined distributed arithmetic-based LMS adaptive filter,

    M. T. Khan and R. A. Shaik, “Optimal complexity architectures for pipelined distributed arithmetic-based LMS adaptive filter,”IEEE Trans- actions on Circuits and Systems I: Regular Papers, vol. 66, no. 2, pp. 630–642, 2018

  26. [26]

    Low- complexity distributed-arithmetic-based pipelined architecture for an LSTM network,

    K. P. Yalamarthy, S. Dhall, M. T. Khan, and R. A. Shaik, “Low- complexity distributed-arithmetic-based pipelined architecture for an LSTM network,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 2, pp. 329–338, 2019

  27. [27]

    Archi- tectural trade-off analysis for accelerating LSTM network using Radix-r OBC scheme,

    M. T. Khan, H. E. Yantır, K. N. Salama, and A. M. Eltawil, “Archi- tectural trade-off analysis for accelerating LSTM network using Radix-r OBC scheme,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 70, no. 1, pp. 266–279, 2022

  28. [28]

    Low-area and low-power VLSI architectures for long short-term memory networks,

    M. A. Alhartomi, M. T. Khan, S. Alzahrani, A. Alzahmi, R. A. Shaik, J. Hazarika, R. Alsulami, A. Alotaibi, and M. Al-Harthi, “Low-area and low-power VLSI architectures for long short-term memory networks,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 13, no. 4, pp. 1000–1014, 2023

  29. [29]

    Digit-serial DA-based fixed-point RNNs: A unified approach for enhancing architectural efficiency,

    M. T. Khan and M. A. Alhartomi, “Digit-serial DA-based fixed-point RNNs: A unified approach for enhancing architectural efficiency,”IEEE Transactions on Neural Networks and Learning Systems, 2024

  30. [30]

    Modified distributed arithmetic based low complexity CNN architecture design methodology,

    M. Panwar, J. Padmini, A. Acharyya, D. Biswaset al., “Modified distributed arithmetic based low complexity CNN architecture design methodology,” in2017 European conference on circuit theory and design (ECCTD). IEEE, 2017, pp. 1–4

  31. [31]

    Area-efficient distributed arithmetic optimization via heuristic decomposition and in-memroy computing,

    J. Chen, W. Zhao, and Y . Ha, “Area-efficient distributed arithmetic optimization via heuristic decomposition and in-memroy computing,” in 2019 IEEE 13th International Conference on ASIC (ASICON). IEEE, 2019, pp. 1–4

  32. [32]

    Performance analysis and optimization of distributed arithmetic-based convolutional algorithms for FIR filters on FPGA,

    C. Chen, V . Romashchenko, M. Brutscheck, and I. Chmielewski, “Performance analysis and optimization of distributed arithmetic-based convolutional algorithms for FIR filters on FPGA,” in2023 34th Irish Signals and Systems Conference (ISSC). IEEE, 2023, pp. 1–6

  33. [33]

    Goodfellow, Y

    I. Goodfellow, Y . Bengio, A. Courville, and Y . Bengio,Deep learning. MIT press Cambridge, 2016, vol. 1, no. 2

  34. [34]

    Gradient-based learning applied to document recognition,

    Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998

  35. [35]

    Take it in your stride: Do we need striding in CNNs?

    C. Kong and S. Lucey, “Take it in your stride: Do we need striding in CNNs?”arXiv preprint arXiv:1712.02502, 2017

  36. [36]

    Optimized schoolbook polynomial multiplication for compact lattice-based cryp- tography on fpga,

    W. Liu, S. Fan, A. Khalid, C. Rafferty, and M. O’Neill, “Optimized schoolbook polynomial multiplication for compact lattice-based cryp- tography on fpga,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 10, pp. 2459–2463, 2019

  37. [37]

    Energy-efficient precision-scaled CNN implementation with dynamic partial reconfiguration,

    E. Youssef, H. A. Elsimary, M. A. El-Moursy, H. Mostafa, and A. Khattab, “Energy-efficient precision-scaled CNN implementation with dynamic partial reconfiguration,”IEEE Access, vol. 10, pp. 95 571– 95 584, 2022

  38. [38]

    FPGA-based convolutional neural network accel- erator with resource-optimized approximate multiply-accumulate unit,

    M. Cho and Y . Kim, “FPGA-based convolutional neural network accel- erator with resource-optimized approximate multiply-accumulate unit,” Electronics, vol. 10, no. 22, p. 2859, 2021

  39. [39]

    Research and implementation of high computational power for training and inference of convolutional neural networks,

    T. Li, B. He, and Y . Zheng, “Research and implementation of high computational power for training and inference of convolutional neural networks,”Applied Sciences, vol. 13, no. 2, p. 1003, 2023

  40. [40]

    Fpqnet: Fully pipelined and quantized cnn for ultra-low latency image classification on fpgas using opencapi,

    M. Ji, Z. Al-Ars, P. Hofstee, Y . Chang, and B. Zhang, “Fpqnet: Fully pipelined and quantized cnn for ultra-low latency image classification on fpgas using opencapi,”Electronics, vol. 12, no. 19, p. 4085, 2023. Boyang Chenreceived his B.Eng. degree in Elec- tronics from Heriot-Watt University, UK, and Xi- dian University, China, in 2025, through a joint u...