pith. machine review for the scientific record. sign in

arxiv: 2605.08231 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.AI· cs.AR

Recognition: 2 theorem links

· Lean Theorem

TRAM: Training Approximate Multiplier Structures for Low-Power AI Accelerators

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.AR
keywords approximate computingmultipliersAI acceleratorslow-power designjoint optimizationCNNvision transformerspower reduction
0
0 comments X

The pith

TRAM jointly optimizes approximate multiplier structures with AI model parameters to reduce power while limiting accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TRAM as a method that trains approximate multipliers together with the AI model instead of designing multipliers separately first. This co-optimization targets the power-hungry multiplier units inside neural networks to find structures that consume less energy for the specific computations the model performs. Experiments on CNNs with CIFAR-10 show up to 25 percent lower multiplier power than prior approximate multipliers, and similar gains appear for vision transformers on ImageNet. A sympathetic reader would care because multipliers dominate energy use in AI accelerators, and any reduction that keeps accuracy drops small could extend battery life or lower cooling costs in deployed systems. The approach treats the multiplier configuration as an additional trainable element during backpropagation.

Core claim

TRAM performs joint optimization of approximate multiplier structures and model weights during training, using a power estimation model to penalize high-power configurations while the loss function keeps accuracy high. This produces multiplier designs tailored to each layer's data patterns rather than generic approximations chosen in advance.

What carries the argument

The TRAM joint optimization loop, which alternates or combines updates to model parameters and to the selection or bit-width choices inside approximate multipliers using differentiable power proxies.

If this is right

  • AI training pipelines can incorporate multiplier structure search as a standard step without separate hardware design phases.
  • Power reduction scales with model size because the savings apply to every multiplication operation in the network.
  • The same framework can be reused across different model families by simply changing the training dataset and architecture.
  • Designers gain a direct knob to trade accuracy for power by adjusting the weight of the power penalty term.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hardware-aware training of this form may become routine for any edge device where power is the primary constraint.
  • Extending the method to other approximate units such as adders or activation functions would follow the same joint-optimization pattern.
  • If the power model can be made differentiable at the gate level, the approach could move from simulation to direct silicon optimization.
  • Mobile and IoT applications could see the largest practical impact because the reported percentage savings compound over millions of inferences.

Load-bearing premise

The power estimation model used inside training must closely match the actual power draw of the final hardware multiplier implementation.

What would settle it

Fabricate the TRAM-designed multipliers in silicon or on an FPGA and measure real power under the same workloads; if the measured savings fall below the simulated figures by more than the reported margin, the joint-optimization benefit does not hold.

Figures

Figures reproduced from arXiv: 2605.08231 by Chang Meng, Giovanni De Micheli, Hanyu Wang, Mingfei Yu, Wayne Burleson, Yuyang Ye.

Figure 1
Figure 1. Figure 1: A 4-bit unsigned array multiplier. The red crosses [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: TRAM framework overview. Experimental results show that, compared to state-of-the-art AxM designs, TRAM reduces AxM power by up to 25.05% on CNNs with CIFAR-10 at the same accuracy level, and by 27.09% on vision transformers with ImageNet. Since TRAM allows different structure parameters for different model layers, it naturally supports layer￾wise application of different AxMs. Compared to the state-of-the… view at source ↗
Figure 3
Figure 3. Figure 3: Dataflow for computing the objective function in Eq. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of final accuracy and AxM power con [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of 𝜆 on DenseNet161 accuracy and AxM power under w4a4. Power is normalized to the 4-bit AccMul [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

Reducing power consumption in AI accelerators is increasingly important. Approximate computing can reduce power consumption while keeping the accuracy loss small. Since multipliers are power-hungry components in AI models, this paper focuses on synthesizing low-power approximate multipliers (AxMs). Unlike prior works that design AxMs separately from AI model training, we present TRAM, which jointly optimizes the AxM structure and AI model parameters to lower power with small accuracy loss. Experiments show that compared to state-of-the-art AxMs, TRAM achieves up to 25.05% AxM power reduction on CNNs with CIFAR-10, and reduces power by up to 27.09% on vision transformers with ImageNet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces TRAM, a method that jointly optimizes approximate multiplier (AxM) structures together with AI model parameters during training to reduce power consumption in AI accelerators while incurring only small accuracy loss. Experiments on CNNs with CIFAR-10 and vision transformers with ImageNet report up to 25.05% and 27.09% AxM power reductions, respectively, relative to state-of-the-art fixed AxMs.

Significance. If the reported power savings are confirmed by hardware measurements that match the in-training estimator, TRAM would advance approximate computing by moving AxM design from a separate post-training step into the model training loop itself. This integrated approach could yield more effective power-accuracy trade-offs for multiplier-heavy workloads on edge devices.

major comments (3)
  1. [§3.2 (Differentiable Power Model)] §3.2 (Differentiable Power Model): The central empirical claims rest on the assumption that the differentiable power estimator used to guide AxM structure search during joint optimization accurately predicts post-synthesis or FPGA power. No correlation coefficient, scatter plot, or error metric between the estimator and actual gate-level power on the final TRAM structures is reported; because the optimization directly trades accuracy against this estimated objective, any systematic mismatch directly undermines the 25%+ reduction figures.
  2. [§5 (Experimental Evaluation)] §5 (Experimental Evaluation): The reported power reductions (25.05% on CIFAR-10 CNNs, 27.09% on ImageNet ViTs) are presented without details on power measurement methodology (synthesis tool, operating conditions), exact baseline AxM implementations and their configurations, number of independent runs, error bars, or statistical significance tests. These omissions are load-bearing for assessing whether the gains exceed those of fixed SOTA AxMs under comparable conditions.
  3. [§4.1 (Joint Optimization)] §4.1 (Joint Optimization): The joint training procedure introduces additional structural parameters for the AxM and an extra loss term for estimated power. No ablation or analysis is provided on training stability, sensitivity to the power-accuracy trade-off hyperparameter, or the increase in wall-clock training time relative to standard fine-tuning, which is necessary to evaluate the practicality of the claimed co-optimization.
minor comments (3)
  1. [Abstract] Abstract: The phrase 'state-of-the-art AxMs' is used without naming the specific prior designs; adding a short parenthetical or reference to the compared methods would improve immediate clarity.
  2. [§2 (Related Work)] §2 (Related Work): Several recent hardware-aware approximate multiplier papers that include FPGA power measurements are not cited; including them would better situate the contribution.
  3. [Notation] Notation: The symbols for AxM bit-width parameters, error metrics, and power terms are introduced piecemeal; a consolidated notation table would aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we will revise the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [§3.2 (Differentiable Power Model)] §3.2 (Differentiable Power Model): The central empirical claims rest on the assumption that the differentiable power estimator used to guide AxM structure search during joint optimization accurately predicts post-synthesis or FPGA power. No correlation coefficient, scatter plot, or error metric between the estimator and actual gate-level power on the final TRAM structures is reported; because the optimization directly trades accuracy against this estimated objective, any systematic mismatch directly undermines the 25%+ reduction figures.

    Authors: We agree that empirical validation of the differentiable power model is essential to support our claims. In the revised manuscript, we will add a new subsection or appendix providing a correlation analysis between the estimator and post-synthesis power measurements for the TRAM-optimized multipliers. This will include scatter plots, Pearson correlation coefficients, and mean absolute error metrics to quantify the estimator's accuracy. revision: yes

  2. Referee: [§5 (Experimental Evaluation)] §5 (Experimental Evaluation): The reported power reductions (25.05% on CIFAR-10 CNNs, 27.09% on ImageNet ViTs) are presented without details on power measurement methodology (synthesis tool, operating conditions), exact baseline AxM implementations and their configurations, number of independent runs, error bars, or statistical significance tests. These omissions are load-bearing for assessing whether the gains exceed those of fixed SOTA AxMs under comparable conditions.

    Authors: We acknowledge the need for greater transparency in the experimental setup. We will revise Section 5 to include detailed descriptions of the power measurement methodology, including the synthesis tool and operating conditions. We will specify the exact configurations of the baseline approximate multipliers from prior works. Additionally, we will report results from multiple independent runs with error bars and include statistical significance tests to confirm that the improvements are significant. revision: yes

  3. Referee: [§4.1 (Joint Optimization)] §4.1 (Joint Optimization): The joint training procedure introduces additional structural parameters for the AxM and an extra loss term for estimated power. No ablation or analysis is provided on training stability, sensitivity to the power-accuracy trade-off hyperparameter, or the increase in wall-clock training time relative to standard fine-tuning, which is necessary to evaluate the practicality of the claimed co-optimization.

    Authors: We will enhance the discussion in Section 4.1 by adding an ablation study on the sensitivity of results to the trade-off hyperparameter λ, showing performance across a range of values. We will also provide analysis on training stability and report the observed increase in wall-clock training time relative to standard fine-tuning. These additions will better demonstrate the practicality of TRAM. revision: yes

Circularity Check

0 steps flagged

No significant circularity in TRAM's joint optimization and empirical results

full rationale

The paper presents TRAM as a joint optimization framework for approximate multiplier (AxM) structures and AI model parameters, with power savings reported as experimental outcomes on public datasets (CIFAR-10 for CNNs, ImageNet for vision transformers). No derivation chain reduces a claimed prediction or first-principles result to its own inputs by construction. Power estimation occurs during training as part of the optimization objective, but the final claims are empirical comparisons to SOTA AxMs rather than self-referential identities or fitted quantities renamed as predictions. Self-citations, if present, are not load-bearing for uniqueness or ansatz adoption in a way that creates circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no specific free parameters, axioms, or invented entities can be identified; the method appears to rely on standard optimization and power modeling techniques from prior literature.

pith-pipeline@v0.9.0 · 5429 in / 997 out tokens · 36967 ms · 2026-05-12T00:52:15.469923+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    Green AI.Commu- nications of the ACM, 63(12):54–63, 2020

    Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green AI.Commu- nications of the ACM, 63(12):54–63, 2020

  2. [2]

    Approximate computing survey, part II: Application-specific & architectural approximation techniques and applications.ACM Computing Surveys, 57(7):1–36, 2025

    Vasileios Leon, Muhammad Abdullah Hanif, Giorgos Armeniakos, Xun Jiao, Muhammad Shafique, Kiamal Pekmestzi, and Dimitrios Soudris. Approximate computing survey, part II: Application-specific & architectural approximation techniques and applications.ACM Computing Surveys, 57(7):1–36, 2025

  3. [3]

    Hardware approximate techniques for deep neural network accelerators: A survey

    Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, and Jörg Henkel. Hardware approximate techniques for deep neural network accelerators: A survey. ACM Computing Surveys (CSUR), 55(4):1–36, 2022

  4. [4]

    A survey on approximate multiplier designs for energy efficiency: From algorithms to circuits.ACM Transactions on Design Automation of Electronic Systems (TODAES), 29(1):1–37, 2024

    Ying Wu, Chuangtao Chen, Weihua Xiao, Xuan Wang, Chenyi Wen, Jie Han, Xun- zhao Yin, Weikang Qian, and Cheng Zhuo. A survey on approximate multiplier designs for energy efficiency: From algorithms to circuits.ACM Transactions on Design Automation of Electronic Systems (TODAES), 29(1):1–37, 2024

  5. [5]

    EvoAp- prox8b: Library of approximate adders and multipliers for circuit design and benchmarking of approximation methods

    Vojtech Mrazek, Radek Hrbacek, Zdenek Vasicek, and Lukas Sekanina. EvoAp- prox8b: Library of approximate adders and multipliers for circuit design and benchmarking of approximation methods. InDesign, Automation & Test in Europe Conference & Exhibition (DATE), pages 258–261, 2017

  6. [6]

    Libraries of approximate circuits: Automated design and application in CNN accelerators.IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 10(4):406–418, 2020

    Vojtech Mrazek, Lukas Sekanina, and Zdenek Vasicek. Libraries of approximate circuits: Automated design and application in CNN accelerators.IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 10(4):406–418, 2020

  7. [7]

    OPACT: Optimization of approxi- mate compressor tree for approximate multiplier

    Weihua Xiao, Cheng Zhuo, and Weikang Qian. OPACT: Optimization of approxi- mate compressor tree for approximate multiplier. InDesign, Automation & Test in Europe Conference & Exhibition (DATE), pages 178–183, 2022

  8. [8]

    A configurable approximate multiplier for CNNs using partial product speculation

    Xiaolu Hu, Ao Liu, Xinkuang Geng, Zizhong Wei, Kai Jiang, and Honglan Jiang. A configurable approximate multiplier for CNNs using partial product speculation. InDesign, Automation & Test in Europe Conference & Exhibition (DATE), pages 1–6, 2024

  9. [9]

    DASALS: Differentiable architecture search-driven approximate logic synthesis

    Xuan Wang, Zheyu Yan, Chang Meng, Yiyu Shi, and Weikang Qian. DASALS: Differentiable architecture search-driven approximate logic synthesis. InInter- national Conference on Computer Aided Design (ICCAD), pages 1–9, 2023

  10. [10]

    Approximate logic synthesis using Boolean matrix factorization.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 41(1):15–28, 2021

    Jingxiao Ma, Soheil Hashemi, and Sherief Reda. Approximate logic synthesis using Boolean matrix factorization.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 41(1):15–28, 2021

  11. [11]

    Simulation-guided ap- proximate logic synthesis under the maximum error constraint.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2026

    Chang Meng, Weikang Qian, and Giovanni De Micheli. Simulation-guided ap- proximate logic synthesis under the maximum error constraint.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2026

  12. [12]

    Exact neural networks from inexact multi- pliers via Fibonacci weight encoding

    William Andrew Simon, Valérian Ray, Alexandre Levisse, Giovanni Ansaloni, Marina Zapater, and David Atienza. Exact neural networks from inexact multi- pliers via Fibonacci weight encoding. InDesign Automation Conference (DAC), pages 805–810, 2021

  13. [13]

    Learning to design accurate deep learning accelerators with inaccu- rate multipliers

    Paras Jain, Safeen Huda, Martin Maas, Joseph E Gonzalez, Ion Stoical, and Azalia Mirhoseini. Learning to design accurate deep learning accelerators with inaccu- rate multipliers. InDesign, Automation & Test in Europe Conference & Exhibition (DATE), pages 184–189, 2022

  14. [14]

    Gradient approximation of approximate multipliers for high-accuracy deep neural network retraining

    Chang Meng, Wayne Burleson, Weikang Qian, and Giovanni De Micheli. Gradient approximation of approximate multipliers for high-accuracy deep neural network retraining. InDesign, Automation & Test in Europe Conference & Exhibition (DATE), pages 1–7, 2025

  15. [15]

    HEAM: High-efficiency approximate multiplier optimization for deep neural networks

    Su Zheng, Zhen Li, Yao Lu, Jingbo Gao, Jide Zhang, and Lingli Wang. HEAM: High-efficiency approximate multiplier optimization for deep neural networks. In IEEE International Symposium on Circuits and Systems (ISCAS), pages 3359–3363, 2022

  16. [16]

    Approximate arithmetic circuits: A survey, characterization, and recent applications.Proceedings of the IEEE, 108(12):2108–2135, 2020

    Honglan Jiang, Francisco Javier Hernandez Santiago, Hai Mo, Leibo Liu, and Jie Han. Approximate arithmetic circuits: A survey, characterization, and recent applications.Proceedings of the IEEE, 108(12):2108–2135, 2020

  17. [17]

    Quantization and training of neural networks for efficient integer-arithmetic-only inference

    Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, An- drew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2704– 2713, 2018

  18. [18]

    OmniQuant: Omnidirectionally calibrated quantization for large language models.International Conference on Learning Representations (ICLR), pages 1–25, 2024

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. OmniQuant: Omnidirectionally calibrated quantization for large language models.International Conference on Learning Representations (ICLR), pages 1–25, 2024

  19. [19]

    ALSRAC: Approximate logic synthesis by resubstitution with approximate care set

    Chang Meng, Weikang Qian, and Alan Mishchenko. ALSRAC: Approximate logic synthesis by resubstitution with approximate care set. InDesign Automation Conference (DAC), pages 1–6, 2020

  20. [20]

    HEDALS: Highly efficient delay-driven approximate logic synthesis.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 42(11):3491–3504, 2023

    Chang Meng, Zhuangzhuang Zhou, Yue Yao, Shuyang Huang, Yuhang Chen, and Weikang Qian. HEDALS: Highly efficient delay-driven approximate logic synthesis.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 42(11):3491–3504, 2023

  21. [21]

    BLASYS: Approximate logic synthesis using boolean matrix factorization

    Soheil Hashemi, Hokchhay Tann, and Sherief Reda. BLASYS: Approximate logic synthesis using boolean matrix factorization. InDesign Automation Conference (DAC), pages 1–6, 2018

  22. [22]

    Ef- ficient resubstitution-based approximate logic synthesis.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 44(6):2040–2053, 2025

    Chang Meng, Alan Mishchenko, Weikang Qian, and Giovanni De Micheli. Ef- ficient resubstitution-based approximate logic synthesis.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 44(6):2040–2053, 2025

  23. [23]

    PyTorch: An imperative style, high-performance deep learning library

    Adam Paszke, Sam Gross, et al. PyTorch: An imperative style, high-performance deep learning library. InInternational Conference on Neural Information Processing Systems (NeurIPS), pages 8026–8037, 2019

  24. [24]

    Learning multiple layers of features from tiny images.Technical Report, University of Toronto, 2009

    Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.Technical Report, University of Toronto, 2009

  25. [25]

    ImageNet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009

  26. [26]

    ASAP7: A 7-nm FinFET predictive process design kit.Microelectronics Journal, 53:105–115, 2016

    Lawrence T Clark, Vinay Vashishtha, Lucian Shifren, Aditya Gujja, Saurabh Sinha, Brian Cline, Chandarasekaran Ramamurthy, and Greg Yeric. ASAP7: A 7-nm FinFET predictive process design kit.Microelectronics Journal, 53:105–115, 2016

  27. [27]

    VECSEM: Verifying average errors in approximate circuits using simulation- enhanced model counting

    Chang Meng, Hanyu Wang, Yuqi Mai, Weikang Qian, and Giovanni De Micheli. VECSEM: Verifying average errors in approximate circuits using simulation- enhanced model counting. InDesign, Automation & Test in Europe Conference & Exhibition (DATE), pages 1–6, 2024

  28. [28]

    AdaPT: Fast emulation of approximate DNN accelerators in PyTorch

    Dimitrios Danopoulos, Georgios Zervakis, Kostas Siozios, Dimitrios Soudris, and Jörg Henkel. AdaPT: Fast emulation of approximate DNN accelerators in PyTorch. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 42(6):2074–2078, 2022

  29. [29]

    TransAxx: Efficient transformers with approximate computing.IEEE Transactions on Circuits and Systems for Artificial Intelligence, 2(4):288–301, 2025

    Dimitrios Danopoulos, Georgios Zervakis, Dimitrios Soudris, and Jörg Henkel. TransAxx: Efficient transformers with approximate computing.IEEE Transactions on Circuits and Systems for Artificial Intelligence, 2(4):288–301, 2025

  30. [30]

    Flavia Guella, Emanuele Valpreda, Michele Caon, Guido Masera, and Maurizio Martina. MARLIN: A co-design methodology for approximate reconfigurable inference of neural networks at the edge.IEEE Transactions on Circuits and Systems I: Regular Papers (TCAS-I), 71(5):2105–2118, 2024

  31. [31]

    ALWANN: Automatic layer-wise approximation of deep neural network accelerators without retraining

    Vojtech Mrazek et al. ALWANN: Automatic layer-wise approximation of deep neural network accelerators without retraining. InInternational Conference on Computer Aided Design (ICCAD), pages 1–8, 2019