arxiv: 2605.04754 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.AR

Recognition: 3 theorem links

AxMoE: Characterizing the Impact of Approximate Multipliers on Mixture-of-Experts DNN Architectures

Omkar B Shende , Marcello Traiola , Gayathri Ananthanarayanan

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AR

keywords approximate computingmixture of expertsdeep neural networksapproximate multipliersmodel resilienceretrainingvision transformersedge inference

0 comments

The pith

Approximate multipliers impact MoE DNN architectures differently than dense baselines, with retraining enabling recovery in many cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes the effects of replacing exact multipliers with approximate ones in Mixture-of-Experts deep neural network models for the first time. It evaluates three MoE variants against dense models on convolutional and transformer architectures using multiple 8-bit approximate multipliers from a library. Key results indicate that dense models are initially more resilient without retraining on CNNs, but after retraining, ResNet models recover completely while VGG models show limited recovery at high approximation levels, and Hard MoE gains an edge on vision transformers at equivalent cost. This interaction matters for designing energy-efficient inference systems at the edge by combining approximation with conditional computation.

Core claim

Using eight 8-bit signed multipliers including an exact baseline, the study finds that without retraining the dense baseline is most resilient across CNN architectures while all topologies degrade similarly on ViT-Small. After approximate-aware retraining ResNet-20 fully recovers accuracy for the entire range of multipliers, VGG architectures recover at moderate levels but suffer irreversible failure at aggressive approximation except for Cluster MoE on VGG11_bn, and Hard MoE outperforms the dense baseline on ViT-Small under aggressive approximation when normalized inference cost is equal.

What carries the argument

Mixture-of-Experts routing mechanisms (Hard, Soft, and Cluster variants) applied to DNN layers using approximate multipliers in place of exact arithmetic, tested through accuracy and cost measurements on image classification benchmarks.

If this is right

ResNet-20 MoE models can maintain original accuracy with any tested approximate multiplier following retraining.
VGG-based MoE requires selection of moderate multipliers or specific topologies like Cluster to avoid unrecoverable accuracy loss.
Hard MoE on vision transformers provides superior performance compared to dense models when using aggressive approximate multipliers at matched inference cost.
MoE architectures offer topology-specific resilience to approximation that can inform hardware-software co-design for efficient inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hardware designs could integrate approximate multipliers selectively with MoE routing to optimize power without full retraining in some cases.
The observed recovery patterns suggest that MoE's conditional computation may inherently buffer against certain approximation errors in expert paths.
These results could extend to language processing tasks where MoE models are increasingly used, potentially enabling more efficient large language model inference.

Load-bearing premise

That the observed behaviors in simulation with the chosen multipliers and datasets will match real-world hardware execution and performance on unseen tasks or data distributions.

What would settle it

Implementing the approximate multipliers in actual FPGA or ASIC hardware, deploying the retrained models, and measuring end-to-end accuracy and energy on a new dataset to verify if the recovery and outperformance hold.

Figures

Figures reproduced from arXiv: 2605.04754 by Gayathri Ananthanarayanan, Marcello Traiola, Omkar B Shende.

**Figure 1.** Figure 1: Accuracy Degradation vs Approximation Aggressiveness. Part (a) of the figure provides the Absolute Top-1 Accuracy without retraining, view at source ↗

**Figure 2.** Figure 2: Top-1 Accuracy (after approximate-aware retraining) vs Normalised Power Consumption w.r.t Dense Exact Multiplier across different view at source ↗

read the original abstract

Deep neural network (DNN) inference at the edge demands simultaneous improvements in accuracy, computational efficiency, and energy consumption. Approximate computing and Mixture-of-Experts (MoE) architectures have each been studied as independent routes towards efficient inference, the former by replacing exact arithmetic with low-power approximate multipliers, the latter by routing inputs through specialized expert sub-networks to enable conditional computation. However, their interaction remains entirely unexplored. This paper presents AxMoE, the first study of the impact of approximate multiplication on MoE DNN architectures. We evaluate three MoE variants: Hard MoE, Soft MoE, and Cluster MoE against dense baselines across three CNN architectures (ResNet-20, VGG11_bn, VGG19_bn) on CIFAR-100 and a Vision Transformer (ViT-Small) on Tiny ImageNet-200 dataset, using eight 8-bit signed multipliers (including one exact baseline) from the EvoApproxLib library. Results show that, without retraining, the Dense baseline is the most resilient topology across all CNN architectures, whereas on ViT-Small, all topologies degrade at comparable rates regardless of routing strategy. After approximate-aware retraining, recovery varies substantially across architectures, topologies, and multipliers. ResNet-20 achieves full recovery across the entire multiplier range, whereas VGG architectures recover at moderate multipliers but fail irreversibly at aggressive ones for all topologies except Cluster MoE on VGG11_bn; on ViT-Small, Hard MoE outperforms Dense under aggressive approximation at equal normalized inference cost. These results pave the way for future approximate MoE hardware-software co-design strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

First data on approximate multipliers inside MoE models shows retraining rescues ResNet fully but hits limits on VGGs, with Hard MoE pulling ahead on ViT under cost parity.

read the letter

The main thing to know is that this paper runs the first controlled comparison of approximate multipliers on Hard, Soft, and Cluster MoE versus dense baselines, and the retraining results give concrete patterns: full recovery on ResNet-20 across all eight multipliers, irreversible failure on VGGs at aggressive ones except for Cluster MoE on VGG11_bn, and Hard MoE beating dense on ViT-Small at equal normalized cost after retraining on Tiny ImageNet-200.

Referee Report

2 major / 4 minor

Summary. The manuscript presents AxMoE, the first empirical characterization of the impact of eight 8-bit signed approximate multipliers (from EvoApproxLib, including one exact baseline) on three MoE variants (Hard MoE, Soft MoE, Cluster MoE) versus dense baselines. Experiments cover ResNet-20, VGG11_bn, and VGG19_bn on CIFAR-100 plus ViT-Small on Tiny ImageNet-200. Without retraining, Dense is most resilient for CNNs while all topologies degrade similarly on ViT; after approximate-aware retraining, ResNet-20 shows full recovery across the multiplier range, VGGs recover at moderate but not aggressive multipliers (except Cluster MoE on VGG11_bn), and Hard MoE outperforms Dense on ViT-Small under aggressive approximation at equal normalized inference cost. The work concludes by motivating future approximate MoE hardware-software co-design.

Significance. If the reported observations hold, the study is significant for efficient edge inference research by providing the first direct evidence on how approximate multipliers interact with conditional computation in MoE architectures. The broad coverage across architectures, datasets, multipliers, and the explicit before/after retraining comparisons constitutes a useful empirical baseline. Credit is due for the controlled simulation setup that isolates topology effects and identifies architecture-specific recovery patterns (e.g., ResNet-20 robustness versus VGG fragility).

major comments (2)

[Experimental results (recovery and ViT-Small subsections)] Results sections on recovery (ResNet-20, VGG, ViT-Small): the claims of 'full recovery,' 'fail irreversibly,' and 'outperforms ... at equal normalized inference cost' are presented without reference to error bars, standard deviations across multiple training runs, or statistical tests. Given known sensitivity of DNN training to initialization and hyperparameters, single-run results undermine the definitiveness of these observational claims; please add variance measures or clarify the experimental protocol.
[Methodology and ViT-Small experiments] Methodology and ViT-Small results: the key claim that Hard MoE outperforms Dense 'at equal normalized inference cost' under aggressive approximation requires an explicit definition of the normalization procedure (e.g., how MoE routing overhead, expert activation sparsity, and multiplier energy are folded into the cost metric). Without this, the comparison is difficult to interpret or reproduce.

minor comments (4)

[Abstract] Abstract: the long sentence describing VGG recovery behavior is difficult to parse; consider splitting it or using a table reference for clarity.
[Related work / Methodology] Related work or methodology: the EvoApproxLib library should be cited with its original reference (typically the 2017 or 2018 paper introducing the library) rather than only by name.
[Figures and captions] Figures: ensure legends and captions explicitly label all eight multipliers and distinguish Hard/Soft/Cluster MoE curves; current descriptions risk ambiguity when comparing topologies.
[Conclusion] Discussion or conclusion: add a short limitations paragraph addressing the simulation-only nature of the results and the restriction to 8-bit signed multipliers and the chosen datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the recommendation of minor revision. The two major comments highlight important aspects of experimental rigor and reproducibility; we address each below with concrete plans for revision.

read point-by-point responses

Referee: [Experimental results (recovery and ViT-Small subsections)] Results sections on recovery (ResNet-20, VGG, ViT-Small): the claims of 'full recovery,' 'fail irreversibly,' and 'outperforms ... at equal normalized inference cost' are presented without reference to error bars, standard deviations across multiple training runs, or statistical tests. Given known sensitivity of DNN training to initialization and hyperparameters, single-run results undermine the definitiveness of these observational claims; please add variance measures or clarify the experimental protocol.

Authors: We acknowledge that single-run results limit the statistical strength of the claims. All experiments were performed with fixed random seeds (detailed in the supplementary material) to ensure reproducibility across the large configuration space (8 multipliers, 4 topologies, multiple architectures and training modes). In the revised manuscript we will (1) explicitly state the fixed-seed protocol in Section 4, (2) add a limitations paragraph noting that results represent single-run observations and that training variability could affect absolute numbers, and (3) emphasize that the reported qualitative trends (full recovery for ResNet-20, irreversible failure for aggressive VGG cases except Cluster MoE, and Hard MoE advantage on ViT) remain consistent across the entire experimental sweep. We cannot add error bars or statistical tests without repeating the full set of trainings, which exceeds the scope of a minor revision. revision: partial
Referee: [Methodology and ViT-Small experiments] Methodology and ViT-Small results: the key claim that Hard MoE outperforms Dense 'at equal normalized inference cost' under aggressive approximation requires an explicit definition of the normalization procedure (e.g., how MoE routing overhead, expert activation sparsity, and multiplier energy are folded into the cost metric). Without this, the comparison is difficult to interpret or reproduce.

Authors: We agree that the cost-normalization procedure must be stated more explicitly. The current manuscript defines normalized inference cost in Section 3.3 as total energy relative to the dense exact baseline, but we will expand this paragraph to detail: (a) multiplier energy is taken from EvoApproxLib characterizations and scaled by the fraction of multiplications performed by each expert; (b) MoE routing overhead is included as a fixed per-token cost for the gating network; (c) expert activation sparsity is accounted for by only summing energy over activated experts (top-1 for Hard MoE, soft weights for Soft MoE, cluster assignment for Cluster MoE). This expanded definition will be cross-referenced in the ViT-Small results subsection. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a purely empirical characterization study. It evaluates Hard/Soft/Cluster MoE variants against dense baselines using fixed multipliers from the external EvoApproxLib library, performs standard approximate-aware retraining on CIFAR-100 and Tiny ImageNet-200, and reports direct observational results on accuracy recovery and normalized inference cost. There are no derivations, equations, fitted parameters presented as predictions, self-citation chains, or ansatzes that reduce the central claims to inputs by construction. All load-bearing statements are experimental observations within the tested setup.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No new mathematical axioms or invented entities; the study rests on standard supervised image-classification assumptions and the representativeness of the chosen library and datasets.

axioms (1)

domain assumption Standard DNN training and evaluation protocols on image datasets produce representative accuracy numbers.
All reported accuracy figures assume typical supervised learning setups hold for the chosen models and datasets.

pith-pipeline@v0.9.0 · 5610 in / 1329 out tokens · 60306 ms · 2026-05-08T17:59:39.638599+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applica- tions,

H. Jiang, F. J. H. Santiago, H. Mo, L. Liu, and J. Han, “Approximate Arithmetic Circuits: A Survey, Characterization, and Recent Applica- tions,”Proceedings of the IEEE, vol. 108, no. 12, pp. 2108–2135, Dec
[2]

Available: https://ieeexplore.ieee.org/document/9165786

[Online]. Available: https://ieeexplore.ieee.org/document/9165786

work page arXiv
[3]

Hierarchical mixtures of experts and the em algorithm,

M. Jordan and R. Jacobs, “Hierarchical mixtures of experts and the em algorithm,” inProceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan), vol. 2, 1993, pp. 1339–1344 vol.2

1993
[4]

Toward Approx- imate Computing for Deep Learning in Embedded Systems: A Systematic Literature Review,

Y . Rasheed, M. W. Anwar, G. A. Gillani, and M. Ottavi, “Toward Approx- imate Computing for Deep Learning in Embedded Systems: A Systematic Literature Review,”IEEE Access, vol. 13, pp. 210 863–210 891, 2025. [Online]. Available: https://ieeexplore.ieee.org/document/11284884/

work page arXiv 2025
[5]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,

N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. V . Le, G. E. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” in5th International Conference on Learning Representations, ICLR 2017, 2017

2017
[6]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

D. Lepikhin, H. Lee, Y . Xu, D. Chen, O. Firat, Y . Huang, M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with conditional computation and automatic sharding,” 2020. [Online]. Available: https: //arxiv.org/abs/2006.16668

work page internal anchor Pith review arXiv 2020
[7]

Switch transformers: scaling to trillion parameter models with simple and efficient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: scaling to trillion parameter models with simple and efficient sparsity,”J. Mach. Learn. Res., vol. 23, no. 1, Jan. 2022

2022
[8]

From sparse to soft mixtures of experts,

J. Puigcerver, C. Riquelme, B. Mustafa, and N. Houlsby, “From sparse to soft mixtures of experts,” inThe Twelfth International Conference on Learning Representations ICLR, 2024. [Online]. Available: https: //openreview.net/forum?id=jxpsAj7ltE

2024
[9]

Network of experts for large- scale image categorization,

K. Ahmed, M. H. Baig, and L. Torresani, “Network of experts for large- scale image categorization,” inComputer Vision – ECCV 2016, 2016, pp. 516–532

2016
[10]

Kwok, and Yu Zhang

Y . Gou, Z. Liu, K. Chen, L. Hong, H. Xu, A. Li, D.-Y . Yeung, J. T. Kwok, and Y . Zhang, “Mixture of cluster-conditional lora experts for vision- language instruction tuning,”arXiv preprint arXiv:2312.12379, 2023

work page arXiv 2023
[11]

Evoapprox8b: library of approximate adders and multipliers for circuit design and bench- marking of approximation methods,

V . Mrazek, R. Hrbacek, Z. Vasicek, and L. Sekanina, “Evoapprox8b: library of approximate adders and multipliers for circuit design and bench- marking of approximation methods,” inProceedings of the Conference on Design, Automation & Test in Europe, ser. DATE ’17, 2017, p. 258–261

2017
[12]

Axnn: energy-efficient neuromorphic systems using approximate computing,

S. Venkataramani, A. Ranjan, K. Roy, and A. Raghunathan, “Axnn: energy-efficient neuromorphic systems using approximate computing,” inProceedings of the 2014 International Symposium on Low Power Electronics and Design, ser. ISLPED ’14, 2014, p. 27–32. [Online]. Available: https://doi.org/10.1145/2627369.2627613

work page doi:10.1145/2627369.2627613 2014
[13]

Alwann: Automatic layer-wise approximation of deep neural network accelerators without retraining,

V . Mrazek, Z. Vasicek, L. Sekanina, M. A. Hanif, and M. Shafique, “Alwann: Automatic layer-wise approximation of deep neural network accelerators without retraining,” in2019 IEEE/ACM International Con- ference on Computer-Aided Design (ICCAD), 2019, pp. 1–8

2019
[14]

Tfapprox: towards a fast emulation of dnn approximate hardware accelerators on gpu,

F. Vaverka, V . Mrazek, Z. Vasicek, and L. Sekanina, “Tfapprox: towards a fast emulation of dnn approximate hardware accelerators on gpu,” in Proceedings of the 23rd Conference on Design, Automation and Test in Europe, ser. DATE ’20, 2020, p. 294–297

2020
[15]

Approxtrain: Fast simulation of approximate multi- pliers for dnn training and inference,

J. Gong, H. Saadat, H. Gamaarachchi, H. Javaid, X. S. Hu, and S. Parameswaran, “Approxtrain: Fast simulation of approximate multi- pliers for dnn training and inference,”Trans. Comp.-Aided Des. Integ. Cir. Sys., vol. 42, no. 11, p. 3505–3518, Nov. 2023. [Online]. Available: https://doi.org/10.1109/TCAD.2023.3253045

work page doi:10.1109/tcad.2023.3253045 2023
[16]

Adapt: Fast emulation of approximate dnn accelerators in pytorch,

D. Danopoulos, G. Zervakis, K. Siozios, D. Soudris, and J. Henkel, “Adapt: Fast emulation of approximate dnn accelerators in pytorch,” Trans. Comp.-Aided Des. Integ. Cir. Sys., vol. 42, no. 6, p. 2074–2078,

2074
[17]

Available: https://doi.org/10.1109/TCAD.2022.3212645

[Online]. Available: https://doi.org/10.1109/TCAD.2022.3212645

work page doi:10.1109/tcad.2022.3212645 2022
[18]

Transaxx: Efficient transformers with approximate computing,

D. Danopoulos, G. Zervakis, D. Soudris, and J. Henkel, “Transaxx: Efficient transformers with approximate computing,”IEEE Transactions on Circuits and Systems for Artificial Intelligence, vol. 2, no. 4, pp. 288– 301, 2025. [17]PyTorch CIFAR Models: https://github.com/chenyaofo/ pytorch-cifar-models. [18]PyTorch Image Models: https://timm.fast.ai/

2025
[19]

Learning multiple layers of features from tiny images,

A. Krizhevsky, G. Hintonet al., “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009. [Online]. Available: https://www.cs.toronto.edu/∼kriz/learning-features-2009-TR.pdf

2009
[20]

Tiny imagenet visual recognition challenge,

Y . Le and X. S. Yang, “Tiny imagenet visual recognition challenge,” 2015

2015