An Energy-Efficient Mixed-Signal Parallel Multiply-Accumulate (MAC) Engine Based on Stochastic Computing
Pith reviewed 2026-05-25 10:18 UTC · model grok-4.3
The pith
A mixed-signal parallel MAC engine based on stochastic computing achieves 5.03 pJ per 26-input operation in 28nm CMOS.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper proposes a mixed-signal MAC engine based on stochastic computing with a parallel architecture that delivers an overall energy consumption of 5.03pJ per 26-input MAC operation under 28nm CMOS technology, solving the latency problem while using low hardware cost logic gates.
What carries the argument
The parallel architecture in the mixed-signal stochastic computing MAC engine, which performs arithmetic with simple logic gates to reduce latency.
Load-bearing premise
The simulation results accurately reflect performance in a fabricated chip without extra overheads from real implementation.
What would settle it
Fabricating the circuit in 28nm CMOS and measuring energy above 5.03 pJ per 26-input MAC operation would disprove the efficiency result.
read the original abstract
Convolutional neural networks (CNN) have achieved excellent performance on various tasks, but deploying CNN to edge is constrained by the high energy consumption of convolution operation. Stochastic computing (SC) is an attractive paradigm which performs arithmetic operations with simple logic gates and low hardware cost. This paper presents an energy-efficient mixed-signal multiply-accumulate (MAC) engine based on SC. A parallel architecture is adopted in this work to solve the latency problem of SC. The simulation results show that the overall energy consumption of our design is 5.03pJ per 26-input MAC operation under 28nm CMOS technology.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a mixed-signal parallel MAC engine based on stochastic computing for energy-efficient CNN convolution. A parallel architecture is used to mitigate the latency of stochastic bit-stream processing, and post-simulation results in 28 nm CMOS are reported to achieve 5.03 pJ per 26-input MAC operation.
Significance. If the energy figure is confirmed beyond simulation, the work could provide a useful data point for low-power edge inference by exploiting SC's simple logic together with analog summation. The parallel replication approach directly targets a known drawback of SC. No machine-checked proofs, open code, or parameter-free derivations are present.
major comments (3)
- [Abstract / Results] Abstract and results section: the headline claim of 5.03 pJ/MAC is stated without any quantitative comparison to a conventional digital MAC, a prior SC design, or a synthesized RTL baseline in the same 28 nm node, so the asserted energy-efficiency advantage cannot be evaluated from the given data.
- [Simulation / Results] Simulation methodology: the energy number is obtained from simulation (behavioral or post-layout unspecified) with no reported inclusion of interconnect parasitics, device mismatch, clock skew, or supply noise; in 28 nm these first-order effects directly affect both stochastic stream generation and analog summation and could alter the reported figure by 2-3x.
- [Architecture] Architecture section: the parallel replication is presented as solving the latency problem, yet no breakdown quantifies the added energy or area cost of the extra stochastic units, routing, and synchronization logic relative to a serial SC implementation.
minor comments (2)
- [Abstract] Abstract: the 26-input MAC is mentioned without stating the stochastic bit-stream length, the representation range, or any accuracy metric (e.g., mean absolute error) for the MAC result.
- [Results] No error bars, Monte-Carlo runs, or sensitivity analysis to process corners are supplied for the energy or accuracy figures.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below with clarifications and indicate planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and results section: the headline claim of 5.03 pJ/MAC is stated without any quantitative comparison to a conventional digital MAC, a prior SC design, or a synthesized RTL baseline in the same 28 nm node, so the asserted energy-efficiency advantage cannot be evaluated from the given data.
Authors: We agree that direct comparisons would aid evaluation of the claimed efficiency. The manuscript focuses on the mixed-signal parallel SC architecture and its simulated energy. In revision we will add a table placing the 5.03 pJ figure in context using published energy numbers for digital MACs and prior SC designs in comparable nodes. revision: partial
-
Referee: [Simulation / Results] Simulation methodology: the energy number is obtained from simulation (behavioral or post-layout unspecified) with no reported inclusion of interconnect parasitics, device mismatch, clock skew, or supply noise; in 28 nm these first-order effects directly affect both stochastic stream generation and analog summation and could alter the reported figure by 2-3x.
Authors: The reported energy derives from gate-level simulation with a 28 nm standard-cell library. We will revise the text to explicitly describe the simulation flow and note that parasitics, mismatch, skew, and supply noise are not modeled, which may affect absolute accuracy. This limitation is typical for architectural studies at this stage. revision: yes
-
Referee: [Architecture] Architecture section: the parallel replication is presented as solving the latency problem, yet no breakdown quantifies the added energy or area cost of the extra stochastic units, routing, and synchronization logic relative to a serial SC implementation.
Authors: The parallel design replicates stochastic units to shorten bit-stream latency; the quoted 5.03 pJ already accounts for all replicated hardware. We will add a breakdown of energy and area contributions from the stochastic generators, summation network, and synchronization logic, together with a qualitative comparison to the serial case. revision: partial
Circularity Check
No derivation chain; simulation result only
full rationale
The paper proposes a mixed-signal parallel MAC architecture based on stochastic computing and reports a single headline figure (5.03 pJ per 26-input MAC) obtained from 28 nm CMOS simulation. No equations, fitted parameters, uniqueness theorems, or self-citations are used to derive this number; the result is produced by direct circuit-level simulation rather than any mathematical reduction that could be circular. The architecture description and latency claim are engineering choices evaluated by simulation, not self-referential definitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard 28nm CMOS technology parameters and device models are accurate for energy estimation
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.