An Energy-Efficient Mixed-Signal Parallel Multiply-Accumulate (MAC) Engine Based on Stochastic Computing

Jiahao Song; Ru Huang; Runsheng Wang; Xinyue Zhang; Yawen Zhang; Yuan Wang; Zuodong Zhang

arxiv: 1907.01807 · v1 · pith:DPJF64MInew · submitted 2019-07-03 · 📡 eess.SP

An Energy-Efficient Mixed-Signal Parallel Multiply-Accumulate (MAC) Engine Based on Stochastic Computing

Xinyue Zhang , Jiahao Song , Yuan Wang , Yawen Zhang , Zuodong Zhang , Runsheng Wang , Ru Huang This is my paper

Pith reviewed 2026-05-25 10:18 UTC · model grok-4.3

classification 📡 eess.SP

keywords stochastic computingMAC enginemixed-signalenergy efficiencyconvolutional neural networksedge computing28nm CMOSparallel architecture

0 comments

The pith

A mixed-signal parallel MAC engine based on stochastic computing achieves 5.03 pJ per 26-input operation in 28nm CMOS.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces an energy-efficient mixed-signal multiply-accumulate engine that relies on stochastic computing for arithmetic operations using simple logic gates. A parallel architecture is used specifically to overcome the typical latency issues associated with stochastic computing. Simulations demonstrate an energy use of 5.03 pJ for each 26-input MAC operation under 28nm CMOS technology. Such a design targets the high energy demands of convolutional operations in neural networks deployed at the edge.

Core claim

The paper proposes a mixed-signal MAC engine based on stochastic computing with a parallel architecture that delivers an overall energy consumption of 5.03pJ per 26-input MAC operation under 28nm CMOS technology, solving the latency problem while using low hardware cost logic gates.

What carries the argument

The parallel architecture in the mixed-signal stochastic computing MAC engine, which performs arithmetic with simple logic gates to reduce latency.

Load-bearing premise

The simulation results accurately reflect performance in a fabricated chip without extra overheads from real implementation.

What would settle it

Fabricating the circuit in 28nm CMOS and measuring energy above 5.03 pJ per 26-input MAC operation would disprove the efficiency result.

read the original abstract

Convolutional neural networks (CNN) have achieved excellent performance on various tasks, but deploying CNN to edge is constrained by the high energy consumption of convolution operation. Stochastic computing (SC) is an attractive paradigm which performs arithmetic operations with simple logic gates and low hardware cost. This paper presents an energy-efficient mixed-signal multiply-accumulate (MAC) engine based on SC. A parallel architecture is adopted in this work to solve the latency problem of SC. The simulation results show that the overall energy consumption of our design is 5.03pJ per 26-input MAC operation under 28nm CMOS technology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Simulation of a parallel mixed-signal stochastic MAC gives 5.03 pJ per operation in 28 nm but stays at the abstract level with no silicon data or accuracy baselines.

read the letter

The paper builds a mixed-signal MAC that uses stochastic computing with a parallel architecture to cut the usual SC latency while targeting low energy for edge CNN convolutions. The headline result is the simulated 5.03 pJ per 26-input MAC in 28 nm CMOS. That number comes from their specific combination of stochastic bit streams, analog summation, and replicated units, which is the concrete design contribution here. The approach is straightforward: SC keeps the logic simple and the parallel copies address throughput without changing the core arithmetic style. The simulation is presented as post-layout or behavioral, which at least tries to capture some hardware effects. That is the part that is actually new relative to generic SC descriptions. The energy claim is the main output the authors want readers to take away. The soft spots are exactly where the stress-test note flags them. Everything rests on simulation; there are no measured results from silicon, no error bars on the energy figure, and no reported accuracy numbers for the MAC outputs. In 28 nm, mismatch, interconnect capacitance, and clock skew are first-order for both the stochastic generators and the analog summer, and the abstract gives no indication those were quantified or shown to be negligible. The parallel replication adds routing and synchronization, yet the net energy after those costs is not broken out against a digital baseline or prior SC engines. Without those comparisons it is difficult to know whether the 5.03 pJ figure represents a real advantage or just a simulation artifact. This is the sort of incremental hardware design paper that a low-power circuits group might want to see. It has a specific implementation and a number, so it is worth sending to referees who can ask for accuracy metrics, baseline tables, and any additional post-layout analysis on parasitics. I would not cite it yet on the strength of the abstract alone, but it clears the bar for peer review.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a mixed-signal parallel MAC engine based on stochastic computing for energy-efficient CNN convolution. A parallel architecture is used to mitigate the latency of stochastic bit-stream processing, and post-simulation results in 28 nm CMOS are reported to achieve 5.03 pJ per 26-input MAC operation.

Significance. If the energy figure is confirmed beyond simulation, the work could provide a useful data point for low-power edge inference by exploiting SC's simple logic together with analog summation. The parallel replication approach directly targets a known drawback of SC. No machine-checked proofs, open code, or parameter-free derivations are present.

major comments (3)

[Abstract / Results] Abstract and results section: the headline claim of 5.03 pJ/MAC is stated without any quantitative comparison to a conventional digital MAC, a prior SC design, or a synthesized RTL baseline in the same 28 nm node, so the asserted energy-efficiency advantage cannot be evaluated from the given data.
[Simulation / Results] Simulation methodology: the energy number is obtained from simulation (behavioral or post-layout unspecified) with no reported inclusion of interconnect parasitics, device mismatch, clock skew, or supply noise; in 28 nm these first-order effects directly affect both stochastic stream generation and analog summation and could alter the reported figure by 2-3x.
[Architecture] Architecture section: the parallel replication is presented as solving the latency problem, yet no breakdown quantifies the added energy or area cost of the extra stochastic units, routing, and synchronization logic relative to a serial SC implementation.

minor comments (2)

[Abstract] Abstract: the 26-input MAC is mentioned without stating the stochastic bit-stream length, the representation range, or any accuracy metric (e.g., mean absolute error) for the MAC result.
[Results] No error bars, Monte-Carlo runs, or sensitivity analysis to process corners are supplied for the energy or accuracy figures.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Results] Abstract and results section: the headline claim of 5.03 pJ/MAC is stated without any quantitative comparison to a conventional digital MAC, a prior SC design, or a synthesized RTL baseline in the same 28 nm node, so the asserted energy-efficiency advantage cannot be evaluated from the given data.

Authors: We agree that direct comparisons would aid evaluation of the claimed efficiency. The manuscript focuses on the mixed-signal parallel SC architecture and its simulated energy. In revision we will add a table placing the 5.03 pJ figure in context using published energy numbers for digital MACs and prior SC designs in comparable nodes. revision: partial
Referee: [Simulation / Results] Simulation methodology: the energy number is obtained from simulation (behavioral or post-layout unspecified) with no reported inclusion of interconnect parasitics, device mismatch, clock skew, or supply noise; in 28 nm these first-order effects directly affect both stochastic stream generation and analog summation and could alter the reported figure by 2-3x.

Authors: The reported energy derives from gate-level simulation with a 28 nm standard-cell library. We will revise the text to explicitly describe the simulation flow and note that parasitics, mismatch, skew, and supply noise are not modeled, which may affect absolute accuracy. This limitation is typical for architectural studies at this stage. revision: yes
Referee: [Architecture] Architecture section: the parallel replication is presented as solving the latency problem, yet no breakdown quantifies the added energy or area cost of the extra stochastic units, routing, and synchronization logic relative to a serial SC implementation.

Authors: The parallel design replicates stochastic units to shorten bit-stream latency; the quoted 5.03 pJ already accounts for all replicated hardware. We will add a breakdown of energy and area contributions from the stochastic generators, summation network, and synchronization logic, together with a qualitative comparison to the serial case. revision: partial

Circularity Check

0 steps flagged

No derivation chain; simulation result only

full rationale

The paper proposes a mixed-signal parallel MAC architecture based on stochastic computing and reports a single headline figure (5.03 pJ per 26-input MAC) obtained from 28 nm CMOS simulation. No equations, fitted parameters, uniqueness theorems, or self-citations are used to derive this number; the result is produced by direct circuit-level simulation rather than any mathematical reduction that could be circular. The architecture description and latency claim are engineering choices evaluated by simulation, not self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the energy result rests on standard 28nm CMOS process assumptions and the unstated premise that simulation models match silicon behavior.

axioms (1)

domain assumption Standard 28nm CMOS technology parameters and device models are accurate for energy estimation
The reported energy figure is given under this process node.

pith-pipeline@v0.9.0 · 5648 in / 1061 out tokens · 31601 ms · 2026-05-25T10:18:33.089284+00:00 · methodology

An Energy-Efficient Mixed-Signal Parallel Multiply-Accumulate (MAC) Engine Based on Stochastic Computing

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)