pith. sign in

arxiv: 2511.17265 · v2 · submitted 2025-11-21 · 💻 cs.AR · cs.AI· cs.ET· cs.PF

DISCA: A Digital In-memory Stochastic Computing Architecture Using A Compressed Bent-Pyramid Format

Pith reviewed 2026-05-17 20:20 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.ETcs.PF
keywords digital in-memory computingstochastic computingBent-Pyramid formatenergy efficiencymatrix multiplicationAI hardwareCMOS technologyedge computing
0
0 comments X

The pith

DISCA achieves 3.59 TOPS/W per bit in digital in-memory stochastic computing for matrix multiplications using a compressed Bent-Pyramid format.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DISCA, a digital in-memory stochastic computing architecture built around a compressed version of the Bent-Pyramid data format. It seeks to deliver the simple arithmetic of analog in-memory designs while retaining the scalability, reliability, and ease of use that digital circuits provide. This matters for edge AI applications such as robotics and unmanned vehicles, where large matrix multiplications must fit inside tight power and area limits. The work reports post-layout results showing 3.59 TOPS/W per bit at 500 MHz in 180 nm CMOS and claims this yields orders-of-magnitude energy-efficiency gains over existing architectures when scaled to the same workloads.

Core claim

DISCA is a digital in-memory stochastic computing architecture that utilizes a compressed version of the quasi-stochastic Bent-Pyramid data format. This approach inherits the computational simplicity of analog computing while preserving the scalability, productivity, and reliability of digital systems. Post-layout modeling results of DISCA show an energy efficiency of 3.59 TOPS/W per bit at 500 MHz using a commercial 180 nm CMOS technology, leading to significant improvements in energy efficiency for matrix multiplication workloads by orders of magnitude if scaled and compared to counterpart architectures.

What carries the argument

The compressed Bent-Pyramid format, which supplies a quasi-stochastic data representation that simplifies arithmetic operations inside a fully digital in-memory array.

If this is right

  • Matrix multiplication for AI models can be executed at far lower energy cost than in conventional digital or analog in-memory designs.
  • Edge devices such as robots and surveillance UAVs can support larger models within existing power budgets.
  • Digital implementations can capture analog-like computational simplicity without the usual reliability penalties.
  • Standard commercial CMOS processes can be used to build scalable versions of the architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the modeled accuracy holds on silicon, the architecture could be combined with existing digital accelerators to reduce overall system power in autonomous systems.
  • Real chip measurements would also reveal whether the compressed format needs extra error-correction circuitry for safety-critical applications.
  • The same format might extend to other linear-algebra kernels beyond matrix multiplication if the stochastic representation remains stable.

Load-bearing premise

Post-layout modeling in 180 nm CMOS accurately predicts the energy efficiency and numerical accuracy that a fabricated chip would achieve on real AI inference tasks.

What would settle it

Fabricate a DISCA test chip in 180 nm CMOS, run matrix-multiplication workloads from actual AI models, and measure the realized energy efficiency together with end-to-end inference accuracy.

Figures

Figures reproduced from arXiv: 2511.17265 by Shady Agwa, Shiwei Wang, Themis Prodromakis, Yikang Shen.

Figure 1
Figure 1. Figure 1: A compressed 8-bit Version of Bent-Pyramid datasets, achieving the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: In-memory computing for matrix-matrix multiplication: (a) an example of matrix-matrix multiplication workload including the micro-algorithm, where [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An 8CX128R SRAM core slice, including the layout of 8x8 6T bitcells [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Simulations of eight read operations for different 8-bit values stored [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Simulation results of SC multiplication using the bitline computing [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
read the original abstract

Nowadays, we are witnessing an Artificial Intelligence revolution that dominates the technology landscape in various application domains, such as healthcare, robotics, automotive, security, and defense. Massive-scale AI models, which mimic the human brain's functionality, typically feature millions and even billions of parameters through data-intensive matrix multiplication tasks. While conventional Von-Neumann architectures struggle with the memory wall and the end of Moore's Law, these AI applications are migrating rapidly towards the edge, such as in robotics and unmanned aerial vehicles for surveillance, thereby adding more constraints to the hardware budget of AI architectures at the edge. Although in-memory computing has been proposed as a promising solution for the memory wall, both analog and digital in-memory computing architectures suffer from substantial degradation of the proposed benefits due to various design limitations. We propose a new digital in-memory stochastic computing architecture, DISCA, utilizing a compressed version of the quasi-stochastic Bent-Pyramid data format. DISCA inherits the same computational simplicity of analog computing, while preserving the same scalability, productivity, and reliability of digital systems. Post-layout modeling results of DISCA show an energy efficiency of 3.59TOPS/W per bit at 500 MHz using a commercial 180 nm CMOS technology. Therefore, DISCA significantly improves the energy efficiency for matrix multiplication workloads by orders of magnitude if scaled and compared to its counterpart architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DISCA, a digital in-memory stochastic computing architecture that uses a compressed Bent-Pyramid format for quasi-stochastic data representation. It targets matrix-multiplication workloads in edge AI applications and reports an energy efficiency of 3.59 TOPS/W per bit at 500 MHz in a commercial 180 nm CMOS process, obtained via post-layout modeling. The authors claim that this yields orders-of-magnitude efficiency gains relative to counterpart in-memory architectures when scaled.

Significance. If the post-layout energy figures prove predictive of silicon behavior and the compressed Bent-Pyramid representation maintains acceptable numerical accuracy for AI inference without prohibitive increases in bit-stream length, the architecture would offer a digitally reliable alternative to analog in-memory computing while retaining computational simplicity. The work addresses the memory wall in edge AI but currently lacks direct empirical support for either the performance prediction or the accuracy claim.

major comments (2)
  1. [Abstract] Abstract: The central efficiency claim of 3.59 TOPS/W per bit rests exclusively on post-layout modeling results; the manuscript provides neither fabricated silicon measurements, measured power/accuracy data on real AI workloads, nor error bars, leaving the 'orders of magnitude' improvement claim without direct empirical grounding.
  2. [Abstract] The manuscript does not quantify how compression in the Bent-Pyramid format affects stochastic correlation or required bit-stream length, nor does it compare matrix-multiplication accuracy against fixed-point or other in-memory baselines in the same technology node; without this analysis the claim that accuracy remains sufficient for AI tasks cannot be evaluated.
minor comments (2)
  1. Clarify the exact definition and compression algorithm for the Bent-Pyramid format, including any pseudocode or equations that define the mapping from binary values to stochastic streams.
  2. Provide a table comparing area, power, and latency against at least two published digital and analog in-memory designs in comparable process nodes.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major point below and clarify the scope of our post-layout results while strengthening the manuscript with additional analysis where feasible.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central efficiency claim of 3.59 TOPS/W per bit rests exclusively on post-layout modeling results; the manuscript provides neither fabricated silicon measurements, measured power/accuracy data on real AI workloads, nor error bars, leaving the 'orders of magnitude' improvement claim without direct empirical grounding.

    Authors: We acknowledge that all reported efficiency numbers derive from post-layout simulations rather than silicon measurements. This approach is standard for architectural proposals prior to tape-out. In the revision we have added error bars obtained from Monte Carlo process-variation simulations and expanded the power-breakdown discussion. We have also revised the abstract to explicitly state that the efficiency and scaling claims rest on post-layout modeling and published comparisons to other 180 nm designs. Direct measured silicon data cannot be supplied at present because the circuit has not been fabricated. revision: partial

  2. Referee: [Abstract] The manuscript does not quantify how compression in the Bent-Pyramid format affects stochastic correlation or required bit-stream length, nor does it compare matrix-multiplication accuracy against fixed-point or other in-memory baselines in the same technology node; without this analysis the claim that accuracy remains sufficient for AI tasks cannot be evaluated.

    Authors: We agree that a quantitative treatment of compression effects was missing. The revised manuscript now includes a dedicated subsection that measures the increase in bit-stream length and the change in stochastic correlation caused by the compressed Bent-Pyramid encoding. It also reports matrix-multiplication accuracy for representative edge-AI workloads and compares these results against fixed-point implementations synthesized in the identical 180 nm node, confirming that accuracy remains within acceptable limits for inference. revision: yes

standing simulated objections not resolved
  • Fabricated silicon measurements and measured power/accuracy data on real AI workloads, because no physical prototype has been taped out.

Circularity Check

0 steps flagged

No circularity in derivation chain; efficiency from independent post-layout modeling

full rationale

The paper derives its central energy-efficiency figure directly from post-layout simulation results in a commercial 180 nm CMOS process at 500 MHz. This modeling outcome is presented as an empirical measurement rather than a fitted parameter or self-referential definition. The subsequent claim of orders-of-magnitude improvement upon scaling is an extrapolation based on comparison to external counterpart architectures, not a quantity forced by the paper's own inputs or equations. No self-citations are invoked to justify load-bearing premises, no uniqueness theorems are imported from prior author work, and the compressed Bent-Pyramid format is introduced as a proposed representation without reducing to tautological redefinition. The derivation chain therefore remains self-contained against external benchmarks and does not collapse by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The proposal rests on standard CMOS modeling assumptions and the validity of the new compressed stochastic format; no explicit free parameters or invented physical entities are stated in the abstract.

invented entities (1)
  • Compressed Bent-Pyramid format no independent evidence
    purpose: Enable efficient stochastic representation for digital in-memory matrix multiplication
    Introduced as the core data format innovation in DISCA.

pith-pipeline@v0.9.0 · 5561 in / 1163 out tokens · 25353 ms · 2026-05-17T20:20:09.038951+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

  1. [1]

    P. Yao, H. Wu, B. Gao, J. Tang, Q. Zhang, W. Zhang, J. Joshua Yang, and H. Qian, ”Fully hardware-implemented memristor convolutional neural network,” Nature 577, 641–646 (2020). https://doi.org/10.1038/s41586- 020-1942-4. This work has been accepted by an IEEE Conference for publication. Copyright may be transferred without notice, after which this versio...

  2. [2]

    W. Wan, R. Kubendran, C. Schaefer, S. B. Eryilmaz, W. Zhang, D. Wu, S. Deiss, P. Raina, H. Qian, B. Gao, S. Joshi, H. Wu, H. S. P. Wong, and G. Cauwenberghs, ”A compute-in-memory chip based on resistive random-access memory,” Nature 608, 504–512 (2022). https://doi.org/10.1038/s41586-022-04992-8

  3. [3]

    D. Kim, C. Yu, S. Xie, Y . Chen, J. Kim, B. Kim, J. P. Kulkarni, and T. T. Kim, ”An Overview of Processing-in-Memory Circuits for Artificial Intelligence and Machine Learning,” in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 12, no. 2, pp. 338-353, June 2022, doi: 10.1109/JETCAS.2022.3160455

  4. [4]

    G. C. Adam, A. Khiat, and T. Prodromakis, ”Challenges hindering memristive neuromorphic hardware from going mainstream”, Nature Communications 9, 5267 (2018). https://doi.org/10.1038/s41467-018- 07565-4

  5. [5]

    Q. Liu, B. Gao, P. Yao, D. Wu, J. Chen, Y . Pang, W. Zhang, Y . Liao, C. Xue, W. Chen, J. Tang, Y . Wang, M. Chang, H. Qian, and H. Wu, ”A Fully Integrated Analog ReRAM Based 78.4TOPS/W Compute- In-Memory Chip with Fully Parallel MAC Computing,” 2020 IEEE International Solid- State Circuits Conference - (ISSCC), 2020, pp. 500- 502, doi: 10.1109/ISSCC19947...

  6. [6]

    Rowclone: Fast and energy-efficient in-dram bulk data copy and initialization,

    V . Seshadri, Y . Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y . Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, “Rowclone: Fast and energy-efficient in-dram bulk data copy and initialization,” International Symposium on Microarchitecture (MICRO), Dec 2013

  7. [7]

    Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules,

    A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim, “Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules,” International Symposium on High- Performance Computer Architecture (HPCA), Feb 2015

  8. [8]

    S. Agwa, Y . Pan, T. Abbey, A. Serb, T. Prodromakis, ”High-Density Digital RRAM-based Memory with Bit-line Compute Capability,” 2022 IEEE International Symposium on Circuits and Systems (ISCAS), 2022

  9. [9]

    A configurable tcam/bcam/sram using 28nm push-rule 6t bit cell,

    S. Jeloka, N. B. Akesh, D. Sylvester, and D. Blaauw, “A configurable tcam/bcam/sram using 28nm push-rule 6t bit cell,” Symp. on Very Large- Scale Integration Circuits (VLSIC), Jun 2015

  10. [10]

    Neural cache: Bit-serial in-cache acceleration of deep neural networks,

    C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer†, D. Sylvester, D. Blaauw, and R. Das, “Neural cache: Bit-serial in-cache acceleration of deep neural networks,” International Symposium on Computer Ar- chitecture (ISCA), Jul 2018

  11. [11]

    Duality cache for data parallel ac- celeration,

    D. Fujiki, S. Mahlke, and R. Das, “Duality cache for data parallel ac- celeration,” International Symposium on Computer Architecture (ISCA), Jun 2019

  12. [12]

    Al-Hawaj, O

    K. Al-Hawaj, O. Afuye, S. Agwa, A. Apsel and C. Batten, ”Towards a Reconfigurable Bit-Serial/Bit-Parallel Vector Accelerator using In- Situ Processing-In-SRAM,” 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Seville, Spain, 2020, pp. 1-5, doi: 10.1109/ISCAS45731.2020.9181068

  13. [13]

    EVE: Ephemeral Vector Engines

    K. Al-Hawaj, T. Ta, N. Cebry, S. Agwa, O. Afuye, E. Hall, C. Golden, A. Apsel and C. Batten, “EVE: Ephemeral Vector Engines”, 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Montreal, QC, Canada, 2023, pp. 691-704, doi: 10.1109/HPCA56546.2023.10071074

  14. [14]

    ACM Transaction on Embedded Computing Systems12(2s) Title Suppressed Due to Excessive Length 35 (2013)

    A. Alaghi, and J. P. Hayes, ”Survey of Stochastic Computing,” ACM Trans. Embed. Comput. Syst. 12, 2s, Article 92 (May 2013), 19 pages. https://doi.org/10.1145/2465787.2465794

  15. [15]

    Alaghi, C

    A. Alaghi, C. Li and J. P. Hayes, ”Stochastic circuits for real-time image- processing applications,” 2013 50th ACM/EDAC/IEEE Design Automa- tion Conference (DAC), 2013, pp. 1-6, doi: 10.1145/2463209.2488901

  16. [16]

    Alaghi, and J

    A. Alaghi, and J. P. Hayes, ”Fast and accurate computation using stochastic circuits,” 2014 Design, Automation & Test in Europe Confer- ence & Exhibition (DATE), 2014, pp. 1-4, doi: 10.7873/DATE.2014.089

  17. [17]

    Winstead (2019), ”Tutorial on Stochastic Computing,” In: W

    C. Winstead (2019), ”Tutorial on Stochastic Computing,” In: W. Gross , and V . Gaudet, (eds) ”Stochastic Computing: Techniques and Appli- cations,” Springer, Cham. https://doi.org/10.1007/978-3-030-03730-7 3

  18. [18]

    A. J. Groszewski, and T. Lenz, ”Deterministic Stochastic Com- putation Using Parallel Datapaths,” 20th International Symposium on Quality Electronic Design (ISQED), 2019, pp. 138-144, doi: 10.1109/ISQED.2019.8697451

  19. [19]

    The logic of random pulses: Stochastic computing,

    A. Alaghi, “The logic of random pulses: Stochastic computing,” Ph.D. dissertation, Dept. Comput. Sci. Eng., Univ. Michigan, Ann Arbor, MI, USA, 2015

  20. [20]

    Zhang, R

    Y . Zhang, R. Wang, X. Zhang, Z. Zhang, J. Song, Z. Zhang, Y . Wang, and R. Huang, ”A Parallel Bitstream Generator for Stochastic Computing,” 2019 Silicon Nanoelectronics Workshop (SNW), 2019, pp. 1-2, doi: 10.23919/SNW.2019.8782977

  21. [21]

    S. A. Salehi, ”Low-Cost Stochastic Number Generators for Stochastic Computing,” in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 4, pp. 992-1001, April 2020, doi: 10.1109/TVLSI.2019.2963678

  22. [22]

    2019 , journal =

    A. Alaghi, W. Qian, and J. P. Hayes, ”The Promise and Challenge of Stochastic Computing,” in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 8, pp. 1515- 1531, Aug. 2018, doi: 10.1109/TCAD.2017.2778107

  23. [23]

    Zhang, R

    Y . Zhang, R. Wang, X. Zhang, Y . Wang and R. Huang, ”Parallel Hybrid Stochastic-Binary-Based Neural Network Accelerators,” in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 67, no. 12, pp. 3387-3391, Dec. 2020, doi: 10.1109/TCSII.2020.2994464

  24. [24]

    Digital in-memory stochastic com- puting architecture for vector-matrix multiplication

    S. Agwa and T. Prodromakis, “Digital in-memory stochastic com- puting architecture for vector-matrix multiplication” Frontiers in Nanotechnology, Nanoelectronics Section, 5:1147396, 2023. doi: 10.3389/fnano.2023.1147396

  25. [25]

    Stillmaker and B

    A. Stillmaker and B. Baas, ”Scaling equations for the accu- rate prediction of CMOS device performance from 180nm to 7nm,” Integration, V olume 58, 2017, Pages 74-81, ISSN 0167-9260, https://doi.org/10.1016/j.vlsi.2017.02.002

  26. [26]

    Bent-Pyramid: Towards A Quasi- Stochastic Data Representation for AI Hardware,

    S. Agwa and T. Prodromakis, “Bent-Pyramid: Towards A Quasi- Stochastic Data Representation for AI Hardware,” 2023 21st IEEE Interregional NEWCAS Conference (NEWCAS), Edinburgh, United Kingdom, 2023, pp. 1-5, doi: 10.1109/NEWCAS57931.2023.10198194

  27. [27]

    S. Agwa, Y . Pan, G. Papandroulidakis and T. Prodromakis, ”OISMA: On-the-fly In-memory Stochastic Multiplication Architecture for Matrix- Multiplication Workloads”, arXiv:2508.08822, 2025

  28. [28]

    Stillmaker and B

    A. Stillmaker and B. Baas, ”Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm”, Integration, Elsevier, vol. 58, pp. 74-81, 2017

  29. [29]

    Sarangi and B

    S. Sarangi and B. Baas, ”DeepScaleTool: A Tool for the Accurate Estimation of Technology Scaling in the Deep-Submicron Era”, 2021 IEEE International Symposium on Circuits and Systems (ISCAS), 2021, doi=10.1109/ISCAS51556.2021.9401196

  30. [30]

    Eckert, A

    C. Eckert, A. Subramaniyan, X. Wang, C. Augustine, R. Iyer and R. Das, ”Eidetic: An In-Memory Matrix Multiplication Accelerator for Neural Networks”, IEEE Transactions on Computers, vol. 72, no.6, pp. 1539- 1553, 2023, doi=10.1109/TC.2022.3214151. This work has been accepted by an IEEE Conference for publication. Copyright may be transferred without notic...