DISCA: A Digital In-memory Stochastic Computing Architecture Using A Compressed Bent-Pyramid Format
Pith reviewed 2026-05-17 20:20 UTC · model grok-4.3
The pith
DISCA achieves 3.59 TOPS/W per bit in digital in-memory stochastic computing for matrix multiplications using a compressed Bent-Pyramid format.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DISCA is a digital in-memory stochastic computing architecture that utilizes a compressed version of the quasi-stochastic Bent-Pyramid data format. This approach inherits the computational simplicity of analog computing while preserving the scalability, productivity, and reliability of digital systems. Post-layout modeling results of DISCA show an energy efficiency of 3.59 TOPS/W per bit at 500 MHz using a commercial 180 nm CMOS technology, leading to significant improvements in energy efficiency for matrix multiplication workloads by orders of magnitude if scaled and compared to counterpart architectures.
What carries the argument
The compressed Bent-Pyramid format, which supplies a quasi-stochastic data representation that simplifies arithmetic operations inside a fully digital in-memory array.
If this is right
- Matrix multiplication for AI models can be executed at far lower energy cost than in conventional digital or analog in-memory designs.
- Edge devices such as robots and surveillance UAVs can support larger models within existing power budgets.
- Digital implementations can capture analog-like computational simplicity without the usual reliability penalties.
- Standard commercial CMOS processes can be used to build scalable versions of the architecture.
Where Pith is reading between the lines
- If the modeled accuracy holds on silicon, the architecture could be combined with existing digital accelerators to reduce overall system power in autonomous systems.
- Real chip measurements would also reveal whether the compressed format needs extra error-correction circuitry for safety-critical applications.
- The same format might extend to other linear-algebra kernels beyond matrix multiplication if the stochastic representation remains stable.
Load-bearing premise
Post-layout modeling in 180 nm CMOS accurately predicts the energy efficiency and numerical accuracy that a fabricated chip would achieve on real AI inference tasks.
What would settle it
Fabricate a DISCA test chip in 180 nm CMOS, run matrix-multiplication workloads from actual AI models, and measure the realized energy efficiency together with end-to-end inference accuracy.
Figures
read the original abstract
Nowadays, we are witnessing an Artificial Intelligence revolution that dominates the technology landscape in various application domains, such as healthcare, robotics, automotive, security, and defense. Massive-scale AI models, which mimic the human brain's functionality, typically feature millions and even billions of parameters through data-intensive matrix multiplication tasks. While conventional Von-Neumann architectures struggle with the memory wall and the end of Moore's Law, these AI applications are migrating rapidly towards the edge, such as in robotics and unmanned aerial vehicles for surveillance, thereby adding more constraints to the hardware budget of AI architectures at the edge. Although in-memory computing has been proposed as a promising solution for the memory wall, both analog and digital in-memory computing architectures suffer from substantial degradation of the proposed benefits due to various design limitations. We propose a new digital in-memory stochastic computing architecture, DISCA, utilizing a compressed version of the quasi-stochastic Bent-Pyramid data format. DISCA inherits the same computational simplicity of analog computing, while preserving the same scalability, productivity, and reliability of digital systems. Post-layout modeling results of DISCA show an energy efficiency of 3.59TOPS/W per bit at 500 MHz using a commercial 180 nm CMOS technology. Therefore, DISCA significantly improves the energy efficiency for matrix multiplication workloads by orders of magnitude if scaled and compared to its counterpart architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DISCA, a digital in-memory stochastic computing architecture that uses a compressed Bent-Pyramid format for quasi-stochastic data representation. It targets matrix-multiplication workloads in edge AI applications and reports an energy efficiency of 3.59 TOPS/W per bit at 500 MHz in a commercial 180 nm CMOS process, obtained via post-layout modeling. The authors claim that this yields orders-of-magnitude efficiency gains relative to counterpart in-memory architectures when scaled.
Significance. If the post-layout energy figures prove predictive of silicon behavior and the compressed Bent-Pyramid representation maintains acceptable numerical accuracy for AI inference without prohibitive increases in bit-stream length, the architecture would offer a digitally reliable alternative to analog in-memory computing while retaining computational simplicity. The work addresses the memory wall in edge AI but currently lacks direct empirical support for either the performance prediction or the accuracy claim.
major comments (2)
- [Abstract] Abstract: The central efficiency claim of 3.59 TOPS/W per bit rests exclusively on post-layout modeling results; the manuscript provides neither fabricated silicon measurements, measured power/accuracy data on real AI workloads, nor error bars, leaving the 'orders of magnitude' improvement claim without direct empirical grounding.
- [Abstract] The manuscript does not quantify how compression in the Bent-Pyramid format affects stochastic correlation or required bit-stream length, nor does it compare matrix-multiplication accuracy against fixed-point or other in-memory baselines in the same technology node; without this analysis the claim that accuracy remains sufficient for AI tasks cannot be evaluated.
minor comments (2)
- Clarify the exact definition and compression algorithm for the Bent-Pyramid format, including any pseudocode or equations that define the mapping from binary values to stochastic streams.
- Provide a table comparing area, power, and latency against at least two published digital and analog in-memory designs in comparable process nodes.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and clarify the scope of our post-layout results while strengthening the manuscript with additional analysis where feasible.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central efficiency claim of 3.59 TOPS/W per bit rests exclusively on post-layout modeling results; the manuscript provides neither fabricated silicon measurements, measured power/accuracy data on real AI workloads, nor error bars, leaving the 'orders of magnitude' improvement claim without direct empirical grounding.
Authors: We acknowledge that all reported efficiency numbers derive from post-layout simulations rather than silicon measurements. This approach is standard for architectural proposals prior to tape-out. In the revision we have added error bars obtained from Monte Carlo process-variation simulations and expanded the power-breakdown discussion. We have also revised the abstract to explicitly state that the efficiency and scaling claims rest on post-layout modeling and published comparisons to other 180 nm designs. Direct measured silicon data cannot be supplied at present because the circuit has not been fabricated. revision: partial
-
Referee: [Abstract] The manuscript does not quantify how compression in the Bent-Pyramid format affects stochastic correlation or required bit-stream length, nor does it compare matrix-multiplication accuracy against fixed-point or other in-memory baselines in the same technology node; without this analysis the claim that accuracy remains sufficient for AI tasks cannot be evaluated.
Authors: We agree that a quantitative treatment of compression effects was missing. The revised manuscript now includes a dedicated subsection that measures the increase in bit-stream length and the change in stochastic correlation caused by the compressed Bent-Pyramid encoding. It also reports matrix-multiplication accuracy for representative edge-AI workloads and compares these results against fixed-point implementations synthesized in the identical 180 nm node, confirming that accuracy remains within acceptable limits for inference. revision: yes
- Fabricated silicon measurements and measured power/accuracy data on real AI workloads, because no physical prototype has been taped out.
Circularity Check
No circularity in derivation chain; efficiency from independent post-layout modeling
full rationale
The paper derives its central energy-efficiency figure directly from post-layout simulation results in a commercial 180 nm CMOS process at 500 MHz. This modeling outcome is presented as an empirical measurement rather than a fitted parameter or self-referential definition. The subsequent claim of orders-of-magnitude improvement upon scaling is an extrapolation based on comparison to external counterpart architectures, not a quantity forced by the paper's own inputs or equations. No self-citations are invoked to justify load-bearing premises, no uniqueness theorems are imported from prior author work, and the compressed Bent-Pyramid format is introduced as a proposed representation without reducing to tautological redefinition. The derivation chain therefore remains self-contained against external benchmarks and does not collapse by construction.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Compressed Bent-Pyramid format
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Post-layout modeling results of DISCA show an energy efficiency of 3.59 TOPS/W per bit at 500 MHz using a commercial 180 nm CMOS technology... compressed version of the quasi-stochastic Bent-Pyramid data format
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DISCA utilizes a compressed version (8-bit) of the Bent-Pyramid (BP) data format... average error of 2.13% for VMM benchmarking
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
P. Yao, H. Wu, B. Gao, J. Tang, Q. Zhang, W. Zhang, J. Joshua Yang, and H. Qian, ”Fully hardware-implemented memristor convolutional neural network,” Nature 577, 641–646 (2020). https://doi.org/10.1038/s41586- 020-1942-4. This work has been accepted by an IEEE Conference for publication. Copyright may be transferred without notice, after which this versio...
-
[2]
W. Wan, R. Kubendran, C. Schaefer, S. B. Eryilmaz, W. Zhang, D. Wu, S. Deiss, P. Raina, H. Qian, B. Gao, S. Joshi, H. Wu, H. S. P. Wong, and G. Cauwenberghs, ”A compute-in-memory chip based on resistive random-access memory,” Nature 608, 504–512 (2022). https://doi.org/10.1038/s41586-022-04992-8
-
[3]
D. Kim, C. Yu, S. Xie, Y . Chen, J. Kim, B. Kim, J. P. Kulkarni, and T. T. Kim, ”An Overview of Processing-in-Memory Circuits for Artificial Intelligence and Machine Learning,” in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 12, no. 2, pp. 338-353, June 2022, doi: 10.1109/JETCAS.2022.3160455
-
[4]
G. C. Adam, A. Khiat, and T. Prodromakis, ”Challenges hindering memristive neuromorphic hardware from going mainstream”, Nature Communications 9, 5267 (2018). https://doi.org/10.1038/s41467-018- 07565-4
-
[5]
Q. Liu, B. Gao, P. Yao, D. Wu, J. Chen, Y . Pang, W. Zhang, Y . Liao, C. Xue, W. Chen, J. Tang, Y . Wang, M. Chang, H. Qian, and H. Wu, ”A Fully Integrated Analog ReRAM Based 78.4TOPS/W Compute- In-Memory Chip with Fully Parallel MAC Computing,” 2020 IEEE International Solid- State Circuits Conference - (ISSCC), 2020, pp. 500- 502, doi: 10.1109/ISSCC19947...
-
[6]
Rowclone: Fast and energy-efficient in-dram bulk data copy and initialization,
V . Seshadri, Y . Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y . Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, “Rowclone: Fast and energy-efficient in-dram bulk data copy and initialization,” International Symposium on Microarchitecture (MICRO), Dec 2013
work page 2013
-
[7]
A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim, “Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules,” International Symposium on High- Performance Computer Architecture (HPCA), Feb 2015
work page 2015
-
[8]
S. Agwa, Y . Pan, T. Abbey, A. Serb, T. Prodromakis, ”High-Density Digital RRAM-based Memory with Bit-line Compute Capability,” 2022 IEEE International Symposium on Circuits and Systems (ISCAS), 2022
work page 2022
-
[9]
A configurable tcam/bcam/sram using 28nm push-rule 6t bit cell,
S. Jeloka, N. B. Akesh, D. Sylvester, and D. Blaauw, “A configurable tcam/bcam/sram using 28nm push-rule 6t bit cell,” Symp. on Very Large- Scale Integration Circuits (VLSIC), Jun 2015
work page 2015
-
[10]
Neural cache: Bit-serial in-cache acceleration of deep neural networks,
C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer†, D. Sylvester, D. Blaauw, and R. Das, “Neural cache: Bit-serial in-cache acceleration of deep neural networks,” International Symposium on Computer Ar- chitecture (ISCA), Jul 2018
work page 2018
-
[11]
Duality cache for data parallel ac- celeration,
D. Fujiki, S. Mahlke, and R. Das, “Duality cache for data parallel ac- celeration,” International Symposium on Computer Architecture (ISCA), Jun 2019
work page 2019
-
[12]
K. Al-Hawaj, O. Afuye, S. Agwa, A. Apsel and C. Batten, ”Towards a Reconfigurable Bit-Serial/Bit-Parallel Vector Accelerator using In- Situ Processing-In-SRAM,” 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Seville, Spain, 2020, pp. 1-5, doi: 10.1109/ISCAS45731.2020.9181068
-
[13]
K. Al-Hawaj, T. Ta, N. Cebry, S. Agwa, O. Afuye, E. Hall, C. Golden, A. Apsel and C. Batten, “EVE: Ephemeral Vector Engines”, 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Montreal, QC, Canada, 2023, pp. 691-704, doi: 10.1109/HPCA56546.2023.10071074
-
[14]
A. Alaghi, and J. P. Hayes, ”Survey of Stochastic Computing,” ACM Trans. Embed. Comput. Syst. 12, 2s, Article 92 (May 2013), 19 pages. https://doi.org/10.1145/2465787.2465794
-
[15]
A. Alaghi, C. Li and J. P. Hayes, ”Stochastic circuits for real-time image- processing applications,” 2013 50th ACM/EDAC/IEEE Design Automa- tion Conference (DAC), 2013, pp. 1-6, doi: 10.1145/2463209.2488901
-
[16]
A. Alaghi, and J. P. Hayes, ”Fast and accurate computation using stochastic circuits,” 2014 Design, Automation & Test in Europe Confer- ence & Exhibition (DATE), 2014, pp. 1-4, doi: 10.7873/DATE.2014.089
-
[17]
Winstead (2019), ”Tutorial on Stochastic Computing,” In: W
C. Winstead (2019), ”Tutorial on Stochastic Computing,” In: W. Gross , and V . Gaudet, (eds) ”Stochastic Computing: Techniques and Appli- cations,” Springer, Cham. https://doi.org/10.1007/978-3-030-03730-7 3
-
[18]
A. J. Groszewski, and T. Lenz, ”Deterministic Stochastic Com- putation Using Parallel Datapaths,” 20th International Symposium on Quality Electronic Design (ISQED), 2019, pp. 138-144, doi: 10.1109/ISQED.2019.8697451
-
[19]
The logic of random pulses: Stochastic computing,
A. Alaghi, “The logic of random pulses: Stochastic computing,” Ph.D. dissertation, Dept. Comput. Sci. Eng., Univ. Michigan, Ann Arbor, MI, USA, 2015
work page 2015
-
[20]
Y . Zhang, R. Wang, X. Zhang, Z. Zhang, J. Song, Z. Zhang, Y . Wang, and R. Huang, ”A Parallel Bitstream Generator for Stochastic Computing,” 2019 Silicon Nanoelectronics Workshop (SNW), 2019, pp. 1-2, doi: 10.23919/SNW.2019.8782977
-
[21]
S. A. Salehi, ”Low-Cost Stochastic Number Generators for Stochastic Computing,” in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 4, pp. 992-1001, April 2020, doi: 10.1109/TVLSI.2019.2963678
-
[22]
A. Alaghi, W. Qian, and J. P. Hayes, ”The Promise and Challenge of Stochastic Computing,” in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 8, pp. 1515- 1531, Aug. 2018, doi: 10.1109/TCAD.2017.2778107
-
[23]
Y . Zhang, R. Wang, X. Zhang, Y . Wang and R. Huang, ”Parallel Hybrid Stochastic-Binary-Based Neural Network Accelerators,” in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 67, no. 12, pp. 3387-3391, Dec. 2020, doi: 10.1109/TCSII.2020.2994464
-
[24]
Digital in-memory stochastic com- puting architecture for vector-matrix multiplication
S. Agwa and T. Prodromakis, “Digital in-memory stochastic com- puting architecture for vector-matrix multiplication” Frontiers in Nanotechnology, Nanoelectronics Section, 5:1147396, 2023. doi: 10.3389/fnano.2023.1147396
-
[25]
A. Stillmaker and B. Baas, ”Scaling equations for the accu- rate prediction of CMOS device performance from 180nm to 7nm,” Integration, V olume 58, 2017, Pages 74-81, ISSN 0167-9260, https://doi.org/10.1016/j.vlsi.2017.02.002
-
[26]
Bent-Pyramid: Towards A Quasi- Stochastic Data Representation for AI Hardware,
S. Agwa and T. Prodromakis, “Bent-Pyramid: Towards A Quasi- Stochastic Data Representation for AI Hardware,” 2023 21st IEEE Interregional NEWCAS Conference (NEWCAS), Edinburgh, United Kingdom, 2023, pp. 1-5, doi: 10.1109/NEWCAS57931.2023.10198194
-
[27]
S. Agwa, Y . Pan, G. Papandroulidakis and T. Prodromakis, ”OISMA: On-the-fly In-memory Stochastic Multiplication Architecture for Matrix- Multiplication Workloads”, arXiv:2508.08822, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
A. Stillmaker and B. Baas, ”Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm”, Integration, Elsevier, vol. 58, pp. 74-81, 2017
work page 2017
-
[29]
S. Sarangi and B. Baas, ”DeepScaleTool: A Tool for the Accurate Estimation of Technology Scaling in the Deep-Submicron Era”, 2021 IEEE International Symposium on Circuits and Systems (ISCAS), 2021, doi=10.1109/ISCAS51556.2021.9401196
-
[30]
C. Eckert, A. Subramaniyan, X. Wang, C. Augustine, R. Iyer and R. Das, ”Eidetic: An In-Memory Matrix Multiplication Accelerator for Neural Networks”, IEEE Transactions on Computers, vol. 72, no.6, pp. 1539- 1553, 2023, doi=10.1109/TC.2022.3214151. This work has been accepted by an IEEE Conference for publication. Copyright may be transferred without notic...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.