pith. sign in

arxiv: 2508.08822 · v2 · submitted 2025-08-12 · 💻 cs.AR · cs.AI· cs.ET· cs.PF

OISMA: On-the-fly In-memory Stochastic Multiplication Architecture for Matrix-Multiplication Workloads

Pith reviewed 2026-05-18 23:30 UTC · model grok-4.3

classification 💻 cs.AR cs.AIcs.ETcs.PF
keywords in-memory computingstochastic computingmatrix multiplicationRRAMenergy efficiencyAI acceleratorsquasi-stochastic computing
0
0 comments X

The pith

OISMA turns ordinary memory read operations into in-situ stochastic multiplications that complete matrix multiplications after bitstream accumulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that matrix multiplication workloads for AI can be handled directly inside a memory array by converting standard read operations into stochastic multiplications at almost zero extra cost. It does this through a bent-pyramid quasi-stochastic scheme that keeps the accuracy, scalability, and fabrication simplicity of ordinary digital memories while adding only an accumulation circuit on the periphery. A prototype 4-kB 1T1R array built in 180-nm technology demonstrates the approach, and projections indicate large efficiency gains when moved to 22-nm nodes. If the central claim holds, data movement energy that now dominates AI accelerators would drop sharply because the multiplications occur where the numbers already sit.

Core claim

OISMA converts normal memory read operations into in situ stochastic multiplication operations with a negligible cost. An accumulation periphery then accumulates the output multiplication bitstreams, achieving the matrix multiplication (MatMul) functionality. A 4-kB 1T1R OISMA array was implemented using a commercial 180-nm technology node and in-house resistive random-access memory (RRAM) technology. At 50 MHz, it achieves 0.789 TOPS/W and 3.98 GOPS/mm2 for energy and area efficiency, respectively, occupying an effective computing area of 0.804241 mm2. Scaling OISMA to 22-nm technology shows a significant improvement of two orders of magnitude in energy efficiency and one order of magnitude

What carries the argument

The bent-pyramid quasi-stochastic computing domain that converts standard memory reads into stochastic multiplications inside a 1T1R RRAM array.

If this is right

  • Matrix multiplications complete inside the memory array after ordinary reads produce stochastic bitstreams that an accumulation periphery sums.
  • The 180-nm prototype reaches 0.789 TOPS/W and 3.98 GOPS/mm2 while occupying 0.804 mm2 of effective computing area.
  • Moving the same design to 22-nm technology yields roughly 100 times better energy efficiency and 10 times better area efficiency than dense in-memory matrix-multiplication alternatives.
  • The architecture keeps the fabrication and scaling advantages of conventional digital memories while adding matrix-multiplication capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Existing memory production lines could add this capability without new process steps or large redesigns.
  • Accuracy on concrete neural-network models can be checked by running the same layer with and without the stochastic conversion.
  • Power savings would be largest in edge devices where matrix multiplications dominate total energy use.

Load-bearing premise

The bent-pyramid quasi-stochastic computing domain preserves sufficient numerical accuracy for AI matrix-multiplication workloads while retaining the scalability and productivity of standard digital memories.

What would settle it

A side-by-side run of a representative neural-network layer in which the bent-pyramid stochastic results deviate from exact floating-point matrix multiplication by more than the error budget the target AI application can tolerate.

Figures

Figures reproduced from arXiv: 2508.08822 by Georgios Papandroulidakis, Shady Agwa, Themis Prodromakis, Yihan Pan.

Figure 1
Figure 1. Figure 1: The AI complexity scaling, primarily driven by LLMs [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: AI performance bottlenecks and OISMA solutions. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bent-Pyramid quasi-stochastic bitstreams [2], and a [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Data quantization of 8-bit Floating Point (FP8) versus [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Data mapping from the ideal 64-bit floating point normalization of FP8 values (0.0 to 1.0) to the valid equivalent [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) All possible multiplication results in FP8 and BP10 versus the ideal 64-bit floating point. (b) The multiplication [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Matrix-Multiplication M atMul benchmarking accu￾racy results, comparing FP8 and BP10 to the ideal 64-bit floating point format. The benchmarking covers various matrix dimensions from 4x4 to 512x512. matrix dimensions. Two square input matrices (NxN) are randomly generated, with the ideal M atMul results generated in FP64 format. The two input matrices are mapped to FP8 and BP10 formats. Then, the aforement… view at source ↗
Figure 8
Figure 8. Figure 8: Block diagram of 4KB OISMA array featuring [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Block diagram of: (a) parallel counter with 16-bit SC input and 5-bit binary output. (b) 64-bit SC to 7-bit binary [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: The proposed system-level OISMA architecture, fea [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Dynamic simulation of the OISMA array for the read [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Chip layout of OISMA memory array fJ/bit, and using a compressed 8-bit BP format results in 2.245 pJ/MAC at 180nm technology. While the area of the accumulation peripherals is 20485.606 µm2 , the OISMA array chip depicted in [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗
read the original abstract

Artificial intelligence (AI) models are currently driven by a significant upscaling of their complexity, with massive matrix-multiplication workloads representing the major computational bottleneck. In-memory computing (IMC) architectures are proposed to avoid the von Neumann bottleneck. However, both digital/binary-based and analog IMC architectures suffer from various limitations, which significantly degrade the performance and energy efficiency gains. This work proposes OISMA, an energy-efficient IMC architecture that utilizes the computational simplicity of a quasi-stochastic computing (SC) domain (bent-pyramid (BP) system) while keeping the same efficiency, scalability, and productivity of digital memories. OISMA converts normal memory read operations into in situ stochastic multiplication operations with a negligible cost. An accumulation periphery then accumulates the output multiplication bitstreams, achieving the matrix multiplication (MatMul) functionality. A 4-kB 1T1R OISMA array was implemented using a commercial 180-nm technology node and in-house resistive random-access memory (RRAM) technology. At 50 MHz, it achieves 0.789 TOPS/W and 3.98 GOPS/mm2 for energy and area efficiency, respectively, occupying an effective computing area of 0.804241 mm2. Scaling OISMA to 22-nm technology shows a significant improvement of two orders of magnitude in energy efficiency and one order of magnitude in area efficiency, compared to dense MatMul IMC architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes OISMA, an in-memory computing architecture that performs stochastic multiplications in situ during normal memory read operations using a bent-pyramid quasi-stochastic computing domain, followed by peripheral accumulation of bitstreams to realize matrix multiplication. A fabricated 4-kB 1T1R array in 180-nm technology is measured at 50 MHz, reporting 0.789 TOPS/W and 3.98 GOPS/mm², with projections for 22-nm scaling showing gains over dense MatMul IMC designs.

Significance. If the bent-pyramid domain maintains adequate numerical fidelity for AI-scale workloads, OISMA could offer a practical bridge between digital memory productivity and in-memory computation, with the physical 1T1R implementation and measured results providing concrete evidence of feasibility. The approach avoids analog non-idealities while retaining scalability, but its significance for AI applications remains conditional on unshown accuracy properties.

major comments (1)
  1. [Abstract] Abstract: The central claim that OISMA achieves MatMul functionality for AI workloads via bent-pyramid quasi-SC relies on the assumption of sufficient numerical accuracy, yet the abstract (and reported results) supply no error metrics, no bitstream length vs. precision trade-offs, no comparison against FP32 reference MatMul on representative matrices, and no end-to-end network accuracy. This omission is load-bearing for the scalability assertion.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that OISMA achieves MatMul functionality for AI workloads via bent-pyramid quasi-SC relies on the assumption of sufficient numerical accuracy, yet the abstract (and reported results) supply no error metrics, no bitstream length vs. precision trade-offs, no comparison against FP32 reference MatMul on representative matrices, and no end-to-end network accuracy. This omission is load-bearing for the scalability assertion.

    Authors: We agree that the abstract, due to its length constraints, does not explicitly include numerical accuracy metrics or direct FP32 comparisons. The manuscript centers on the hardware implementation of in-memory stochastic multiplication via the bent-pyramid quasi-stochastic domain and reports measured efficiency results from the fabricated 4-kB 1T1R array. The full text explains how the bent-pyramid encoding and peripheral bitstream accumulation realize MatMul while preserving digital memory characteristics. To address the concern, we will revise the abstract to include a concise statement on the numerical fidelity of the approach and add a dedicated results subsection with error metrics, bitstream-length versus precision trade-offs, and FP32 reference comparisons on representative matrices. End-to-end accuracy on full AI networks lies outside the scope of this work, which demonstrates the core multiplication architecture and its physical realization rather than complete model evaluation. revision: partial

standing simulated objections not resolved
  • End-to-end network accuracy evaluation on representative AI workloads

Circularity Check

0 steps flagged

No circularity; claims rest on physical implementation and measurement

full rationale

The paper describes a hardware architecture (OISMA) implemented as a 4 kB 1T1R array in 180 nm technology with RRAM, reporting directly measured metrics (0.789 TOPS/W, 3.98 GOPS/mm² at 50 MHz). No equations, fitted parameters, or predictions appear that reduce by construction to inputs; the bent-pyramid quasi-SC domain is presented as an adopted computational domain whose accuracy is asserted for AI workloads but not derived from self-referential steps within the paper. Central claims are externally falsifiable via fabrication and testing rather than any self-citation chain or definitional loop, making the derivation self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that the bent-pyramid system delivers usable accuracy for matrix multiplication and on the specific technology parameters chosen for the prototype.

free parameters (2)
  • Operating frequency
    Set to 50 MHz for the reported measurements and efficiency numbers.
  • Technology node
    Commercial 180-nm process used for the physical implementation.
axioms (1)
  • domain assumption The bent-pyramid quasi-stochastic domain converts memory reads into multiplications with negligible overhead while preserving digital memory scalability.
    Invoked to justify the core conversion step and the claim of maintained efficiency and productivity.
invented entities (1)
  • OISMA architecture no independent evidence
    purpose: To enable on-the-fly stochastic multiplication inside resistive memory arrays.
    Newly proposed design whose performance is demonstrated only within this work.

pith-pipeline@v0.9.0 · 5809 in / 1430 out tokens · 40302 ms · 2026-05-18T23:30:13.038412+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DISCA: A Digital In-memory Stochastic Computing Architecture Using A Compressed Bent-Pyramid Format

    cs.AR 2025-11 unverdicted novelty 6.0

    DISCA achieves 3.59 TOPS/W per bit energy efficiency for matrix multiplication at 500 MHz in 180 nm CMOS using a compressed Bent-Pyramid stochastic format.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Digital in-memory stochastic computing architecture for vector-matrix multiplication,

    S. Agwa and T. Prodromakis, “Digital in-memory stochastic computing architecture for vector-matrix multiplication,” Frontiers in Nanotechnol- ogy, vol. 5, 2023

  2. [2]

    Bent-Pyramid: Towards a quasi-stochastic data representation for ai hardware,

    S. Agwa and T. Prodromakis, “Bent-Pyramid: Towards a quasi-stochastic data representation for ai hardware,” in 2023 21st IEEE Interregional NEWCAS Conference (NEWCAS) , 2023, pp. 1–5

  3. [3]

    Fully hardware-implemented memristor convolutional neural network,

    P. Yao et al. , “Fully hardware-implemented memristor convolutional neural network,” Nature 577, p. 641–646, 2020

  4. [4]

    A compute-in-memory chip based on resistive random- access memory,

    W. Wan et al. , “A compute-in-memory chip based on resistive random- access memory,” Nature 608, p. 504–512, 2022

  5. [5]

    An overview of processing-in-memory circuits for artificial intelligence and machine learning,

    D. Kim et al. , “An overview of processing-in-memory circuits for artificial intelligence and machine learning,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems , vol. 12, pp. 338–353, 2022

  6. [6]

    Challenges hindering memristive neuromorphic hardware from going mainstream,

    G. C. Adam, A. Khiat, and T. Prodromakis, “Challenges hindering memristive neuromorphic hardware from going mainstream,” Nature Communications 9 , 2018

  7. [7]

    A fully integrated analog ReRAM based 78.4TOPS/W compute-in-memory chip with fully parallel MAC computing,

    Q. Liu et al. , “A fully integrated analog ReRAM based 78.4TOPS/W compute-in-memory chip with fully parallel MAC computing,” in 2020 IEEE International Solid- State Circuits Conference - (ISSCC) , 2020, pp. 500–502

  8. [8]

    Rowclone: Fast and energy-efficient in-dram bulk data copy and initialization,

    V . Seshadri et al. , “Rowclone: Fast and energy-efficient in-dram bulk data copy and initialization,” in International Symposium on Microar- chitecture (MICRO), 2013

  9. [9]

    Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules,

    A. Farmahini-Farahani, K. Ahn, J. H.and Morrow, and N. S. Kim, “Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules,” in International Symposium on High- Performance Computer Architecture (HPCA) , 2015

  10. [10]

    High-density digital rram-based memory with bit-line compute capability,

    S. Agwa, Y . Pan, T. Abbey, A. Serb, and T. Prodromakis, “High-density digital rram-based memory with bit-line compute capability,” in 2022 IEEE International Symposium on Circuits and Systems (ISCAS) , 2022, pp. 1199–1200

  11. [11]

    A configurable tcam/bcam/sram using 28nm push-rule 6t bit cell,

    S. Jeloka, N. B. Akesh, D. Sylvester, and D. Blaauw, “A configurable tcam/bcam/sram using 28nm push-rule 6t bit cell,” in Symposium on V ery Large-Scale Integration Circuits (VLSIC) , 2015

  12. [12]

    Neural cache: Bit-serial in-cache acceleration of deep neural networks,

    C. Eckert et al. , “Neural cache: Bit-serial in-cache acceleration of deep neural networks,” in International Symposium on Computer Architecture (ISCA), 2018

  13. [13]

    Duality cache for data parallel acceleration,

    D. Fujiki, S. Mahlke, and R. Das, “Duality cache for data parallel acceleration,” in International Symposium on Computer Architecture (ISCA), 2019

  14. [14]

    Towards a reconfigurable bit-serial/bit-parallel vector accelerator using in-situ processing-in-sram,

    K. Al-Hawaj, O. Afuye, S. Agwa, A. Apsel, and C. Batten, “Towards a reconfigurable bit-serial/bit-parallel vector accelerator using in-situ processing-in-sram,” in 2020 IEEE International Symposium on Circuits and Systems (ISCAS) , 2020

  15. [15]

    Eve: Ephemeral vector engines,

    K. Al-Hawaj et al. , “Eve: Ephemeral vector engines,” in 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023, pp. 691–704

  16. [16]

    Survey of stochastic computing,

    A. Alaghi and J. P. Hayes, “Survey of stochastic computing,” ACM Trans. Embed. Comput. Syst. , vol. 12, no. 2s, May 2013

  17. [17]

    Stochastic circuits for real-time image-processing applications,

    A. Alaghi, C. Li, and J. P. Hayes, “Stochastic circuits for real-time image-processing applications,” in Proceedings of the 50th Annual Design Automation Conference , ser. DAC ’13. New York, NY , USA: Association for Computing Machinery, 2013

  18. [18]

    Fast and accurate computation using stochastic circuits,

    A. Alaghi and J. P. Hayes, “Fast and accurate computation using stochastic circuits,” in 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE) , 2014, pp. 1–4

  19. [19]

    Winstead, Tutorial on Stochastic Computing

    C. Winstead, Tutorial on Stochastic Computing . Cham: Springer International Publishing, 2019, pp. 39–76

  20. [20]

    Deterministic stochastic computation using parallel datapaths,

    A. J. Groszewski and T. Lenz, “Deterministic stochastic computation using parallel datapaths,” in 20th International Symposium on Quality Electronic Design (ISQED) , 2019, pp. 138–144

  21. [21]

    The logic of random pulses: Stochastic computing,

    A. Alaghi, “The logic of random pulses: Stochastic computing,” Ph.D. dissertation, University of Michigan, Ann Arbor, MI, USA, 2015

  22. [22]

    A parallel bitstream generator for stochastic comput- ing,

    Y . Zhang et al. , “A parallel bitstream generator for stochastic comput- ing,” in 2019 Silicon Nanoelectronics Workshop (SNW) , 2019, pp. 1–2

  23. [23]

    Low-cost stochastic number generators for stochastic computing,

    S. A. Salehi, “Low-cost stochastic number generators for stochastic computing,” IEEE Transactions on V ery Large Scale Integration (VLSI) Systems, vol. 28, no. 4, pp. 992–1001, April 2020

  24. [24]

    The promise and challenge of stochastic computing,

    A. Alaghi, W. Qian, and J. P. Hayes, “The promise and challenge of stochastic computing,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , vol. 37, no. 8, pp. 1515–1531, August 2018

  25. [25]

    Parallel hybrid stochastic-binary-based neural network accelerators,

    Y . Zhang, R. Wang, X. Zhang, Y . Wang, and R. Huang, “Parallel hybrid stochastic-binary-based neural network accelerators,” IEEE Transactions on Circuits and Systems II: Express Briefs , vol. 67, no. 12, pp. 3387– 3391, December 2020

  26. [26]

    FP8 Formats for Deep Learning

    P. Micikevicius et al. , “Fp8 formats for deep learning,” arXiv preprint arXiv:2209.05433, 2022

  27. [27]

    Dip: A scalable, energy-efficient systolic array for matrix multiplication acceleration,

    A. J. Abdelmaksoud, S. Agwa, and T. Prodromakis, “Dip: A scalable, energy-efficient systolic array for matrix multiplication acceleration,” IEEE Transactions on Circuits and Systems I, TCAS-I , 2025

  28. [28]

    Design flow for hybrid CMOS/memristor sys- tems—part i: Modeling and verification steps,

    S. Maheshwari et al. , “Design flow for hybrid CMOS/memristor sys- tems—part i: Modeling and verification steps,” IEEE Transactions on Circuits and Systems I: Regular Papers , vol. 68, no. 12, pp. 4862–4875, 2021

  29. [29]

    A data-driven verilog-a ReRAM model,

    I. Messaris, A. Serb, S. Stathopoulos, A. Khiat, S. Nikolaidis, and T. Prodromakis, “A data-driven verilog-a ReRAM model,”Trans. Comp.- Aided Des. Integ. Cir . Sys. , vol. 37, no. 12, p. 3151–3162, Dec. 2018

  30. [30]

    Eidetic: An in-memory matrix multiplication accelerator for neural networks,

    C. Eckert, A. Subramaniyan, X. Wang, C. Augustine, R. Iyer, and R. Das, “Eidetic: An in-memory matrix multiplication accelerator for neural networks,” IEEE Transactions on Computers , vol. 72, no. 6, pp. 1539–1553, 2023

  31. [31]

    A 51.6TFLOPs/W full-datapath CIM macro approaching sparsity bound and <2-30 loss for compound AI,

    Z. Yue et al., “A 51.6TFLOPs/W full-datapath CIM macro approaching sparsity bound and <2-30 loss for compound AI,” in 2025 IEEE International Solid-State Circuits Conference (ISSCC) , vol. 68, 2025, pp. 1–3

  32. [32]

    A 22nm 16Mb floating-point ReRAM compute-in- memory macro with 31.2TFLOPS/W for AI edge devices,

    T.-H. Wen et al. , “A 22nm 16Mb floating-point ReRAM compute-in- memory macro with 31.2TFLOPS/W for AI edge devices,” in2024 IEEE International Solid-State Circuits Conference (ISSCC), vol. 67, 2024, pp. 580–582

  33. [33]

    A 22nm 104.5TOPS/W µ-NMC- ∆-IMC hetero- geneous STT-MRAM CIM macro for noise-tolerant bayesian neural networks,

    D.-Q. You et al. , “A 22nm 104.5TOPS/W µ-NMC- ∆-IMC hetero- geneous STT-MRAM CIM macro for noise-tolerant bayesian neural networks,” in 2025 IEEE International Solid-State Circuits Conference (ISSCC), vol. 68, 2025, pp. 1–3

  34. [34]

    Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm,

    A. Stillmaker and B. Baas, “Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm,” Integration, vol. 58, pp. 74–81, 2017

  35. [35]

    Deepscaletool: A tool for the accurate estimation of technology scaling in the deep-submicron era,

    S. Sarangi and B. Baas, “Deepscaletool: A tool for the accurate estimation of technology scaling in the deep-submicron era,” in 2021 IEEE International Symposium on Circuits and Systems (ISCAS) , 2021, pp. 1–5. Shady Agwa (Member, IEEE) is a Research Fellow at the Centre for Electronics Frontiers CEF, The Uni- versity of Edinburgh (UK). He received his BS...