GEM3D CIM General Purpose Matrix Computation Using 3D Integrated SRAM eDRAM Hybrid Compute In Memory on Memory Architecture
Pith reviewed 2026-05-10 12:04 UTC · model grok-4.3
The pith
A 3D SRAM-eDRAM hybrid architecture runs general matrix operations inside memory at 4-bit precision.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The proposed 3D-integrated SRAM-eDRAM hybrid CIM architecture performs general matrix operations directly within the memory crossbar at 4-bit precision by combining a specialized transpose-based structure, in-memory arithmetic operations, peripheral-aware design, and vertical SRAM-eDRAM integration, thereby balancing latency, energy efficiency, and compute density while staying compatible with conventional CIM dot-product architectures.
What carries the argument
The 3D SRAM-eDRAM hybrid memory-on-memory CIM crossbar with transpose-based in-memory arithmetic.
If this is right
- CIM arrays can now execute complete matrix-level tasks instead of being limited to dot products.
- The same hardware remains usable for conventional MAC operations without redesign.
- Data movement between memory and compute units decreases for workloads heavy in transposes or element-wise math.
- The architecture supports 4-bit precision general matrix work while preserving energy and density targets.
Where Pith is reading between the lines
- Larger systems could chain multiple such memory stacks for bigger matrix problems without external memory traffic.
- The approach may extend to other memory types or higher bit widths if the 3D stacking overheads stay manageable.
- It opens a path for CIM to serve general high-performance computing workloads beyond neural-network inference.
Load-bearing premise
The specialized transpose architecture, in-memory arithmetic, peripheral design, and 3D SRAM-eDRAM stacking can be built without large unaccounted overheads that would destroy the claimed balance of speed, power, and density.
What would settle it
Fabrication results from a test chip showing measured latency, energy per operation, and effective compute density for matrix transpose and element-wise multiplication, compared against the simulated targets.
Figures
read the original abstract
With the rapid growth of deep neural networks (DNNs), compute-in-memory (CIM) has emerged as a promising energy-efficient paradigm for accelerating multiply-and-accumulate (MAC) operations. Yet, current CIM architectures are largely limited to dot-product computations and struggle to efficiently support general-purpose matrix operations, such as transpose, element-wise addition, and multiplication. This work presents a 3D-integrated, memory-on-memory SRAM-eDRAM hybrid CIM architecture, implemented in GlobalFoundries 22~nm FDSOI technology, capable of performing general matrix operations directly within the memory crossbar with 4-bit precision. By leveraging a specialized transpose-based architecture, in-memory arithmetic operations, peripheral-aware design, and 3D SRAM--eDRAM integration, the proposed architecture balances latency, energy efficiency, and compute density for general purpose matrix operations while remaining compatible with the conventional CIM dot product architectures. Overall, this memory-on-memory CIM framework generalizes CIM beyond dot products, enabling versatile matrix processing and paving the way for broader applications in AI acceleration and general-purpose high performance computing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes GEM3D CIM, a 3D-integrated SRAM-eDRAM hybrid compute-in-memory architecture for general-purpose matrix operations (transpose, element-wise addition, multiplication) performed directly in the memory crossbar at 4-bit precision. It uses a specialized transpose-based design, in-memory arithmetic, peripheral-aware circuits, and 3D SRAM-eDRAM stacking in GlobalFoundries 22 nm FDSOI technology, claiming to balance latency, energy efficiency, and compute density while remaining compatible with conventional CIM dot-product flows.
Significance. If the performance balance holds, the work would usefully generalize CIM beyond dot-product acceleration to versatile matrix processing, with potential benefits for AI accelerators and HPC. The hybrid 3D memory-on-memory approach and compatibility with existing CIM are practical strengths worth exploring; the transpose-based in-memory arithmetic concept is a clear architectural contribution.
major comments (2)
- [Abstract] Abstract: the central claim that the architecture 'balances latency, energy efficiency, and compute density' for general matrix operations rests on design assertions without any supporting quantitative results, simulation data, post-layout extraction, or analytical derivations; this is load-bearing because the balance is the primary performance assertion.
- [Architecture Description] Architecture and integration sections: the assumption that 3D SRAM-eDRAM stacking incurs no significant unaccounted overheads (TSV parasitics, thermal coupling, eDRAM retention) that would unbalance the claimed metrics is not addressed with any analysis or foundry-calibrated 3D simulation; this directly affects the weakest assumption identified in the stress-test note.
minor comments (1)
- [Abstract] Abstract: '22~nm' should be written as '22 nm' for standard notation consistency.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and provide point-by-point responses below. We will incorporate revisions to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the architecture 'balances latency, energy efficiency, and compute density' for general matrix operations rests on design assertions without any supporting quantitative results, simulation data, post-layout extraction, or analytical derivations; this is load-bearing because the balance is the primary performance assertion.
Authors: We agree that the abstract would benefit from explicit reference to supporting evidence. The full manuscript includes post-layout simulations in GlobalFoundries 22 nm FDSOI technology that quantify latency, energy, and density for transpose, element-wise addition, and multiplication operations at 4-bit precision, with direct comparisons to baseline CIM dot-product designs and conventional processors. These results underpin the balance claim. We will revise the abstract to concisely incorporate key quantitative highlights from the results section. revision: yes
-
Referee: [Architecture Description] Architecture and integration sections: the assumption that 3D SRAM-eDRAM stacking incurs no significant unaccounted overheads (TSV parasitics, thermal coupling, eDRAM retention) that would unbalance the claimed metrics is not addressed with any analysis or foundry-calibrated 3D simulation; this directly affects the weakest assumption identified in the stress-test note.
Authors: We acknowledge that a more rigorous treatment of 3D integration overheads is warranted. The current manuscript focuses on the architectural benefits of SRAM-eDRAM stacking but does not include detailed quantification of TSV parasitics, thermal coupling, or eDRAM retention effects. In the revised version, we will add a dedicated analysis subsection presenting foundry-calibrated 3D simulations that bound these overheads and demonstrate they remain within acceptable limits without unbalancing the reported latency, energy, and density metrics. revision: yes
Circularity Check
No circularity detected; architectural proposal without self-referential derivations
full rationale
The paper describes a proposed 3D SRAM-eDRAM hybrid CIM architecture for general matrix operations, asserting balance of latency, energy, and density in GlobalFoundries 22nm FDSOI. No equations, fitted parameters, or quantitative predictions appear in the abstract or description. Claims rest on design choices (transpose-based architecture, in-memory arithmetic, peripheral-aware design, 3D integration) rather than any derivation that reduces to its own inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The absence of a mathematical derivation chain means the patterns of self-definitional, fitted-input, or self-citation circularity do not apply; the work is a forward-looking design statement.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption 3D integration of SRAM and eDRAM in 22nm FDSOI is feasible and does not introduce prohibitive overheads in latency or energy.
invented entities (1)
-
GEM3D CIM transpose-based architecture
no independent evidence
Reference graph
Works this paper leans on
-
[1]
S. Shankar and A. Reuther, “Trends in energy estimates for comput- ing in ai/machine learning accelerators, supercomputers, and compute- intensive applications,” in2022 IEEE High Performance Extreme Com- puting Conference (HPEC). IEEE, 2022, pp. 1–8
work page 2022
-
[2]
Breaking the von neu- mann bottleneck: architecture-level processing-in-memory technology,
X. Zou, S. Xu, X. Chen, L. Yan, and Y . Han, “Breaking the von neu- mann bottleneck: architecture-level processing-in-memory technology,” Science China Information Sciences, vol. 64, no. 6, p. 160404, 2021
work page 2021
-
[3]
C. A. Mack, “Fifty years of moore’s law,”IEEE Transactions on semiconductor manufacturing, vol. 24, no. 2, pp. 202–207, 2011
work page 2011
-
[4]
Hitting the memory wall: Implications of the obvious,
W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implications of the obvious,”ACM SIGARCH computer architecture news, vol. 23, no. 1, pp. 20–24, 1995
work page 1995
-
[5]
Memory devices and applications for in-memory computing,
A. Sebastian, M. Le Gallo, R. Khaddam-Aljameh, and E. Eleftheriou, “Memory devices and applications for in-memory computing,”Nature nanotechnology, vol. 15, no. 7, pp. 529–544, 2020
work page 2020
-
[6]
Compute-in-memory chips for deep learning: Recent trends and prospects,
S. Yu, H. Jiang, S. Huang, X. Peng, and A. Lu, “Compute-in-memory chips for deep learning: Recent trends and prospects,”IEEE circuits and systems magazine, vol. 21, no. 3, pp. 31–56, 2021
work page 2021
-
[7]
Emerging nvm: A survey on architectural integration and research challenges,
J. Boukhobza, S. Rubini, R. Chen, and Z. Shao, “Emerging nvm: A survey on architectural integration and research challenges,”ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 23, no. 2, pp. 1–32, 2017
work page 2017
-
[8]
Challenges and trends of sram-based computing-in-memory for ai edge devices,
C.-J. Jhang, C.-X. Xue, J.-M. Hung, F.-C. Chang, and M.-F. Chang, “Challenges and trends of sram-based computing-in-memory for ai edge devices,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 68, no. 5, pp. 1773–1786, 2021
work page 2021
-
[9]
Drisa: A dram-based reconfigurable in-situ accelerator,
S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y . Xie, “Drisa: A dram-based reconfigurable in-situ accelerator,” inProceedings of the 50th annual ieee/acm international symposium on microarchitecture, 2017, pp. 288–301
work page 2017
-
[10]
X. Qiao, Q. Guo, X. Tang, J. Song, R. Wei, M. Li, R. Wang, and Y . Wang, “A 16.38 tops and 4.55 pops/w sram computing-in-memory macro for signed operands computation and batch normalization im- plementation,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 71, no. 4, pp. 1706–1718, 2024
work page 2024
-
[12]
Analog compute-in-memory for ai edge inference,
D. Fick, “Analog compute-in-memory for ai edge inference,” in2022 International Electron Devices Meeting (IEDM). IEEE, 2022, pp. 21–8
work page 2022
-
[13]
Mixed-precision in-memory computing,
M. Le Gallo, A. Sebastian, R. Mathis, M. Manica, H. Giefers, T. Tuma, C. Bekas, A. Curioni, and E. Eleftheriou, “Mixed-precision in-memory computing,”Nature Electronics, vol. 1, no. 4, pp. 246–253, 2018
work page 2018
-
[14]
A 1.91 pops/w energy-efficient sram based signed multi-bit time domain cim architecture,
S. Chakraborty, D. Kushwaha, H. Ranjan, and S. Dasgupta, “A 1.91 pops/w energy-efficient sram based signed multi-bit time domain cim architecture,” in2025 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2025, pp. 1–5
work page 2025
-
[15]
A survey of sram- based in-memory computing techniques and applications,
S. Mittal, G. Verma, B. Kaushik, and F. A. Khanday, “A survey of sram- based in-memory computing techniques and applications,”Journal of Systems Architecture, vol. 119, p. 102276, 2021
work page 2021
-
[16]
X-sram: Enabling in- memory boolean computations in cmos static random access memories,
A. Agrawal, A. Jaiswal, C. Lee, and K. Roy, “X-sram: Enabling in- memory boolean computations in cmos static random access memories,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 12, pp. 4219–4232, 2018
work page 2018
-
[17]
A survey of several finite difference methods for systems of nonlinear hyperbolic conservation laws,
G. A. Sod, “A survey of several finite difference methods for systems of nonlinear hyperbolic conservation laws,”Journal of computational physics, vol. 27, no. 1, pp. 1–31, 1978
work page 1978
-
[18]
Hadamard product in deep learning: Introduction, advances and challenges,
G. G. Chrysos, Y . Wu, R. Pascanu, P. Torr, and V . Cevher, “Hadamard product in deep learning: Introduction, advances and challenges,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
work page 2025
-
[19]
Tensormask: A founda- tion for dense object segmentation,
X. Chen, R. Girshick, K. He, and P. Doll ´ar, “Tensormask: A founda- tion for dense object segmentation,” inProceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 2061–2069
work page 2019
-
[21]
Neural cache: Bit-serial in-cache acceleration of deep neural networks,
C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaaauw, and R. Das, “Neural cache: Bit-serial in-cache acceleration of deep neural networks,” in2018 ACM/IEEE 45Th annual international symposium on computer architecture (ISCA). IEEE, 2018, pp. 383–396
work page 2018
-
[22]
Hadamard product-based in-memory computing design for floating point neural network training,
A. Fan, Y . Fu, Y . Tao, Z. Jin, H. Han, H. Liu, Y . Zhang, B. Yan, Y . Yang, and R. Huang, “Hadamard product-based in-memory computing design for floating point neural network training,”Neuromorphic Computing and Engineering, vol. 3, no. 1, p. 014009, 2023. GEM3D CIM 12
work page 2023
-
[23]
“A 4nm 6163-tops/w/b 4790-TOPS/mm 2/b sram based digital- computing-in-memory macro supporting bit-width flexibility and simul- taneous mac and weight update,” in2023 IEEE International Solid-State Circuits Conference (ISSCC). IEEE, 2023, pp. 132–134
work page 2023
-
[24]
Monolithic 3d integration of logic, memory and computing-in-memory for one-shot learning,
Y . Li, J. Tang, B. Gao, J. Yao, Y . Xi, Y . Li, T. Li, Y . Zhou, Z. Liu, Q. Zhanget al., “Monolithic 3d integration of logic, memory and computing-in-memory for one-shot learning,” in2021 IEEE Interna- tional Electron Devices Meeting (IEDM). IEEE, 2021, pp. 21–5
work page 2021
-
[25]
Y . Du, J. Tang, Y . Li, Y . Xi, Y . Li, J. Li, H. Huang, Q. Qin, Q. Zhang, B. Gaoet al., “Monolithic 3d integration of analog rram- based computing-in-memory and sensor for energy-efficient near-sensor computing,”Advanced Materials, vol. 36, no. 22, p. 2302658, 2024
work page 2024
-
[26]
A monolithic 3d hybrid architecture for energy- efficient computation,
Y . Yu and N. K. Jha, “A monolithic 3d hybrid architecture for energy- efficient computation,”IEEE Transactions on Multi-Scale Computing Systems, vol. 4, no. 4, pp. 533–547, 2018
work page 2018
-
[27]
A monolithic-3d sram design with enhanced robustness and in-memory computation support,
S. Srinivasa, A. K. Ramanathan, X. Li, W.-H. Chen, F.-K. Hsueh, C.-C. Yang, C.-H. Shen, J.-M. Shieh, S. Gupta, M.-F. M. Changet al., “A monolithic-3d sram design with enhanced robustness and in-memory computation support,” inProceedings of the International Symposium on Low Power Electronics and Design, 2018, pp. 1–6
work page 2018
-
[28]
Z. Or-Bach, D. C. Sekar, and B. Cronquist, “3d integrated circuit,” Patent US11 018 133B2. [Online]. Available: https://patents.google.com/ patent/US11018133B2/en
-
[29]
High-density integration of functional modules using monolithic 3d-ic technology,
S. Panth, K. Samadi, Y . Du, and S. K. Lim, “High-density integration of functional modules using monolithic 3d-ic technology,” in2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2013, pp. 681–686
work page 2013
-
[30]
Designing vertical processors in mono- lithic 3d,
B. Gopireddy and J. Torrellas, “Designing vertical processors in mono- lithic 3d,” inProceedings of the 46th International Symposium on Computer Architecture, 2019, pp. 643–656
work page 2019
-
[31]
Monolithic 3d integration: A path from concept to reality,
M. M. Shulaker, T. F. Wu, M. M. Sabry, H. Wei, H.-S. P. Wong, and S. Mitra, “Monolithic 3d integration: A path from concept to reality,” in2015 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2015, pp. 1197–1202
work page 2015
-
[32]
8t sram cell as a multibit dot-product engine for beyond von neumann computing,
A. Jaiswal, I. Chakraborty, A. Agrawal, and K. Roy, “8t sram cell as a multibit dot-product engine for beyond von neumann computing,”IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 27, no. 11, pp. 2556–2567, 2019
work page 2019
-
[33]
Extremely-low threshold voltage finfet for 5g mmwave applications,
A. Razavieh, Y . Chen, T. Ethirajan, M. Gu, S. Cimino, T. Shimizu, M. Hassan, T. Morshed, J. Singh, W. Zhenget al., “Extremely-low threshold voltage finfet for 5g mmwave applications,”IEEE Journal of the Electron Devices Society, vol. 9, pp. 165–169, 2020
work page 2020
-
[34]
Comparison of binary and lfsr counters and efficient lfsr decoding algorithm,
A. Ajane, P. M. Furth, E. E. Johnson, and R. L. Subramanyam, “Comparison of binary and lfsr counters and efficient lfsr decoding algorithm,” in2011 IEEE 54th International Midwest Symposium on Circuits and Systems (MWSCAS). IEEE, 2011, pp. 1–4
work page 2011
-
[35]
Cimat: A compute-in-memory architecture for on-chip training based on transpose sram arrays,
H. Jiang, X. Peng, S. Huang, and S. Yu, “Cimat: A compute-in-memory architecture for on-chip training based on transpose sram arrays,”IEEE Transactions on Computers, vol. 69, no. 7, pp. 944–954, 2020
work page 2020
-
[36]
A 28 nm 16 kb bit-scalable charge-domain transpose 6t sram in-memory computing macro,
J. Song, X. Tang, X. Qiao, Y . Wang, R. Wang, and R. Huang, “A 28 nm 16 kb bit-scalable charge-domain transpose 6t sram in-memory computing macro,”IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 70, no. 5, pp. 1835–1845, 2023
work page 2023
-
[37]
J. Wang, X. Wang, C. Eckert, A. Subramaniyan, R. Das, D. Blaauw, and D. Sylvester, “A 28-nm compute sram with bit-serial logic/arithmetic op- erations for programmable in-memory vector computing,”IEEE Journal of Solid-State Circuits, vol. 55, no. 1, pp. 76–86, 2019
work page 2019
-
[38]
Fat: An in-memory accelerator with fast addition for ternary weight neural networks,
S. Zhu, L. H. Duong, H. Chen, D. Liu, and W. Liu, “Fat: An in-memory accelerator with fast addition for ternary weight neural networks,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 3, pp. 781–794, 2022
work page 2022
-
[39]
Bit parallel 6t sram in-memory computing with reconfigurable bit-precision,
K. Lee, J. Jeong, S. Cheon, W. Choi, and J. Park, “Bit parallel 6t sram in-memory computing with reconfigurable bit-precision,” in2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE, 2020, pp. 1–6
work page 2020
-
[40]
Low-variation sram bitcells in 22nm fdsoi technology,
V . Joshi, H. Ramamurthy, S. Balasubramanian, S. Seo, H. Yoon, X. Zou, N. Chan, J. Yun, T. Klick, E. Smithet al., “Low-variation sram bitcells in 22nm fdsoi technology,” in2017 Symposium on VLSI Technology. IEEE, 2017, pp. T222–T223
work page 2017
-
[41]
Y . Luo, S. Dutta, A. Kaul, S. K. Lim, M. Bakir, S. Datta, and S. Yu, “A compute-in-memory hardware accelerator design with back-end-of-line (beol) transistor based reconfigurable interconnect,”IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 12, no. 2, pp. 445–457, 2022
work page 2022
-
[42]
S. Subhechha, N. Rassoul, A. Belmonte, A. Chasin, H. Dekkers, M. J. van Setten, A. Kruv, Y . Wan, H. Tang, A. Pavelet al., “Oxide semicon- ductors tfts integration in cmos beol: Device considerations for enabling novel applications,” in2025 Device Research Conference (DRC). IEEE, 2025, pp. 1–2
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.