A complete discussion on fully reconfigurable, digital, scalable, graph and sparsity-aware near-memory accelerator for graph neural networks
Pith reviewed 2026-05-20 02:19 UTC · model grok-4.3
The pith
NEM-GNN is a digital processing-in-memory design that accelerates graph neural networks through sparsity-aware near-memory aggregation and early compute termination.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NEM-GNN demonstrates a scalable digital near-memory accelerator that performs graph and sparsity-aware aggregation using a compute-as-soon-as-ready execution model together with broadcast communication, early termination, and reconfigurable pre-computation to eliminate analog conversion overheads and reduce data movement.
What carries the argument
The compute-as-soon-as-ready (CAR) and broadcast-based execution model for near-memory aggregation, which activates operations on graph nodes only when their inputs arrive and propagates results efficiently across the memory array.
If this is right
- GNN training and inference for large citation or molecular graphs becomes feasible with substantially lower total energy.
- Hardware designs can achieve higher operations per square millimeter without relying on analog circuits.
- Reconfigurable components allow the same accelerator to adapt to different graph structures and sparsity levels at runtime.
- System-on-chip integration simplifies because the design avoids dedicated analog blocks and uses standard digital flows.
Where Pith is reading between the lines
- The broadcast model may prove especially effective for graphs with community structure, suggesting targeted benchmarks on social or biological networks.
- Similar sparsity-aware near-memory techniques could transfer to other irregular workloads such as sparse linear algebra or graph analytics outside neural networks.
- Scaling the design to multi-chip modules would require new mechanisms to handle inter-chip graph partitioning while preserving the early-termination benefits.
Load-bearing premise
The large reported gains in speed and efficiency rest on comparisons to prior accelerators that use matching technology nodes, identical workloads, and unbiased baseline implementations.
What would settle it
Fabricating NEM-GNN and the compared prior accelerators in the same semiconductor process and measuring their performance and energy on identical GNN benchmarks would directly test whether the claimed 80-230x speedups and 850-1134x energy gains hold.
Figures
read the original abstract
Graph neural networks (GNNs) have gained significant interest for applications such as citation network analysis and drug discovery due to their ability to apply machine learning techniques on graph-structured data. GNNs typically employ a two-stage execution pipeline consisting of combination and aggregation kernels. The combination stage performs data-intensive convolution operations with relatively regular memory access patterns, whereas the aggregation stage operates on sparse graph data with highly irregular accesses. These heterogeneous memory behaviors make conventional CPU- and GPU-based execution energy inefficient due to substantial data movement overheads. Existing accelerators attempt to mitigate these challenges using specialized architectures and processing-in-memory (PIM) techniques. However, prior approaches often suffer from scalability limitations, area overheads, restricted parallelism, and energy inefficiencies associated with analog compute and dedicated accelerator structures. This paper presents NEM-GNN, a scalable DAC/ADC-less processing-in-memory architecture for graph neural network acceleration. The proposed design introduces early compute termination mechanisms, pre-computation using reconfigurable system-on-chip components, and graph- and sparsity-aware near-memory aggregation using a compute-as-soon-as-ready (CAR) and broadcast-based execution model. Experimental results demonstrate that NEM-GNN achieves approximately 80--230x higher performance, 80--300x higher throughput, 850--1134x better energy efficiency, and 7--8x higher compute density compared to prior state-of-the-art approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents NEM-GNN, a scalable DAC/ADC-less processing-in-memory architecture for graph neural network acceleration. It introduces early compute termination mechanisms, pre-computation using reconfigurable system-on-chip components, and graph- and sparsity-aware near-memory aggregation via a compute-as-soon-as-ready (CAR) and broadcast-based execution model. Experimental results are claimed to show 80--230x higher performance, 80--300x higher throughput, 850--1134x better energy efficiency, and 7--8x higher compute density versus prior state-of-the-art accelerators.
Significance. If the reported gains are shown to rest on fair, node-matched, and fully re-implemented baselines, the work would constitute a meaningful advance in digital near-memory accelerators for irregular GNN workloads by reducing data-movement costs and avoiding analog compute overheads. The emphasis on reconfigurability and sparsity awareness is a positive differentiator from prior PIM designs.
major comments (2)
- [Experimental Evaluation] Experimental Evaluation section: The headline performance and energy-efficiency claims (80--230x and 850--1134x) are load-bearing for the central contribution. The manuscript does not state whether all cited prior accelerators were re-implemented at the identical process node, with identical workload graphs, memory models, and clock/voltage assumptions as NEM-GNN; any mismatch would directly undermine the reported ratios.
- [Results] Results section, Table or Figure reporting speedups: No error bars, workload selection criteria, or baseline re-implementation details are provided, making it impossible to assess whether the 80--300x throughput and 7--8x compute-density numbers are robust or sensitive to undisclosed simulation assumptions.
minor comments (1)
- [Abstract] Abstract: The phrase 'approximately 80--230x' is used without reference to the specific technology node or number of workloads; adding a short parenthetical note on these parameters would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and have revised the manuscript to improve transparency in the experimental methodology and results presentation.
read point-by-point responses
-
Referee: [Experimental Evaluation] Experimental Evaluation section: The headline performance and energy-efficiency claims (80--230x and 850--1134x) are load-bearing for the central contribution. The manuscript does not state whether all cited prior accelerators were re-implemented at the identical process node, with identical workload graphs, memory models, and clock/voltage assumptions as NEM-GNN; any mismatch would directly undermine the reported ratios.
Authors: We agree that explicit documentation of baseline comparison methodology is necessary to support the headline claims. The original manuscript used performance numbers as reported in the cited prior works, scaled to a common 28 nm process node via standard Dennard scaling factors from the literature, while employing the same public graph datasets (Cora, CiteSeer, PubMed, and synthetic graphs matching the sparsity distributions in the original papers). Full gate-level re-implementation of every baseline was not performed because several prior designs lack open-source RTL or detailed microarchitectural descriptions. In the revised Experimental Evaluation section we now state this methodology explicitly, list the exact scaling assumptions, and add a short discussion of the resulting limitations on the reported ratios. revision: yes
-
Referee: [Results] Results section, Table or Figure reporting speedups: No error bars, workload selection criteria, or baseline re-implementation details are provided, making it impossible to assess whether the 80--300x throughput and 7--8x compute-density numbers are robust or sensitive to undisclosed simulation assumptions.
Authors: We accept the referee’s observation that additional statistical and methodological detail is required. The revised Results section now includes error bars on all speedup, throughput, and energy-efficiency plots; these bars represent one standard deviation across five independent simulation runs that vary graph partitioning seeds and memory access latency within the modeled range. We have also inserted a new paragraph describing workload selection criteria (graphs chosen to span two orders of magnitude in vertex count and edge sparsity while remaining representative of real-world GNN applications) and have cross-referenced the baseline re-implementation details added to the Experimental Evaluation section. revision: yes
Circularity Check
No significant circularity: claims are experimental results from hardware design
full rationale
The paper presents NEM-GNN as a hardware architecture with specific features like early compute termination, CAR execution, and broadcast-based aggregation, then reports measured speedups and efficiency gains from simulations. No mathematical derivation chain, equations, or first-principles predictions appear in the provided abstract or description; performance numbers are framed as outcomes of the proposed design evaluated against external baselines rather than quantities defined or fitted from within the paper's own inputs. Self-citations, if present for prior PIM work, do not load-bear the central claims because the evaluation relies on re-simulation and comparison to independent prior accelerators.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
NEM-GNN ... scalable DAC/ADC-less processing-in-memory architecture ... early compute termination ... compute-as-soon-as-ready (CAR) and broadcast-based execution model
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
bit-serial PIM ... Jcost not referenced; no mention of recognition cost or phi-ladder
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Pavan Kumar Reddy Boppidi, S. Siddhartha Raman, H. Renuka, and Souvik Kundu. 2020. Pt/Cu:ZnO/Nb:STO memristive dual port for cache memory applications.AIP Conference Proceedings2265, 1 (11 2020), 030212. https://doi.org/10. 1063/5.0016597 arXiv:https://pubs.aip.org/aip/acp/article-pdf/doi/10.1063/5.0016597/14105127/030212_1_online.pdf
work page doi:10.1063/5.0016597/14105127/030212_1_online.pdf 2020
-
[2]
Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. 2017. Geometric deep learning: going beyond euclidean data.IEEE Signal Processing Magazine34, 4 (2017), 18–42
work page 2017
-
[3]
Nagadastagiri Challapalle, Karthik Swaminathan, Nandhini Chandramoorthy, and Vijaykrishnan Narayanan. 2021. Crossbar based Processing in Memory Accelerator Architecture for Graph Convolutional Networks. In2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD). 1–9. https://doi.org/10.1109/ICCAD51958.2021.9643465
-
[4]
Jiaxian Chen, Yiquan Lin, Kaoyi Sun, Jiexin Chen, Chenlin Ma, Rui Mao, and Yi Wang. 2022. GCIM: Toward Efficient Processing of Graph Convolutional Networks in 3D-Stacked Memory.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems41, 11 (2022), 3579–3590. https://doi.org/10.1109/TCAD.2022.3198320
-
[5]
Yuhan Chen, Alireza Khadem, Xin He, Nishil Talati, Tanvir Ahmed Khan, and Trevor Mudge. 2023. PEDAL: A Power Efficient GCN Accelerator with Multiple DAtafLows. In2023 Design, Automation & Test in Europe Conference Exhibition (DATE). 1–6. https://doi.org/10.23919/DATE56975.2023.10137240
-
[6]
Matthias Fey and Jan Eric Lenssen. 2019. Fast graph representation learning with PyTorch Geometric.arXiv preprint arXiv:1903.02428(2019)
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[7]
Tong Geng, Ang Li, Runbin Shi, Chunshu Wu, Tianqi Wang, Yanfei Li, Pouya Haghi, Antonino Tumeo, Shuai Che, Steve Reinhardt, et al. 2020. AWB-GCN: A graph convolutional network accelerator with runtime workload rebalancing. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 922–936
work page 2020
-
[8]
Yu Huang, Long Zheng, Pengcheng Yao, Qinggang Wang, Xiaofei Liao, Hai Jin, and Jingling Xue. 2022. Accelerating Graph Convolutional Networks Using Crossbar-based Processing-In-Memory Architectures. In2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 1029–1042
work page 2022
-
[9]
Chanwoo Jeong, Sion Jang, Eunjeong Park, and Sungchul Choi. 2020. A context-aware citation recommendation model with BERT and graph convolutional networks.Scientometrics124, 3 (2020), 1907–1922
work page 2020
-
[10]
Kulkarni, Siddhartha Raman Sundara Raman, Shanshan Xie, and Chieh-Pu Lo
Jaydeep P. Kulkarni, Siddhartha Raman Sundara Raman, Shanshan Xie, and Chieh-Pu Lo. 2025. Unconventional Computing Using Ising Accelerators.Computer58, 6 (2025), 83–86. https://doi.org/10.1109/MC.2025.3544798
-
[11]
Sukhan Lee, Shin-haeng Kang, Jaehoon Lee, Hyeonsu Kim, Eojin Lee, Seungwoo Seo, Hosang Yoon, Seungwon Lee, Kyounghwan Lim, Hyunsung Shin, Jinhyun Kim, O Seongil, Anand Iyer, David Wang, Kyomin Sohn, and Nam Sung Kim. 2021. Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product. In2021 ACM/IEEE 48th Annual...
- [12]
- [13]
-
[14]
S. S. Teja Nibhanupudi, Siddhartha Raman Sundara Raman, Mikaël Cassé, Louis Hutin, and Jaydeep P. Kulkarni. 2021. Ultra-Low-Voltage UTBB-SOI-Based, Pseudo-Static Storage Circuits for Cryogenic CMOS Applications.IEEE Journal on Exploratory Solid-State Computational Devices and Circuits7, 2 (2021), 201–208. https://doi.org/10.1109/JXCDC. 2021.3130839
- [15]
-
[16]
Yikan Qiu, Yufei Ma, Wentao Zhao, Meng Wu, Le Ye, and Ru Huang. 2022. DCIM-GCN: Digital Computing-in-Memory to Efficiently Accelerate Graph Convolutional Networks. In2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD). 1–9
work page 2022
-
[17]
Siddhartha Raman Sundara Raman. 2024. A Review on Non-Volatile and Volatile Emerging Memory Technologies. InComputer Memory and Data Storage, Azam Seyedi (Ed.). IntechOpen, Rijeka, Chapter 3. https://doi.org/10.5772/ intechopen.110617
work page 2024
-
[18]
Siddhartha Raman Sundara Raman, Lizy John, and Jaydeep P. Kulkarni. 2025. SPARK: Sparsity Aware, Low Area, Energy-Efficient, Near-memory Architecture for Accelerating Linear Programming Problems. In2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). 99–112. https://doi.org/10.1109/HPCA61900.2025.00019
-
[19]
Siddhartha Raman Sundara Raman, Lizy K John, and Jaydeep P. Kulkarni. 2026. A comprehensive study on ILP acceler- ation accounting for sparsity, area, energy, data movement using near-memory architecture. arXiv:2605.17158 [cs.AR] https://arxiv.org/abs/2605.17158
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
A detailed algorithmic study on a reuse-aware, near memory, all-digital Ising machine
Siddhartha Raman Sundara Raman, Lizy K. John, and Jaydeep P. Kulkarni. 2026. A detailed algorithmic study on a reuse-aware, near memory, all-digital Ising machine. arXiv:2605.12959 [cs.AR] https://arxiv.org/abs/2605.12959
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Siddhartha Raman Sundara Raman and Jaydeep P. Kulkarni. 2026. ABI: A tightly integrated, unified, sparsity-aware, reconfigurable, compute near-register file/cache GPU architecture with light-weight softmax for deep learning, linear algebra, and Ising compute. arXiv:2602.14262 [cs.AR] https://arxiv.org/abs/2602.14262
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Siddhartha Raman Sundara Raman, Siyuan Ma, and Lizy Kurian John. 2026. A comparative study on power delivery aspects of compute-in/near-memory approaches using DRAM. arXiv:2604.04773 [cs.AR] https://arxiv.org/abs/2604. 04773
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Siddhartha Raman Sundara Raman, S. S. Teja Nibhanupudi, Atanu K. Saha, Sumeet Gupta, and Jaydeep P. Kulkarni. 2021. Threshold Selector and Capacitive Coupled Assist Techniques for Write Voltage Reduction in Metal–Ferroelectric–Metal Field-Effect Transistor.IEEE Transactions on Electron Devices68, 12 (2021), 6132–6138. https://doi.org/10.1109/TED. 2021.3121348
work page doi:10.1109/ted 2021
-
[24]
Siddhartha Raman Sundara Raman, Feng Wen, Ravi Pillarisetty, Vivek De, and Jaydeep P. Kulkarni. 2021. High Noise Margin, Digital Logic Design Using Josephson Junction Field-Effect Transistors for Cryogenic Computing.IEEE Transactions on Applied Superconductivity31, 5 (2021), 1–5. https://doi.org/10.1109/TASC.2021.3054347
-
[25]
Siddhartha Raman Sundara Raman, Shanshan Xie, and Jaydeep P Kulkarni. 2022. IGZO CIM: Enabling In-Memory Computations Using Multilevel Capacitorless Indium–Gallium–Zinc–Oxide-Based Embedded DRAM Technology. IEEE Journal on Exploratory Solid-State Computational Devices and Circuits8, 1 (2022), 35–43
work page 2022
-
[26]
Siddhartha Raman Sundara Raman, Shanshan Xie, and Jaydeep P.Kulkarni. 2021. Compute-in-eDRAM with Backend Integrated Indium Gallium Zinc Oxide Transistors. In2021 IEEE International Symposium on Circuits and Systems (ISCAS). 1–5. https://doi.org/10.1109/ISCAS51556.2021.9401798
-
[27]
Rishov Sarkar, Stefan Abi-Karam, Yuqi He, Lakshmi Sathidevi, and Cong Hao. 2023. FlowGNN: A Dataflow Architecture for Real-Time Workload-Agnostic Graph Neural Network Inference. In2023 IEEE International Symposium on High- Performance Computer Architecture (HPCA). 1099–1112. https://doi.org/10.1109/HPCA56546.2023.10071015
-
[28]
James E Stine, Ivan Castellanos, Michael Wood, Jeff Henson, Fred Love, W Rhett Davis, Paul D Franzon, Michael Bucher, Sunil Basavarajaiah, Julie Oh, et al. 2007. FreePDK: An open-source variation-aware design kit. In2007 IEEE international conference on Microelectronic Systems Education (MSE’07). IEEE, 173–174
work page 2007
-
[29]
Siddhartha Raman Sundara Raman, Lizy John, and Jaydeep P. Kulkarni. 2024. NEM-GNN: DAC/ADC-less, Scalable, Reconfigurable, Graph and Sparsity-Aware Near-Memory Accelerator for Graph Neural Networks.ACM Trans. Archit. Code Optim.21, 2, Article 39 (May 2024), 26 pages. https://doi.org/10.1145/3652607
-
[30]
Siddhartha Raman Sundara Raman, Lizy K. John, and Jaydeep P. Kulkarni. 2024. SACHI: A Stationarity-Aware, All-Digital, Near-Memory, Ising Architecture. In2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 719–731. https://doi.org/10.1109/HPCA57654.2024.00061
-
[31]
Siddhartha Raman Sundara Raman, S. S. Teja Nibhanupudi, and Jaydeep P. Kulkarni. 2022. Enabling In-Memory Computations in Non-Volatile SRAM Designs.IEEE Journal on Emerging and Selected Topics in Circuits and Systems12, 2 (2022), 557–568. https://doi.org/10.1109/JETCAS.2022.3174148
-
[32]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks.arXiv preprint arXiv:1710.10903(2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[33]
Hongwei Wang, Jialin Wang, Jia Wang, Miao Zhao, Weinan Zhang, Fuzheng Zhang, Wenjie Li, Xing Xie, and Minyi Guo. 2019. Learning graph representation with generative adversarial nets.IEEE Transactions on Knowledge and Data Engineering33, 8 (2019), 3090–3103
work page 2019
-
[34]
Shanshan Xie, Siddhartha Raman Sundara Raman, Can Ni, Meizhi Wang, Mengtian Yang, and Jaydeep P. Kulkarni. 2022. Ising-CIM: A Reconfigurable and Scalable Compute Within Memory Analog Ising Accelerator for Solving Combinatorial Optimization Problems.IEEE Journal of Solid-State Circuits(2022), 1–13. https://doi.org/10.1109/JSSC.2022.3176610
-
[35]
Xinfeng Xie, Zheng Liang, Peng Gu, Abanti Basak, Lei Deng, Ling Liang, Xing Hu, and Yuan Xie. 2021. Spacea: Sparse matrix vector multiplication on processing-in-memory accelerator. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 570–583
work page 2021
-
[36]
Mingyu Yan, Lei Deng, Xing Hu, Ling Liang, Yujing Feng, Xiaochun Ye, Zhimin Zhang, Dongrui Fan, and Yuan Xie
-
[37]
In2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)
Hygcn: A gcn accelerator with hybrid architecture. In2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 15–29
-
[38]
Tao Yang, Dongyue Li, Yibo Han, Yilong Zhao, Fangxin Liu, Xiaoyao Liang, Zhezhi He, and Li Jiang. 2021. PIMGCN: A ReRAM-Based PIM Design for Graph Convolutional Network Acceleration. In2021 58th ACM/IEEE Design Automation Conference (DAC). 583–588. https://doi.org/10.1109/DAC18074.2021.9586231
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.