DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures
Pith reviewed 2026-05-25 07:31 UTC · model grok-4.3
The pith
DCC co-optimizes data rearrangements and compute code in one tuning process for ML kernels on PIM hardware.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present DCC as the first data-centric ML compiler for PIM systems that jointly co-optimizes data rearrangements and compute code in a unified tuning process. It integrates a multi-layer PIM abstraction to support multiple backends, enables co-optimization of data partitioning strategies with compute loop partitioning schemes, applies PIM-specific code optimizations, and uses a fast performance prediction model to select the best schedule. This produces measured speedups of up to 7.68x on HBM-PIM and 13.17x on AttAcc over GPU-only execution for individual kernels, and 4.52x average for GPT-3 and LLaMA-2 inference on AttAcc.
What carries the argument
The unified tuning process that simultaneously explores data partitioning strategies with compute loop partitioning schemes, supported by a multi-layer PIM abstraction and a performance prediction model.
If this is right
- PIM acceleration of ML kernels becomes more effective when data layout costs are considered during the same tuning pass as compute scheduling.
- A single compiler can target multiple PIM device types through the shared abstraction layer without per-device rewrites.
- End-to-end inference of large language models such as GPT-3 and LLaMA-2 runs substantially faster on PIM hardware under the joint schedule.
- The prediction model reduces the cost of finding good schedules, making the approach practical for new kernels.
Where Pith is reading between the lines
- The co-optimization principle could apply to other cases where host and accelerator memory layouts conflict, such as in near-memory or disaggregated systems.
- Widespread use of such a compiler might reduce reliance on manual, device-specific data layout tuning when deploying ML models on emerging memory hardware.
- Testing the same joint approach on kernels from domains outside ML, such as graph analytics, would reveal how broadly the interdependence holds.
Load-bearing premise
That data rearrangements and compute code optimizations are interdependent enough that optimizing them separately misses large performance gains, and that the prediction model can reliably rank schedules across kernels and devices without runtime measurement.
What would settle it
Measurements on the same ML kernels and PIM devices showing that separately optimizing data rearrangements followed by compute code produces equal or higher performance than the jointly tuned schedules from DCC.
Figures
read the original abstract
High-performance Host processors can integrate Processing-In-Memory (PIM) devices, which can accelerate memory-intensive kernels of Machine Learning (ML) models, including Large Language Models (LLMs), by leveraging the large memory bandwidth available at PIM cores. However, Host processor needs consecutive elements distributed across DRAM banks, while PIM cores need consecutive elements within their local banks. This necessitates data rearrangements in ML kernel execution that pose significant performance and programmability challenges, further exacerbated by the need to support diverse PIM devices. Current compilation approaches lack systematic optimization for diverse ML kernels and multiple PIM devices, and may largely ignore data rearrangement costs during the compute code optimization step. We show that data rearrangements and compute code optimization are interdependent, and need to be jointly optimized during the tuning process. Therefore, we design DCC, the first data-centric ML compiler for PIM systems that jointly co-optimizes data rearrangements and compute code in a unified tuning process. DCC integrates a multi-layer PIM abstraction to support multiple PIM backends. DCC enables effective co-optimization of data partitioning strategies with compute loop partitioning schemes. DCC applies PIM-specific code optimizations, and leverages a fast and accurate performance prediction model to select the bestperforming code schedule for a given kernel on a target PIM architecture. Our evaluations in various individual ML kernels show that DCC achieves up to 7.68x speedup (2.21x average) on HBM-PIM, and up to 13.17x speedup (3.92x average) on AttAcc PIM, over GPU-only execution. In end-to-end LLM inference, DCC on AttAcc accelerates GPT-3 and LLaMA-2 by 4.52x average (up to 7.71x in LLaMA-2) over GPU. DCC is open-sourced at https://github.com/SPIN-Research-Group/DCC.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DCC, the first data-centric ML compiler for PIM architectures. It claims that data rearrangements and compute optimizations are interdependent and must be jointly tuned; DCC uses a multi-layer PIM abstraction to support multiple backends, co-optimizes data partitioning with compute loop schemes, applies PIM-specific optimizations, and employs a fast performance prediction model to select schedules. Evaluations report up to 7.68× (2.21× avg) speedup on HBM-PIM and 13.17× (3.92× avg) on AttAcc over GPU for individual kernels, plus 4.52× average (up to 7.71×) acceleration for GPT-3/LLaMA-2 inference on AttAcc. The code is open-sourced.
Significance. If the joint-optimization necessity and the accuracy of the prediction model are substantiated, the work would be a meaningful contribution to compilation for PIM hardware, addressing a real programmability gap for memory-intensive ML kernels. The open-source release and support for multiple PIM backends are concrete strengths that would aid reproducibility and further research.
major comments (2)
- [Abstract and Evaluation (performance prediction model)] The central claim that a fast performance prediction model can reliably identify optimal joint schedules (without exhaustive runtime search) is load-bearing for all reported speedups, yet the abstract and evaluation description provide no quantitative validation such as prediction error, correlation coefficient, or held-out accuracy across kernels and devices. This must be added with concrete metrics and methodology.
- [Introduction and § on co-optimization] The assertion that data rearrangements and compute code are interdependent (necessitating unified tuning) requires explicit evidence that separate optimization yields inferior results; without a controlled comparison (e.g., independent vs. joint tuning on the same kernels), the necessity of the unified process remains unproven.
minor comments (2)
- [Abstract and Evaluation] The abstract mentions 'various individual ML kernels' but does not list them or the input sizes; the evaluation section should include a table of kernels, datasets, and exact configurations for reproducibility.
- [Evaluation] No mention of error bars, number of runs, or statistical significance for the speedup numbers; these details belong in the experimental methodology subsection.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to incorporate the requested evidence and metrics.
read point-by-point responses
-
Referee: [Abstract and Evaluation (performance prediction model)] The central claim that a fast performance prediction model can reliably identify optimal joint schedules (without exhaustive runtime search) is load-bearing for all reported speedups, yet the abstract and evaluation description provide no quantitative validation such as prediction error, correlation coefficient, or held-out accuracy across kernels and devices. This must be added with concrete metrics and methodology.
Authors: We agree that quantitative validation of the performance prediction model is necessary to substantiate the central claim. In the revised manuscript we will add a dedicated subsection in the evaluation that reports prediction error (e.g., MAPE), Pearson/Spearman correlation with measured runtimes, and held-out accuracy across kernels and both PIM devices, together with the cross-validation methodology used to train and assess the model. revision: yes
-
Referee: [Introduction and § on co-optimization] The assertion that data rearrangements and compute code are interdependent (necessitating unified tuning) requires explicit evidence that separate optimization yields inferior results; without a controlled comparison (e.g., independent vs. joint tuning on the same kernels), the necessity of the unified process remains unproven.
Authors: We acknowledge that an explicit controlled comparison is required to demonstrate the benefit of joint optimization. We will add an ablation study (new table or figure) that compares (i) independent tuning of data partitioning followed by compute-loop optimization against (ii) the unified co-optimization performed by DCC, using the same kernels and devices, to quantify the performance gap. revision: yes
Circularity Check
No significant circularity detected
full rationale
The abstract and context present DCC's speedups as outcomes of empirical hardware evaluation on HBM-PIM and AttAcc, with the performance prediction model described as a selection tool rather than a fitted input renamed as a prediction. No equations, self-definitions, or self-citation chains are shown that reduce claimed results to inputs by construction. The interdependence claim and joint optimization are presented as design rationale supported by evaluation, not tautological. This is the common case of a self-contained empirical systems paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Host processors require consecutive elements distributed across DRAM banks while PIM cores require consecutive elements within their local banks, necessitating data rearrangements.
Reference graph
Works this paper leans on
-
[1]
Deep Learning for Finance: Deep Portfolios,
J. B. Heaton, N. G. Polson, and J. H. Witte, “Deep Learning for Finance: Deep Portfolios,”Appl. Stoch. Models Bus. Ind., 2017
work page 2017
-
[2]
Deep Learning for Financial Applications: A Survey,
A. M. Ozbayoglu, M. U. Gudelek, and O. B. Sezer, “Deep Learning for Financial Applications: A Survey,”Appl. Soft Comput., 2020
work page 2020
-
[3]
Artificial Intelligence in Retail: The AI-Enabled Value Chain,
K. Oosthuizen, E. Botha, J. Robertson, and M. Montecchi, “Artificial Intelligence in Retail: The AI-Enabled Value Chain,”Australas. Mark. J., 2021
work page 2021
-
[4]
Deep Learning for Health Informatics,
D. Rav `ı, C. Wong, F. Deligianni, M. Berthelot, J. Andreu-Perez, B. Lo, and G.-Z. Yang, “Deep Learning for Health Informatics,”IEEE JBHI, 2016
work page 2016
-
[5]
Deep Learning for Healthcare: Review, Opportunities, and Challenges,
R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley, “Deep Learning for Healthcare: Review, Opportunities, and Challenges,”Brief. Bioinform., 2018
work page 2018
-
[6]
Perception and Navigation in Autonomous Systems in the Era of Learning: A Survey,
Y . Tang, C. Zhao, J. Wang, C. Zhang, Q. Sun, W. X. Zheng, W. Du, F. Qian, and J. Kurths, “Perception and Navigation in Autonomous Systems in the Era of Learning: A Survey,”IEEE TNNLS, 2022
work page 2022
-
[7]
The Cityscapes Dataset for Semantic Urban Scene Understanding,
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, and B. Schiele, “The Cityscapes Dataset for Semantic Urban Scene Understanding,” inCVPR, 2016
work page 2016
-
[8]
{DistServe}: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving,
Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “{DistServe}: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving,” inOSDI, 2024
work page 2024
-
[9]
Seesaw: High-Throughput LLM Inference via Model Re-Sharding,
Q. Su, W. Zhao, X. Li, M. Andoorveedu, C. Jiang, Z. Zhu, K. Song, C. Giannoula, and G. Pekhimenko, “Seesaw: High-Throughput LLM Inference via Model Re-Sharding,”MLSys, 2025
work page 2025
-
[10]
Syncron: Efficient Synchronization Support for Near-Data- Processing Architectures,
C. Giannoula, N. Vijaykumar, N. Papadopoulou, V . Karakostas, I. Fer- nandez, J. G ´omez-Luna, L. Orosa, N. Koziris, G. Goumas, and O. Mutlu, “Syncron: Efficient Synchronization Support for Near-Data- Processing Architectures,” inHPCA, 2021
work page 2021
-
[11]
Evaluating Machine Learning Workloads on Memory-Centric Computing Systems,
J. G ´omez-Luna, Y . Guo, S. Brocard, J. Legriel, R. Cimadomo, G. F. Oliveira, G. Singh, and O. Mutlu, “Evaluating Machine Learning Workloads on Memory-Centric Computing Systems,” inISPASS, 2023
work page 2023
-
[12]
J. G ´omez-Luna, I. E. Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System,”IEEE Access, 2022
work page 2022
-
[13]
PyGim: An Efficient Graph Neural Network Library for Real Processing-In- Memory Architectures,
C. Giannoula, P. Yang, I. Fernandez, J. Yang, S. Durvasula, Y . X. Li, M. Sadrosadati, J. G. Luna, O. Mutlu, and G. Pekhimenko, “PyGim: An Efficient Graph Neural Network Library for Real Processing-In- Memory Architectures,”POMACS, 2025
work page 2025
-
[14]
C. Li, Z. Zhou, Y . Wang, F. Yang, T. Cao, M. Yang, Y . Liang, and G. Sun, “PIM-DL: Expanding the Applicability of Commodity DRAM- PIMs for Deep Learning via Algorithm–System Co-Optimization,” in ASPLOS, 2024
work page 2024
-
[15]
A Modern Primer on Processing in Memory,
O. Mutlu, S. Ghose, J. G ´omez-Luna, and R. Ausavarungnirun, “A Modern Primer on Processing in Memory,” inEmerging Computing: From Devices to Systems - Looking Beyond Moore and Von Neumann, 2021
work page 2021
-
[16]
S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kimet al., “A 1ynm 1.25 V 8GB, 16GB/s/Pin GDDR6- Based Accelerator-in-Memory Supporting 1Tflops MAC Operation and Various Activation Functions for Deep-Learning Applications,” in ISSCC, 2022
work page 2022
-
[17]
Newton: A DRAM-Maker’s Accelerator-in- Memory (AiM) Architecture for Machine Learning,
M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thottethodi, and T. N. Vijaykumar, “Newton: A DRAM-Maker’s Accelerator-in- Memory (AiM) Architecture for Machine Learning,” inMICRO, 2020
work page 2020
-
[18]
S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin, J. Kim, O. Seongil, A. Iyer, D. Wang, K. Sohn, and N. S. Kim, “Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product,” in ISCA, 2021
work page 2021
-
[19]
The True Processing-in-Memory Accelerator,
F. Devaux, “The True Processing-in-Memory Accelerator,” inHCS, 2019
work page 2019
-
[20]
AttAcc!: Unleashing the Power of PIM for Batched Transformer- Based Generative Model Inference,
J. Park, J. Choi, K. Kyung, M. J. Kim, Y . Kwon, N. S. Kim, and J. H. Ahn, “AttAcc!: Unleashing the Power of PIM for Batched Transformer- Based Generative Model Inference,” inASPLOS, 2024
work page 2024
-
[21]
Cost-effective Extension of DRAM-PIM for Group-wise LLM Quantization,
B. Kim, C. Lee, G. Kim, and E. Park, “Cost-effective Extension of DRAM-PIM for Group-wise LLM Quantization,”IEEE CAL, 2025
work page 2025
-
[22]
BlockPIM: Optimizing Memory Management for PIM-enabled Long-Context LLM Inference,
Z. Li, J. Zhou, X. Li, and N. Sun, “BlockPIM: Optimizing Memory Management for PIM-enabled Long-Context LLM Inference,” inDAC, 2025
work page 2025
-
[23]
McPAL: Scaling Un- structured Sparse Inference with Multi-Chiplet HBM-PIM Architecture for LLMs,
S. Liu, Z. Huang, J. Yu, Q. Liu, and C. Chen, “McPAL: Scaling Un- structured Sparse Inference with Multi-Chiplet HBM-PIM Architecture for LLMs,” inDAC, 2025
work page 2025
-
[24]
AttenPIM: Accelerating LLM Attention with Dual-mode GEMV in Processing-in-Memory,
L. Chen, D. Lyu, Z. Li, J. Jiang, Q. Wang, Z. Mao, and N. Jing, “AttenPIM: Accelerating LLM Attention with Dual-mode GEMV in Processing-in-Memory,” inDAC, 2025
work page 2025
-
[25]
Y . He, H. Mao, C. Giannoula, M. Sadrosadati, J. G ´omez-Luna, H. Li, X. Li, Y . Wang, and O. Mutlu, “PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding With a Processing-in-Memory- Enabled Computing System,” inASPLOS, 2025
work page 2025
-
[26]
C. Li, Z. Zhou, S. Zheng, J. Zhang, Y . Liang, and G. Sun, “SpecPIM: Accelerating Speculative Inference on PIM-Enabled Systems via Ar- chitecture–Dataflow Co-Exploration,” inASPLOS, 2024
work page 2024
-
[27]
PIM-GPT: A Hybrid Processing-in- Memory Accelerator for Autoregressive Transformers,
Y . Wu, Z. Wang, and W. D. Lu, “PIM-GPT: A Hybrid Processing-in- Memory Accelerator for Autoregressive Transformers,”npj Unconven- tional Computing, 2024
work page 2024
-
[28]
PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM,
H. Lee, D. Baek, J. Son, J. Choi, K. Moon, and M. Jang, “PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM,” inHPCA, 2025
work page 2025
-
[29]
Pimba: A Processing- in-Memory Acceleration for Post-Transformer Large Language Model Serving,
W. Kim, Y . Lee, Y . Kim, J. Hwang, S. Oh, J. Jung, A. Huseynov, W. G. Park, C. H. Park, D. Mahajan, and J. Park, “Pimba: A Processing- in-Memory Acceleration for Post-Transformer Large Language Model Serving,” inMICRO, 2025
work page 2025
-
[30]
LongSight: Compute-Enabled Memory to Accelerate Large-Context LLMs via Sparse Attention,
D. Quinn, E. E. Y ¨ucel, J. Kim, J. F. Mart ´ınez, and M. Alian, “LongSight: Compute-Enabled Memory to Accelerate Large-Context LLMs via Sparse Attention,” inMICRO, 2025
work page 2025
-
[31]
ATiM: Autotuning Tensor Programs for Processing-in-DRAM,
Y . Shin, D. Kang, and H. Sung, “ATiM: Autotuning Tensor Programs for Processing-in-DRAM,” inISCA, 2025
work page 2025
-
[32]
SimplePIM: A Software Framework for Productive and Efficient Programming of Real PIM Systems,
J. Chen, J. G ´omez-Luna, I. E. Hajj, Y . Guo, and O. Mutlu, “SimplePIM: A Software Framework for Productive and Efficient Programming of Real PIM Systems,” inPACT, 2023
work page 2023
-
[33]
A. A. Khan, H. Farzaneh, K. F. A. Friebel, C. Fournier, L. Chelini, and J. Castrillon, “CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms,” inASPLOS, 2024. 12
work page 2024
-
[34]
PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM,
Y . Shin, J. Park, S. Cho, and H. Sung, “PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM,” inCGO, 2023
work page 2023
-
[35]
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning,
T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y . Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning,” inOSDI, 2018
work page 2018
-
[36]
TensorIR: An Abstraction for Automatic Tensorized Program Optimization,
S. Feng, B. Hou, H. Jin, W. Lin, J. Shao, R. Lai, Z. Ye, L. Zheng, C. H. Yu, Y . Yuet al., “TensorIR: An Abstraction for Automatic Tensorized Program Optimization,” inASPLOS, 2023
work page 2023
-
[37]
Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs,
Y . Ding, C. H. Yu, B. Zheng, Y . Liu, Y . Wang, and G. Pekhimenko, “Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs,” inASPLOS, 2023
work page 2023
-
[38]
Bolt: Bridging the Gap Between Auto-Tuners and Hardware-Native Perfor- mance,
J. Xing, L. Wang, S. Zhang, J. Chen, A. Chen, and Y . Zhu, “Bolt: Bridging the Gap Between Auto-Tuners and Hardware-Native Perfor- mance,”MLSys, 2022
work page 2022
-
[39]
SPLAT: A Framework for Optimised GPU Code-Generation for SParse reguLar ATtention,
A. Gupta, Y . Yuan, D. Jain, Y . Ge, D. Aponte, Y . Zhou, and C. Mendis, “SPLAT: A Framework for Optimised GPU Code-Generation for SParse reguLar ATtention,”Proc. ACM Program. Lang., 2025
work page 2025
-
[40]
CROSS: Compiler-Driven Optimization of Sparse DNNs Using Sparse/Dense Computation Kernels,
F. Liu, S. Huang, N. Yang, Z. Wang, H. Li, and L. Jiang, “CROSS: Compiler-Driven Optimization of Sparse DNNs Using Sparse/Dense Computation Kernels,” inHPCA, 2025
work page 2025
-
[41]
Finch: Sparse and Structured Tensor Programming with Control Flow,
W. Ahrens, T. F. Collin, R. Patel, K. Deeds, C. Hong, and S. Amaras- inghe, “Finch: Sparse and Structured Tensor Programming with Control Flow,”PACMPL, 2025
work page 2025
-
[42]
SRSparse: Gen- erating Codes for High-Performance Sparse Matrix-Vector Semiring Computations,
Z. Du, Y . Liu, N. Sun, H. Cui, X. Feng, and J. Li, “SRSparse: Gen- erating Codes for High-Performance Sparse Matrix-Vector Semiring Computations,”TACO, 2025
work page 2025
-
[43]
Unified Convolution Framework: A Compiler-Based Approach to Support Sparse Convolutions,
J. Won, C. Hong, C. Mendis, J. Emer, and S. Amarasinghe, “Unified Convolution Framework: A Compiler-Based Approach to Support Sparse Convolutions,”MLSys, 2023
work page 2023
-
[44]
SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning,
Z. Ye, R. Lai, J. Shao, T. Chen, and L. Ceze, “SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning,” inASPLOS, 2023
work page 2023
-
[45]
F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe, “The Tensor Algebra Compiler,”Proc. ACM Program. Lang., no. OOPSLA, 2017
work page 2017
-
[46]
Ansor: Generating{High- Performance}Tensor Programs for Deep Learning,
L. Zheng, C. Jia, M. Sun, Z. Wu, C. H. Yu, A. Haj-Ali, Y . Wang, J. Yang, D. Zhuo, K. Senet al., “Ansor: Generating{High- Performance}Tensor Programs for Deep Learning,” inOSDI, 2020
work page 2020
-
[47]
Learning to Optimize Tensor Programs,
T. Chen, L. Zheng, E. Yan, Z. Jiang, T. Moreau, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Learning to Optimize Tensor Programs,” NeurIPS, vol. 31, 2018
work page 2018
-
[48]
XGBoost: A Scalable Tree Boosting System,
T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” inKDD, 2016
work page 2016
-
[49]
Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator,
H. Luo, Y . C. Tu ˘grul, F. N. Bostancı, A. Olgun, A. G. Ya ˘glıkc ¸ı, , and O. Mutlu, “Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator,” 2023
work page 2023
-
[50]
Op- tiPIM: Optimizing Processing-in-Memory Acceleration Using Integer Linear Programming,
J. Liu, M. Zhou, Y . Pan, C.-Y . Yang, L. Josipovi´c, and T. Rosing, “Op- tiPIM: Optimizing Processing-in-Memory Acceleration Using Integer Linear Programming,” inISCA, 2025
work page 2025
-
[51]
Multi-Dimensional Vector ISA Extension for Mobile In-Cache Computing,
A. Khadem, D. Fujiki, H. Chen, Y . Gu, N. Talati, S. Mahlke, and R. Das, “Multi-Dimensional Vector ISA Extension for Mobile In-Cache Computing,” inHPCA, 2025
work page 2025
-
[52]
TC-CIM: Empowering Tensor Compre- hensions for Computing-In-Memory,
A. Drebes, L. Chelini, O. Zinenko, A. Cohen, H. Corporaal, T. Grosser, K. Vadivel, and N. Vasilache, “TC-CIM: Empowering Tensor Compre- hensions for Computing-In-Memory,” inIMPACT, 2020
work page 2020
-
[53]
M. Rhee, J. Sim, T. Ahn, S. Lee, D. Yoon, E. Kim, K. Park, Y . Joo, and H. Kim, “HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing,”arXiv preprint arXiv:2504.16112, 2025
-
[54]
Y . Kwon, Y . Lee, and M. Rhu, “TensorDIMM: A Practical Near- Memory Processing Architecture for Embeddings and Tensor Oper- ations in Deep Learning,” inMICRO, 2019
work page 2019
-
[55]
Make LLM Inference Affordable to Everyone: Augmenting GPU Memory With NDP-DIMM,
L. Liu, S. Zhao, B. Li, H. Ren, Z. Xu, M. Wang, X. Li, Y . Han, and Y . Wang, “Make LLM Inference Affordable to Everyone: Augmenting GPU Memory With NDP-DIMM,” inHPCA, 2025
work page 2025
-
[56]
PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference,
Y . Gu, A. Khadem, S. Umesh, N. Liang, X. Servot, O. Mutlu, R. Iyer, and R. Das, “PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference,” inASPLOS, 2025
work page 2025
-
[57]
Accelerating Retrieval Augmented Language Model via PIM and PNM Integration,
J.-W. Jang, J. Oh, Y . Kong, J.-Y . Hong, S.-H. Cho, J. Lee, H. Yang, and J.-S. Yang, “Accelerating Retrieval Augmented Language Model via PIM and PNM Integration,” inMICRO, 2025
work page 2025
-
[58]
DReX: Accurate and Scalable Dense Retrieval Acceleration via Algorithmic-Hardware Codesign,
D. Quinn, E. E. Y ¨ucel, M. Prammer, Z. Fan, K. Skadron, J. M. Patel, J. F. Mart ´ınez, and M. Alian, “DReX: Accurate and Scalable Dense Retrieval Acceleration via Algorithmic-Hardware Codesign,” inISCA, 2025
work page 2025
-
[59]
Darwin: A DRAM-Based Multi-Level Processing-in-Memory Architecture for Column-Oriented Database,
D. Kim, J.-Y . Kim, W. Han, J. Won, H. Choi, Y . Kwon, and J.-Y . Kim, “Darwin: A DRAM-Based Multi-Level Processing-in-Memory Architecture for Column-Oriented Database,”IEEE TETC, 2024
work page 2024
-
[60]
S. He, Z. Zhu, Y . He, and T. Jia, “LP-Spec: Leveraging LPDDR PIM for Efficient LLM Mobile Speculative Inference with Architecture- Dataflow Co-Optimization,”arXiv preprint arXiv:2508.07227, 2025
-
[61]
Near- Memory LLM Inference Processor based on 3D DRAM-to-logic Hy- brid Bonding,
S. Han, B. Yoon, G. Park, C. Song, D. Kim, and J.-J. Kim, “Near- Memory LLM Inference Processor based on 3D DRAM-to-logic Hy- brid Bonding,” inDAC, 2025
work page 2025
-
[62]
L. Wu, H. Zhu, S. He, X. Lin, X. Zeng, and C. Chen, “PIMoE: Towards efficient MoE transformer deployment on NPU-PIM system through throttle-aware task offloading,” inDAC, 2025
work page 2025
-
[63]
IANUS: Integrated Accelerator Based on an NPU-PIM Unified Memory System,
M. Seo, X. T. Nguyen, S. J. Hwang, Y . Kwon, G. Kim, C. Park, I. Kim, J. Park, J. Kim, W. Shinet al., “IANUS: Integrated Accelerator Based on an NPU-PIM Unified Memory System,” inASPLOS, 2024
work page 2024
-
[64]
NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing,
G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Mahajan, and J. Park, “NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing,” inASPLOS, 2024
work page 2024
-
[65]
Heat: Npu-ndp heterogeneous architecture for transformer-empowered graph neural networks,
Chen, Ruiyang and Song, Zhuoran and Zheng, Yicheng and Zhu, Zeyu and Li, Gang and Jing, Naifeng and Liang, Xiaoyao and Guan, Haibing, “Heat: Npu-ndp heterogeneous architecture for transformer-empowered graph neural networks,” inMICRO, 2025
work page 2025
-
[66]
A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing,
J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing,” in ISCA, 2015
work page 2015
-
[67]
GraphH: A Processing-in-Memory Architecture for Large- Scale Graph Processing,
G. Dai, T. Huang, Y . Chi, J. Zhao, G. Sun, Y . Liu, Y . Wang, Y . Xie, and H. Yang, “GraphH: A Processing-in-Memory Architecture for Large- Scale Graph Processing,”IEEE TCAD, 2018
work page 2018
-
[68]
SparsePIM: An Efficient HBM- Based PIM Architecture for Sparse Matrix-Vector Multiplications,
T. Kang, G. Choi, T. Suh, and G. Koo, “SparsePIM: An Efficient HBM- Based PIM Architecture for Sparse Matrix-Vector Multiplications,” in ICS, 2025
work page 2025
-
[69]
C. Giannoula, I. Fernandez, J. G. Luna, N. Koziris, G. Goumas, and O. Mutlu, “SparseP: Towards Efficient Sparse Matrix–Vector Multipli- cation on Real Processing-in-Memory Architectures,”POMACS, 2022
work page 2022
-
[71]
Gnnear: Accelerating Full-Batch Training of Graph Neural Networks with Near-Memory Processing,
Z. Zhou, C. Li, X. Wei, X. Wang, and G. Sun, “Gnnear: Accelerating Full-Batch Training of Graph Neural Networks with Near-Memory Processing,” inPACT, 2022
work page 2022
-
[72]
G-NMP: Accelerating Graph Neural Networks with DIMM- Based Near-Memory Processing,
T. Tian, X. Wang, L. Zhao, W. Wu, X. Zhang, F. Lu, T. Wang, and X. Jin, “G-NMP: Accelerating Graph Neural Networks with DIMM- Based Near-Memory Processing,”JSA, 2022
work page 2022
-
[73]
GraNDe: Ef- ficient Near-Data Processing Architecture for Graph Neural Networks,
S. Yun, H. Nam, J. Park, B. Kim, J. H. Ahn, and E. Lee, “GraNDe: Ef- ficient Near-Data Processing Architecture for Graph Neural Networks,” IEEE TC, 2023
work page 2023
-
[74]
MetaNMP: Leveraging Cartesian-Like Product to Accelerate HGNNs with Near-Memory Processing,
D. Chen, H. He, H. Jin, L. Zheng, Y . Huang, X. Shen, and X. Liao, “MetaNMP: Leveraging Cartesian-Like Product to Accelerate HGNNs with Near-Memory Processing,” inISCA, 2023
work page 2023
-
[75]
RayN: Ray Tracing Acceler- ation with Near-memory Computing,
M. Saed, P. J. Nair, and T. M. Aamodt, “RayN: Ray Tracing Acceler- ation with Near-memory Computing,” inMICRO, 2025
work page 2025
-
[76]
CoPIM: A Concurrency-Aware PIM Workload Offloading Architecture for Graph Applications,
L. Yan, M. Zhang, R. Wang, X. Chen, X. Zou, X. Lu, Y . Han, and X.-H. Sun, “CoPIM: A Concurrency-Aware PIM Workload Offloading Architecture for Graph Applications,” inISLPED, 2021
work page 2021
-
[77]
C. Giannoula, I. Fernandez, J. G. Luna, N. Koziris, G. Goumas, and O. Mutlu, “SparseP: Towards Efficient Sparse Matrix Vector Multipli- cation on Real Processing-in-Memory Architectures,”POMACS, 2022
work page 2022
-
[78]
A Framework for High-Throughput Sequence Alignment Using Real Processing-in-Memory Systems,
S. Diab, A. Nassereldine, M. Alser, J. G ´omez Luna, O. Mutlu, and I. El Hajj, “A Framework for High-Throughput Sequence Alignment Using Real Processing-in-Memory Systems,”Bioinformatics, 2023
work page 2023
-
[79]
Design and Analysis of a Processing-in-DIMM Join Algorithm: A Case Study with UPMEM DIMMs,
C. Lim, S. Lee, J. Choi, J. Lee, S. Park, H. Kim, J. Lee, and Y . Kim, “Design and Analysis of a Processing-in-DIMM Join Algorithm: A Case Study with UPMEM DIMMs,”PACMMOD, 2023
work page 2023
-
[80]
TransPimLib: Efficient Transcendental Functions for Processing-in-Memory Systems,
M. Item, G. F. Oliveira, J. G ´omez-Luna, M. Sadrosadati, Y . Guo, and O. Mutlu, “TransPimLib: Efficient Transcendental Functions for Processing-in-Memory Systems,” inISPASS, 2023
work page 2023
-
[81]
P. Das, P. R. Sutradhar, M. Indovina, S. M. P. Dinakarrao, and A. Gan- guly, “Implementation and Evaluation of Deep Neural Networks in Commercially Available Processing in Memory Hardware,” inSOCC, 2022. 13
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.