DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures

Christina Giannoula; Gennady Pekhimenko; Ivan Fernandez; Mohammad Sadrosadati; Onur Mutlu; Peiming Yang; Sankeerth Durvasula

arxiv: 2511.15503 · v2 · pith:DZESHEGRnew · submitted 2025-11-19 · 💻 cs.AR · cs.DC· cs.LG· cs.PF

DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures

Peiming Yang , Sankeerth Durvasula , Ivan Fernandez , Mohammad Sadrosadati , Onur Mutlu , Gennady Pekhimenko , Christina Giannoula This is my paper

Pith reviewed 2026-05-25 07:31 UTC · model grok-4.3

classification 💻 cs.AR cs.DCcs.LGcs.PF

keywords data-centric compilationprocessing-in-memorymachine learning kernelsdata rearrangementcompiler optimizationPIM architecturesLLM inferenceperformance prediction

0 comments

The pith

DCC co-optimizes data rearrangements and compute code in one tuning process for ML kernels on PIM hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that data rearrangements required to align host and PIM memory layouts are interdependent with compute code optimizations, so they must be handled together rather than in separate steps. DCC implements this joint approach through a multi-layer abstraction that works across PIM devices, data and loop partitioning co-exploration, and a performance prediction model that selects schedules without exhaustive testing. If the interdependence holds, the result is faster execution of memory-intensive ML kernels and large language models on PIM hardware compared with GPU-only runs. A sympathetic reader would care because this addresses programmability and performance barriers that have limited practical use of high-bandwidth PIM for modern ML workloads. The compiler is presented as the first to make such co-optimization systematic and device-portable.

Core claim

The authors present DCC as the first data-centric ML compiler for PIM systems that jointly co-optimizes data rearrangements and compute code in a unified tuning process. It integrates a multi-layer PIM abstraction to support multiple backends, enables co-optimization of data partitioning strategies with compute loop partitioning schemes, applies PIM-specific code optimizations, and uses a fast performance prediction model to select the best schedule. This produces measured speedups of up to 7.68x on HBM-PIM and 13.17x on AttAcc over GPU-only execution for individual kernels, and 4.52x average for GPT-3 and LLaMA-2 inference on AttAcc.

What carries the argument

The unified tuning process that simultaneously explores data partitioning strategies with compute loop partitioning schemes, supported by a multi-layer PIM abstraction and a performance prediction model.

If this is right

PIM acceleration of ML kernels becomes more effective when data layout costs are considered during the same tuning pass as compute scheduling.
A single compiler can target multiple PIM device types through the shared abstraction layer without per-device rewrites.
End-to-end inference of large language models such as GPT-3 and LLaMA-2 runs substantially faster on PIM hardware under the joint schedule.
The prediction model reduces the cost of finding good schedules, making the approach practical for new kernels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The co-optimization principle could apply to other cases where host and accelerator memory layouts conflict, such as in near-memory or disaggregated systems.
Widespread use of such a compiler might reduce reliance on manual, device-specific data layout tuning when deploying ML models on emerging memory hardware.
Testing the same joint approach on kernels from domains outside ML, such as graph analytics, would reveal how broadly the interdependence holds.

Load-bearing premise

That data rearrangements and compute code optimizations are interdependent enough that optimizing them separately misses large performance gains, and that the prediction model can reliably rank schedules across kernels and devices without runtime measurement.

What would settle it

Measurements on the same ML kernels and PIM devices showing that separately optimizing data rearrangements followed by compute code produces equal or higher performance than the jointly tuned schedules from DCC.

Figures

Figures reproduced from arXiv: 2511.15503 by Christina Giannoula, Gennady Pekhimenko, Ivan Fernandez, Mohammad Sadrosadati, Onur Mutlu, Peiming Yang, Sankeerth Durvasula.

**Figure 3.** Figure 3: presents a high-level overview of DCC components. Users develop ML kernels using the DCC API, which enables execution on target PIM backends. At compile time, DCC analyzes ML kernels in the model and trains its coupled predictor. At runtime, once input tensor dimensions are known (e.g., token counts in LLMs), DCC uses its pre-trained coupled predictor to generate optimized schedules for all PIM-running ker… view at source ↗

**Figure 2.** Figure 2: Normalized breakdown of compute and data rearrange [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Example schedule generation process for GEMV. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: shows an example of replacing the QKV generation layer in GPT-3 13B model with a PIM-running layer. Lines 1-8 define the PIM-running kernel with for loops. Line 10 creates class QKV Layer inheriting from the torch.nn.Module and DCC.Layer. In lines 11-14, DCC initializes the kernel and model parameters. Lines 16-18 pre-load weights and bias to PIM devices after having selected the best-performing draft. Lin… view at source ↗

**Figure 6.** Figure 6: Speedup of AttAcc and AttAcc+DCC over GPU for ATTN, GEMV and RED kernels, when varying the tensor sizes. 5.2Gbps per pin and running at 333MHz. Each HBM has 16 pseudo-channels and 64 banks per pCH. In AttAcc, each bank is equipped with a GEMV unit and each channel has a softmax unit. In HBM-PIM, every two banks share a 16-way FP16 FPU and two 16×256-bit GRF registers (one per bank). ML Kernels and Models. … view at source ↗

**Figure 7.** Figure 7: Speedup of HBM-PIM and HBM-PIM+DCC over GPU for GEMV, RED, VA and RELU, varying the tensor sizes [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Normalized time breakdown for GPT3-13B and LLaMA2-33B models for various input, output token and batch sizes. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Speedup of DCC over HBM-PIM (left) and AttAcc (right), when increasing the batch size in the GEMV kernel [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Normalized time breakdown of compute and data [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

read the original abstract

High-performance Host processors can integrate Processing-In-Memory (PIM) devices, which can accelerate memory-intensive kernels of Machine Learning (ML) models, including Large Language Models (LLMs), by leveraging the large memory bandwidth available at PIM cores. However, Host processor needs consecutive elements distributed across DRAM banks, while PIM cores need consecutive elements within their local banks. This necessitates data rearrangements in ML kernel execution that pose significant performance and programmability challenges, further exacerbated by the need to support diverse PIM devices. Current compilation approaches lack systematic optimization for diverse ML kernels and multiple PIM devices, and may largely ignore data rearrangement costs during the compute code optimization step. We show that data rearrangements and compute code optimization are interdependent, and need to be jointly optimized during the tuning process. Therefore, we design DCC, the first data-centric ML compiler for PIM systems that jointly co-optimizes data rearrangements and compute code in a unified tuning process. DCC integrates a multi-layer PIM abstraction to support multiple PIM backends. DCC enables effective co-optimization of data partitioning strategies with compute loop partitioning schemes. DCC applies PIM-specific code optimizations, and leverages a fast and accurate performance prediction model to select the bestperforming code schedule for a given kernel on a target PIM architecture. Our evaluations in various individual ML kernels show that DCC achieves up to 7.68x speedup (2.21x average) on HBM-PIM, and up to 13.17x speedup (3.92x average) on AttAcc PIM, over GPU-only execution. In end-to-end LLM inference, DCC on AttAcc accelerates GPT-3 and LLaMA-2 by 4.52x average (up to 7.71x in LLaMA-2) over GPU. DCC is open-sourced at https://github.com/SPIN-Research-Group/DCC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DCC claims the first joint data-compute compiler for PIM but its speedups depend on an untested prediction model.

read the letter

The main thing to know is that DCC is presented as the first compiler to treat data rearrangements and compute scheduling as interdependent for PIM hardware, optimizing them together in one tuning pass with a performance model to pick the schedule. It targets the layout mismatch where host processors want consecutive elements across banks but PIM cores want them local, and it adds support for multiple PIM devices through a layered abstraction plus PIM-specific code tweaks. The reported numbers are up to 7.68x on HBM-PIM and 13.17x on AttAcc for kernels, plus 4.52x average on GPT-3 and LLaMA-2 inference, all over GPU baselines, and the code is open-sourced. That is concrete and addresses a real programmability gap in emerging PIM setups for ML. The open-source part and the multi-backend design are the parts that stand out as useful. The soft spot is the fast performance prediction model that is supposed to replace exhaustive search. The entire speedup story rests on this model correctly ranking joint schedules across kernels and devices. The abstract gives no correlation, error bounds, or validation numbers for it, so it is impossible to tell whether the claimed gains come from the joint approach or from something else. Experimental details on kernel choice, run counts, variance, and exact baselines are also absent from the summary, which leaves the strength of the evidence unclear. This paper is for architecture and systems people who work on PIM compilers or ML acceleration on near-memory hardware. It shows honest engagement with a practical constraint and ships a working framework, so it deserves a serious referee even though the model validation will need close checking. I would send it out for peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces DCC, the first data-centric ML compiler for PIM architectures. It claims that data rearrangements and compute optimizations are interdependent and must be jointly tuned; DCC uses a multi-layer PIM abstraction to support multiple backends, co-optimizes data partitioning with compute loop schemes, applies PIM-specific optimizations, and employs a fast performance prediction model to select schedules. Evaluations report up to 7.68× (2.21× avg) speedup on HBM-PIM and 13.17× (3.92× avg) on AttAcc over GPU for individual kernels, plus 4.52× average (up to 7.71×) acceleration for GPT-3/LLaMA-2 inference on AttAcc. The code is open-sourced.

Significance. If the joint-optimization necessity and the accuracy of the prediction model are substantiated, the work would be a meaningful contribution to compilation for PIM hardware, addressing a real programmability gap for memory-intensive ML kernels. The open-source release and support for multiple PIM backends are concrete strengths that would aid reproducibility and further research.

major comments (2)

[Abstract and Evaluation (performance prediction model)] The central claim that a fast performance prediction model can reliably identify optimal joint schedules (without exhaustive runtime search) is load-bearing for all reported speedups, yet the abstract and evaluation description provide no quantitative validation such as prediction error, correlation coefficient, or held-out accuracy across kernels and devices. This must be added with concrete metrics and methodology.
[Introduction and § on co-optimization] The assertion that data rearrangements and compute code are interdependent (necessitating unified tuning) requires explicit evidence that separate optimization yields inferior results; without a controlled comparison (e.g., independent vs. joint tuning on the same kernels), the necessity of the unified process remains unproven.

minor comments (2)

[Abstract and Evaluation] The abstract mentions 'various individual ML kernels' but does not list them or the input sizes; the evaluation section should include a table of kernels, datasets, and exact configurations for reproducibility.
[Evaluation] No mention of error bars, number of runs, or statistical significance for the speedup numbers; these details belong in the experimental methodology subsection.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to incorporate the requested evidence and metrics.

read point-by-point responses

Referee: [Abstract and Evaluation (performance prediction model)] The central claim that a fast performance prediction model can reliably identify optimal joint schedules (without exhaustive runtime search) is load-bearing for all reported speedups, yet the abstract and evaluation description provide no quantitative validation such as prediction error, correlation coefficient, or held-out accuracy across kernels and devices. This must be added with concrete metrics and methodology.

Authors: We agree that quantitative validation of the performance prediction model is necessary to substantiate the central claim. In the revised manuscript we will add a dedicated subsection in the evaluation that reports prediction error (e.g., MAPE), Pearson/Spearman correlation with measured runtimes, and held-out accuracy across kernels and both PIM devices, together with the cross-validation methodology used to train and assess the model. revision: yes
Referee: [Introduction and § on co-optimization] The assertion that data rearrangements and compute code are interdependent (necessitating unified tuning) requires explicit evidence that separate optimization yields inferior results; without a controlled comparison (e.g., independent vs. joint tuning on the same kernels), the necessity of the unified process remains unproven.

Authors: We acknowledge that an explicit controlled comparison is required to demonstrate the benefit of joint optimization. We will add an ablation study (new table or figure) that compares (i) independent tuning of data partitioning followed by compute-loop optimization against (ii) the unified co-optimization performed by DCC, using the same kernels and devices, to quantify the performance gap. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and context present DCC's speedups as outcomes of empirical hardware evaluation on HBM-PIM and AttAcc, with the performance prediction model described as a selection tool rather than a fitted input renamed as a prediction. No equations, self-definitions, or self-citation chains are shown that reduce claimed results to inputs by construction. The interdependence claim and joint optimization are presented as design rationale supported by evaluation, not tautological. This is the common case of a self-contained empirical systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that host and PIM data access patterns fundamentally conflict and that this conflict must be resolved through joint tuning rather than separate passes. No free parameters or invented entities are identifiable from the abstract.

axioms (1)

domain assumption Host processors require consecutive elements distributed across DRAM banks while PIM cores require consecutive elements within their local banks, necessitating data rearrangements.
Stated directly in the abstract as the root cause of performance and programmability challenges.

pith-pipeline@v0.9.0 · 5913 in / 1320 out tokens · 55175 ms · 2026-05-25T07:31:23.093983+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

100 extracted references · 100 canonical work pages

[1]

Deep Learning for Finance: Deep Portfolios,

J. B. Heaton, N. G. Polson, and J. H. Witte, “Deep Learning for Finance: Deep Portfolios,”Appl. Stoch. Models Bus. Ind., 2017

work page 2017
[2]

Deep Learning for Financial Applications: A Survey,

A. M. Ozbayoglu, M. U. Gudelek, and O. B. Sezer, “Deep Learning for Financial Applications: A Survey,”Appl. Soft Comput., 2020

work page 2020
[3]

Artificial Intelligence in Retail: The AI-Enabled Value Chain,

K. Oosthuizen, E. Botha, J. Robertson, and M. Montecchi, “Artificial Intelligence in Retail: The AI-Enabled Value Chain,”Australas. Mark. J., 2021

work page 2021
[4]

Deep Learning for Health Informatics,

D. Rav `ı, C. Wong, F. Deligianni, M. Berthelot, J. Andreu-Perez, B. Lo, and G.-Z. Yang, “Deep Learning for Health Informatics,”IEEE JBHI, 2016

work page 2016
[5]

Deep Learning for Healthcare: Review, Opportunities, and Challenges,

R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley, “Deep Learning for Healthcare: Review, Opportunities, and Challenges,”Brief. Bioinform., 2018

work page 2018
[6]

Perception and Navigation in Autonomous Systems in the Era of Learning: A Survey,

Y . Tang, C. Zhao, J. Wang, C. Zhang, Q. Sun, W. X. Zheng, W. Du, F. Qian, and J. Kurths, “Perception and Navigation in Autonomous Systems in the Era of Learning: A Survey,”IEEE TNNLS, 2022

work page 2022
[7]

The Cityscapes Dataset for Semantic Urban Scene Understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, and B. Schiele, “The Cityscapes Dataset for Semantic Urban Scene Understanding,” inCVPR, 2016

work page 2016
[8]

{DistServe}: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving,

Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “{DistServe}: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving,” inOSDI, 2024

work page 2024
[9]

Seesaw: High-Throughput LLM Inference via Model Re-Sharding,

Q. Su, W. Zhao, X. Li, M. Andoorveedu, C. Jiang, Z. Zhu, K. Song, C. Giannoula, and G. Pekhimenko, “Seesaw: High-Throughput LLM Inference via Model Re-Sharding,”MLSys, 2025

work page 2025
[10]

Syncron: Efficient Synchronization Support for Near-Data- Processing Architectures,

C. Giannoula, N. Vijaykumar, N. Papadopoulou, V . Karakostas, I. Fer- nandez, J. G ´omez-Luna, L. Orosa, N. Koziris, G. Goumas, and O. Mutlu, “Syncron: Efficient Synchronization Support for Near-Data- Processing Architectures,” inHPCA, 2021

work page 2021
[11]

Evaluating Machine Learning Workloads on Memory-Centric Computing Systems,

J. G ´omez-Luna, Y . Guo, S. Brocard, J. Legriel, R. Cimadomo, G. F. Oliveira, G. Singh, and O. Mutlu, “Evaluating Machine Learning Workloads on Memory-Centric Computing Systems,” inISPASS, 2023

work page 2023
[12]

Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System,

J. G ´omez-Luna, I. E. Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System,”IEEE Access, 2022

work page 2022
[13]

PyGim: An Efficient Graph Neural Network Library for Real Processing-In- Memory Architectures,

C. Giannoula, P. Yang, I. Fernandez, J. Yang, S. Durvasula, Y . X. Li, M. Sadrosadati, J. G. Luna, O. Mutlu, and G. Pekhimenko, “PyGim: An Efficient Graph Neural Network Library for Real Processing-In- Memory Architectures,”POMACS, 2025

work page 2025
[14]

PIM-DL: Expanding the Applicability of Commodity DRAM- PIMs for Deep Learning via Algorithm–System Co-Optimization,

C. Li, Z. Zhou, Y . Wang, F. Yang, T. Cao, M. Yang, Y . Liang, and G. Sun, “PIM-DL: Expanding the Applicability of Commodity DRAM- PIMs for Deep Learning via Algorithm–System Co-Optimization,” in ASPLOS, 2024

work page 2024
[15]

A Modern Primer on Processing in Memory,

O. Mutlu, S. Ghose, J. G ´omez-Luna, and R. Ausavarungnirun, “A Modern Primer on Processing in Memory,” inEmerging Computing: From Devices to Systems - Looking Beyond Moore and Von Neumann, 2021

work page 2021
[16]

A 1ynm 1.25 V 8GB, 16GB/s/Pin GDDR6- Based Accelerator-in-Memory Supporting 1Tflops MAC Operation and Various Activation Functions for Deep-Learning Applications,

S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kimet al., “A 1ynm 1.25 V 8GB, 16GB/s/Pin GDDR6- Based Accelerator-in-Memory Supporting 1Tflops MAC Operation and Various Activation Functions for Deep-Learning Applications,” in ISSCC, 2022

work page 2022
[17]

Newton: A DRAM-Maker’s Accelerator-in- Memory (AiM) Architecture for Machine Learning,

M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thottethodi, and T. N. Vijaykumar, “Newton: A DRAM-Maker’s Accelerator-in- Memory (AiM) Architecture for Machine Learning,” inMICRO, 2020

work page 2020
[18]

Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product,

S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin, J. Kim, O. Seongil, A. Iyer, D. Wang, K. Sohn, and N. S. Kim, “Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product,” in ISCA, 2021

work page 2021
[19]

The True Processing-in-Memory Accelerator,

F. Devaux, “The True Processing-in-Memory Accelerator,” inHCS, 2019

work page 2019
[20]

AttAcc!: Unleashing the Power of PIM for Batched Transformer- Based Generative Model Inference,

J. Park, J. Choi, K. Kyung, M. J. Kim, Y . Kwon, N. S. Kim, and J. H. Ahn, “AttAcc!: Unleashing the Power of PIM for Batched Transformer- Based Generative Model Inference,” inASPLOS, 2024

work page 2024
[21]

Cost-effective Extension of DRAM-PIM for Group-wise LLM Quantization,

B. Kim, C. Lee, G. Kim, and E. Park, “Cost-effective Extension of DRAM-PIM for Group-wise LLM Quantization,”IEEE CAL, 2025

work page 2025
[22]

BlockPIM: Optimizing Memory Management for PIM-enabled Long-Context LLM Inference,

Z. Li, J. Zhou, X. Li, and N. Sun, “BlockPIM: Optimizing Memory Management for PIM-enabled Long-Context LLM Inference,” inDAC, 2025

work page 2025
[23]

McPAL: Scaling Un- structured Sparse Inference with Multi-Chiplet HBM-PIM Architecture for LLMs,

S. Liu, Z. Huang, J. Yu, Q. Liu, and C. Chen, “McPAL: Scaling Un- structured Sparse Inference with Multi-Chiplet HBM-PIM Architecture for LLMs,” inDAC, 2025

work page 2025
[24]

AttenPIM: Accelerating LLM Attention with Dual-mode GEMV in Processing-in-Memory,

L. Chen, D. Lyu, Z. Li, J. Jiang, Q. Wang, Z. Mao, and N. Jing, “AttenPIM: Accelerating LLM Attention with Dual-mode GEMV in Processing-in-Memory,” inDAC, 2025

work page 2025
[25]

PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding With a Processing-in-Memory- Enabled Computing System,

Y . He, H. Mao, C. Giannoula, M. Sadrosadati, J. G ´omez-Luna, H. Li, X. Li, Y . Wang, and O. Mutlu, “PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding With a Processing-in-Memory- Enabled Computing System,” inASPLOS, 2025

work page 2025
[26]

SpecPIM: Accelerating Speculative Inference on PIM-Enabled Systems via Ar- chitecture–Dataflow Co-Exploration,

C. Li, Z. Zhou, S. Zheng, J. Zhang, Y . Liang, and G. Sun, “SpecPIM: Accelerating Speculative Inference on PIM-Enabled Systems via Ar- chitecture–Dataflow Co-Exploration,” inASPLOS, 2024

work page 2024
[27]

PIM-GPT: A Hybrid Processing-in- Memory Accelerator for Autoregressive Transformers,

Y . Wu, Z. Wang, and W. D. Lu, “PIM-GPT: A Hybrid Processing-in- Memory Accelerator for Autoregressive Transformers,”npj Unconven- tional Computing, 2024

work page 2024
[28]

PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM,

H. Lee, D. Baek, J. Son, J. Choi, K. Moon, and M. Jang, “PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM,” inHPCA, 2025

work page 2025
[29]

Pimba: A Processing- in-Memory Acceleration for Post-Transformer Large Language Model Serving,

W. Kim, Y . Lee, Y . Kim, J. Hwang, S. Oh, J. Jung, A. Huseynov, W. G. Park, C. H. Park, D. Mahajan, and J. Park, “Pimba: A Processing- in-Memory Acceleration for Post-Transformer Large Language Model Serving,” inMICRO, 2025

work page 2025
[30]

LongSight: Compute-Enabled Memory to Accelerate Large-Context LLMs via Sparse Attention,

D. Quinn, E. E. Y ¨ucel, J. Kim, J. F. Mart ´ınez, and M. Alian, “LongSight: Compute-Enabled Memory to Accelerate Large-Context LLMs via Sparse Attention,” inMICRO, 2025

work page 2025
[31]

ATiM: Autotuning Tensor Programs for Processing-in-DRAM,

Y . Shin, D. Kang, and H. Sung, “ATiM: Autotuning Tensor Programs for Processing-in-DRAM,” inISCA, 2025

work page 2025
[32]

SimplePIM: A Software Framework for Productive and Efficient Programming of Real PIM Systems,

J. Chen, J. G ´omez-Luna, I. E. Hajj, Y . Guo, and O. Mutlu, “SimplePIM: A Software Framework for Productive and Efficient Programming of Real PIM Systems,” inPACT, 2023

work page 2023
[33]

CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms,

A. A. Khan, H. Farzaneh, K. F. A. Friebel, C. Fournier, L. Chelini, and J. Castrillon, “CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms,” inASPLOS, 2024. 12

work page 2024
[34]

PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM,

Y . Shin, J. Park, S. Cho, and H. Sung, “PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM,” inCGO, 2023

work page 2023
[35]

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning,

T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y . Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning,” inOSDI, 2018

work page 2018
[36]

TensorIR: An Abstraction for Automatic Tensorized Program Optimization,

S. Feng, B. Hou, H. Jin, W. Lin, J. Shao, R. Lai, Z. Ye, L. Zheng, C. H. Yu, Y . Yuet al., “TensorIR: An Abstraction for Automatic Tensorized Program Optimization,” inASPLOS, 2023

work page 2023
[37]

Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs,

Y . Ding, C. H. Yu, B. Zheng, Y . Liu, Y . Wang, and G. Pekhimenko, “Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs,” inASPLOS, 2023

work page 2023
[38]

Bolt: Bridging the Gap Between Auto-Tuners and Hardware-Native Perfor- mance,

J. Xing, L. Wang, S. Zhang, J. Chen, A. Chen, and Y . Zhu, “Bolt: Bridging the Gap Between Auto-Tuners and Hardware-Native Perfor- mance,”MLSys, 2022

work page 2022
[39]

SPLAT: A Framework for Optimised GPU Code-Generation for SParse reguLar ATtention,

A. Gupta, Y . Yuan, D. Jain, Y . Ge, D. Aponte, Y . Zhou, and C. Mendis, “SPLAT: A Framework for Optimised GPU Code-Generation for SParse reguLar ATtention,”Proc. ACM Program. Lang., 2025

work page 2025
[40]

CROSS: Compiler-Driven Optimization of Sparse DNNs Using Sparse/Dense Computation Kernels,

F. Liu, S. Huang, N. Yang, Z. Wang, H. Li, and L. Jiang, “CROSS: Compiler-Driven Optimization of Sparse DNNs Using Sparse/Dense Computation Kernels,” inHPCA, 2025

work page 2025
[41]

Finch: Sparse and Structured Tensor Programming with Control Flow,

W. Ahrens, T. F. Collin, R. Patel, K. Deeds, C. Hong, and S. Amaras- inghe, “Finch: Sparse and Structured Tensor Programming with Control Flow,”PACMPL, 2025

work page 2025
[42]

SRSparse: Gen- erating Codes for High-Performance Sparse Matrix-Vector Semiring Computations,

Z. Du, Y . Liu, N. Sun, H. Cui, X. Feng, and J. Li, “SRSparse: Gen- erating Codes for High-Performance Sparse Matrix-Vector Semiring Computations,”TACO, 2025

work page 2025
[43]

Unified Convolution Framework: A Compiler-Based Approach to Support Sparse Convolutions,

J. Won, C. Hong, C. Mendis, J. Emer, and S. Amarasinghe, “Unified Convolution Framework: A Compiler-Based Approach to Support Sparse Convolutions,”MLSys, 2023

work page 2023
[44]

SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning,

Z. Ye, R. Lai, J. Shao, T. Chen, and L. Ceze, “SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning,” inASPLOS, 2023

work page 2023
[45]

The Tensor Algebra Compiler,

F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe, “The Tensor Algebra Compiler,”Proc. ACM Program. Lang., no. OOPSLA, 2017

work page 2017
[46]

Ansor: Generating{High- Performance}Tensor Programs for Deep Learning,

L. Zheng, C. Jia, M. Sun, Z. Wu, C. H. Yu, A. Haj-Ali, Y . Wang, J. Yang, D. Zhuo, K. Senet al., “Ansor: Generating{High- Performance}Tensor Programs for Deep Learning,” inOSDI, 2020

work page 2020
[47]

Learning to Optimize Tensor Programs,

T. Chen, L. Zheng, E. Yan, Z. Jiang, T. Moreau, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Learning to Optimize Tensor Programs,” NeurIPS, vol. 31, 2018

work page 2018
[48]

XGBoost: A Scalable Tree Boosting System,

T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” inKDD, 2016

work page 2016
[49]

Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator,

H. Luo, Y . C. Tu ˘grul, F. N. Bostancı, A. Olgun, A. G. Ya ˘glıkc ¸ı, , and O. Mutlu, “Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator,” 2023

work page 2023
[50]

Op- tiPIM: Optimizing Processing-in-Memory Acceleration Using Integer Linear Programming,

J. Liu, M. Zhou, Y . Pan, C.-Y . Yang, L. Josipovi´c, and T. Rosing, “Op- tiPIM: Optimizing Processing-in-Memory Acceleration Using Integer Linear Programming,” inISCA, 2025

work page 2025
[51]

Multi-Dimensional Vector ISA Extension for Mobile In-Cache Computing,

A. Khadem, D. Fujiki, H. Chen, Y . Gu, N. Talati, S. Mahlke, and R. Das, “Multi-Dimensional Vector ISA Extension for Mobile In-Cache Computing,” inHPCA, 2025

work page 2025
[52]

TC-CIM: Empowering Tensor Compre- hensions for Computing-In-Memory,

A. Drebes, L. Chelini, O. Zinenko, A. Cohen, H. Corporaal, T. Grosser, K. Vadivel, and N. Vasilache, “TC-CIM: Empowering Tensor Compre- hensions for Computing-In-Memory,” inIMPACT, 2020

work page 2020
[53]

HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing,

M. Rhee, J. Sim, T. Ahn, S. Lee, D. Yoon, E. Kim, K. Park, Y . Joo, and H. Kim, “HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing,”arXiv preprint arXiv:2504.16112, 2025

work page arXiv 2025
[54]

TensorDIMM: A Practical Near- Memory Processing Architecture for Embeddings and Tensor Oper- ations in Deep Learning,

Y . Kwon, Y . Lee, and M. Rhu, “TensorDIMM: A Practical Near- Memory Processing Architecture for Embeddings and Tensor Oper- ations in Deep Learning,” inMICRO, 2019

work page 2019
[55]

Make LLM Inference Affordable to Everyone: Augmenting GPU Memory With NDP-DIMM,

L. Liu, S. Zhao, B. Li, H. Ren, Z. Xu, M. Wang, X. Li, Y . Han, and Y . Wang, “Make LLM Inference Affordable to Everyone: Augmenting GPU Memory With NDP-DIMM,” inHPCA, 2025

work page 2025
[56]

PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference,

Y . Gu, A. Khadem, S. Umesh, N. Liang, X. Servot, O. Mutlu, R. Iyer, and R. Das, “PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference,” inASPLOS, 2025

work page 2025
[57]

Accelerating Retrieval Augmented Language Model via PIM and PNM Integration,

J.-W. Jang, J. Oh, Y . Kong, J.-Y . Hong, S.-H. Cho, J. Lee, H. Yang, and J.-S. Yang, “Accelerating Retrieval Augmented Language Model via PIM and PNM Integration,” inMICRO, 2025

work page 2025
[58]

DReX: Accurate and Scalable Dense Retrieval Acceleration via Algorithmic-Hardware Codesign,

D. Quinn, E. E. Y ¨ucel, M. Prammer, Z. Fan, K. Skadron, J. M. Patel, J. F. Mart ´ınez, and M. Alian, “DReX: Accurate and Scalable Dense Retrieval Acceleration via Algorithmic-Hardware Codesign,” inISCA, 2025

work page 2025
[59]

Darwin: A DRAM-Based Multi-Level Processing-in-Memory Architecture for Column-Oriented Database,

D. Kim, J.-Y . Kim, W. Han, J. Won, H. Choi, Y . Kwon, and J.-Y . Kim, “Darwin: A DRAM-Based Multi-Level Processing-in-Memory Architecture for Column-Oriented Database,”IEEE TETC, 2024

work page 2024
[60]

LP-Spec: Leveraging LPDDR PIM for Efficient LLM Mobile Speculative Inference with Architecture- Dataflow Co-Optimization,

S. He, Z. Zhu, Y . He, and T. Jia, “LP-Spec: Leveraging LPDDR PIM for Efficient LLM Mobile Speculative Inference with Architecture- Dataflow Co-Optimization,”arXiv preprint arXiv:2508.07227, 2025

work page arXiv 2025
[61]

Near- Memory LLM Inference Processor based on 3D DRAM-to-logic Hy- brid Bonding,

S. Han, B. Yoon, G. Park, C. Song, D. Kim, and J.-J. Kim, “Near- Memory LLM Inference Processor based on 3D DRAM-to-logic Hy- brid Bonding,” inDAC, 2025

work page 2025
[62]

PIMoE: Towards efficient MoE transformer deployment on NPU-PIM system through throttle-aware task offloading,

L. Wu, H. Zhu, S. He, X. Lin, X. Zeng, and C. Chen, “PIMoE: Towards efficient MoE transformer deployment on NPU-PIM system through throttle-aware task offloading,” inDAC, 2025

work page 2025
[63]

IANUS: Integrated Accelerator Based on an NPU-PIM Unified Memory System,

M. Seo, X. T. Nguyen, S. J. Hwang, Y . Kwon, G. Kim, C. Park, I. Kim, J. Park, J. Kim, W. Shinet al., “IANUS: Integrated Accelerator Based on an NPU-PIM Unified Memory System,” inASPLOS, 2024

work page 2024
[64]

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing,

G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Mahajan, and J. Park, “NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing,” inASPLOS, 2024

work page 2024
[65]

Heat: Npu-ndp heterogeneous architecture for transformer-empowered graph neural networks,

Chen, Ruiyang and Song, Zhuoran and Zheng, Yicheng and Zhu, Zeyu and Li, Gang and Jing, Naifeng and Liang, Xiaoyao and Guan, Haibing, “Heat: Npu-ndp heterogeneous architecture for transformer-empowered graph neural networks,” inMICRO, 2025

work page 2025
[66]

A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing,

J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing,” in ISCA, 2015

work page 2015
[67]

GraphH: A Processing-in-Memory Architecture for Large- Scale Graph Processing,

G. Dai, T. Huang, Y . Chi, J. Zhao, G. Sun, Y . Liu, Y . Wang, Y . Xie, and H. Yang, “GraphH: A Processing-in-Memory Architecture for Large- Scale Graph Processing,”IEEE TCAD, 2018

work page 2018
[68]

SparsePIM: An Efficient HBM- Based PIM Architecture for Sparse Matrix-Vector Multiplications,

T. Kang, G. Choi, T. Suh, and G. Koo, “SparsePIM: An Efficient HBM- Based PIM Architecture for Sparse Matrix-Vector Multiplications,” in ICS, 2025

work page 2025
[69]

SparseP: Towards Efficient Sparse Matrix–Vector Multipli- cation on Real Processing-in-Memory Architectures,

C. Giannoula, I. Fernandez, J. G. Luna, N. Koziris, G. Goumas, and O. Mutlu, “SparseP: Towards Efficient Sparse Matrix–Vector Multipli- cation on Real Processing-in-Memory Architectures,”POMACS, 2022

work page 2022
[71]

Gnnear: Accelerating Full-Batch Training of Graph Neural Networks with Near-Memory Processing,

Z. Zhou, C. Li, X. Wei, X. Wang, and G. Sun, “Gnnear: Accelerating Full-Batch Training of Graph Neural Networks with Near-Memory Processing,” inPACT, 2022

work page 2022
[72]

G-NMP: Accelerating Graph Neural Networks with DIMM- Based Near-Memory Processing,

T. Tian, X. Wang, L. Zhao, W. Wu, X. Zhang, F. Lu, T. Wang, and X. Jin, “G-NMP: Accelerating Graph Neural Networks with DIMM- Based Near-Memory Processing,”JSA, 2022

work page 2022
[73]

GraNDe: Ef- ficient Near-Data Processing Architecture for Graph Neural Networks,

S. Yun, H. Nam, J. Park, B. Kim, J. H. Ahn, and E. Lee, “GraNDe: Ef- ficient Near-Data Processing Architecture for Graph Neural Networks,” IEEE TC, 2023

work page 2023
[74]

MetaNMP: Leveraging Cartesian-Like Product to Accelerate HGNNs with Near-Memory Processing,

D. Chen, H. He, H. Jin, L. Zheng, Y . Huang, X. Shen, and X. Liao, “MetaNMP: Leveraging Cartesian-Like Product to Accelerate HGNNs with Near-Memory Processing,” inISCA, 2023

work page 2023
[75]

RayN: Ray Tracing Acceler- ation with Near-memory Computing,

M. Saed, P. J. Nair, and T. M. Aamodt, “RayN: Ray Tracing Acceler- ation with Near-memory Computing,” inMICRO, 2025

work page 2025
[76]

CoPIM: A Concurrency-Aware PIM Workload Offloading Architecture for Graph Applications,

L. Yan, M. Zhang, R. Wang, X. Chen, X. Zou, X. Lu, Y . Han, and X.-H. Sun, “CoPIM: A Concurrency-Aware PIM Workload Offloading Architecture for Graph Applications,” inISLPED, 2021

work page 2021
[77]

SparseP: Towards Efficient Sparse Matrix Vector Multipli- cation on Real Processing-in-Memory Architectures,

C. Giannoula, I. Fernandez, J. G. Luna, N. Koziris, G. Goumas, and O. Mutlu, “SparseP: Towards Efficient Sparse Matrix Vector Multipli- cation on Real Processing-in-Memory Architectures,”POMACS, 2022

work page 2022
[78]

A Framework for High-Throughput Sequence Alignment Using Real Processing-in-Memory Systems,

S. Diab, A. Nassereldine, M. Alser, J. G ´omez Luna, O. Mutlu, and I. El Hajj, “A Framework for High-Throughput Sequence Alignment Using Real Processing-in-Memory Systems,”Bioinformatics, 2023

work page 2023
[79]

Design and Analysis of a Processing-in-DIMM Join Algorithm: A Case Study with UPMEM DIMMs,

C. Lim, S. Lee, J. Choi, J. Lee, S. Park, H. Kim, J. Lee, and Y . Kim, “Design and Analysis of a Processing-in-DIMM Join Algorithm: A Case Study with UPMEM DIMMs,”PACMMOD, 2023

work page 2023
[80]

TransPimLib: Efficient Transcendental Functions for Processing-in-Memory Systems,

M. Item, G. F. Oliveira, J. G ´omez-Luna, M. Sadrosadati, Y . Guo, and O. Mutlu, “TransPimLib: Efficient Transcendental Functions for Processing-in-Memory Systems,” inISPASS, 2023

work page 2023
[81]

Implementation and Evaluation of Deep Neural Networks in Commercially Available Processing in Memory Hardware,

P. Das, P. R. Sutradhar, M. Indovina, S. M. P. Dinakarrao, and A. Gan- guly, “Implementation and Evaluation of Deep Neural Networks in Commercially Available Processing in Memory Hardware,” inSOCC, 2022. 13

work page 2022

Showing first 80 references.

[1] [1]

Deep Learning for Finance: Deep Portfolios,

J. B. Heaton, N. G. Polson, and J. H. Witte, “Deep Learning for Finance: Deep Portfolios,”Appl. Stoch. Models Bus. Ind., 2017

work page 2017

[2] [2]

Deep Learning for Financial Applications: A Survey,

A. M. Ozbayoglu, M. U. Gudelek, and O. B. Sezer, “Deep Learning for Financial Applications: A Survey,”Appl. Soft Comput., 2020

work page 2020

[3] [3]

Artificial Intelligence in Retail: The AI-Enabled Value Chain,

K. Oosthuizen, E. Botha, J. Robertson, and M. Montecchi, “Artificial Intelligence in Retail: The AI-Enabled Value Chain,”Australas. Mark. J., 2021

work page 2021

[4] [4]

Deep Learning for Health Informatics,

D. Rav `ı, C. Wong, F. Deligianni, M. Berthelot, J. Andreu-Perez, B. Lo, and G.-Z. Yang, “Deep Learning for Health Informatics,”IEEE JBHI, 2016

work page 2016

[5] [5]

Deep Learning for Healthcare: Review, Opportunities, and Challenges,

R. Miotto, F. Wang, S. Wang, X. Jiang, and J. T. Dudley, “Deep Learning for Healthcare: Review, Opportunities, and Challenges,”Brief. Bioinform., 2018

work page 2018

[6] [6]

Perception and Navigation in Autonomous Systems in the Era of Learning: A Survey,

Y . Tang, C. Zhao, J. Wang, C. Zhang, Q. Sun, W. X. Zheng, W. Du, F. Qian, and J. Kurths, “Perception and Navigation in Autonomous Systems in the Era of Learning: A Survey,”IEEE TNNLS, 2022

work page 2022

[7] [7]

The Cityscapes Dataset for Semantic Urban Scene Understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, and B. Schiele, “The Cityscapes Dataset for Semantic Urban Scene Understanding,” inCVPR, 2016

work page 2016

[8] [8]

{DistServe}: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving,

Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang, “{DistServe}: Disaggregating Prefill and Decoding for Goodput-Optimized Large Language Model Serving,” inOSDI, 2024

work page 2024

[9] [9]

Seesaw: High-Throughput LLM Inference via Model Re-Sharding,

Q. Su, W. Zhao, X. Li, M. Andoorveedu, C. Jiang, Z. Zhu, K. Song, C. Giannoula, and G. Pekhimenko, “Seesaw: High-Throughput LLM Inference via Model Re-Sharding,”MLSys, 2025

work page 2025

[10] [10]

Syncron: Efficient Synchronization Support for Near-Data- Processing Architectures,

C. Giannoula, N. Vijaykumar, N. Papadopoulou, V . Karakostas, I. Fer- nandez, J. G ´omez-Luna, L. Orosa, N. Koziris, G. Goumas, and O. Mutlu, “Syncron: Efficient Synchronization Support for Near-Data- Processing Architectures,” inHPCA, 2021

work page 2021

[11] [11]

Evaluating Machine Learning Workloads on Memory-Centric Computing Systems,

J. G ´omez-Luna, Y . Guo, S. Brocard, J. Legriel, R. Cimadomo, G. F. Oliveira, G. Singh, and O. Mutlu, “Evaluating Machine Learning Workloads on Memory-Centric Computing Systems,” inISPASS, 2023

work page 2023

[12] [12]

Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System,

J. G ´omez-Luna, I. E. Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System,”IEEE Access, 2022

work page 2022

[13] [13]

PyGim: An Efficient Graph Neural Network Library for Real Processing-In- Memory Architectures,

C. Giannoula, P. Yang, I. Fernandez, J. Yang, S. Durvasula, Y . X. Li, M. Sadrosadati, J. G. Luna, O. Mutlu, and G. Pekhimenko, “PyGim: An Efficient Graph Neural Network Library for Real Processing-In- Memory Architectures,”POMACS, 2025

work page 2025

[14] [14]

PIM-DL: Expanding the Applicability of Commodity DRAM- PIMs for Deep Learning via Algorithm–System Co-Optimization,

C. Li, Z. Zhou, Y . Wang, F. Yang, T. Cao, M. Yang, Y . Liang, and G. Sun, “PIM-DL: Expanding the Applicability of Commodity DRAM- PIMs for Deep Learning via Algorithm–System Co-Optimization,” in ASPLOS, 2024

work page 2024

[15] [15]

A Modern Primer on Processing in Memory,

O. Mutlu, S. Ghose, J. G ´omez-Luna, and R. Ausavarungnirun, “A Modern Primer on Processing in Memory,” inEmerging Computing: From Devices to Systems - Looking Beyond Moore and Von Neumann, 2021

work page 2021

[16] [16]

A 1ynm 1.25 V 8GB, 16GB/s/Pin GDDR6- Based Accelerator-in-Memory Supporting 1Tflops MAC Operation and Various Activation Functions for Deep-Learning Applications,

S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kimet al., “A 1ynm 1.25 V 8GB, 16GB/s/Pin GDDR6- Based Accelerator-in-Memory Supporting 1Tflops MAC Operation and Various Activation Functions for Deep-Learning Applications,” in ISSCC, 2022

work page 2022

[17] [17]

Newton: A DRAM-Maker’s Accelerator-in- Memory (AiM) Architecture for Machine Learning,

M. He, C. Song, I. Kim, C. Jeong, S. Kim, I. Park, M. Thottethodi, and T. N. Vijaykumar, “Newton: A DRAM-Maker’s Accelerator-in- Memory (AiM) Architecture for Machine Learning,” inMICRO, 2020

work page 2020

[18] [18]

Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product,

S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin, J. Kim, O. Seongil, A. Iyer, D. Wang, K. Sohn, and N. S. Kim, “Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology: Industrial Product,” in ISCA, 2021

work page 2021

[19] [19]

The True Processing-in-Memory Accelerator,

F. Devaux, “The True Processing-in-Memory Accelerator,” inHCS, 2019

work page 2019

[20] [20]

AttAcc!: Unleashing the Power of PIM for Batched Transformer- Based Generative Model Inference,

J. Park, J. Choi, K. Kyung, M. J. Kim, Y . Kwon, N. S. Kim, and J. H. Ahn, “AttAcc!: Unleashing the Power of PIM for Batched Transformer- Based Generative Model Inference,” inASPLOS, 2024

work page 2024

[21] [21]

Cost-effective Extension of DRAM-PIM for Group-wise LLM Quantization,

B. Kim, C. Lee, G. Kim, and E. Park, “Cost-effective Extension of DRAM-PIM for Group-wise LLM Quantization,”IEEE CAL, 2025

work page 2025

[22] [22]

BlockPIM: Optimizing Memory Management for PIM-enabled Long-Context LLM Inference,

Z. Li, J. Zhou, X. Li, and N. Sun, “BlockPIM: Optimizing Memory Management for PIM-enabled Long-Context LLM Inference,” inDAC, 2025

work page 2025

[23] [23]

McPAL: Scaling Un- structured Sparse Inference with Multi-Chiplet HBM-PIM Architecture for LLMs,

S. Liu, Z. Huang, J. Yu, Q. Liu, and C. Chen, “McPAL: Scaling Un- structured Sparse Inference with Multi-Chiplet HBM-PIM Architecture for LLMs,” inDAC, 2025

work page 2025

[24] [24]

AttenPIM: Accelerating LLM Attention with Dual-mode GEMV in Processing-in-Memory,

L. Chen, D. Lyu, Z. Li, J. Jiang, Q. Wang, Z. Mao, and N. Jing, “AttenPIM: Accelerating LLM Attention with Dual-mode GEMV in Processing-in-Memory,” inDAC, 2025

work page 2025

[25] [25]

PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding With a Processing-in-Memory- Enabled Computing System,

Y . He, H. Mao, C. Giannoula, M. Sadrosadati, J. G ´omez-Luna, H. Li, X. Li, Y . Wang, and O. Mutlu, “PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding With a Processing-in-Memory- Enabled Computing System,” inASPLOS, 2025

work page 2025

[26] [26]

SpecPIM: Accelerating Speculative Inference on PIM-Enabled Systems via Ar- chitecture–Dataflow Co-Exploration,

C. Li, Z. Zhou, S. Zheng, J. Zhang, Y . Liang, and G. Sun, “SpecPIM: Accelerating Speculative Inference on PIM-Enabled Systems via Ar- chitecture–Dataflow Co-Exploration,” inASPLOS, 2024

work page 2024

[27] [27]

PIM-GPT: A Hybrid Processing-in- Memory Accelerator for Autoregressive Transformers,

Y . Wu, Z. Wang, and W. D. Lu, “PIM-GPT: A Hybrid Processing-in- Memory Accelerator for Autoregressive Transformers,”npj Unconven- tional Computing, 2024

work page 2024

[28] [28]

PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM,

H. Lee, D. Baek, J. Son, J. Choi, K. Moon, and M. Jang, “PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM,” inHPCA, 2025

work page 2025

[29] [29]

Pimba: A Processing- in-Memory Acceleration for Post-Transformer Large Language Model Serving,

W. Kim, Y . Lee, Y . Kim, J. Hwang, S. Oh, J. Jung, A. Huseynov, W. G. Park, C. H. Park, D. Mahajan, and J. Park, “Pimba: A Processing- in-Memory Acceleration for Post-Transformer Large Language Model Serving,” inMICRO, 2025

work page 2025

[30] [30]

LongSight: Compute-Enabled Memory to Accelerate Large-Context LLMs via Sparse Attention,

D. Quinn, E. E. Y ¨ucel, J. Kim, J. F. Mart ´ınez, and M. Alian, “LongSight: Compute-Enabled Memory to Accelerate Large-Context LLMs via Sparse Attention,” inMICRO, 2025

work page 2025

[31] [31]

ATiM: Autotuning Tensor Programs for Processing-in-DRAM,

Y . Shin, D. Kang, and H. Sung, “ATiM: Autotuning Tensor Programs for Processing-in-DRAM,” inISCA, 2025

work page 2025

[32] [32]

SimplePIM: A Software Framework for Productive and Efficient Programming of Real PIM Systems,

J. Chen, J. G ´omez-Luna, I. E. Hajj, Y . Guo, and O. Mutlu, “SimplePIM: A Software Framework for Productive and Efficient Programming of Real PIM Systems,” inPACT, 2023

work page 2023

[33] [33]

CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms,

A. A. Khan, H. Farzaneh, K. F. A. Friebel, C. Fournier, L. Chelini, and J. Castrillon, “CINM (Cinnamon): A Compilation Infrastructure for Heterogeneous Compute In-Memory and Compute Near-Memory Paradigms,” inASPLOS, 2024. 12

work page 2024

[34] [34]

PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM,

Y . Shin, J. Park, S. Cho, and H. Sung, “PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM,” inCGO, 2023

work page 2023

[35] [35]

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning,

T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y . Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning,” inOSDI, 2018

work page 2018

[36] [36]

TensorIR: An Abstraction for Automatic Tensorized Program Optimization,

S. Feng, B. Hou, H. Jin, W. Lin, J. Shao, R. Lai, Z. Ye, L. Zheng, C. H. Yu, Y . Yuet al., “TensorIR: An Abstraction for Automatic Tensorized Program Optimization,” inASPLOS, 2023

work page 2023

[37] [37]

Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs,

Y . Ding, C. H. Yu, B. Zheng, Y . Liu, Y . Wang, and G. Pekhimenko, “Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs,” inASPLOS, 2023

work page 2023

[38] [38]

Bolt: Bridging the Gap Between Auto-Tuners and Hardware-Native Perfor- mance,

J. Xing, L. Wang, S. Zhang, J. Chen, A. Chen, and Y . Zhu, “Bolt: Bridging the Gap Between Auto-Tuners and Hardware-Native Perfor- mance,”MLSys, 2022

work page 2022

[39] [39]

SPLAT: A Framework for Optimised GPU Code-Generation for SParse reguLar ATtention,

A. Gupta, Y . Yuan, D. Jain, Y . Ge, D. Aponte, Y . Zhou, and C. Mendis, “SPLAT: A Framework for Optimised GPU Code-Generation for SParse reguLar ATtention,”Proc. ACM Program. Lang., 2025

work page 2025

[40] [40]

CROSS: Compiler-Driven Optimization of Sparse DNNs Using Sparse/Dense Computation Kernels,

F. Liu, S. Huang, N. Yang, Z. Wang, H. Li, and L. Jiang, “CROSS: Compiler-Driven Optimization of Sparse DNNs Using Sparse/Dense Computation Kernels,” inHPCA, 2025

work page 2025

[41] [41]

Finch: Sparse and Structured Tensor Programming with Control Flow,

W. Ahrens, T. F. Collin, R. Patel, K. Deeds, C. Hong, and S. Amaras- inghe, “Finch: Sparse and Structured Tensor Programming with Control Flow,”PACMPL, 2025

work page 2025

[42] [42]

SRSparse: Gen- erating Codes for High-Performance Sparse Matrix-Vector Semiring Computations,

Z. Du, Y . Liu, N. Sun, H. Cui, X. Feng, and J. Li, “SRSparse: Gen- erating Codes for High-Performance Sparse Matrix-Vector Semiring Computations,”TACO, 2025

work page 2025

[43] [43]

Unified Convolution Framework: A Compiler-Based Approach to Support Sparse Convolutions,

J. Won, C. Hong, C. Mendis, J. Emer, and S. Amarasinghe, “Unified Convolution Framework: A Compiler-Based Approach to Support Sparse Convolutions,”MLSys, 2023

work page 2023

[44] [44]

SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning,

Z. Ye, R. Lai, J. Shao, T. Chen, and L. Ceze, “SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning,” inASPLOS, 2023

work page 2023

[45] [45]

The Tensor Algebra Compiler,

F. Kjolstad, S. Kamil, S. Chou, D. Lugato, and S. Amarasinghe, “The Tensor Algebra Compiler,”Proc. ACM Program. Lang., no. OOPSLA, 2017

work page 2017

[46] [46]

Ansor: Generating{High- Performance}Tensor Programs for Deep Learning,

L. Zheng, C. Jia, M. Sun, Z. Wu, C. H. Yu, A. Haj-Ali, Y . Wang, J. Yang, D. Zhuo, K. Senet al., “Ansor: Generating{High- Performance}Tensor Programs for Deep Learning,” inOSDI, 2020

work page 2020

[47] [47]

Learning to Optimize Tensor Programs,

T. Chen, L. Zheng, E. Yan, Z. Jiang, T. Moreau, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Learning to Optimize Tensor Programs,” NeurIPS, vol. 31, 2018

work page 2018

[48] [48]

XGBoost: A Scalable Tree Boosting System,

T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” inKDD, 2016

work page 2016

[49] [49]

Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator,

H. Luo, Y . C. Tu ˘grul, F. N. Bostancı, A. Olgun, A. G. Ya ˘glıkc ¸ı, , and O. Mutlu, “Ramulator 2.0: A Modern, Modular, and Extensible DRAM Simulator,” 2023

work page 2023

[50] [50]

Op- tiPIM: Optimizing Processing-in-Memory Acceleration Using Integer Linear Programming,

J. Liu, M. Zhou, Y . Pan, C.-Y . Yang, L. Josipovi´c, and T. Rosing, “Op- tiPIM: Optimizing Processing-in-Memory Acceleration Using Integer Linear Programming,” inISCA, 2025

work page 2025

[51] [51]

Multi-Dimensional Vector ISA Extension for Mobile In-Cache Computing,

A. Khadem, D. Fujiki, H. Chen, Y . Gu, N. Talati, S. Mahlke, and R. Das, “Multi-Dimensional Vector ISA Extension for Mobile In-Cache Computing,” inHPCA, 2025

work page 2025

[52] [52]

TC-CIM: Empowering Tensor Compre- hensions for Computing-In-Memory,

A. Drebes, L. Chelini, O. Zinenko, A. Cohen, H. Corporaal, T. Grosser, K. Vadivel, and N. Vasilache, “TC-CIM: Empowering Tensor Compre- hensions for Computing-In-Memory,” inIMPACT, 2020

work page 2020

[53] [53]

HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing,

M. Rhee, J. Sim, T. Ahn, S. Lee, D. Yoon, E. Kim, K. Park, Y . Joo, and H. Kim, “HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing,”arXiv preprint arXiv:2504.16112, 2025

work page arXiv 2025

[54] [54]

TensorDIMM: A Practical Near- Memory Processing Architecture for Embeddings and Tensor Oper- ations in Deep Learning,

Y . Kwon, Y . Lee, and M. Rhu, “TensorDIMM: A Practical Near- Memory Processing Architecture for Embeddings and Tensor Oper- ations in Deep Learning,” inMICRO, 2019

work page 2019

[55] [55]

Make LLM Inference Affordable to Everyone: Augmenting GPU Memory With NDP-DIMM,

L. Liu, S. Zhao, B. Li, H. Ren, Z. Xu, M. Wang, X. Li, Y . Han, and Y . Wang, “Make LLM Inference Affordable to Everyone: Augmenting GPU Memory With NDP-DIMM,” inHPCA, 2025

work page 2025

[56] [56]

PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference,

Y . Gu, A. Khadem, S. Umesh, N. Liang, X. Servot, O. Mutlu, R. Iyer, and R. Das, “PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference,” inASPLOS, 2025

work page 2025

[57] [57]

Accelerating Retrieval Augmented Language Model via PIM and PNM Integration,

J.-W. Jang, J. Oh, Y . Kong, J.-Y . Hong, S.-H. Cho, J. Lee, H. Yang, and J.-S. Yang, “Accelerating Retrieval Augmented Language Model via PIM and PNM Integration,” inMICRO, 2025

work page 2025

[58] [58]

DReX: Accurate and Scalable Dense Retrieval Acceleration via Algorithmic-Hardware Codesign,

D. Quinn, E. E. Y ¨ucel, M. Prammer, Z. Fan, K. Skadron, J. M. Patel, J. F. Mart ´ınez, and M. Alian, “DReX: Accurate and Scalable Dense Retrieval Acceleration via Algorithmic-Hardware Codesign,” inISCA, 2025

work page 2025

[59] [59]

Darwin: A DRAM-Based Multi-Level Processing-in-Memory Architecture for Column-Oriented Database,

D. Kim, J.-Y . Kim, W. Han, J. Won, H. Choi, Y . Kwon, and J.-Y . Kim, “Darwin: A DRAM-Based Multi-Level Processing-in-Memory Architecture for Column-Oriented Database,”IEEE TETC, 2024

work page 2024

[60] [60]

LP-Spec: Leveraging LPDDR PIM for Efficient LLM Mobile Speculative Inference with Architecture- Dataflow Co-Optimization,

S. He, Z. Zhu, Y . He, and T. Jia, “LP-Spec: Leveraging LPDDR PIM for Efficient LLM Mobile Speculative Inference with Architecture- Dataflow Co-Optimization,”arXiv preprint arXiv:2508.07227, 2025

work page arXiv 2025

[61] [61]

Near- Memory LLM Inference Processor based on 3D DRAM-to-logic Hy- brid Bonding,

S. Han, B. Yoon, G. Park, C. Song, D. Kim, and J.-J. Kim, “Near- Memory LLM Inference Processor based on 3D DRAM-to-logic Hy- brid Bonding,” inDAC, 2025

work page 2025

[62] [62]

PIMoE: Towards efficient MoE transformer deployment on NPU-PIM system through throttle-aware task offloading,

L. Wu, H. Zhu, S. He, X. Lin, X. Zeng, and C. Chen, “PIMoE: Towards efficient MoE transformer deployment on NPU-PIM system through throttle-aware task offloading,” inDAC, 2025

work page 2025

[63] [63]

IANUS: Integrated Accelerator Based on an NPU-PIM Unified Memory System,

M. Seo, X. T. Nguyen, S. J. Hwang, Y . Kwon, G. Kim, C. Park, I. Kim, J. Park, J. Kim, W. Shinet al., “IANUS: Integrated Accelerator Based on an NPU-PIM Unified Memory System,” inASPLOS, 2024

work page 2024

[64] [64]

NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing,

G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Mahajan, and J. Park, “NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing,” inASPLOS, 2024

work page 2024

[65] [65]

Heat: Npu-ndp heterogeneous architecture for transformer-empowered graph neural networks,

Chen, Ruiyang and Song, Zhuoran and Zheng, Yicheng and Zhu, Zeyu and Li, Gang and Jing, Naifeng and Liang, Xiaoyao and Guan, Haibing, “Heat: Npu-ndp heterogeneous architecture for transformer-empowered graph neural networks,” inMICRO, 2025

work page 2025

[66] [66]

A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing,

J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing,” in ISCA, 2015

work page 2015

[67] [67]

GraphH: A Processing-in-Memory Architecture for Large- Scale Graph Processing,

G. Dai, T. Huang, Y . Chi, J. Zhao, G. Sun, Y . Liu, Y . Wang, Y . Xie, and H. Yang, “GraphH: A Processing-in-Memory Architecture for Large- Scale Graph Processing,”IEEE TCAD, 2018

work page 2018

[68] [68]

SparsePIM: An Efficient HBM- Based PIM Architecture for Sparse Matrix-Vector Multiplications,

T. Kang, G. Choi, T. Suh, and G. Koo, “SparsePIM: An Efficient HBM- Based PIM Architecture for Sparse Matrix-Vector Multiplications,” in ICS, 2025

work page 2025

[69] [69]

SparseP: Towards Efficient Sparse Matrix–Vector Multipli- cation on Real Processing-in-Memory Architectures,

C. Giannoula, I. Fernandez, J. G. Luna, N. Koziris, G. Goumas, and O. Mutlu, “SparseP: Towards Efficient Sparse Matrix–Vector Multipli- cation on Real Processing-in-Memory Architectures,”POMACS, 2022

work page 2022

[70] [71]

Gnnear: Accelerating Full-Batch Training of Graph Neural Networks with Near-Memory Processing,

Z. Zhou, C. Li, X. Wei, X. Wang, and G. Sun, “Gnnear: Accelerating Full-Batch Training of Graph Neural Networks with Near-Memory Processing,” inPACT, 2022

work page 2022

[71] [72]

G-NMP: Accelerating Graph Neural Networks with DIMM- Based Near-Memory Processing,

T. Tian, X. Wang, L. Zhao, W. Wu, X. Zhang, F. Lu, T. Wang, and X. Jin, “G-NMP: Accelerating Graph Neural Networks with DIMM- Based Near-Memory Processing,”JSA, 2022

work page 2022

[72] [73]

GraNDe: Ef- ficient Near-Data Processing Architecture for Graph Neural Networks,

S. Yun, H. Nam, J. Park, B. Kim, J. H. Ahn, and E. Lee, “GraNDe: Ef- ficient Near-Data Processing Architecture for Graph Neural Networks,” IEEE TC, 2023

work page 2023

[73] [74]

MetaNMP: Leveraging Cartesian-Like Product to Accelerate HGNNs with Near-Memory Processing,

D. Chen, H. He, H. Jin, L. Zheng, Y . Huang, X. Shen, and X. Liao, “MetaNMP: Leveraging Cartesian-Like Product to Accelerate HGNNs with Near-Memory Processing,” inISCA, 2023

work page 2023

[74] [75]

RayN: Ray Tracing Acceler- ation with Near-memory Computing,

M. Saed, P. J. Nair, and T. M. Aamodt, “RayN: Ray Tracing Acceler- ation with Near-memory Computing,” inMICRO, 2025

work page 2025

[75] [76]

CoPIM: A Concurrency-Aware PIM Workload Offloading Architecture for Graph Applications,

L. Yan, M. Zhang, R. Wang, X. Chen, X. Zou, X. Lu, Y . Han, and X.-H. Sun, “CoPIM: A Concurrency-Aware PIM Workload Offloading Architecture for Graph Applications,” inISLPED, 2021

work page 2021

[76] [77]

SparseP: Towards Efficient Sparse Matrix Vector Multipli- cation on Real Processing-in-Memory Architectures,

C. Giannoula, I. Fernandez, J. G. Luna, N. Koziris, G. Goumas, and O. Mutlu, “SparseP: Towards Efficient Sparse Matrix Vector Multipli- cation on Real Processing-in-Memory Architectures,”POMACS, 2022

work page 2022

[77] [78]

A Framework for High-Throughput Sequence Alignment Using Real Processing-in-Memory Systems,

S. Diab, A. Nassereldine, M. Alser, J. G ´omez Luna, O. Mutlu, and I. El Hajj, “A Framework for High-Throughput Sequence Alignment Using Real Processing-in-Memory Systems,”Bioinformatics, 2023

work page 2023

[78] [79]

Design and Analysis of a Processing-in-DIMM Join Algorithm: A Case Study with UPMEM DIMMs,

C. Lim, S. Lee, J. Choi, J. Lee, S. Park, H. Kim, J. Lee, and Y . Kim, “Design and Analysis of a Processing-in-DIMM Join Algorithm: A Case Study with UPMEM DIMMs,”PACMMOD, 2023

work page 2023

[79] [80]

TransPimLib: Efficient Transcendental Functions for Processing-in-Memory Systems,

M. Item, G. F. Oliveira, J. G ´omez-Luna, M. Sadrosadati, Y . Guo, and O. Mutlu, “TransPimLib: Efficient Transcendental Functions for Processing-in-Memory Systems,” inISPASS, 2023

work page 2023

[80] [81]

Implementation and Evaluation of Deep Neural Networks in Commercially Available Processing in Memory Hardware,

P. Das, P. R. Sutradhar, M. Indovina, S. M. P. Dinakarrao, and A. Gan- guly, “Implementation and Evaluation of Deep Neural Networks in Commercially Available Processing in Memory Hardware,” inSOCC, 2022. 13

work page 2022