pith. sign in

arxiv: 2606.30553 · v1 · pith:7P4GGESRnew · submitted 2026-06-29 · 💻 cs.AR · cs.DC

COSM: A Cooperative Scheduling Framework for Concurrent PIM and CPU Execution on Mobile Devices

Pith reviewed 2026-06-30 03:05 UTC · model grok-4.3

classification 💻 cs.AR cs.DC
keywords processing-in-memorymobile devicesscheduling frameworkconcurrent executionLLM inferenceDRAM accessbank conflictsthroughput optimization
0
0 comments X

The pith

COSM enables concurrent PIM and CPU execution on mobile devices by scheduling PIM tasks into CPU idle periods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents COSM to allow processing-in-memory and CPU tasks to share DRAM on phones without major interference during LLM runs. Bank conflicts and bus congestion normally reduce the gains from PIM when both access memory at once. COSM introduces a control interface that issues many PIM commands without stopping CPU accesses and a scheduler that inserts those commands into gaps in the CPU sequence. This approach hides PIM latency and overlaps its work with data movement. Readers would care because on-device AI requires both privacy and speed yet faces tight memory constraints on existing hardware.

Core claim

The central claim is that a low-interference PIM control interface together with an idleness-aware scheduling method can integrate PIM commands into CPU access sequences on mobile platforms, enabling concurrent execution that improves PIM throughput by up to 2.8x while limiting CPU performance loss to under 2% during tests with LLMs and mobile workloads.

What carries the argument

The idleness-aware scheduling method that places PIM commands into available idle time windows within the CPU access sequence, supported by a low-interference PIM control interface that generates the maximum number of PIM commands without disrupting CPU memory accesses.

If this is right

  • PIM execution latency becomes hidden from the CPU view.
  • PIM work overlaps with ongoing data transfer operations.
  • The approach requires no hardware modifications to existing mobile DRAM.
  • The gains apply across LLMs paired with both mobile applications and compute kernels.
  • Throughput improves while CPU performance remains nearly unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same idle-window insertion idea could extend to other memory-bound tasks on edge devices beyond language models.
  • Software scheduling alone might reduce the need for dedicated PIM hardware in memory-constrained systems.
  • Validation on varied DRAM bank organizations would test how well idleness detection holds across device models.

Load-bearing premise

That the low-interference interface and idleness-aware scheduling can reliably avoid or mitigate bank conflicts and bus congestion in real mobile hardware without creating new bottlenecks.

What would settle it

A measurement on physical mobile hardware showing that concurrent LLM and app execution still produces measurable bus congestion or CPU slowdown above 2% even when the COSM scheduler is active.

Figures

Figures reproduced from arXiv: 2606.30553 by Fangxin Liu, Haibing Guan, Jian Liu, Li Jiang, Mingyu Gao, Onur Mutlu, Yilong Zhao.

Figure 1
Figure 1. Figure 1: Internal/external DRAM bandwidth utilization of CPU and PIM work [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) CPU workload performance under injected read latency. (b) PIM [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) An example of FR-FCFS scheduling for a CPU-only workload and [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: COSM’s memory controller architecture and memory interface [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Timing diagram of PIM unit, memory bus, Command Arbiter, and [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The data, command, and address path of (a) [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: (a) Conventional software-controlled three-stage sequential scheduling. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overall PIM & CPU performance of COSM and baselines for concurrent CPU and PIM execution. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Normalized CPU and PIM workload performance under fixed-length and preemptable PIM execution command. Cases where CPU performance degrades [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Impact of CPU-mediated data transfers on CPU performance under different scheduling strategies. We test on the attention layers of the benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: (a) PIM performance and (b) Internal bandwidth usage during concur [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: PIM workload energy consumption per token (including PIM unit computation, PIM bank access, and CPU-mediated data transfer) of COSM and baselines [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Sensitivity analysis on (a) nP T L (b) rank count per channel (c) KV cache size (d) scaled tRP (n×, relative to [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
read the original abstract

The development of on-device large language models (LLMs) is driven by the need for privacy and fast response times. Energy-intensive data transfer on mobile devices makes Processing-in-Memory (PIM) an effective solution. Due to stringent DRAM cost constraints, limited physical footprint on circuit boards, and the interaction between applications and LLMs, it is imperative for the CPU and PIM to operate concurrently within a shared memory space. However, challenges such as bank conflicts and bus congestion can arise, potentially diminishing the performance and energy benefits of PIM. To address this challenge, we introduce COSM, a cooperative scheduling framework designed to facilitate the concurrent operation of PIM and CPU tasks on mobile platforms. Our key innovations include: 1) a low-interference PIM control interface that generates the maximum number of PIM commands without disrupting CPU memory accesses; 2) an idleness-aware scheduling method that integrates PIM commands into available idle time windows within the CPU's access sequence. COSM not only hides PIM execution latency from the CPU, but also overlaps PIM execution with data transfer. Experiments on concurrent execution of LLMs and mobile workloads, including mobile applications and compute-intensive kernels, demonstrate that COSM improves PIM throughput by up to 2.8x compared to the baseline scheduling method with less than 2.0% CPU performance loss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes COSM, a cooperative scheduling framework for concurrent PIM and CPU execution on mobile devices to support on-device LLMs. It introduces two key innovations: (1) a low-interference PIM control interface that generates the maximum number of PIM commands without disrupting CPU memory accesses, and (2) an idleness-aware scheduling method that integrates PIM commands into idle time windows in the CPU's access sequence. The framework is claimed to hide PIM latency from the CPU and overlap PIM execution with data transfer. Experiments on concurrent LLMs and mobile workloads (applications and compute-intensive kernels) report up to 2.8x PIM throughput improvement over a baseline scheduling method with less than 2.0% CPU performance loss.

Significance. If the central claims hold under realistic evaluation on unmodified mobile hardware, the work would be significant for enabling practical PIM deployment in mobile SoCs. It directly addresses energy costs of data movement for on-device LLMs by allowing concurrent PIM/CPU operation in a shared memory space without requiring hardware modifications, which is a key constraint for mobile platforms.

major comments (2)
  1. [Abstract / Proposed mechanism] The central feasibility claim—that the low-interference PIM control interface and idleness-aware scheduler can safely interleave PIM commands with CPU traffic on unmodified commodity LPDDR memory controllers without new contention, bank conflicts, or bus congestion—lacks a concrete mechanism. The abstract states these are 'software innovations' but provides no description of how PIM commands are issued, recognized, or routed separately while preserving CPU access ordering (e.g., via existing command queues or without controller extensions). This is load-bearing for the no-hardware-change assertion and the reported 2.8x throughput gain.
  2. [Experiments / Evaluation] The experimental results (2.8x PIM throughput, <2% CPU loss) cannot be assessed for validity because the abstract supplies no methodology details: no description of the simulation or hardware platform, baseline scheduling method, workload specifics (LLM models, mobile apps, kernels), error bars, or whether the evaluation models real unmodified controllers versus an idealized PIM-capable controller. This directly impacts whether the gains demonstrate feasibility on actual mobile hardware.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight areas where the abstract could better convey the paper's contributions. The full manuscript already contains the requested technical details in dedicated sections, but we are happy to revise the abstract and add cross-references to improve clarity. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract / Proposed mechanism] The central feasibility claim—that the low-interference PIM control interface and idleness-aware scheduler can safely interleave PIM commands with CPU traffic on unmodified commodity LPDDR memory controllers without new contention, bank conflicts, or bus congestion—lacks a concrete mechanism. The abstract states these are 'software innovations' but provides no description of how PIM commands are issued, recognized, or routed separately while preserving CPU access ordering (e.g., via existing command queues or without controller extensions). This is load-bearing for the no-hardware-change assertion and the reported 2.8x throughput gain.

    Authors: Section 3.2 of the full manuscript describes the low-interference interface: PIM commands are generated in software by mapping to standard LPDDR command encodings and inserted into the memory controller's existing command queue during detected idle windows (via performance counter monitoring of bank and bus utilization). No controller extensions are used; ordering is preserved because PIM commands only occupy slots that the controller would otherwise leave empty, avoiding new bank conflicts or congestion. The idleness-aware scheduler (Section 4) explicitly tracks CPU access sequences to find these windows. We agree the abstract is too terse on this point and will expand it with a one-sentence mechanism summary plus a pointer to Section 3. revision: yes

  2. Referee: [Experiments / Evaluation] The experimental results (2.8x PIM throughput, <2% CPU loss) cannot be assessed for validity because the abstract supplies no methodology details: no description of the simulation or hardware platform, baseline scheduling method, workload specifics (LLM models, mobile apps, kernels), error bars, or whether the evaluation models real unmodified controllers versus an idealized PIM-capable controller. This directly impacts whether the gains demonstrate feasibility on actual mobile hardware.

    Authors: Section 5 of the manuscript details the evaluation: a cycle-accurate simulator modeling unmodified LPDDR5 controllers (no PIM extensions), baseline as a simple round-robin PIM/CPU interleaver, workloads including Llama-7B/13B inference, mobile apps (Chrome, YouTube), and kernels (GEMM, convolution), with results averaged over 10 runs and error bars shown. All experiments use the unmodified-controller model. We will add a concise methodology paragraph to the abstract and ensure the evaluation section is explicitly referenced there. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical systems proposal with no derivation chain

full rationale

The paper presents a scheduling framework (COSM) with two proposed mechanisms: a low-interference PIM control interface and an idleness-aware scheduler. Performance claims (up to 2.8x PIM throughput, <2% CPU loss) rest on experimental measurements of concurrent LLM and mobile workloads. No equations, mathematical derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the abstract or described structure. The work is self-contained against external benchmarks via simulation or hardware experiments; no reduction of outputs to inputs by construction exists.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract; the work is a systems scheduling proposal rather than a theoretical derivation.

pith-pipeline@v0.9.1-grok · 5787 in / 1105 out tokens · 55879 ms · 2026-06-30T03:05:01.588544+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

134 extracted references · 10 canonical work pages · 3 internal anchors

  1. [1]

    Moshi: a speech-text foundation model for real-time dialogue

    A. Défossez, L. Mazaré, M. Orsini, A. Royer, P. Pérez, H. Jégou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue, ” arXiv e-prints, p. arXiv:2410.00037, Sep. 2024

  2. [2]

    MinMo: A Multimodal Large Language Model for Seamless Voice Interaction,

    Q. Chen, Y. Chen, Y. Chen, M. Chen, Y. Chen, C. Deng, Z. Du, R. Gao, C. Gao, Z. Gao, Y. Li, X. Lv, J. Liu, H. Luo, B. Ma, C. Ni, X. Shi, J. Tang, H. Wang, H. Wang, W. Wang, Y. Wang, Y. Xu, F. Yu, Z. Yan, Y. Yang, B. Yang, X. Yang, G. Yang, T. Zhao, Q. Zhang, S. Zhang, N. Zhao, P. Zhang, C. Zhang, and J. Zhou, “MinMo: A Multimodal Large Language Model for ...

  3. [3]

    V oila: V oice-language foundation models for real-time autonomous interaction and voice role-play.arXiv preprint arXiv:2505.02707, 2025

    Y. Shi, Y. Shu, S. Dong, G. Liu, J. Sesay, J. Li, and Z. Hu, “Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play, ” arXiv e-prints, p. arXiv:2505.02707, May 2025

  4. [4]

    Simultaneous Ma- chine Translation with Large Language Models,

    M. Wang, J. Zhao, T.-T. Vu, F. Shiri, E. Shareghi, and G. Haffari, “Simultaneous Ma- chine Translation with Large Language Models, ”arXiv e-prints, p. arXiv:2309.06706, Sep. 2025

  5. [5]

    LLMs Can Achieve High-quality Simultaneous Machine Translation as Efficiently as Offline,

    B. Fu, M. Liao, K. Fan, C. Li, L. Zhang, Y. Chen, and X. Shi, “LLMs Can Achieve High-quality Simultaneous Machine Translation as Efficiently as Offline, ”arXiv e-prints, p. arXiv:2504.09570, Apr. 2025

  6. [6]

    MediaPipe: A Framework for Building Perception Pipelines

    C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C.-L. Chang, M. G. Yong, J. Lee, W.-T. Chang, W. Hua, M. Georg, and M. Grundmann, “MediaPipe: A Framework for Building Perception Pipelines, ”arXiv e-prints, p. arXiv:1906.08172, Jun. 2019

  7. [7]

    MotionBridge: Dynamic Video Inbetweening with Flexible Controls,

    M. Tanveer, Y. Zhou, S. Niklaus, A. Mahdavi Amiri, H. Zhang, K. K. Singh, and N. Zhao, “MotionBridge: Dynamic Video Inbetweening with Flexible Controls, ” arXiv e-prints, p. arXiv:2412.13190, Dec. 2024

  8. [8]

    Apple Intelligence,

    Apple Inc., “Apple Intelligence, ” https://www.apple.com/apple-intelligence/, 2024, accessed: 2025-11-05

  9. [9]

    Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition,

    H. Chen, Y. Wang, K. Han, D. Li, L. Li, Z. Bi, J. Li, H. Wang, F. Mi, M. Zhu, B. Wang, K. Song, Y. Fu, X. He, Y. Luo, C. Zhu, Q. He, X. Wu, W. He, H. Hu, Y. Tang, D. Tao, X. Chen, and Y. Wang, “Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition, ”arXiv e-prints, p. arXiv:2505.22375, May 2025

  10. [10]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone,

    Y. Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, H. Li, W. Zhao, Z. Heet al., “MiniCPM-V: A GPT-4V Level MLLM on Your Phone, ”Nat Commun 16, 5509 (2025), 2025

  11. [11]

    Samsung Electronics,Galaxy.ai - The #1 All-in-One AI Platform, https://galaxy.ai/, 2024, accessed: 2025-11-05

  12. [12]

    BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices,

    X. Lu, Y. Chen, C. Chen, H. Tan, B. Chen, Y. Xie, R. Hu, G. Tan, R. Wu, Y. Hu, Y. Zeng, L. Wu, L. Bian, Z. Wang, L. Liu, Y. Yang, H. Xiao, A. Zhou, Y. Wen, X. Chen, S. Ren, and H. Li, “BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices, ” inProceedings of the IEEE/CVF Conference on Computer Vision and Patter...

  13. [13]

    Unlocking On-Device Generative AI with an NPU and Heterogeneous Computing,

    Qualcomm Technologies, Inc., “Unlocking On-Device Generative AI with an NPU and Heterogeneous Computing, ” Qualcomm, Tech. Rep., 2024, accessed: 2025-11-15

  14. [14]

    vivo Unveils New AI Strategy: BlueHeart Large Model Matrix and Major Upgrade to OriginOS 5,

    vivo, “vivo Unveils New AI Strategy: BlueHeart Large Model Matrix and Major Upgrade to OriginOS 5, ” https://www.vivo.com.cn/brand/news/detail?id=1271, 2024, accessed: 2025-11-15

  15. [15]

    The true Processing In Memory accelerator,

    F. Devaux, “The true Processing In Memory accelerator, ” in2019 IEEE Hot Chips 31 Symposium (HCS), Aug 2019, pp. 1–24

  16. [16]

    A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep- Learning Applications,

    S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kim, J. Jeon, N. Kim, Y. Kwon, K. Vladimir, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, J. Lee, D. Ko, Y. Jun, K. Cho, I. Kim, C. Song, C. Jeong, D. Kwon, J. Jang, I. Park, J. Chun, and J. Cho, “A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC...

  17. [17]

    PUSHtap: PIM-based In-Memory HTAP with Unified Data Storage Format,

    Y. Zhao, M. Gao, H. Zhang, F. Liu, G. Chen, H. Xian, H. Guan, and L. Jiang, “PUSHtap: PIM-based In-Memory HTAP with Unified Data Storage Format, ” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS ’25, 2025, pp. 179–194

  18. [18]

    UM-PIM: DRAM-based PIM with Uniform & Shared Memory Space,

    Y. Zhao, M. Gao, F. Liu, Y. Hu, Z. Wang, H. Lin, J. Li, H. Xian, H. Dong, T. Yang, N. Jing, X. Liang, and L. Jiang, “UM-PIM: DRAM-based PIM with Uniform & Shared Memory Space, ” in2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), June 2024, pp. 644–659

  19. [19]

    PAPI: Exploiting Dynamic Parallelism in Large Language Model De- coding with a Processing-In-Memory-Enabled Computing System,

    Y. He, H. Mao, C. Giannoula, M. Sadrosadati, J. Gómez-Luna, H. Li, X. Li, Y. Wang, and O. Mutlu, “PAPI: Exploiting Dynamic Parallelism in Large Language Model De- coding with a Processing-In-Memory-Enabled Computing System, ” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vo...

  20. [20]

    NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing,

    G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Mahajan, and J. Park, “NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing, ” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS ’24, 2024, pp. 722–737

  21. [21]

    Unifying Two Operators with One PIM: Leveraging Hybrid Bonding for Efficient LLM Inference,

    J. Chen, Y. Qi, K. Sun, Z. Lin, T. Wang, C. Ma, and Y. Wang, “Unifying Two Operators with One PIM: Leveraging Hybrid Bonding for Efficient LLM Inference, ” inAdvanced Parallel Processing Technologies, C. Li, X. Qian, D. Gizopoulos, and B. Grot, Eds., 2026, pp. 215–230

  22. [22]

    Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving,

    W. Kim, Y. Lee, Y. Kim, J. Hwang, S. Oh, J. Jung, A. Huseynov, W. G. Park, C. H. Park, D. Mahajan, and J. Park, “Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving, ” inProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’25, 2025, pp. 292–307

  23. [23]

    AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference,

    J. Park, J. Choi, K. Kyung, M. J. Kim, Y. Kwon, N. S. Kim, and J. H. Ahn, “AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference, ” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’24, 2024, p. 103119

  24. [24]

    Pyramid: Accelerating LLM Infer- ence With Cross-Level Processing-in-Memory,

    L. Yan, X. Lu, X. Chen, Y. Han, and X.-H. Sun, “Pyramid: Accelerating LLM Infer- ence With Cross-Level Processing-in-Memory, ”IEEE Computer Architecture Letters, vol. 24, no. 1, pp. 121–124, Jan 2025

  25. [25]

    A scalable processing-in-memory ac- celerator for parallel graph processing,

    J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memory ac- celerator for parallel graph processing, ” in2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), June 2015, pp. 105–117

  26. [26]

    PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture,

    J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture, ” in2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), June 2015, pp. 336–348

  27. [27]

    Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks,

    A. Boroumand, S. Ghose, B. Akin, R. Narayanaswami, G. F. Oliveira, X. Ma, E. Shiu, and O. Mutlu, “Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks, ” in2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), Sep. 2021, pp. 159–172

  28. [28]

    SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems,

    M. Besta, R. Kanakagiri, G. Kwasniewski, R. Ausavarungnirun, J. Beránek, K. Kanel- lopoulos, K. Janda, Z. Vonarburg-Shmaria, L. Gianinazzi, I. Stefan, J. G. Luna, J. Goli- nowski, M. Copik, L. Kapp-Schwoerer, S. Di Girolamo, N. Blach, M. Konieczny, O. Mutlu, and T. Hoefler, “SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-...

  29. [29]

    Google Workloads for Con- sumer Devices: Mitigating Data Movement Bottlenecks,

    A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan, and O. Mutlu, “Google Workloads for Con- sumer Devices: Mitigating Data Movement Bottlenecks, ” inProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLO...

  30. [30]

    PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference,

    Y. Gu, A. Khadem, S. Umesh, N. Liang, X. Servot, O. Mutlu, R. Iyer, and R. Das, “PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference, ” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’25, 2025, pp. 862–881

  31. [31]

    Mutlu, S

    O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun,A Modern Primer on Processing in Memory, 2023, pp. 171–243

  32. [32]

    Processing data where it makes sense in modern computing systems: En- abling in-memory computation,

    O. Mutlu, “Processing data where it makes sense in modern computing systems: En- abling in-memory computation, ” in2018 7th Mediterranean Conference on Embedded Computing (MECO), June 2018, pp. 8–9

  33. [33]

    Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System,

    J. Gómez-Luna, I. E. Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System, ”IEEE Access, vol. 10, pp. 52 565–52 608, 2022

  34. [34]

    Evaluating Machine LearningWorkloads on Memory-Centric Com- puting Systems,

    J. Gómez-Luna, Y. Guo, S. Brocard, J. Legriel, R. Cimadomo, G. F. Oliveira, G. Singh, and O. Mutlu, “Evaluating Machine LearningWorkloads on Memory-Centric Com- puting Systems, ” in2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2023, pp. 35–49

  35. [35]

    Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing- In-Memory Hardware,

    J. Gómez-Luna, I. El Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing- In-Memory Hardware, ” in2021 12th International Green and Sustainable Computing Conference (IGSC), Oct 2021, pp. 1–7

  36. [36]

    Samsung PIM/PNM for Transfmer Based AI : Energy Efficiency on PIM/PNM Cluster,

    J. H. Kim, Y. Ro, J. So, S. Lee, S.-h. Kang, Y. Cho, H. Kim, B. Kim, K. Kim, S. Park, J.-S. Kim, S. Cha, W.-J. Lee, J. Jung, J.-G. Lee, J. Lee, J. Song, S. Lee, J. Cho, J. Yu, and K. Sohn, “Samsung PIM/PNM for Transfmer Based AI : Energy Efficiency on PIM/PNM Cluster, ” in2023 IEEE Hot Chips 35 Symposium (HCS), Aug 2023, pp. 1–31

  37. [37]

    PIM-MMU: A Memory Management Unit for Accelerating Data Transfers in Commercial PIM Systems,

    D. Lee, B. Hyun, T. Kim, and M. Rhu, “PIM-MMU: A Memory Management Unit for Accelerating Data Transfers in Commercial PIM Systems, ” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), Nov 2024, pp. 627–642

  38. [38]

    AsyncDIMM: Achieving Asynchronous Execution in DIMM-Based Near-Memory Processing,

    L. Chen, D. Lyu, J. Jiang, Q. Wang, Z. Mao, and N. Jing, “AsyncDIMM: Achieving Asynchronous Execution in DIMM-Based Near-Memory Processing, ” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), March 2025, pp. 518–532

  39. [39]

    Near Data Acceleration with Concurrent Host Access,

    B. Y. Cho, Y. Kwon, S. Lym, and M. Erez, “Near Data Acceleration with Concurrent Host Access, ” in2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), May 2020, pp. 818–831

  40. [40]

    Concurrent PIM and Load/Store Servicing in PIM-Enabled Memory,

    S. Gupta, N. Madan, S. Puthoor, N. Jayasena, and S. Dwarkadas, “Concurrent PIM and Load/Store Servicing in PIM-Enabled Memory, ” in2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), May 2025, pp. 320–334

  41. [41]

    ComPASS: A Compatible PIM Protocol Architecture and Scheduling Solution for Processor-PIM Collaboration,

    S. Yu, H. Kim, K. Jeun, S. Hwang, S. Cho, and E. Lee, “ComPASS: A Compatible PIM Protocol Architecture and Scheduling Solution for Processor-PIM Collaboration, ” inProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’25, 2025, pp. 49–62

  42. [42]

    25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank- Level Parallelism, for Machine Learning Applications,

    Y.-C. Kwon, S. H. Lee, J. Lee, S.-H. Kwon, J. M. Ryu, J.-P. Son, O. Seongil, H.-S. Yu, H. Lee, S. Y. Kim, Y. Cho, J. G. Kim, J. Choi, H.-S. Shin, J. Kim, B. Phuah, H. Kim, M. J. Song, A. Choi, D. Kim, S. Kim, E.-B. Kim, D. Wang, S. Kang, Y. Ro, S. Seo, J. Song, J. Youn, K. Sohn, and N. S. Kim, “25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a...

  43. [43]

    JEDEC JESD209-5C: Low Power Double Data Rate 5 (LPDDR5),

    JEDEC, “JEDEC JESD209-5C: Low Power Double Data Rate 5 (LPDDR5), ” JEDEC Solid State Technology Association, Tech. Rep., 6 2023, revision of JESD209-5B (June 2021)

  44. [44]

    A case for exploiting subarray-level parallelism (SALP) in DRAM,

    Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, “A case for exploiting subarray-level parallelism (SALP) in DRAM, ” in2012 39th Annual International Symposium on Computer Architecture (ISCA), June 2012, pp. 368–379

  45. [45]

    Tiered-latency DRAM: A low latency and low cost DRAM architecture,

    D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu, “Tiered-latency DRAM: A low latency and low cost DRAM architecture, ” in2013 IEEE 19th Interna- tional Symposium on High Performance Computer Architecture (HPCA), Feb 2013, pp. 615–626

  46. [46]

    PIM-AI: A Novel Architecture for High- Efficiency LLM Inference,

    C. Ortega, Y. Falevoz, and R. Ayrignac, “PIM-AI: A Novel Architecture for High- Efficiency LLM Inference, ”arXiv e-prints, p. arXiv:2411.17309, Nov. 2024

  47. [47]

    ALPHA-PIM: Analysis of Linear Algebraic Process- ing for High-Performance Graph Applications on a Real Processing-In-Memory Sys- tem,

    M. Barkhordar, A. Tabatabaeian, M. Sadrosadati, C. Giannoula, J. G. Luna, I. El Hajj, O. Mutlu, and A. R. Alameldeen, “ALPHA-PIM: Analysis of Linear Algebraic Process- ing for High-Performance Graph Applications on a Real Processing-In-Memory Sys- tem, ” in2025 IEEE International Symposium on Workload Characterization (IISWC), Oct 2025, pp. 257–271

  48. [48]

    PyGim: An Efficient Graph Neural Network Library for Real Processing-In-Memory Architectures,

    C. Giannoula, P. Yang, I. Fernandez, J. Yang, S. Durvasula, Y. X. Li, M. Sadrosadati, J. G. Luna, O. Mutlu, and G. Pekhimenko, “PyGim: An Efficient Graph Neural Network Library for Real Processing-In-Memory Architectures, ”Proc. ACM Meas. Anal. Comput. Syst., vol. 8, no. 3, Dec. 2024

  49. [49]

    SparseP: Efficient Sparse Matrix Vector Multiplication on Real Processing-In- Memory Architectures,

    C. Giannoula, I. Fernandez, J. Gómez-Luna, N. Koziris, G. Goumas, and O. Mutlu, “SparseP: Efficient Sparse Matrix Vector Multiplication on Real Processing-In- Memory Architectures, ” in2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), July 2022, pp. 288–291

  50. [50]

    System Architecture and Software Stack for GDDR6-AiM,

    Y. Kwon, K. Vladimir, N. Kim, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, G. Kim, B. An, J. Kim, J. Lee, I. Kim, J. Park, C. Park, Y. Song, B. Yang, H. Lee, S. Kim, D. Kwon, S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kim, J. Jeon, M. Lee, M. Shin, M. Shin, J. Cha, C. Jung, K. Chang, C. Jeong, E. Lim, I. Park, J. Chun, and...

  51. [51]

    Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product,

    S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin, J. Kim, O. Seongil, A. Iyer, D. Wang, K. Sohn, and N. S. Kim, “Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product, ” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), June 2021, pp. 43–56

  52. [52]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Ding, H. Gao, H. Qu, H. Li, J. Gu...

  53. [53]

    Memory access scheduling,

    S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, “Memory access scheduling, ” inProceedings of the 27th Annual International Symposium on Computer Architecture, ser. ISCA ’00, 2000, pp. 128–138

  54. [54]

    Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order,

    D. M. Zuravleff and J. I. Robinson, “Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order, ” May 1997, US Patent 5,630,096

  55. [55]

    Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems,

    T. Moscibroda and O. Mutlu, “Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems, ” in16th USENIX Security Symposium (USENIX Security 07), Aug 2007

  56. [56]

    Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,

    O. Mutlu and T. Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors, ” in40th Annual IEEE/ACM International Symposium on Microar- chitecture (MICRO 2007), Dec 2007, pp. 146–160

  57. [57]

    Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems,

    O. Mutlu and T. Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems, ” in2008 International Symposium on Computer Architecture, June 2008, pp. 63–74

  58. [58]

    The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost,

    L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, “The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost, ” in2014 IEEE 32nd International Conference on Computer Design (ICCD), Oct 2014, pp. 8–15

  59. [59]

    ATLAS: A scalable and high- performance scheduling algorithm for multiple memory controllers,

    Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, “ATLAS: A scalable and high- performance scheduling algorithm for multiple memory controllers, ” inHPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture, Jan 2010, pp. 1–12

  60. [60]

    Ramu- lator 2.0: A Modern, Modular, and Extensible DRAM Simulator,

    H. Luo, Y. C. Tuğrul, F. N. Bostancı, A. Olgun, A. G. Yağlıkçı, and O. Mutlu, “Ramu- lator 2.0: A Modern, Modular, and Extensible DRAM Simulator, ”IEEE Computer Architecture Letters, vol. 23, no. 1, pp. 112–116, Jan 2024

  61. [61]

    Ramulator V2.0a,

    CMU-SAFARI, “Ramulator V2.0a, ” https://github.com/CMU-SAFARI/ramulator2, 2023, accessed: 2025-11-05

  62. [62]

    Ramulator: A Fast and Extensible DRAM Simulator,

    Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A Fast and Extensible DRAM Simulator, ” IEEE Computer Architecture Letters, vol. 15, no. 1, pp. 45–49, 2016

  63. [63]

    DRAMPower 5: An Open-Source Power Simulator for Current Generation DRAM Standards,

    L. Steiner, T. Psota, M. Mörz, D. Christ, M. Jung, and N. Wehn, “DRAMPower 5: An Open-Source Power Simulator for Current Generation DRAM Standards, ” in Proceedings of the Rapid Simulation and Performance Evaluation for Design Workshop, ser. RAPIDO ’25, 2025, pp. 8–16

  64. [64]

    Snapdragon 888 5G Mobile Plat- form,

    Qualcomm Technologies, Inc., “Snapdragon 888 5G Mobile Plat- form, ” https://www.qualcomm.com/smartphones/products/8-series/ snapdragon-888-5g-mobile-platform, 2020, accessed: 2026-02-19

  65. [65]

    ZSim: fast and accurate microarchitectural simu- lation of thousand-core systems,

    D. Sanchez and C. Kozyrakis, “ZSim: fast and accurate microarchitectural simu- lation of thousand-core systems, ” inProceedings of the 40th Annual International Symposium on Computer Architecture, ser. ISCA ’13, 2013, pp. 475–486

  66. [66]

    Xiaomi Mi 11 Pro - Technical Specifications,

    Xiaomi Corporation, “Xiaomi Mi 11 Pro - Technical Specifications, ” https://www. mi.com/mi11Pro/specs, 2021, accessed: 2026-02-19

  67. [67]

    Stalker — Frida Documentation,

    F. Developers, “Stalker — Frida Documentation, ” https://frida.re/docs/stalker/, 2025, accessed: 2025-11-15

  68. [68]

    Android 15,

    Google LLC, “Android 15, ” https://developer.android.google.cn/about/versions/15, 2024, accessed: 2025-11-05

  69. [69]

    G. Yeap, S. S. Lin, Y. M. Chen, H. L. Shang, P. W. Wang, H. C. Lin, Y. C. Peng, J. Y. Sheu, M. Wang, X. Chen, B. R. Yang, C. P. Lin, F. C. Yang, Y. K. Leung, D. W. Lin, C. P. Chen, K. F. Yu, D. H. Chen, C. Y. Chang, H. K. Chen, P. Hung, C. S. Hou, Y. K. Cheng, J. Chang, L. Yuan, C. K. Lin, C. C. Chen, Y. C. Yeo, M. H. Tsai, H. T. Lin, C. O. Chui, K. B. Hu...

  70. [70]

    Auto-tuning a high-level language targeted to GPU codes,

    S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, “Auto-tuning a high-level language targeted to GPU codes, ” in2012 Innovative Parallel Computing (InPar), May 2012, pp. 1–10

  71. [71]

    SPEC CPU2017,

    Standard Performance Evaluation Corporation (SPEC), “SPEC CPU2017, ” https: //www.spec.org/cpu2017/, 2017, accessed: 2025-11-15

  72. [72]

    BigScience Language Open-science Open-access Multilingual (BLOOM) Language Model,

    BigScience, “BigScience Language Open-science Open-access Multilingual (BLOOM) Language Model, ” https://huggingface.co/bigscience/bloom-1b1, 2022, accessed: 2025-11-05

  73. [73]

    Qwen2 Technical Report

    A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Yang, J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T...

  74. [74]

    FALA: Locality-Aware PIM- Host Cooperation for Graph Processing with Fine-Grained Column Access,

    C. Shin, J. Song, S. Na, J. Sung, H. Jang, and J. Lee, “FALA: Locality-Aware PIM- Host Cooperation for Graph Processing with Fine-Grained Column Access, ” in Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’25, 2025, p. 15201534

  75. [75]

    Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology,

    V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology, ” in2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct 2017, pp. 273–287

  76. [76]

    SIMDRAM: a framework for bit- serial SIMD processing using DRAM,

    N. Hajinazar, G. F. Oliveira, S. Gregorio, J. a. D. Ferreira, N. M. Ghiasi, M. Patel, M. Alser, S. Ghose, J. Gómez-Luna, and O. Mutlu, “SIMDRAM: a framework for bit- serial SIMD processing using DRAM, ” inProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’21, 2021, pp...

  77. [77]

    MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple- Instruction Multiple-Data Computing,

    G. F. Oliveira, A. Olgun, A. G. Yağlıkçı, F. N. Bostancı, J. Gómez-Luna, S. Ghose, and O. Mutlu, “MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple- Instruction Multiple-Data Computing, ” in2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), March 2...

  78. [78]

    Pro- teus: Achieving High-Performance Processing-Using-DRAM with Dynamic Bit- Precision, Adaptive Data Representation, and Flexible Arithmetic,

    G. F. de Oliveira Junior, M. Kabra, Y. Guo, K. Chen, A. G. Yaglikci, M. Soysal, M. Sadrosadati, J. Olivares Bueno, S. Ghose, J. Gómez-Luna, and O. Mutlu, “Pro- teus: Achieving High-Performance Processing-Using-DRAM with Dynamic Bit- Precision, Adaptive Data Representation, and Flexible Arithmetic, ” inProceedings of the 39th ACM International Conference o...

  79. [79]

    DRISA: A DRAM-based Reconfigurable In-Situ Accelerator,

    S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, “DRISA: A DRAM-based Reconfigurable In-Situ Accelerator, ” in2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct 2017, pp. 288–301

  80. [80]

    Compute Caches,

    S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das, “Compute Caches, ” in2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb 2017, pp. 481–492

Showing first 80 references.