pith. sign in

arxiv: 2606.30553 · v2 · pith:7P4GGESRnew · submitted 2026-06-29 · 💻 cs.AR · cs.DC

COSM: A Cooperative Scheduling Framework for Concurrent PIM and CPU Execution on Mobile Devices

Pith reviewed 2026-07-01 06:25 UTC · model grok-4.3

classification 💻 cs.AR cs.DC
keywords processing-in-memoryconcurrent schedulingmobile deviceslarge language modelsDRAM accessPIM control interfaceidleness-aware schedulingthroughput improvement
0
0 comments X

The pith

COSM lets PIM and CPU run concurrently on mobile devices by filling CPU idle times with PIM commands, achieving up to 2.8 times higher PIM throughput with less than 2 percent CPU slowdown.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that PIM can operate alongside the CPU in shared memory on mobiles without major performance penalties. It proposes COSM, which uses a control interface that avoids disrupting CPU accesses and schedules PIM tasks during CPU idle periods. This is important for on-device LLMs, where PIM can cut energy costs from data movement, but shared use risks conflicts that reduce benefits. If successful, the system allows both to work at the same time, hiding PIM delays and overlapping operations.

Core claim

The central discovery is that a low-interference PIM control interface, which produces the maximum number of PIM commands without affecting CPU memory accesses, combined with an idleness-aware scheduler that inserts PIM commands into idle windows in the CPU access sequence, enables concurrent execution. This approach not only conceals PIM latency from the CPU but also allows PIM execution to overlap with data transfers. On workloads involving LLMs and mobile applications, it delivers up to 2.8x PIM throughput improvement over baseline methods while incurring under 2.0% CPU performance loss.

What carries the argument

The idleness-aware scheduling method that places PIM commands into CPU idle time windows, supported by a low-interference PIM control interface.

If this is right

  • PIM latency can be hidden from the CPU, preserving its performance during concurrent runs.
  • PIM execution can overlap with data transfer for better overall efficiency.
  • Bank conflicts and bus congestion are minimized through careful command generation and timing.
  • Mobile devices can support both LLMs and other apps concurrently with PIM acceleration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method might apply to other shared-memory PIM systems if idle patterns are similar.
  • Hardware designers could add idle detection logic to memory controllers based on this idea.
  • Testing on a wider range of workloads could reveal if the 2.8x gain is consistent across different access patterns.

Load-bearing premise

CPU memory access sequences must contain sufficient idle time windows, and the PIM interface must avoid creating extra bank conflicts or congestion not captured in the model.

What would settle it

Measuring PIM throughput and CPU performance on actual mobile hardware executing the same LLM and workload traces, checking if gains persist when idle windows are limited or conflicts increase.

Figures

Figures reproduced from arXiv: 2606.30553 by Fangxin Liu, Haibing Guan, Jian Liu, Li Jiang, Mingyu Gao, Onur Mutlu, Yilong Zhao.

Figure 1
Figure 1. Figure 1: Internal/external DRAM bandwidth utilization of CPU and PIM work [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) CPU workload performance under injected read latency. (b) PIM [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) An example of FR-FCFS scheduling for a CPU-only workload and [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: COSM’s memory controller architecture and memory interface [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Timing diagram of PIM unit, memory bus, Command Arbiter, and [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The data, command, and address path of (a) [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: (a) Conventional software-controlled three-stage sequential scheduling. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overall PIM & CPU performance of COSM and baselines for concurrent CPU and PIM execution. [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Normalized CPU and PIM workload performance under fixed-length and preemptable PIM execution command. Cases where CPU performance degrades [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Impact of CPU-mediated data transfers on CPU performance under different scheduling strategies. We test on the attention layers of the benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: (a) PIM performance and (b) Internal bandwidth usage during concur [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: PIM workload energy consumption per token (including PIM unit computation, PIM bank access, and CPU-mediated data transfer) of COSM and baselines [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Sensitivity analysis on (a) nP T L (b) rank count per channel (c) KV cache size (d) scaled tRP (n×, relative to [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗
read the original abstract

The development of on-device large language models (LLMs) is driven by the need for privacy and fast response times. Energy-intensive data transfer on mobile devices makes Processing-in-Memory (PIM) an effective solution. Due to stringent DRAM cost constraints, limited physical footprint on circuit boards, and the interaction between applications and LLMs, it is imperative for the CPU and PIM to operate concurrently within a shared memory space. However, challenges such as bank conflicts and bus congestion can arise, potentially diminishing the performance and energy benefits of PIM. To address this challenge, we introduce COSM, a cooperative scheduling framework designed to facilitate the concurrent operation of PIM and CPU tasks on mobile platforms. Our key innovations include: 1) a low-interference PIM control interface that generates the maximum number of PIM commands without disrupting CPU memory accesses; 2) an idleness-aware scheduling method that integrates PIM commands into available idle time windows within the CPU's access sequence. COSM not only hides PIM execution latency from the CPU, but also overlaps PIM execution with data transfer. Experiments on concurrent execution of LLMs and mobile workloads, including mobile applications and compute-intensive kernels, demonstrate that COSM improves PIM throughput by up to 2.8x compared to the baseline scheduling method with less than 2.0% CPU performance loss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces COSM, a cooperative scheduling framework for concurrent PIM and CPU execution on mobile devices. It proposes a low-interference PIM control interface that maximizes PIM commands without disrupting CPU accesses and an idleness-aware scheduler that inserts PIM commands into CPU idle time windows. On concurrent LLM and mobile workloads, it claims up to 2.8x PIM throughput improvement versus a baseline scheduler with under 2% CPU performance loss.

Significance. If the evaluation holds, the work addresses a practical barrier to PIM adoption on mobile platforms by enabling safe overlap of PIM and CPU traffic in a shared DRAM space, which could improve both performance and energy efficiency for on-device LLMs under tight cost and footprint constraints.

major comments (2)
  1. [Abstract] Abstract: the reported 2.8x PIM throughput gain and <2% CPU loss are presented without any description of the experimental setup, workload selection criteria, error bars, simulator fidelity (bank-level timing, bus model), or the method used to identify and measure CPU idle windows. These omissions are load-bearing because the central claim rests on the scheduler successfully locating usable idle slots and the interface issuing maximum commands without unmodeled contention.
  2. [Evaluation] Evaluation section (implied by the abstract's experimental claims): the paper does not demonstrate that the measured overlap survives when CPU access sequences are taken from production mobile DRAM traces rather than filtered or generated workloads, nor does it quantify residual bank conflicts or bus congestion that the low-interference interface might still induce. This directly affects whether the 2.8x figure is an artifact of idealized idle-window assumptions.
minor comments (1)
  1. [Abstract] Abstract: the baseline scheduler against which the 2.8x gain is measured is not named or briefly characterized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported 2.8x PIM throughput gain and <2% CPU loss are presented without any description of the experimental setup, workload selection criteria, error bars, simulator fidelity (bank-level timing, bus model), or the method used to identify and measure CPU idle windows. These omissions are load-bearing because the central claim rests on the scheduler successfully locating usable idle slots and the interface issuing maximum commands without unmodeled contention.

    Authors: Abstracts are concise by design, but we agree that the numerical claims would benefit from additional context. The Evaluation section provides the simulator details (cycle-accurate model with bank-level timing and bus model), workload selection criteria (real mobile applications and compute-intensive kernels run concurrently with LLM inference), error bars on all figures, and the idleness-aware scheduling algorithm for detecting CPU idle windows. We will revise the abstract to add one sentence briefly noting that results come from detailed simulation of representative mobile workloads. revision: yes

  2. Referee: [Evaluation] Evaluation section (implied by the abstract's experimental claims): the paper does not demonstrate that the measured overlap survives when CPU access sequences are taken from production mobile DRAM traces rather than filtered or generated workloads, nor does it quantify residual bank conflicts or bus congestion that the low-interference interface might still induce. This directly affects whether the 2.8x figure is an artifact of idealized idle-window assumptions.

    Authors: The workloads are drawn from actual mobile applications and kernels executing alongside LLM inference, which capture realistic access patterns under mobile constraints. The low-interference interface and idleness-aware scheduler are evaluated for their ability to limit contention, with results showing <2% CPU slowdown. We did not evaluate against raw production DRAM traces. We will add a paragraph in the Evaluation section discussing workload representativeness and any observed residual bank or bus effects to address this concern. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical evaluation against baseline

full rationale

The paper introduces COSM as a scheduling framework with two stated innovations (low-interference PIM interface and idleness-aware scheduling) and reports throughput and overhead numbers from direct experimental comparison on LLM + mobile workloads. No equations, fitted parameters, or self-citation chains are presented in the provided text that would reduce the 2.8x claim or the <2% loss figure to quantities defined by the inputs themselves. The central results are therefore independent empirical measurements rather than self-referential derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the framework description relies on standard hardware assumptions about memory access patterns and idle windows without stating fitted constants or new postulates.

pith-pipeline@v0.9.1-grok · 5787 in / 1179 out tokens · 23891 ms · 2026-07-01T06:25:42.420886+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

134 extracted references · 10 canonical work pages · 3 internal anchors

  1. [1]

    Moshi: a speech-text foundation model for real-time dialogue

    A. Défossez, L. Mazaré, M. Orsini, A. Royer, P. Pérez, H. Jégou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue, ” arXiv e-prints, p. arXiv:2410.00037, Sep. 2024

  2. [2]

    MinMo: A Multimodal Large Language Model for Seamless Voice Interaction,

    Q. Chen, Y. Chen, Y. Chen, M. Chen, Y. Chen, C. Deng, Z. Du, R. Gao, C. Gao, Z. Gao, Y. Li, X. Lv, J. Liu, H. Luo, B. Ma, C. Ni, X. Shi, J. Tang, H. Wang, H. Wang, W. Wang, Y. Wang, Y. Xu, F. Yu, Z. Yan, Y. Yang, B. Yang, X. Yang, G. Yang, T. Zhao, Q. Zhang, S. Zhang, N. Zhao, P. Zhang, C. Zhang, and J. Zhou, “MinMo: A Multimodal Large Language Model for ...

  3. [3]

    Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play,

    Y. Shi, Y. Shu, S. Dong, G. Liu, J. Sesay, J. Li, and Z. Hu, “Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play, ” arXiv e-prints, p. arXiv:2505.02707, May 2025

  4. [4]

    Simultaneous Ma- chine Translation with Large Language Models,

    M. Wang, J. Zhao, T.-T. Vu, F. Shiri, E. Shareghi, and G. Haffari, “Simultaneous Ma- chine Translation with Large Language Models, ”arXiv e-prints, p. arXiv:2309.06706, Sep. 2025

  5. [5]

    LLMs Can Achieve High-quality Simultaneous Machine Translation as Efficiently as Offline,

    B. Fu, M. Liao, K. Fan, C. Li, L. Zhang, Y. Chen, and X. Shi, “LLMs Can Achieve High-quality Simultaneous Machine Translation as Efficiently as Offline, ”arXiv e-prints, p. arXiv:2504.09570, Apr. 2025

  6. [6]

    MediaPipe: A Framework for Building Perception Pipelines

    C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C.-L. Chang, M. G. Yong, J. Lee, W.-T. Chang, W. Hua, M. Georg, and M. Grundmann, “MediaPipe: A Framework for Building Perception Pipelines, ”arXiv e-prints, p. arXiv:1906.08172, Jun. 2019

  7. [7]

    MotionBridge: Dynamic Video Inbetweening with Flexible Controls,

    M. Tanveer, Y. Zhou, S. Niklaus, A. Mahdavi Amiri, H. Zhang, K. K. Singh, and N. Zhao, “MotionBridge: Dynamic Video Inbetweening with Flexible Controls, ” arXiv e-prints, p. arXiv:2412.13190, Dec. 2024

  8. [8]

    Apple Intelligence,

    Apple Inc., “Apple Intelligence, ” https://www.apple.com/apple-intelligence/, 2024, accessed: 2025-11-05

  9. [9]

    Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition,

    H. Chen, Y. Wang, K. Han, D. Li, L. Li, Z. Bi, J. Li, H. Wang, F. Mi, M. Zhu, B. Wang, K. Song, Y. Fu, X. He, Y. Luo, C. Zhu, Q. He, X. Wu, W. He, H. Hu, Y. Tang, D. Tao, X. Chen, and Y. Wang, “Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition, ”arXiv e-prints, p. arXiv:2505.22375, May 2025

  10. [10]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone,

    Y. Yao, T. Yu, A. Zhang, C. Wang, J. Cui, H. Zhu, T. Cai, H. Li, W. Zhao, Z. Heet al., “MiniCPM-V: A GPT-4V Level MLLM on Your Phone, ”Nat Commun 16, 5509 (2025), 2025

  11. [11]

    Samsung Electronics,Galaxy.ai - The #1 All-in-One AI Platform, https://galaxy.ai/, 2024, accessed: 2025-11-05

  12. [12]

    BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices,

    X. Lu, Y. Chen, C. Chen, H. Tan, B. Chen, Y. Xie, R. Hu, G. Tan, R. Wu, Y. Hu, Y. Zeng, L. Wu, L. Bian, Z. Wang, L. Liu, Y. Yang, H. Xiao, A. Zhou, Y. Wen, X. Chen, S. Ren, and H. Li, “BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices, ” inProceedings of the IEEE/CVF Conference on Computer Vision and Patter...

  13. [13]

    Unlocking On-Device Generative AI with an NPU and Heterogeneous Computing,

    Qualcomm Technologies, Inc., “Unlocking On-Device Generative AI with an NPU and Heterogeneous Computing, ” Qualcomm, Tech. Rep., 2024, accessed: 2025-11-15

  14. [14]

    vivo Unveils New AI Strategy: BlueHeart Large Model Matrix and Major Upgrade to OriginOS 5,

    vivo, “vivo Unveils New AI Strategy: BlueHeart Large Model Matrix and Major Upgrade to OriginOS 5, ” https://www.vivo.com.cn/brand/news/detail?id=1271, 2024, accessed: 2025-11-15

  15. [15]

    The true Processing In Memory accelerator,

    F. Devaux, “The true Processing In Memory accelerator, ” in2019 IEEE Hot Chips 31 Symposium (HCS), Aug 2019, pp. 1–24

  16. [16]

    A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep- Learning Applications,

    S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kim, J. Jeon, N. Kim, Y. Kwon, K. Vladimir, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, J. Lee, D. Ko, Y. Jun, K. Cho, I. Kim, C. Song, C. Jeong, D. Kwon, J. Jang, I. Park, J. Chun, and J. Cho, “A 1ynm 1.25V 8Gb, 16Gb/s/pin GDDR6-based Accelerator-in-Memory supporting 1TFLOPS MAC...

  17. [17]

    PUSHtap: PIM-based In-Memory HTAP with Unified Data Storage Format,

    Y. Zhao, M. Gao, H. Zhang, F. Liu, G. Chen, H. Xian, H. Guan, and L. Jiang, “PUSHtap: PIM-based In-Memory HTAP with Unified Data Storage Format, ” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS ’25, 2025, pp. 179–194

  18. [18]

    UM-PIM: DRAM-based PIM with Uniform & Shared Memory Space,

    Y. Zhao, M. Gao, F. Liu, Y. Hu, Z. Wang, H. Lin, J. Li, H. Xian, H. Dong, T. Yang, N. Jing, X. Liang, and L. Jiang, “UM-PIM: DRAM-based PIM with Uniform & Shared Memory Space, ” in2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), June 2024, pp. 644–659

  19. [19]

    PAPI: Exploiting Dynamic Parallelism in Large Language Model De- coding with a Processing-In-Memory-Enabled Computing System,

    Y. He, H. Mao, C. Giannoula, M. Sadrosadati, J. Gómez-Luna, H. Li, X. Li, Y. Wang, and O. Mutlu, “PAPI: Exploiting Dynamic Parallelism in Large Language Model De- coding with a Processing-In-Memory-Enabled Computing System, ” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vo...

  20. [20]

    NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing,

    G. Heo, S. Lee, J. Cho, H. Choi, S. Lee, H. Ham, G. Kim, D. Mahajan, and J. Park, “NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing, ” in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ser. ASPLOS ’24, 2024, pp. 722–737

  21. [21]

    Unifying Two Operators with One PIM: Leveraging Hybrid Bonding for Efficient LLM Inference,

    J. Chen, Y. Qi, K. Sun, Z. Lin, T. Wang, C. Ma, and Y. Wang, “Unifying Two Operators with One PIM: Leveraging Hybrid Bonding for Efficient LLM Inference, ” inAdvanced Parallel Processing Technologies, C. Li, X. Qian, D. Gizopoulos, and B. Grot, Eds., 2026, pp. 215–230

  22. [22]

    Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving,

    W. Kim, Y. Lee, Y. Kim, J. Hwang, S. Oh, J. Jung, A. Huseynov, W. G. Park, C. H. Park, D. Mahajan, and J. Park, “Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving, ” inProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’25, 2025, pp. 292–307

  23. [23]

    AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference,

    J. Park, J. Choi, K. Kyung, M. J. Kim, Y. Kwon, N. S. Kim, and J. H. Ahn, “AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference, ” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’24, 2024, p. 103119

  24. [24]

    Pyramid: Accelerating LLM Infer- ence With Cross-Level Processing-in-Memory,

    L. Yan, X. Lu, X. Chen, Y. Han, and X.-H. Sun, “Pyramid: Accelerating LLM Infer- ence With Cross-Level Processing-in-Memory, ”IEEE Computer Architecture Letters, vol. 24, no. 1, pp. 121–124, Jan 2025

  25. [25]

    A scalable processing-in-memory ac- celerator for parallel graph processing,

    J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memory ac- celerator for parallel graph processing, ” in2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), June 2015, pp. 105–117

  26. [26]

    PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture,

    J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture, ” in2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), June 2015, pp. 336–348

  27. [27]

    Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks,

    A. Boroumand, S. Ghose, B. Akin, R. Narayanaswami, G. F. Oliveira, X. Ma, E. Shiu, and O. Mutlu, “Google Neural Network Models for Edge Devices: Analyzing and Mitigating Machine Learning Inference Bottlenecks, ” in2021 30th International Conference on Parallel Architectures and Compilation Techniques (PACT), Sep. 2021, pp. 159–172

  28. [28]

    SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-Memory Systems,

    M. Besta, R. Kanakagiri, G. Kwasniewski, R. Ausavarungnirun, J. Beránek, K. Kanel- lopoulos, K. Janda, Z. Vonarburg-Shmaria, L. Gianinazzi, I. Stefan, J. G. Luna, J. Goli- nowski, M. Copik, L. Kapp-Schwoerer, S. Di Girolamo, N. Blach, M. Konieczny, O. Mutlu, and T. Hoefler, “SISA: Set-Centric Instruction Set Architecture for Graph Mining on Processing-in-...

  29. [29]

    Google Workloads for Con- sumer Devices: Mitigating Data Movement Bottlenecks,

    A. Boroumand, S. Ghose, Y. Kim, R. Ausavarungnirun, E. Shiu, R. Thakur, D. Kim, A. Kuusela, A. Knies, P. Ranganathan, and O. Mutlu, “Google Workloads for Con- sumer Devices: Mitigating Data Movement Bottlenecks, ” inProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLO...

  30. [30]

    PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference,

    Y. Gu, A. Khadem, S. Umesh, N. Liang, X. Servot, O. Mutlu, R. Iyer, and R. Das, “PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference, ” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ser. ASPLOS ’25, 2025, pp. 862–881

  31. [31]

    Mutlu, S

    O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun,A Modern Primer on Processing in Memory, 2023, pp. 171–243

  32. [32]

    Processing data where it makes sense in modern computing systems: En- abling in-memory computation,

    O. Mutlu, “Processing data where it makes sense in modern computing systems: En- abling in-memory computation, ” in2018 7th Mediterranean Conference on Embedded Computing (MECO), June 2018, pp. 8–9

  33. [33]

    Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System,

    J. Gómez-Luna, I. E. Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System, ”IEEE Access, vol. 10, pp. 52 565–52 608, 2022

  34. [34]

    Evaluating Machine LearningWorkloads on Memory-Centric Com- puting Systems,

    J. Gómez-Luna, Y. Guo, S. Brocard, J. Legriel, R. Cimadomo, G. F. Oliveira, G. Singh, and O. Mutlu, “Evaluating Machine LearningWorkloads on Memory-Centric Com- puting Systems, ” in2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), April 2023, pp. 35–49

  35. [35]

    Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing- In-Memory Hardware,

    J. Gómez-Luna, I. El Hajj, I. Fernandez, C. Giannoula, G. F. Oliveira, and O. Mutlu, “Benchmarking Memory-Centric Computing Systems: Analysis of Real Processing- In-Memory Hardware, ” in2021 12th International Green and Sustainable Computing Conference (IGSC), Oct 2021, pp. 1–7

  36. [36]

    Samsung PIM/PNM for Transfmer Based AI : Energy Efficiency on PIM/PNM Cluster,

    J. H. Kim, Y. Ro, J. So, S. Lee, S.-h. Kang, Y. Cho, H. Kim, B. Kim, K. Kim, S. Park, J.-S. Kim, S. Cha, W.-J. Lee, J. Jung, J.-G. Lee, J. Lee, J. Song, S. Lee, J. Cho, J. Yu, and K. Sohn, “Samsung PIM/PNM for Transfmer Based AI : Energy Efficiency on PIM/PNM Cluster, ” in2023 IEEE Hot Chips 35 Symposium (HCS), Aug 2023, pp. 1–31

  37. [37]

    PIM-MMU: A Memory Management Unit for Accelerating Data Transfers in Commercial PIM Systems,

    D. Lee, B. Hyun, T. Kim, and M. Rhu, “PIM-MMU: A Memory Management Unit for Accelerating Data Transfers in Commercial PIM Systems, ” in2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO), Nov 2024, pp. 627–642

  38. [38]

    AsyncDIMM: Achieving Asynchronous Execution in DIMM-Based Near-Memory Processing,

    L. Chen, D. Lyu, J. Jiang, Q. Wang, Z. Mao, and N. Jing, “AsyncDIMM: Achieving Asynchronous Execution in DIMM-Based Near-Memory Processing, ” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), March 2025, pp. 518–532

  39. [39]

    Near Data Acceleration with Concurrent Host Access,

    B. Y. Cho, Y. Kwon, S. Lym, and M. Erez, “Near Data Acceleration with Concurrent Host Access, ” in2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), May 2020, pp. 818–831

  40. [40]

    Concurrent PIM and Load/Store Servicing in PIM-Enabled Memory,

    S. Gupta, N. Madan, S. Puthoor, N. Jayasena, and S. Dwarkadas, “Concurrent PIM and Load/Store Servicing in PIM-Enabled Memory, ” in2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), May 2025, pp. 320–334

  41. [41]

    ComPASS: A Compatible PIM Protocol Architecture and Scheduling Solution for Processor-PIM Collaboration,

    S. Yu, H. Kim, K. Jeun, S. Hwang, S. Cho, and E. Lee, “ComPASS: A Compatible PIM Protocol Architecture and Scheduling Solution for Processor-PIM Collaboration, ” inProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’25, 2025, pp. 49–62

  42. [42]

    25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a 1.2TFLOPS Programmable Computing Unit Using Bank- Level Parallelism, for Machine Learning Applications,

    Y.-C. Kwon, S. H. Lee, J. Lee, S.-H. Kwon, J. M. Ryu, J.-P. Son, O. Seongil, H.-S. Yu, H. Lee, S. Y. Kim, Y. Cho, J. G. Kim, J. Choi, H.-S. Shin, J. Kim, B. Phuah, H. Kim, M. J. Song, A. Choi, D. Kim, S. Kim, E.-B. Kim, D. Wang, S. Kang, Y. Ro, S. Seo, J. Song, J. Youn, K. Sohn, and N. S. Kim, “25.4 A 20nm 6GB Function-In-Memory DRAM, Based on HBM2 with a...

  43. [43]

    JEDEC JESD209-5C: Low Power Double Data Rate 5 (LPDDR5),

    JEDEC, “JEDEC JESD209-5C: Low Power Double Data Rate 5 (LPDDR5), ” JEDEC Solid State Technology Association, Tech. Rep., 6 2023, revision of JESD209-5B (June 2021)

  44. [44]

    A case for exploiting subarray-level parallelism (SALP) in DRAM,

    Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, “A case for exploiting subarray-level parallelism (SALP) in DRAM, ” in2012 39th Annual International Symposium on Computer Architecture (ISCA), June 2012, pp. 368–379

  45. [45]

    Tiered-latency DRAM: A low latency and low cost DRAM architecture,

    D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu, “Tiered-latency DRAM: A low latency and low cost DRAM architecture, ” in2013 IEEE 19th Interna- tional Symposium on High Performance Computer Architecture (HPCA), Feb 2013, pp. 615–626

  46. [46]

    PIM-AI: A Novel Architecture for High- Efficiency LLM Inference,

    C. Ortega, Y. Falevoz, and R. Ayrignac, “PIM-AI: A Novel Architecture for High- Efficiency LLM Inference, ”arXiv e-prints, p. arXiv:2411.17309, Nov. 2024

  47. [47]

    ALPHA-PIM: Analysis of Linear Algebraic Process- ing for High-Performance Graph Applications on a Real Processing-In-Memory Sys- tem,

    M. Barkhordar, A. Tabatabaeian, M. Sadrosadati, C. Giannoula, J. G. Luna, I. El Hajj, O. Mutlu, and A. R. Alameldeen, “ALPHA-PIM: Analysis of Linear Algebraic Process- ing for High-Performance Graph Applications on a Real Processing-In-Memory Sys- tem, ” in2025 IEEE International Symposium on Workload Characterization (IISWC), Oct 2025, pp. 257–271

  48. [48]

    PyGim: An Efficient Graph Neural Network Library for Real Processing-In-Memory Architectures,

    C. Giannoula, P. Yang, I. Fernandez, J. Yang, S. Durvasula, Y. X. Li, M. Sadrosadati, J. G. Luna, O. Mutlu, and G. Pekhimenko, “PyGim: An Efficient Graph Neural Network Library for Real Processing-In-Memory Architectures, ”Proc. ACM Meas. Anal. Comput. Syst., vol. 8, no. 3, Dec. 2024

  49. [49]

    SparseP: Efficient Sparse Matrix Vector Multiplication on Real Processing-In- Memory Architectures,

    C. Giannoula, I. Fernandez, J. Gómez-Luna, N. Koziris, G. Goumas, and O. Mutlu, “SparseP: Efficient Sparse Matrix Vector Multiplication on Real Processing-In- Memory Architectures, ” in2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), July 2022, pp. 288–291

  50. [50]

    System Architecture and Software Stack for GDDR6-AiM,

    Y. Kwon, K. Vladimir, N. Kim, W. Shin, J. Won, M. Lee, H. Joo, H. Choi, G. Kim, B. An, J. Kim, J. Lee, I. Kim, J. Park, C. Park, Y. Song, B. Yang, H. Lee, S. Kim, D. Kwon, S. Lee, K. Kim, S. Oh, J. Park, G. Hong, D. Ka, K. Hwang, J. Park, K. Kang, J. Kim, J. Jeon, M. Lee, M. Shin, M. Shin, J. Cha, C. Jung, K. Chang, C. Jeong, E. Lim, I. Park, J. Chun, and...

  51. [51]

    Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product,

    S. Lee, S.-h. Kang, J. Lee, H. Kim, E. Lee, S. Seo, H. Yoon, S. Lee, K. Lim, H. Shin, J. Kim, O. Seongil, A. Iyer, D. Wang, K. Sohn, and N. S. Kim, “Hardware Architecture and Software Stack for PIM Based on Commercial DRAM Technology : Industrial Product, ” in2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), June 2021, pp. 43–56

  52. [52]

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning,

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Xu, H. Ding, H. Gao, H. Qu, H. Li, J. Gu...

  53. [53]

    Memory access scheduling,

    S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, “Memory access scheduling, ” inProceedings of the 27th Annual International Symposium on Computer Architecture, ser. ISCA ’00, 2000, pp. 128–138

  54. [54]

    Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order,

    D. M. Zuravleff and J. I. Robinson, “Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order, ” May 1997, US Patent 5,630,096

  55. [55]

    Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems,

    T. Moscibroda and O. Mutlu, “Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems, ” in16th USENIX Security Symposium (USENIX Security 07), Aug 2007

  56. [56]

    Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,

    O. Mutlu and T. Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors, ” in40th Annual IEEE/ACM International Symposium on Microar- chitecture (MICRO 2007), Dec 2007, pp. 146–160

  57. [57]

    Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems,

    O. Mutlu and T. Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems, ” in2008 International Symposium on Computer Architecture, June 2008, pp. 63–74

  58. [58]

    The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost,

    L. Subramanian, D. Lee, V. Seshadri, H. Rastogi, and O. Mutlu, “The Blacklisting Memory Scheduler: Achieving high performance and fairness at low cost, ” in2014 IEEE 32nd International Conference on Computer Design (ICCD), Oct 2014, pp. 8–15

  59. [59]

    ATLAS: A scalable and high- performance scheduling algorithm for multiple memory controllers,

    Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter, “ATLAS: A scalable and high- performance scheduling algorithm for multiple memory controllers, ” inHPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture, Jan 2010, pp. 1–12

  60. [60]

    Ramu- lator 2.0: A Modern, Modular, and Extensible DRAM Simulator,

    H. Luo, Y. C. Tuğrul, F. N. Bostancı, A. Olgun, A. G. Yağlıkçı, and O. Mutlu, “Ramu- lator 2.0: A Modern, Modular, and Extensible DRAM Simulator, ”IEEE Computer Architecture Letters, vol. 23, no. 1, pp. 112–116, Jan 2024

  61. [61]

    Ramulator V2.0a,

    CMU-SAFARI, “Ramulator V2.0a, ” https://github.com/CMU-SAFARI/ramulator2, 2023, accessed: 2025-11-05

  62. [62]

    Ramulator: A Fast and Extensible DRAM Simulator,

    Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A Fast and Extensible DRAM Simulator, ” IEEE Computer Architecture Letters, vol. 15, no. 1, pp. 45–49, 2016

  63. [63]

    DRAMPower 5: An Open-Source Power Simulator for Current Generation DRAM Standards,

    L. Steiner, T. Psota, M. Mörz, D. Christ, M. Jung, and N. Wehn, “DRAMPower 5: An Open-Source Power Simulator for Current Generation DRAM Standards, ” in Proceedings of the Rapid Simulation and Performance Evaluation for Design Workshop, ser. RAPIDO ’25, 2025, pp. 8–16

  64. [64]

    Snapdragon 888 5G Mobile Plat- form,

    Qualcomm Technologies, Inc., “Snapdragon 888 5G Mobile Plat- form, ” https://www.qualcomm.com/smartphones/products/8-series/ snapdragon-888-5g-mobile-platform, 2020, accessed: 2026-02-19

  65. [65]

    ZSim: fast and accurate microarchitectural simu- lation of thousand-core systems,

    D. Sanchez and C. Kozyrakis, “ZSim: fast and accurate microarchitectural simu- lation of thousand-core systems, ” inProceedings of the 40th Annual International Symposium on Computer Architecture, ser. ISCA ’13, 2013, pp. 475–486

  66. [66]

    Xiaomi Mi 11 Pro - Technical Specifications,

    Xiaomi Corporation, “Xiaomi Mi 11 Pro - Technical Specifications, ” https://www. mi.com/mi11Pro/specs, 2021, accessed: 2026-02-19

  67. [67]

    Stalker — Frida Documentation,

    F. Developers, “Stalker — Frida Documentation, ” https://frida.re/docs/stalker/, 2025, accessed: 2025-11-15

  68. [68]

    Android 15,

    Google LLC, “Android 15, ” https://developer.android.google.cn/about/versions/15, 2024, accessed: 2025-11-05

  69. [69]

    G. Yeap, S. S. Lin, Y. M. Chen, H. L. Shang, P. W. Wang, H. C. Lin, Y. C. Peng, J. Y. Sheu, M. Wang, X. Chen, B. R. Yang, C. P. Lin, F. C. Yang, Y. K. Leung, D. W. Lin, C. P. Chen, K. F. Yu, D. H. Chen, C. Y. Chang, H. K. Chen, P. Hung, C. S. Hou, Y. K. Cheng, J. Chang, L. Yuan, C. K. Lin, C. C. Chen, Y. C. Yeo, M. H. Tsai, H. T. Lin, C. O. Chui, K. B. Hu...

  70. [70]

    Auto-tuning a high-level language targeted to GPU codes,

    S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, “Auto-tuning a high-level language targeted to GPU codes, ” in2012 Innovative Parallel Computing (InPar), May 2012, pp. 1–10

  71. [71]

    SPEC CPU2017,

    Standard Performance Evaluation Corporation (SPEC), “SPEC CPU2017, ” https: //www.spec.org/cpu2017/, 2017, accessed: 2025-11-15

  72. [72]

    BigScience Language Open-science Open-access Multilingual (BLOOM) Language Model,

    BigScience, “BigScience Language Open-science Open-access Multilingual (BLOOM) Language Model, ” https://huggingface.co/bigscience/bloom-1b1, 2022, accessed: 2025-11-05

  73. [73]

    Qwen2 Technical Report

    A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Yang, J. Xu, J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, S. Tan, T. Zhu, T. Li, T...

  74. [74]

    FALA: Locality-Aware PIM- Host Cooperation for Graph Processing with Fine-Grained Column Access,

    C. Shin, J. Song, S. Na, J. Sung, H. Jang, and J. Lee, “FALA: Locality-Aware PIM- Host Cooperation for Graph Processing with Fine-Grained Column Access, ” in Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’25, 2025, p. 15201534

  75. [75]

    Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology,

    V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry, “Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology, ” in2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct 2017, pp. 273–287

  76. [76]

    SIMDRAM: a framework for bit- serial SIMD processing using DRAM,

    N. Hajinazar, G. F. Oliveira, S. Gregorio, J. a. D. Ferreira, N. M. Ghiasi, M. Patel, M. Alser, S. Ghose, J. Gómez-Luna, and O. Mutlu, “SIMDRAM: a framework for bit- serial SIMD processing using DRAM, ” inProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’21, 2021, pp...

  77. [77]

    MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple- Instruction Multiple-Data Computing,

    G. F. Oliveira, A. Olgun, A. G. Yağlıkçı, F. N. Bostancı, J. Gómez-Luna, S. Ghose, and O. Mutlu, “MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple- Instruction Multiple-Data Computing, ” in2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA), March 2...

  78. [78]

    Pro- teus: Achieving High-Performance Processing-Using-DRAM with Dynamic Bit- Precision, Adaptive Data Representation, and Flexible Arithmetic,

    G. F. de Oliveira Junior, M. Kabra, Y. Guo, K. Chen, A. G. Yaglikci, M. Soysal, M. Sadrosadati, J. Olivares Bueno, S. Ghose, J. Gómez-Luna, and O. Mutlu, “Pro- teus: Achieving High-Performance Processing-Using-DRAM with Dynamic Bit- Precision, Adaptive Data Representation, and Flexible Arithmetic, ” inProceedings of the 39th ACM International Conference o...

  79. [79]

    DRISA: A DRAM-based Reconfigurable In-Situ Accelerator,

    S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, “DRISA: A DRAM-based Reconfigurable In-Situ Accelerator, ” in2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Oct 2017, pp. 288–301

  80. [80]

    Compute Caches,

    S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das, “Compute Caches, ” in2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb 2017, pp. 481–492

Showing first 80 references.