Residual GPU Cache State on Apple M4 Pro

Baris Basaran; Faruk Alpay

arxiv: 2606.27098 · v1 · pith:FBVLSXZFnew · submitted 2026-06-25 · 💻 cs.AR

Residual GPU Cache State on Apple M4 Pro

Faruk Alpay , Baris Basaran This is my paper

Pith reviewed 2026-06-26 01:55 UTC · model grok-4.3

classification 💻 cs.AR

keywords Apple M4 ProGPU cacheshared cache displacementunified memoryMetal synchronizationcache recoveryperformance measurement

0 comments

The pith

GPU kernels on the M4 Pro leave residual displacement in the shared cache, slowing the first subsequent CPU traversal until a second pass recovers most performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper characterizes the undocumented cache state after GPU commands complete on the 14-core Apple M4 Pro. Using a synchronized Metal experiment, it shows that touching large memory regions on the GPU displaces cache lines that affect CPU access, with the first 16 MiB probe traversal slower and a second traversal removing most of the cost. This indicates shared-cache displacement rather than ongoing DRAM contention. A simple one-pass software mechanism can recover the state, and the work grounds the observation with PMU and IOReport data separating cache effects from other factors.

Core claim

The paper establishes that a GPU kernel that touches between 0 and 512 MiB and then finishes creates a measurable post-GPU cache-displacement window on the M4 Pro. The first CPU traversal after the GPU command is slower for larger footprints, while a second traversal largely eliminates the slowdown, demonstrating that the effect is residual shared-cache displacement separable from simultaneous DRAM contention. Matched-block experiments further show that high-priority CPU traffic causes only baseline-level slowdown on the GPU.

What carries the argument

The synchronized Metal experiment paired with an 8192-byte system-level-cache occupancy pattern and a 16 MiB CPU probe that starts after the GPU kernel finishes, used to measure traversal slowdowns as evidence of residual displacement.

If this is right

Software can apply a one-pass recovery traversal to clear most of the post-GPU cache cost.
The displacement effect scales with GPU memory footprint size up to 512 MiB.
GPU performance under background CPU traffic stays close to baseline when QoS is managed.
Root PMU and IOReport data can distinguish L1D refills, page-offset conflicts, and core-type demands in such measurements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Task schedulers on unified-memory Apple silicon could time CPU probes to exploit or avoid this window.
Similar residual cache effects may appear on other M-series chips with unified CPU-GPU memory.
Workload designers might interleave small GPU and CPU phases to reduce cross-device cache pollution.
The measurement pipeline could extend to quantify recovery costs for different probe sizes.

Load-bearing premise

The synchronized Metal timing and CPU probe accurately capture residual shared-cache displacement without being confounded by DRAM contention, OS scheduling, or measurement overhead.

What would settle it

If first and second CPU traversals show identical times regardless of prior GPU footprint size, or if the slowdown remains unchanged after the second pass.

Figures

Figures reproduced from arXiv: 2606.27098 by Baris Basaran, Faruk Alpay.

**Figure 2.** Figure 2: Cache-line-granular pointer chasing on the user-interactive path. Each node occupies one 128-byte line; [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Address-pattern separation after completed GPU work. Both probes touch 16 MiB, but the 8192-byte [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Residual GPU cache state. The Metal victim completes before either CPU measurement. The first [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Randomized resource-mode matrix. B-sh/B-pr are shared/private buffers; T-sh/T-pr are shared/private [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: L1D refill granularity from dependent two-load pairs. Each random record is flushed before the pass. A [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Root-PMU working-set sweep on a performance core. Blue (left, log): dependent-load latency in cycles, [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Page-offset-dependent congruent conflict sweep. The 16-KiB page stride does not give a single clean [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Live IOReport DCS-agent separation in seven rotated blocks. The high-priority CPU stress saturates [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Concurrent contention in seven rotated blocks with matched ten-thread CPU workloads. GPU error [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

read the original abstract

Apple silicon exposes unified CPU-GPU memory, but the cache state left after a completed GPU command is not documented. This paper characterizes that phase boundary on a 14-core Apple M4 Pro. We validate the measurement pipeline against unmodified STREAM 5.10 and BabelStream 5.0, then adapt an 8192-byte system-level-cache occupancy pattern to a synchronized Metal experiment. A GPU kernel touches 0 to 512 MiB and finishes before a 16 MiB CPU probe begins. The first CPU traversal is slower after large GPU footprints, while a second traversal removes most of the cost, showing residual shared-cache displacement rather than simultaneous DRAM contention. A separate matched-block experiment measures GPU slowdown under high-priority CPU traffic and finds background QoS close to baseline. Root PMU measurements and public IOReport histograms provide hardware grounding: they distinguish L1D refill sectors from software cache-line size, expose page-offset-dependent conflict behavior, and separate performance-core, efficiency-core, and AGX demand. The results identify a reproducible post-GPU cache-displacement window on M4 Pro and quantify a simple one-pass software recovery mechanism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives new empirical numbers on residual shared-cache state after GPU commands finish on the M4 Pro, but the timing isolation between Metal completion and the CPU probe still needs tighter controls.

read the letter

The core contribution is a set of measurements showing that a GPU kernel touching hundreds of MiB leaves a detectable slowdown on the first CPU traversal of a 16 MiB probe, with most of the cost disappearing on the second pass. They tie this to residual state in the unified cache rather than ongoing DRAM traffic, and they report a simple one-pass recovery pattern. That specific post-command window on M4 Pro is not in the prior STREAM literature.

They handle the validation step cleanly by running unmodified STREAM and BabelStream first, then add PMU and IOReport data to separate L1D refills, page-offset conflicts, and core-type demand. The matched-block QoS test is a useful side check on background interference.

The soft spot is the synchronization assumption. The claim that the first-traversal slowdown is purely residual cache displacement rests on the interval between Metal command end and CPU probe start being free of scheduler jitter, incomplete GPU flush, TLB changes, or probe-induced effects. The abstract and stress-test note do not show explicit timing histograms or controls that rule these out, so the evidence for isolation is thinner than the headline result. Minor issues like missing error bars or full data tables would be easy to fix.

This is for people doing low-level tuning on Apple silicon where GPU-CPU handoff matters. The work is narrow and device-specific, but the measurements are new and the setup is replicable. It deserves a serious referee so the methods section can be examined in detail.

Referee Report

2 major / 2 minor

Summary. The manuscript claims to empirically characterize undocumented residual shared-cache state after completed GPU commands on the 14-core Apple M4 Pro. Using a synchronized Metal experiment, a GPU kernel touches 0–512 MiB before a 16 MiB CPU probe begins; the first CPU traversal exhibits slowdown after large GPU footprints while a second traversal recovers most performance, indicating cache displacement rather than ongoing DRAM contention. The pipeline is validated against unmodified STREAM 5.10 and BabelStream 5.0; separate QoS tests, root PMU data, and public IOReport histograms distinguish L1D refills, page-offset conflicts, and core/AGX demand. The results identify a reproducible post-GPU cache-displacement window and quantify a one-pass software recovery mechanism.

Significance. If the isolation of cache effects holds, the work supplies the first detailed empirical data on the CPU–GPU phase boundary in Apple unified memory, a previously undocumented aspect of the architecture. The two-traversal design, hardware-counter grounding, and identification of a simple recovery mechanism are strengths that could inform performance tuning for heterogeneous workloads on these platforms. The reproducible measurement approach may also serve as a template for similar studies on other unified-memory SoCs.

major comments (2)

[Abstract and synchronized Metal experiment description] Abstract and synchronized Metal experiment description: the claim that the second traversal demonstrates residual shared-cache displacement (rather than DRAM contention) is load-bearing, yet the validation against STREAM/BabelStream does not directly test the timing window or rule out OS preemption, TLB/prefetcher state, or probe-induced conflicts with M4 Pro SLC/L2 sizes.
[PMU/IOReport grounding paragraph] PMU/IOReport grounding paragraph: while the counters distinguish L1D refill sectors from software cache-line size and separate performance/efficiency/AGX demand, the manuscript does not present data showing absence of DRAM traffic correlation specifically during the CPU probe window after Metal command completion.

minor comments (2)

The manuscript should include quantitative timing histograms or jitter statistics for the interval between GPU command completion and CPU probe start to allow readers to assess scheduling variability.
Tables or supplementary data files reporting per-run traversal times, standard deviations, and exact GPU footprint sizes would strengthen verifiability of the occupancy results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's insightful comments on our work characterizing residual GPU cache state on the Apple M4 Pro. The feedback highlights areas where the evidence distinguishing cache displacement from DRAM contention can be clarified. We respond to each major comment below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract and synchronized Metal experiment description] Abstract and synchronized Metal experiment description: the claim that the second traversal demonstrates residual shared-cache displacement (rather than DRAM contention) is load-bearing, yet the validation against STREAM/BabelStream does not directly test the timing window or rule out OS preemption, TLB/prefetcher state, or probe-induced conflicts with M4 Pro SLC/L2 sizes.

Authors: We agree that the validation with STREAM and BabelStream establishes the reliability of our measurement pipeline for standard workloads but does not specifically address the post-GPU timing window or explicitly rule out the listed confounds. In the revised manuscript, we will add a dedicated paragraph in the experiment description section discussing these points: the tight synchronization via Metal command completion and CPU probe start minimizes preemption opportunities; TLB and prefetcher states are unlikely to explain the differential performance between first and second traversals since the second pass recovers; and the 16 MiB probe size was chosen based on prior SLC size knowledge to avoid boundary conflicts, with supporting sensitivity experiments. These additions will strengthen the claim without altering the core results. revision: partial
Referee: [PMU/IOReport grounding paragraph] PMU/IOReport grounding paragraph: while the counters distinguish L1D refill sectors from software cache-line size and separate performance/efficiency/AGX demand, the manuscript does not present data showing absence of DRAM traffic correlation specifically during the CPU probe window after Metal command completion.

Authors: The presented PMU and IOReport data show strong correlation between L1D refills and the first-traversal slowdown, with recovery on the second traversal, which we interpret as evidence of cache state rather than persistent DRAM contention. We acknowledge, however, that direct measurements of DRAM traffic (such as memory controller activity) specifically during the CPU probe window are not included in the manuscript. Our experiments captured L1D and core demand but not DRAM-level counters in that precise interval. We will revise the PMU grounding paragraph to explicitly state the basis of our inference and add a limitations note regarding the absence of direct DRAM correlation data. This will be a partial revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity: pure empirical measurement study

full rationale

The paper is an empirical characterization of residual GPU cache state on M4 Pro using Metal kernels, CPU probes, STREAM/BabelStream validation, PMU/IOReport data, and QoS tests. No equations, derivations, fitted parameters, or predictions appear in the abstract or described methodology. Central claims rest on direct timing measurements of first vs. second traversals and external hardware counters, not on quantities defined by the experiment itself. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked. The study is self-contained against external benchmarks and reports raw observations rather than any closed-loop reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields minimal ledger entries; the work relies on standard assumptions about cache behavior and benchmark validity rather than new free parameters or invented entities.

axioms (1)

domain assumption The adapted 8192-byte system-level-cache occupancy pattern produces measurable and reproducible displacement when transferred to a Metal-synchronized GPU kernel.
Invoked when the authors adapt the pattern to the M4 Pro experiment and interpret first-pass slowdown as cache displacement.

pith-pipeline@v0.9.1-grok · 5723 in / 1230 out tokens · 29628 ms · 2026-06-26T01:55:16.775685+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 5 canonical work pages

[1]

Apple introduces M4 Pro and M4 Max, October 2024

Apple. Apple introduces M4 Pro and M4 Max, October 2024. Accessed 24 June 2026. URL:https: //www.apple.com/newsroom/2024/10/apple-introduces-m4-pro-and-m4-max/

2024
[2]

Macbookpro(14-inch,M4ProorM4Max,2024): Technicalspecifications,2024

AppleSupport. Macbookpro(14-inch,M4ProorM4Max,2024): Technicalspecifications,2024. Accessed 24 June 2026. URL:https://support.apple.com/en-us/121553

2024
[3]

Meet the FaM1ly.IEEE Micro, 42(3):78–84, 2022.doi:10.1109/MM.2022.3169245

Michael Mattioli. Meet the FaM1ly.IEEE Micro, 42(3):78–84, 2022.doi:10.1109/MM.2022.3169245

work page doi:10.1109/mm.2022.3169245 2022
[4]

In: ACSAC ’21: Annual Computer Security Appli- cations Conference, Virtual Event, USA, December 6-10, 2021

Patrick Cronin, Xing Gao, Haining Wang, and Chase Cotton. An exploration of ARM system-level cache and GPU side channels. InProceedings of the 37th Annual Computer Security Applications Conference, pages 784–795, 2021.doi:10.1145/3485832.3485902

work page doi:10.1145/3485832.3485902 2021
[5]

Fletcher

Jiyong Yu, Aishani Dutta, Trent Jaeger, David Kohlbrenner, and Christopher W. Fletcher. Synchronization storagechannels(S2C):Timer-lesscacheside-channelattacksontheAppleM1viahardwaresynchronization instructions. In32ndUSENIXSecuritySymposium, pages1973–1990, 2023. URL:https://www.usenix. org/conference/usenixsecurity23/presentation/yu-jiyong

1990
[6]

EXAM: Exploiting exclusive system-level cache in Apple M-Series SoCs for enhanced cache occupancy attacks

Tianhong Xu, Aidong Adam Ding, and Yunsi Fei. EXAM: Exploiting exclusive system-level cache in Apple M-Series SoCs for enhanced cache occupancy attacks. InProceedings of the 20th ACM Asia Conference on Computer and Communications Security, 2025.doi:10.1145/3708821.3710844

work page doi:10.1145/3708821.3710844 2025
[7]

McCalpin

John D. McCalpin. Memory bandwidth and machine balance in current high performance computers.IEEE Computer Society Technical Committee on Computer Architecture Newsletter, pages 19–25, December 1995. URL:https://www.cs.virginia.edu/~mccalpin/papers/balance/

1995
[8]

Tom Deakin, James Price, Matt Martineau, and Simon McIntosh-Smith. Evaluating attainable memory bandwidth of parallel programming models via BabelStream.International Journal of Computational Science and Engineering, 17(3):247–262, 2018.doi:10.1504/IJCSE.2018.095847. 15

work page doi:10.1504/ijcse.2018.095847 2018
[9]

lmbench: Portable tools for performance analy- sis

Larry McVoy and Carl Staelin. lmbench: Portable tools for performance analy- sis. InProceedings of the 1996 USENIX Annual Technical Conference, pages 279– 294, 1996. URL: https://www.usenix.org/conference/usenix-1996-annual-technical- conference/lmbench-portable-tools-performance-analysis

1996
[10]

Starnuma: Mitigating numa challenges with memory pooling,

Pouya Esmaili-Dokht, Francesco Sgherzi, Valeria Soldera Girelli, Isaac Boixaderas, Mariana Carmin, Alireza Monemi, Adria Armejach, Estanislao Mercadal, German Llort, Petar Radojkovic, et al. A mess of memory system benchmarking, simulation and application profiling. In57th IEEE/ACM International Symposium on Microarchitecture, pages 136–152, 2024.doi:10.1...

work page doi:10.1109/micro61859.2024.00020 2024
[11]

McCalpin

John D. McCalpin. STREAM version 5.10 reference source, 2013. Artifact commit 6703f7504a38a8da96b353cadafa64d3c2d7a2d3. URL:https://github.com/jeffhammond/STREAM

2013
[12]

BabelStream version 5.0 source, 2023

University of Bristol HPC Group. BabelStream version 5.0 source, 2023. Artifact commit f6ae48de899408cf50c24079417dc71a03dbb5a8. URL: https://github.com/UoB-HPC/BabelStream

2023
[13]

Apple Developer Documentation Archive, 2006

Apple.sys_dcache_flush(3): Cache Control. Apple Developer Documentation Archive, 2006. Accessed 24 June 2026. URL:https://developer.apple.com/library/archive/documentation/System/ Conceptual/ManPages_iPhoneOS/man3/sys_dcache_flush.3.html

2006
[14]

MTLResource: Storage modes and cpu/gpu visibility, 2026

Apple Developer. MTLResource: Storage modes and cpu/gpu visibility, 2026. Accessed 24 June 2026; checkedagainstthemacOS26.3SDKheader. URL: https://developer.apple.com/documentation/ metal/mtlresource

2026
[15]

MTLBlitCommandEncoder: Texture access optimization, 2026

Apple Developer. MTLBlitCommandEncoder: Texture access optimization, 2026. Accessed 24 June 2026; checkedagainstthemacOS26.3SDKheader. URL: https://developer.apple.com/documentation/ metal/mtlblitcommandencoder. 16

2026

[1] [1]

Apple introduces M4 Pro and M4 Max, October 2024

Apple. Apple introduces M4 Pro and M4 Max, October 2024. Accessed 24 June 2026. URL:https: //www.apple.com/newsroom/2024/10/apple-introduces-m4-pro-and-m4-max/

2024

[2] [2]

Macbookpro(14-inch,M4ProorM4Max,2024): Technicalspecifications,2024

AppleSupport. Macbookpro(14-inch,M4ProorM4Max,2024): Technicalspecifications,2024. Accessed 24 June 2026. URL:https://support.apple.com/en-us/121553

2024

[3] [3]

Meet the FaM1ly.IEEE Micro, 42(3):78–84, 2022.doi:10.1109/MM.2022.3169245

Michael Mattioli. Meet the FaM1ly.IEEE Micro, 42(3):78–84, 2022.doi:10.1109/MM.2022.3169245

work page doi:10.1109/mm.2022.3169245 2022

[4] [4]

In: ACSAC ’21: Annual Computer Security Appli- cations Conference, Virtual Event, USA, December 6-10, 2021

Patrick Cronin, Xing Gao, Haining Wang, and Chase Cotton. An exploration of ARM system-level cache and GPU side channels. InProceedings of the 37th Annual Computer Security Applications Conference, pages 784–795, 2021.doi:10.1145/3485832.3485902

work page doi:10.1145/3485832.3485902 2021

[5] [5]

Fletcher

Jiyong Yu, Aishani Dutta, Trent Jaeger, David Kohlbrenner, and Christopher W. Fletcher. Synchronization storagechannels(S2C):Timer-lesscacheside-channelattacksontheAppleM1viahardwaresynchronization instructions. In32ndUSENIXSecuritySymposium, pages1973–1990, 2023. URL:https://www.usenix. org/conference/usenixsecurity23/presentation/yu-jiyong

1990

[6] [6]

EXAM: Exploiting exclusive system-level cache in Apple M-Series SoCs for enhanced cache occupancy attacks

Tianhong Xu, Aidong Adam Ding, and Yunsi Fei. EXAM: Exploiting exclusive system-level cache in Apple M-Series SoCs for enhanced cache occupancy attacks. InProceedings of the 20th ACM Asia Conference on Computer and Communications Security, 2025.doi:10.1145/3708821.3710844

work page doi:10.1145/3708821.3710844 2025

[7] [7]

McCalpin

John D. McCalpin. Memory bandwidth and machine balance in current high performance computers.IEEE Computer Society Technical Committee on Computer Architecture Newsletter, pages 19–25, December 1995. URL:https://www.cs.virginia.edu/~mccalpin/papers/balance/

1995

[8] [8]

Tom Deakin, James Price, Matt Martineau, and Simon McIntosh-Smith. Evaluating attainable memory bandwidth of parallel programming models via BabelStream.International Journal of Computational Science and Engineering, 17(3):247–262, 2018.doi:10.1504/IJCSE.2018.095847. 15

work page doi:10.1504/ijcse.2018.095847 2018

[9] [9]

lmbench: Portable tools for performance analy- sis

Larry McVoy and Carl Staelin. lmbench: Portable tools for performance analy- sis. InProceedings of the 1996 USENIX Annual Technical Conference, pages 279– 294, 1996. URL: https://www.usenix.org/conference/usenix-1996-annual-technical- conference/lmbench-portable-tools-performance-analysis

1996

[10] [10]

Starnuma: Mitigating numa challenges with memory pooling,

Pouya Esmaili-Dokht, Francesco Sgherzi, Valeria Soldera Girelli, Isaac Boixaderas, Mariana Carmin, Alireza Monemi, Adria Armejach, Estanislao Mercadal, German Llort, Petar Radojkovic, et al. A mess of memory system benchmarking, simulation and application profiling. In57th IEEE/ACM International Symposium on Microarchitecture, pages 136–152, 2024.doi:10.1...

work page doi:10.1109/micro61859.2024.00020 2024

[11] [11]

McCalpin

John D. McCalpin. STREAM version 5.10 reference source, 2013. Artifact commit 6703f7504a38a8da96b353cadafa64d3c2d7a2d3. URL:https://github.com/jeffhammond/STREAM

2013

[12] [12]

BabelStream version 5.0 source, 2023

University of Bristol HPC Group. BabelStream version 5.0 source, 2023. Artifact commit f6ae48de899408cf50c24079417dc71a03dbb5a8. URL: https://github.com/UoB-HPC/BabelStream

2023

[13] [13]

Apple Developer Documentation Archive, 2006

Apple.sys_dcache_flush(3): Cache Control. Apple Developer Documentation Archive, 2006. Accessed 24 June 2026. URL:https://developer.apple.com/library/archive/documentation/System/ Conceptual/ManPages_iPhoneOS/man3/sys_dcache_flush.3.html

2006

[14] [14]

MTLResource: Storage modes and cpu/gpu visibility, 2026

Apple Developer. MTLResource: Storage modes and cpu/gpu visibility, 2026. Accessed 24 June 2026; checkedagainstthemacOS26.3SDKheader. URL: https://developer.apple.com/documentation/ metal/mtlresource

2026

[15] [15]

MTLBlitCommandEncoder: Texture access optimization, 2026

Apple Developer. MTLBlitCommandEncoder: Texture access optimization, 2026. Accessed 24 June 2026; checkedagainstthemacOS26.3SDKheader. URL: https://developer.apple.com/documentation/ metal/mtlblitcommandencoder. 16

2026