pith. sign in

arxiv: 2606.22588 · v1 · pith:6L7JBHN6new · submitted 2026-06-21 · 💻 cs.AR · cs.CR

Non-Uniform L2 Cache Latency Across the Streaming Multiprocessors of an NVIDIA L40

Pith reviewed 2026-06-26 09:33 UTC · model grok-4.3

classification 💻 cs.AR cs.CR
keywords NVIDIA L40L2 cache latencystreaming multiprocessorGPU architecturenon-uniform cachehardware fingerprintingperformance measurementAD102
0
0 comments X

The pith

L2 cache hit latency on NVIDIA L40 varies 52 percent by which streaming multiprocessor issues the load.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the 96 MiB L2 cache on the L40 cannot be treated as a single uniform pool. Hit latency instead depends on the physical SM that executes the load instruction, producing a reproducible spread from 222 to 339 cycles across all 142 SMs. A turn-serialized probe that resolves the %smid in one kernel launch captures this map with noise below 0.01 cycles per repetition. The variation follows the AD102 GPC layout and is physical, as independent access patterns agree perfectly per SM. If the claim holds, latency-bound kernels must account for SM placement rather than assuming a fixed 279-cycle cost.

Core claim

L2-hit latency is not constant near 279 cycles but spans 222-339 cycles (52 percent range) depending on the issuing SM; an additive model L = μ + a(sm) + b(slice) explains R² = 0.87 of the variance (0.98 with one rank-1 term), the SM term is two-fold symmetric with r = 0.999, and the pattern is stable, device-specific, and enables 92 percent SM identification plus 100 percent device fingerprinting between identical L40 cards.

What carries the argument

The turn-serialized, %smid-resolved probe that resolves per-SM L2 hit latency across all 142 SMs in a single launch.

If this is right

  • Distributing latency-bound work according to the per-SM map reduces makespan by up to 11 percent.
  • A kernel can read its own SM placement inside the device at 92 percent accuracy from the latency map.
  • The same probe distinguishes two physically identical L40 cards at 100 percent accuracy despite near-identical mean latency.
  • The non-uniform pattern appears on Blackwell GPUs, showing the effect is not limited to the L40.
  • The per-SM map remains unchanged after one hour of full device utilization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the map is physical and stable, similar SM-resolved probes could expose layout effects on other NVIDIA GPU families.
  • Schedulers could prefer lower-latency SMs when assigning threads from latency-sensitive kernels.
  • The device fingerprint could support lightweight hardware attestation that requires no secret extraction.
  • The two-fold symmetry aligned with GPC boundaries points to interconnect or memory-controller placement as the root cause.

Load-bearing premise

The probe method isolates true per-SM L2 hit latency without confounding from memory-controller arbitration, warp scheduling, or prefetch behavior.

What would settle it

Re-running the probe on the same L40 and obtaining per-SM latencies that differ by more than 1 cycle from the published map on any SM would falsify the reproducibility and physical-origin claims.

Figures

Figures reproduced from arXiv: 2606.22588 by Baris Basaran, Faruk Alpay.

Figure 1
Figure 1. Figure 1: maps the mean L2-hit latency of every physical SM. Panel (a) plots latency against the SM identifier. The value is not constant: it ranges from 249.8 to 307.0 cycles when averaged over slices, and the curve has visible internal structure rather than noise. Panel (b) shows that this structure is two-fold symmetric. Splitting the SM-placement term a(sm) at identifier 72 yields two halves of 72 and 70 SMs who… view at source ↗
Figure 2
Figure 2. Figure 2: L2-hit latency L(sm,slice) over all 142 SMs and 256 slice probes. SMs are ordered by mean latency and probes by slice term. Latency rises smoothly from 222.5 to 339.2 cycles; the vertical striping is the per-slice term with a 512 B interleave period [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Additive model µ + a(sm) + b(slice) against measured latency (R2 = 0.87); the diagonal banding is the SM×slice interaction. (b) Per-SM latency spectrum: a continuous 57-cycle spread with no discrete partition step. Two finer structures sharpen this picture. First, the map is nearly low-rank. Adding a single rank-1 interaction term c u(sm) v(slice) to the additive model raises R2 from 0.87 to 0.98 on th… view at source ↗
Figure 4
Figure 4. Figure 4: Per-SM L2-hit latency on the Ada L40 and the Blackwell RTX 5090, measured with the same probe. (a) Latency by physical SM (normalized index); both are non-uniform, and the 5090 sits higher. (b) Per-SM latency spectra in nanoseconds; the two devices occupy disjoint bands. The fingerprint is therefore architecture-specific. The L40-trained oracle of Section 4, applied to 5090 fingerprints, drops to 0.6% accu… view at source ↗
Figure 5
Figure 5. Figure 5: Two physically distinct, identical-model L40s: their per-SM latency maps correlate at only r = 0.63 despite near-identical means, so each die has its own pattern. The two devices are separated at 100% from these fingerprints and the per-SM oracle does not transfer between dies (figures in text). 6.2 Physical-location inference Pooling the first L40 and the RTX 5090 into a 312-location label space (142 + 17… view at source ↗
Figure 6
Figure 6. Figure 6: Physical-location inference: the fingerprint space (first two principal components); the two devices occupy disjoint regions, each with an internal per-SM gradient. The 312-way location accuracy (64.6% from one probe, 92.1% from 32) is given in the text. 9 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Makespan reduction over the oblivious baseline for a fixed latency-bound workload. The model-based static schedule (aware) matches or beats runtime work-stealing in the L2-resident regime and gives essentially nothing once the workload is DRAM-bandwidth bound [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The L40 per-SM map after one hour of full GPU and VRAM load matches the independently measured idle map at r = 1.000. Across the run the snapshot-to-snapshot correlation stays at 1.000 on both devices (in text). The per-SM map is essentially invariant. On the L40 the snapshot-to-snapshot correlation has median 1.000 and the per-SM mean drifts by at most 0.08 cycles over the hour at 100% utilization and 63 … view at source ↗
Figure 9
Figure 9. Figure 9: Dependent-load capacity sweep (128 B stride). The median rises from 301.3 cycles at 92 MiB to 631.9 cycles at 104 MiB, bracketing the nominal 96 MiB L2 boundary. Varying the stride separates the two [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Sustained 40 GiB operating-regime sweep. Low arithmetic intensity is bandwidth-limited near 597 GB/s; high intensity becomes power-limited near 300W and reaches 48.38 TFLOP/s. 11 Related Work and Discussion GPU memory hierarchies have been reverse-engineered by microbenchmarking for several generations, establishing cache sizes, line sizes, latencies, and associativities for Volta, Ampere, Hopper, and ear… view at source ↗
read the original abstract

The NVIDIA L40 exposes a 96 MiB L2 cache usually modeled as one uniform pool with a single hit latency. We show this is wrong at the granularity a kernel sees: L2-hit latency depends strongly and reproducibly on which physical streaming multiprocessor (SM) issues the load. A turn-serialized, %smid-resolved probe maps the hit latency across all 142 SMs in one launch; it is not a constant near 279 cycles but spans 222-339 cycles (a 52% range), with per-repetition noise below 0.01 cycles. An additive model $L = \mu + a(\mathrm{sm}) + b(\mathrm{slice})$ explains $R^2 = 0.87$ (0.98 with one rank-1 term), and the SM term is two-fold symmetric (two halves of 72 SMs at correlation $r = 0.999$), following the AD102 GPC layout. Independent access patterns agree per SM at $r = 1.000$, so the effect is physical. The same probe on a Blackwell RTX 5090 shows it generalizes, while the per-die pattern is device-specific. Read as a fingerprint, a single user-level probe identifies the SM within a device at 92%, and two physically identical L40s are separated at 100% despite near-identical mean latency (per-SM map $r = 0.63$): a per-die hardware identity, not a clock artifact. This is a self-localization and fingerprinting primitive: a kernel reads its own placement and device, not a victim's, and extracts no secret data. The map is stable, unchanged after an hour at full utilization on both devices. As a consequence, distributing latency-bound work by the map cuts makespan by up to 11%. Single-thread capacity, line-tag, prefetch-modifier, and persisting-L2 results appear as controls. The artifact contains seeds, raw observations, the trained model, and regeneration scripts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript presents an empirical characterization of L2 cache hit latency on the NVIDIA L40 GPU, demonstrating that it varies significantly depending on the issuing streaming multiprocessor (SM). Using a turn-serialized probe resolved by the %smid register, the authors map latencies across all 142 SMs, finding a range of 222-339 cycles with per-repetition noise below 0.01 cycles. They fit an additive model L = μ + a(sm) + b(slice) achieving R² = 0.87 (0.98 with an additional rank-1 term), show two-fold symmetry following the AD102 GPC layout, and validate with independent access patterns (r = 1.000), cross-device tests on Blackwell, and controls for single-thread capacity, line-tag, prefetch, and persisting L2. The work also demonstrates applications in hardware fingerprinting (92% SM identification, 100% device separation) and performance improvement (up to 11% makespan reduction).

Significance. If the central empirical result holds, this work is significant for the field of computer architecture as it provides the first detailed evidence of per-SM non-uniformity in L2 cache latency on modern NVIDIA GPUs, contradicting the uniform pool assumption. The high-precision measurements, strong cross-validation (r=1.000), device-specific maps, and provision of the artifact containing seeds, raw observations, the trained model, and regeneration scripts are notable strengths that enhance reproducibility and allow falsification. This could impact performance modeling, workload scheduling, and hardware identification techniques.

minor comments (2)
  1. [Abstract] Abstract: the parenthetical R²=0.98 with one rank-1 term is mentioned without derivation; move the explanation of this term and its interpretation to the main text near the model definition for clarity.
  2. [Probe method] The section describing the probe method: while controls (single-thread capacity, line-tag, prefetch-modifier, persisting-L2) are listed, add a short paragraph confirming that the turn-serialized design was tested against memory-controller arbitration and warp-scheduling confounds to make the isolation claim fully self-contained.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report contains no specific major comments to address.

Circularity Check

0 steps flagged

Empirical measurement of hardware latency variation; no derivation reduces to inputs

full rationale

The paper's central claim is a direct empirical observation: L2-hit latency varies 222-339 cycles across 142 SMs, measured via a turn-serialized %smid-resolved probe in a single launch. This raw span, low noise (<0.01 cycles), and cross-validation (r=1.000 across patterns, GPC symmetry) are reported as observed facts independent of any model or prior equations. The additive model L = μ + a(sm) + b(slice) is fitted post-measurement to explain R²=0.87 and is not invoked to predict or derive the existence of the variation. No self-citations, uniqueness theorems, or ansatzes appear in the provided text as load-bearing for the primary result. Controls (single-thread capacity, prefetch-modifier, persisting-L2) and device-specific maps are presented as independent checks. The work is self-contained against external benchmarks and contains no circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The claim rests on the assumption that the custom probe correctly isolates L2 hits and that the observed variation is physical rather than an artifact of the measurement technique or scheduling. The additive model introduces two sets of fitted per-SM and per-slice terms.

free parameters (2)
  • a(sm)
    Per-SM additive latency offset fitted to the measured data for each of the 142 SMs
  • b(slice)
    Per-slice additive latency offset fitted to the measured data
axioms (1)
  • domain assumption The turn-serialized %smid-resolved probe measures true L2 hit latency without confounding effects from other GPU components
    Invoked when the authors interpret the 222-339 cycle range as physical L2 non-uniformity

pith-pipeline@v0.9.1-grok · 5913 in / 1500 out tokens · 26689 ms · 2026-06-26T09:33:36.485888+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 1 linked inside Pith

  1. [1]

    Spy in the GPU-box: Covert and side channel attacks on multi-GPU systems.arXiv preprint arXiv:2203.15981, 2022

    Sankha Baran Dutta, Hoda Naghibijouybari, Arjun Gupta, Nael Abu-Ghazaleh, Andres Marquez, and Kevin Barker. Spy in the GPU-box: Covert and side channel attacks on multi-GPU systems.arXiv preprint arXiv:2203.15981, 2022

  2. [2]

    Reactive NUCA: Near-optimal block placement and replication in distributed caches

    Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. Reactive NUCA: Near-optimal block placement and replication in distributed caches. InProceedings of the 36th Annual International Symposium on Computer Architecture (ISCA), pages 184–195, 2009

  3. [3]

    Scarpazza

    Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. Dissecting the nvidia volta gpu architecture via microbenchmarking.arXiv preprint arXiv:1804.06826, 2018

  4. [4]

    Aamodt, and John Kim

    Zhixian Jin, Christopher Rocca, Jiho Kim, Hans Kasan, Minsoo Rhu, Ali Bakhoda, Tor M. Aamodt, and John Kim. Uncovering real gpu noc characteristics: Implications on interconnect architecture. InProceedings of the 57th Annual IEEE/ACM International Symposium on Microarchitecture, pages 885–898, 2024

  5. [5]

    Changkyu Kim, Doug Burger, and Stephen W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. InProceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS X), pages 211–222, 2002

  6. [6]

    Dissecting the nvidia hopper architecture through microbenchmarking and multiple level analysis.arXiv preprint arXiv:2501.12084, 2025

    Chao Luo, Rengan Fan, Zeyu Li, Dayou Du, and Qiang Chen. Dissecting the nvidia hopper architecture through microbenchmarking and multiple level analysis.arXiv preprint arXiv:2501.12084, 2025

  7. [7]

    Dissecting gpu memory hierarchy through microbench- marking.IEEE Transactions on Parallel and Distributed Systems, 28(1):72–86, 2017

    Xinxin Mei and Xiaowen Chu. Dissecting gpu memory hierarchy through microbench- marking.IEEE Transactions on Parallel and Distributed Systems, 28(1):72–86, 2017

  8. [8]

    Rendered insecure: GPU side channel attacks are practical

    Hoda Naghibijouybari, Ajaya Neupane, Zhiyun Qian, and Nael Abu-Ghazaleh. Rendered insecure: GPU side channel attacks are practical. InProceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pages 2139–2153, 2018

  9. [9]

    CUDA Programming Guide: L2 Cache Control.https://docs.n vidia.com/cuda/cuda-programming-guide/04-special-topics/l2-cache-control.h tml, 2026

    NVIDIA Corporation. CUDA Programming Guide: L2 Cache Control.https://docs.n vidia.com/cuda/cuda-programming-guide/04-special-topics/l2-cache-control.h tml, 2026. Accessed 2026-06-21

  10. [10]

    Nsight Compute Profiling Guide.https://docs.nvidia.com/ns ight-compute/ProfilingGuide/, 2026

    NVIDIA Corporation. Nsight Compute Profiling Guide.https://docs.nvidia.com/ns ight-compute/ProfilingGuide/, 2026. Accessed 2026-06-21

  11. [11]

    Parallel Thread Execution ISA, Version 9.3.https://docs.nvidi a.com/cuda/parallel-thread-execution/, 2026

    NVIDIA Corporation. Parallel Thread Execution ISA, Version 9.3.https://docs.nvidi a.com/cuda/parallel-thread-execution/, 2026. Accessed 2026-06-21

  12. [12]

    Swatman, and Ana Lucia Varbanescu

    Rogier van Stigt, Simon N. Swatman, and Ana Lucia Varbanescu. Isolating gpu archi- tectural features using parallelism-aware microbenchmarks. InProceedings of the 2022 ACM/SPEC International Conference on Performance Engineering, pages 77–88, 2022. 16

  13. [13]

    DELTA: Validate gpu memory profiling with microbenchmarks

    Xianwei Zhang and Evgeny Shcherbakov. DELTA: Validate gpu memory profiling with microbenchmarks. InProceedings of the International Symposium on Memory Systems, pages 97–104, 2020. 17