pith. sign in

arxiv: 2606.24934 · v1 · pith:MMJKNQDSnew · submitted 2026-06-22 · 💻 cs.CR · cs.AR

Unprivileged Topology Certificates for Cloud GPU Attestation

Pith reviewed 2026-06-26 07:59 UTC · model grok-4.3

classification 💻 cs.CR cs.AR
keywords cloud GPU attestationlatency fingerprintCUDA probetopology certificateunprivileged attestationphysical fingerprintnetwork landmarksHBM sweep
0
0 comments X

The pith

CUDA latency maps from ordinary code create certificates attesting cloud GPU identity, class, and coarse location without privileged access or vendor keys.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that ordinary CUDA code running on cloud GPUs can generate certificates attesting to the physical identity of the accelerator, its hardware topology class, and a coarse geographic location. These certificates rely on a measured SM-by-memory-region latency matrix that acts as a stable fingerprint with very low temporal variation, combined with HBM sweep data for topology and public network probes for location. A verifier can check the committed statistics and hashes without needing access to a GPU. If these measurements cannot be forged undetectably, cloud tenants gain a way to confirm they are using the claimed hardware rather than a substitute. This addresses the lack of direct inspection in cloud environments where only model name and region are provided.

Core claim

The paper claims that a software-only CUDA probe measures an SM-by-memory-region latency matrix using physical SM labels and dependent global loads. A streaming reducer commits sufficient statistics, configuration, code hashes, network evidence, and a compressed raw data archive into a certificate that a verifier can check without a GPU. This supports three claims: the per-SM latency map is a stable physical fingerprint with median temporal jitter of 0.09 cycles over a six-hour full-load RTX 5090 run and 100.0% shape-only leave-one-out classification accuracy for distinct Blackwell dies; cache-bypassing HBM sweeps recover hardware-class topology across generations including a unified Volta V

What carries the argument

The per-SM latency matrix measured via dependent global loads using physical SM labels, which serves as a stable physical fingerprint.

If this is right

  • The per-SM latency map remains stable with median temporal jitter of 0.09 cycles over six-hour full-load runs on RTX 5090.
  • Shape-only leave-one-out classification separates distinct Blackwell dies with 100.0% accuracy.
  • Cache-bypassing HBM sweeps recover hardware-class topology across generations, including specific cross-die penalties in Blackwell B200.
  • 169 RIPE Atlas probes localize a B200 server within 44 km of its claimed datacentre and reject all 11 decoy sites.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the fingerprint is unforgeable, tenants could continuously monitor jobs to detect any runtime hardware substitution.
  • The approach might extend to continuous attestation during long-running workloads by re-measuring the latency matrix periodically.
  • Third-party auditors without GPU access could use the certificates to verify cloud provider claims on hardware class and location.
  • Similar per-core or per-unit latency patterns could be explored for attestation on non-GPU accelerators with hierarchical memory.

Load-bearing premise

The latency matrix and network landmarks measured by ordinary CUDA code cannot be forged or altered by the cloud provider or hypervisor without detectable changes to the reported statistics or hashes.

What would settle it

A hypervisor that intercepts CUDA calls, supplies a forged latency matrix and network responses matching the expected certificate hashes and statistics, yet runs on different hardware without producing detectable statistical deviations.

Figures

Figures reproduced from arXiv: 2606.24934 by Faruk Alpay, Taylan Alpay.

Figure 1
Figure 1. Figure 1: The attestation pipeline. The remote GPU produces raw timing rows; the artifact ships bounded summaries and certificates to verifiers, and the arXiv package carries the compressed raw data archive when it fits the source budget. 3 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The on-chip network as a congestion-controlled fabric (RTX PRO 6000, 85 GiB resident). Achieved goodput (left axis) rises with offered concurrency and saturates at the dotted line, 1637 GB/s, past the knee marked by the dashed vertical line near 3008 concurrent warps; the effective per-line service time (right axis) falls from 11.6 to 0.08 ns as concurrency hides latency. The on-chip-network sweep supports… view at source ↗
read the original abstract

Cloud GPU tenants receive a model name and a region, but cannot directly inspect the physical accelerator that runs their job. We present a software-only attestation primitive for this setting. A CUDA probe measures an SM-by-memory-region latency matrix using physical SM labels and dependent global loads. A streaming reducer commits sufficient statistics, configuration, code hashes, network evidence, and a compressed raw data archive into a certificate that a verifier can check without a GPU. The certificate supports three claims. First, the per-SM latency map is a stable physical fingerprint. Over a six-hour full-load RTX 5090 run, its median temporal jitter is 0.09 cycles, while shape-only leave-one-out classification separates distinct Blackwell dies with 100.0% accuracy. Second, cache-bypassing HBM sweeps recover hardware-class topology across generations, including a unified Volta V100 memory domain, a two-way Hopper H200 L2 split, and a Blackwell B200 two-die NV-HBI package whose 74/74 SM partition carries a 30-cycle, 15.5 ns cross-die penalty. Third, public network landmarks bind the same certificate to a coarse location. In the B200 run, 169 RIPE Atlas probes place the server within 44 km of its claimed datacentre and reject all 11 decoy sites. Together, these measurements check cloud-GPU identity, class, and coarse location without privileged access or a vendor key.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a software-only attestation primitive for cloud GPUs. A CUDA probe constructs an SM-by-memory-region latency matrix via physical SM labels and dependent global loads; a streaming reducer commits sufficient statistics, code hashes, configuration, network evidence, and a compressed archive into a verifiable certificate. The certificate supports three claims: (1) the per-SM latency map is a stable physical fingerprint (0.09-cycle median jitter over 6 h on RTX 5090; 100% shape-only leave-one-out classification on distinct Blackwell dies), (2) cache-bypassing HBM sweeps recover hardware-class topology across V100/H200/B200 generations (including a 30-cycle cross-die penalty on B200), and (3) RIPE Atlas landmarks bind the certificate to coarse location (169 probes place a B200 server within 44 km of its claimed datacentre and reject 11 decoys).

Significance. If the unforgeability and stability claims hold, the work supplies a practical, vendor-key-free method for tenants to verify GPU identity, class, and location in shared cloud environments. The empirical components—long-duration jitter measurements, cross-generation topology recovery, and public-network landmark binding—are concrete strengths that could be directly reused or extended.

major comments (2)
  1. [Abstract, §3 (certificate construction), §5 (evaluation)] The central attestation claim (that the latency matrix and RIPE Atlas landmarks cannot be forged or altered by a hypervisor without detectable changes to hashes or statistics) is load-bearing yet unsupported by any adversarial evaluation. No section examines attacks such as memory-region remapping, scheduler virtualization, or controlled per-load delay injection that preserve the reported CUDA code hash, configuration, and streaming-reducer statistics.
  2. [§4.1, §4.2] §4.1 and §4.2 report concrete stability figures (0.09-cycle median jitter, 100% leave-one-out accuracy, 30-cycle cross-die penalty) without error bars, full exclusion criteria, or statistical tests on the underlying distributions; this directly affects the reliability of the fingerprint and topology claims.
minor comments (2)
  1. [Figures 3–5, Table 2] Figure captions and tables should explicitly state the number of independent runs and any filtering applied to the latency samples.
  2. [§2.2] Notation for SM labels and memory regions is introduced without a consolidated glossary; a short table mapping labels to hardware units would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the attestation claims and the statistical presentation of results. We address each major comment below, indicating planned revisions.

read point-by-point responses
  1. Referee: [Abstract, §3 (certificate construction), §5 (evaluation)] The central attestation claim (that the latency matrix and RIPE Atlas landmarks cannot be forged or altered by a hypervisor without detectable changes to hashes or statistics) is load-bearing yet unsupported by any adversarial evaluation. No section examines attacks such as memory-region remapping, scheduler virtualization, or controlled per-load delay injection that preserve the reported CUDA code hash, configuration, and streaming-reducer statistics.

    Authors: We agree that the manuscript does not contain adversarial evaluations against hypervisor attacks such as memory remapping or delay injection. The work centers on the construction of the certificate from unprivileged CUDA measurements and its observed stability and topology properties under normal execution; the code hashes and reducer statistics are included to enable detection of gross tampering, but no claim of resistance to the specific attacks listed is supported by experiments. In revision we will add a limitations subsection to §5 that explicitly enumerates these attack vectors, clarifies the scope of the current empirical claims, and identifies them as directions for future adversarial analysis. revision: yes

  2. Referee: [§4.1, §4.2] §4.1 and §4.2 report concrete stability figures (0.09-cycle median jitter, 100% leave-one-out accuracy, 30-cycle cross-die penalty) without error bars, full exclusion criteria, or statistical tests on the underlying distributions; this directly affects the reliability of the fingerprint and topology claims.

    Authors: We accept that the reported figures would be strengthened by additional statistical detail. In the revised manuscript we will update §4.1 and §4.2 to include error bars (standard deviation and interquartile range) on the jitter and cross-die penalty measurements, provide an explicit account of sample exclusion criteria, and add statistical tests (e.g., two-sample Kolmogorov-Smirnov tests on latency distributions and bootstrap confidence intervals on classification accuracy) to support the leave-one-out results and generational topology distinctions. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical measurements

full rationale

The paper presents empirical observations from CUDA probes (SM-by-memory latency matrices, temporal jitter of 0.09 cycles, leave-one-out classification accuracy, HBM topology sweeps, and RIPE Atlas network landmarks) without any derivation chain, equations, or first-principles predictions. No step reduces a claimed result to fitted parameters, self-definitions, or self-citations by construction; the reported fingerprints and topology recoveries are direct measurements rather than quantities defined by the same data. The central claims are therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on empirical stability of latency patterns rather than formal axioms or derivations; no free parameters are explicitly fitted in the abstract, though classification thresholds are implicit.

axioms (1)
  • domain assumption Latency matrix measured by user-level CUDA code reflects stable physical properties of the GPU die and memory system.
    Invoked to support the fingerprint and topology claims.

pith-pipeline@v0.9.1-grok · 5785 in / 1332 out tokens · 14344 ms · 2026-06-26T07:59:13.128694+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 13 canonical work pages · 2 internal anchors

  1. [1]

    Remote ATtestation procedureS (RATS) Architecture

    Henk Birkholz, Dave Thaler, Michael Richardson, Ned Smith, and Wei Pan. Remote ATtestation procedureS (RATS) Architecture. RFC 9334, Internet Engineering Task Force, 2023

  2. [2]

    Blueprint, Bootstrap, and Bridge: A Security Look at NVIDIA GPU Confidential Computing

    Zhongshu Gu, Enriquillo Valdez, Salman Ahmed, Julian James Stephen, Michael Le, Hani Jamjoom, Shixuan Zhao, and Zhiqiang Lin. NVIDIA GPU Confidential Computing Demystified.arXiv preprint arXiv:2507.02770, 2025. doi: 10.48550/arXiv.2507.02770

  3. [3]

    Validation of GPU Computation in Decentralized, Trustless Networks.arXiv preprint arXiv:2501.05374, 2025

    Eric Boniardi, Stanley Bishop, and Alison Haire. Validation of GPU Computation in Decentralized, Trustless Networks.arXiv preprint arXiv:2501.05374, 2025. doi: 10.48550 /arXiv.2501.05374

  4. [4]

    Towards Verifiable Network Telemetry without Special Purpose Hardware

    Jaechan An, Zeying Zhu, Ian Miers, and Zaoxing Liu. Towards Verifiable Network Telemetry without Special Purpose Hardware. InProceedings of the 24th ACM Workshop on Hot Topics in Networks (HotNets), 2025. doi: 10.1145/3772356.3772392

  5. [5]

    Dissecting GPU memory hierarchy through microbench- marking.IEEE Transactions on Parallel and Distributed Systems, 28(1):72–86, 2017

    Xinxin Mei and Xiaowen Chu. Dissecting GPU memory hierarchy through microbench- marking.IEEE Transactions on Parallel and Distributed Systems, 28(1):72–86, 2017. doi: 10.1109/TPDS.2016.2549523

  6. [6]

    Scarpazza

    Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. Dissecting the NVIDIA volta GPU architecture via microbenchmarking.arXiv preprint arXiv:1804.06826,

  7. [7]

    doi: 10.48550/arXiv.1804.06826

  8. [8]

    Aamodt, and John Kim

    Zhixian Jin, Christopher Rocca, Jiho Kim, Hans Kasan, Minsoo Rhu, Ali Bakhoda, Tor M. Aamodt, and John Kim. Uncovering real GPU NoC characteristics: Implications on interconnect architecture. InProceedings of the 57th Annual IEEE/ACM International Symposium on Microarchitecture, pages 885–898, 2024. doi: 10.1109/MICRO61859.2024. 00070

  9. [9]

    Microbenchmarking NVIDIA’s Blackwell Architecture: An in-depth Architectural Analysis.arXiv preprint arXiv:2512.02189, 2025

    Aaron Jarmusch and Sunita Chandrasekaran. Microbenchmarking NVIDIA’s Blackwell Architecture: An in-depth Architectural Analysis.arXiv preprint arXiv:2512.02189, 2025. doi: 10.48550/arXiv.2512.02189

  10. [10]

    Rendered insecure: GPU side channel attacks are practical

    Hoda Naghibijouybari, Ajaya Neupane, Zhiyun Qian, and Nael Abu-Ghazaleh. Rendered insecure: GPU side channel attacks are practical. InProceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pages 2139–2153, 2018. doi: 10.1145/3243734.3243831

  11. [11]

    Spy in the GPU-box: Covert and side channel attacks on multi-GPU systems

    Sankha Baran Dutta, Hoda Naghibijouybari, Arjun Gupta, Nael Abu-Ghazaleh, Andres Marquez, and Kevin Barker. Spy in the GPU-box: Covert and side channel attacks on multi-GPU systems. InProceedings of the 50th Annual International Symposium on Computer Architecture, 2023. doi: 10.1145/3579371.3589080

  12. [12]

    NVBleed: Covertandside-channelattacksonNVIDIAmulti-GPU interconnect.arXiv preprint arXiv:2503.17847, 2025

    Yicheng Zhang, Ravan Nazaraliyev, Sankha Baran Dutta, Andres Marquez, Kevin Barker, andNaelAbu-Ghazaleh. NVBleed: Covertandside-channelattacksonNVIDIAmulti-GPU interconnect.arXiv preprint arXiv:2503.17847, 2025. doi: 10.48550/arXiv.2503.17847

  13. [13]

    Snoeren, and kc claffy

    Ben Du, Massimo Candela, Bradley Huffaker, Alex C. Snoeren, and kc claffy. RIPE IPmap Active Geolocation: Mechanism and Performance Evaluation. InACM SIGCOMM Computer Communication Review, volume 50, pages 3–10, 2020. doi: 10.1145/3402413.34 02415. 11

  14. [14]

    Dude, where’s that IP? circumventing measurement-based IP geolocation

    Phillipa Gill, Yashar Ganjali, Bernard Wong, and David Lie. Dude, where’s that IP? circumventing measurement-based IP geolocation. InProceedings of the 19th USENIX Security Symposium, 2010

  15. [15]

    Trust, But Verify, Operator-Reported Geolocation.arXiv preprint arXiv:2409.19109, 2024

    Katherine Izhikevich, Ben Du, Sumanth Rao, Alisha Ukani, and Liz Izhikevich. Trust, But Verify, Operator-Reported Geolocation.arXiv preprint arXiv:2409.19109, 2024. doi: 10.48550/arXiv.2409.19109

  16. [16]

    Parallel Thread Execution ISA.https://docs.nvidia.com/cuda /parallel-thread-execution/, 2026

    NVIDIA Corporation. Parallel Thread Execution ISA.https://docs.nvidia.com/cuda /parallel-thread-execution/, 2026. Accessed 2026-06-21

  17. [17]

    How does Cloudflare’s Speed Test really work?https://blog.cloudflare

    Cloudflare. How does Cloudflare’s Speed Test really work?https://blog.cloudflare. com/how-does-cloudflares-speed-test-really-work/, 2025. Accessed 2026-06-21

  18. [18]

    ndt7 Protocol.https://www.measurementlab.net/tests/ndt/ndt7/,

    Measurement Lab. ndt7 Protocol.https://www.measurementlab.net/tests/ndt/ndt7/,

  19. [19]

    RIPE Atlas REST API: Measurements.https://atlas.ripe.net/docs/ap is/rest-api-reference/measurements/, 2026

    RIPE NCC. RIPE Atlas REST API: Measurements.https://atlas.ripe.net/docs/ap is/rest-api-reference/measurements/, 2026. Accessed 2026-06-21

  20. [20]

    Secure, Governable Chips: Using On-Chip Mechanisms to Manage National Security Risks from AI and Advanced Computing

    Tim Fist and Erich Grunewald. Secure, Governable Chips: Using On-Chip Mechanisms to Manage National Security Risks from AI and Advanced Computing. Center for a New American Security (CNAS) report, 2023. Accessed 2026-06-22

  21. [21]

    Location Verification for AI Chips.https://www.ia ps.ai/research/location-verification-for-ai-chips, 2025

    Institute for AI Policy and Strategy. Location Verification for AI Chips.https://www.ia ps.ai/research/location-verification-for-ai-chips, 2025. Accessed 2026-06-22

  22. [22]

    Mechanisms to Verify International Agreements About AI Development.arXiv preprint arXiv:2506.15867, 2025

    Aaron Scher and Lisa Thiergart. Mechanisms to Verify International Agreements About AI Development.arXiv preprint arXiv:2506.15867, 2025. doi: 10.48550/arXiv.2506.15867

  23. [23]

    Distance-bounding protocols

    Stefan Brands and David Chaum. Distance-bounding protocols. InAdvances in Cryptology — EUROCRYPT ’93, volume 765 ofLNCS, pages 344–359. Springer, 1994. doi: 10.1007/ 3-540-48285-7_30

  24. [24]

    Understanding GPU resource interference one level deeper

    Paul Elvinger, Foteini Strati, Natalie Enright Jerger, and Ana Klimovic. Understanding GPU resource interference one level deeper. InProceedings of the 2025 ACM Symposium on Cloud Computing (SoCC), 2025. doi: 10.1145/3772052.3772270

  25. [25]

    Policies for Format Requirements.https://info.arxiv.org/help/policies/f ormat_requirements.html, 2026

    arXiv. Policies for Format Requirements.https://info.arxiv.org/help/policies/f ormat_requirements.html, 2026. Accessed 2026-06-21

  26. [26]

    Oversized Submissions

    arXiv. Oversized Submissions. https://info.arxiv.org/help/sizes.html , 2026. Accessed 2026-06-21

  27. [27]

    Ancillary Files (data, code, images).https://info.arxiv.org/help/ancillary_ files.html, 2026

    arXiv. Ancillary Files (data, code, images).https://info.arxiv.org/help/ancillary_ files.html, 2026. Accessed 2026-06-21

  28. [28]

    Support for data sets associated with arXiv articles.https://info.arxiv.org/h elp/datasets.html, 2026

    arXiv. Support for data sets associated with arXiv articles.https://info.arxiv.org/h elp/datasets.html, 2026. Accessed 2026-06-21. 12