Unprivileged Topology Certificates for Cloud GPU Attestation
Pith reviewed 2026-06-26 07:59 UTC · model grok-4.3
The pith
CUDA latency maps from ordinary code create certificates attesting cloud GPU identity, class, and coarse location without privileged access or vendor keys.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a software-only CUDA probe measures an SM-by-memory-region latency matrix using physical SM labels and dependent global loads. A streaming reducer commits sufficient statistics, configuration, code hashes, network evidence, and a compressed raw data archive into a certificate that a verifier can check without a GPU. This supports three claims: the per-SM latency map is a stable physical fingerprint with median temporal jitter of 0.09 cycles over a six-hour full-load RTX 5090 run and 100.0% shape-only leave-one-out classification accuracy for distinct Blackwell dies; cache-bypassing HBM sweeps recover hardware-class topology across generations including a unified Volta V
What carries the argument
The per-SM latency matrix measured via dependent global loads using physical SM labels, which serves as a stable physical fingerprint.
If this is right
- The per-SM latency map remains stable with median temporal jitter of 0.09 cycles over six-hour full-load runs on RTX 5090.
- Shape-only leave-one-out classification separates distinct Blackwell dies with 100.0% accuracy.
- Cache-bypassing HBM sweeps recover hardware-class topology across generations, including specific cross-die penalties in Blackwell B200.
- 169 RIPE Atlas probes localize a B200 server within 44 km of its claimed datacentre and reject all 11 decoy sites.
Where Pith is reading between the lines
- If the fingerprint is unforgeable, tenants could continuously monitor jobs to detect any runtime hardware substitution.
- The approach might extend to continuous attestation during long-running workloads by re-measuring the latency matrix periodically.
- Third-party auditors without GPU access could use the certificates to verify cloud provider claims on hardware class and location.
- Similar per-core or per-unit latency patterns could be explored for attestation on non-GPU accelerators with hierarchical memory.
Load-bearing premise
The latency matrix and network landmarks measured by ordinary CUDA code cannot be forged or altered by the cloud provider or hypervisor without detectable changes to the reported statistics or hashes.
What would settle it
A hypervisor that intercepts CUDA calls, supplies a forged latency matrix and network responses matching the expected certificate hashes and statistics, yet runs on different hardware without producing detectable statistical deviations.
Figures
read the original abstract
Cloud GPU tenants receive a model name and a region, but cannot directly inspect the physical accelerator that runs their job. We present a software-only attestation primitive for this setting. A CUDA probe measures an SM-by-memory-region latency matrix using physical SM labels and dependent global loads. A streaming reducer commits sufficient statistics, configuration, code hashes, network evidence, and a compressed raw data archive into a certificate that a verifier can check without a GPU. The certificate supports three claims. First, the per-SM latency map is a stable physical fingerprint. Over a six-hour full-load RTX 5090 run, its median temporal jitter is 0.09 cycles, while shape-only leave-one-out classification separates distinct Blackwell dies with 100.0% accuracy. Second, cache-bypassing HBM sweeps recover hardware-class topology across generations, including a unified Volta V100 memory domain, a two-way Hopper H200 L2 split, and a Blackwell B200 two-die NV-HBI package whose 74/74 SM partition carries a 30-cycle, 15.5 ns cross-die penalty. Third, public network landmarks bind the same certificate to a coarse location. In the B200 run, 169 RIPE Atlas probes place the server within 44 km of its claimed datacentre and reject all 11 decoy sites. Together, these measurements check cloud-GPU identity, class, and coarse location without privileged access or a vendor key.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a software-only attestation primitive for cloud GPUs. A CUDA probe constructs an SM-by-memory-region latency matrix via physical SM labels and dependent global loads; a streaming reducer commits sufficient statistics, code hashes, configuration, network evidence, and a compressed archive into a verifiable certificate. The certificate supports three claims: (1) the per-SM latency map is a stable physical fingerprint (0.09-cycle median jitter over 6 h on RTX 5090; 100% shape-only leave-one-out classification on distinct Blackwell dies), (2) cache-bypassing HBM sweeps recover hardware-class topology across V100/H200/B200 generations (including a 30-cycle cross-die penalty on B200), and (3) RIPE Atlas landmarks bind the certificate to coarse location (169 probes place a B200 server within 44 km of its claimed datacentre and reject 11 decoys).
Significance. If the unforgeability and stability claims hold, the work supplies a practical, vendor-key-free method for tenants to verify GPU identity, class, and location in shared cloud environments. The empirical components—long-duration jitter measurements, cross-generation topology recovery, and public-network landmark binding—are concrete strengths that could be directly reused or extended.
major comments (2)
- [Abstract, §3 (certificate construction), §5 (evaluation)] The central attestation claim (that the latency matrix and RIPE Atlas landmarks cannot be forged or altered by a hypervisor without detectable changes to hashes or statistics) is load-bearing yet unsupported by any adversarial evaluation. No section examines attacks such as memory-region remapping, scheduler virtualization, or controlled per-load delay injection that preserve the reported CUDA code hash, configuration, and streaming-reducer statistics.
- [§4.1, §4.2] §4.1 and §4.2 report concrete stability figures (0.09-cycle median jitter, 100% leave-one-out accuracy, 30-cycle cross-die penalty) without error bars, full exclusion criteria, or statistical tests on the underlying distributions; this directly affects the reliability of the fingerprint and topology claims.
minor comments (2)
- [Figures 3–5, Table 2] Figure captions and tables should explicitly state the number of independent runs and any filtering applied to the latency samples.
- [§2.2] Notation for SM labels and memory regions is introduced without a consolidated glossary; a short table mapping labels to hardware units would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the attestation claims and the statistical presentation of results. We address each major comment below, indicating planned revisions.
read point-by-point responses
-
Referee: [Abstract, §3 (certificate construction), §5 (evaluation)] The central attestation claim (that the latency matrix and RIPE Atlas landmarks cannot be forged or altered by a hypervisor without detectable changes to hashes or statistics) is load-bearing yet unsupported by any adversarial evaluation. No section examines attacks such as memory-region remapping, scheduler virtualization, or controlled per-load delay injection that preserve the reported CUDA code hash, configuration, and streaming-reducer statistics.
Authors: We agree that the manuscript does not contain adversarial evaluations against hypervisor attacks such as memory remapping or delay injection. The work centers on the construction of the certificate from unprivileged CUDA measurements and its observed stability and topology properties under normal execution; the code hashes and reducer statistics are included to enable detection of gross tampering, but no claim of resistance to the specific attacks listed is supported by experiments. In revision we will add a limitations subsection to §5 that explicitly enumerates these attack vectors, clarifies the scope of the current empirical claims, and identifies them as directions for future adversarial analysis. revision: yes
-
Referee: [§4.1, §4.2] §4.1 and §4.2 report concrete stability figures (0.09-cycle median jitter, 100% leave-one-out accuracy, 30-cycle cross-die penalty) without error bars, full exclusion criteria, or statistical tests on the underlying distributions; this directly affects the reliability of the fingerprint and topology claims.
Authors: We accept that the reported figures would be strengthened by additional statistical detail. In the revised manuscript we will update §4.1 and §4.2 to include error bars (standard deviation and interquartile range) on the jitter and cross-die penalty measurements, provide an explicit account of sample exclusion criteria, and add statistical tests (e.g., two-sample Kolmogorov-Smirnov tests on latency distributions and bootstrap confidence intervals on classification accuracy) to support the leave-one-out results and generational topology distinctions. revision: yes
Circularity Check
No circularity: claims rest on direct empirical measurements
full rationale
The paper presents empirical observations from CUDA probes (SM-by-memory latency matrices, temporal jitter of 0.09 cycles, leave-one-out classification accuracy, HBM topology sweeps, and RIPE Atlas network landmarks) without any derivation chain, equations, or first-principles predictions. No step reduces a claimed result to fitted parameters, self-definitions, or self-citations by construction; the reported fingerprints and topology recoveries are direct measurements rather than quantities defined by the same data. The central claims are therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Latency matrix measured by user-level CUDA code reflects stable physical properties of the GPU die and memory system.
Reference graph
Works this paper leans on
-
[1]
Remote ATtestation procedureS (RATS) Architecture
Henk Birkholz, Dave Thaler, Michael Richardson, Ned Smith, and Wei Pan. Remote ATtestation procedureS (RATS) Architecture. RFC 9334, Internet Engineering Task Force, 2023
2023
-
[2]
Blueprint, Bootstrap, and Bridge: A Security Look at NVIDIA GPU Confidential Computing
Zhongshu Gu, Enriquillo Valdez, Salman Ahmed, Julian James Stephen, Michael Le, Hani Jamjoom, Shixuan Zhao, and Zhiqiang Lin. NVIDIA GPU Confidential Computing Demystified.arXiv preprint arXiv:2507.02770, 2025. doi: 10.48550/arXiv.2507.02770
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.02770 2025
-
[3]
Eric Boniardi, Stanley Bishop, and Alison Haire. Validation of GPU Computation in Decentralized, Trustless Networks.arXiv preprint arXiv:2501.05374, 2025. doi: 10.48550 /arXiv.2501.05374
arXiv 2025
-
[4]
Towards Verifiable Network Telemetry without Special Purpose Hardware
Jaechan An, Zeying Zhu, Ian Miers, and Zaoxing Liu. Towards Verifiable Network Telemetry without Special Purpose Hardware. InProceedings of the 24th ACM Workshop on Hot Topics in Networks (HotNets), 2025. doi: 10.1145/3772356.3772392
-
[5]
Xinxin Mei and Xiaowen Chu. Dissecting GPU memory hierarchy through microbench- marking.IEEE Transactions on Parallel and Distributed Systems, 28(1):72–86, 2017. doi: 10.1109/TPDS.2016.2549523
-
[6]
Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. Dissecting the NVIDIA volta GPU architecture via microbenchmarking.arXiv preprint arXiv:1804.06826,
-
[7]
doi: 10.48550/arXiv.1804.06826
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1804.06826
-
[8]
Zhixian Jin, Christopher Rocca, Jiho Kim, Hans Kasan, Minsoo Rhu, Ali Bakhoda, Tor M. Aamodt, and John Kim. Uncovering real GPU NoC characteristics: Implications on interconnect architecture. InProceedings of the 57th Annual IEEE/ACM International Symposium on Microarchitecture, pages 885–898, 2024. doi: 10.1109/MICRO61859.2024. 00070
-
[9]
Aaron Jarmusch and Sunita Chandrasekaran. Microbenchmarking NVIDIA’s Blackwell Architecture: An in-depth Architectural Analysis.arXiv preprint arXiv:2512.02189, 2025. doi: 10.48550/arXiv.2512.02189
-
[10]
Rendered insecure: GPU side channel attacks are practical
Hoda Naghibijouybari, Ajaya Neupane, Zhiyun Qian, and Nael Abu-Ghazaleh. Rendered insecure: GPU side channel attacks are practical. InProceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pages 2139–2153, 2018. doi: 10.1145/3243734.3243831
-
[11]
Spy in the GPU-box: Covert and side channel attacks on multi-GPU systems
Sankha Baran Dutta, Hoda Naghibijouybari, Arjun Gupta, Nael Abu-Ghazaleh, Andres Marquez, and Kevin Barker. Spy in the GPU-box: Covert and side channel attacks on multi-GPU systems. InProceedings of the 50th Annual International Symposium on Computer Architecture, 2023. doi: 10.1145/3579371.3589080
-
[12]
Yicheng Zhang, Ravan Nazaraliyev, Sankha Baran Dutta, Andres Marquez, Kevin Barker, andNaelAbu-Ghazaleh. NVBleed: Covertandside-channelattacksonNVIDIAmulti-GPU interconnect.arXiv preprint arXiv:2503.17847, 2025. doi: 10.48550/arXiv.2503.17847
-
[13]
Ben Du, Massimo Candela, Bradley Huffaker, Alex C. Snoeren, and kc claffy. RIPE IPmap Active Geolocation: Mechanism and Performance Evaluation. InACM SIGCOMM Computer Communication Review, volume 50, pages 3–10, 2020. doi: 10.1145/3402413.34 02415. 11
-
[14]
Dude, where’s that IP? circumventing measurement-based IP geolocation
Phillipa Gill, Yashar Ganjali, Bernard Wong, and David Lie. Dude, where’s that IP? circumventing measurement-based IP geolocation. InProceedings of the 19th USENIX Security Symposium, 2010
2010
-
[15]
Trust, But Verify, Operator-Reported Geolocation.arXiv preprint arXiv:2409.19109, 2024
Katherine Izhikevich, Ben Du, Sumanth Rao, Alisha Ukani, and Liz Izhikevich. Trust, But Verify, Operator-Reported Geolocation.arXiv preprint arXiv:2409.19109, 2024. doi: 10.48550/arXiv.2409.19109
-
[16]
Parallel Thread Execution ISA.https://docs.nvidia.com/cuda /parallel-thread-execution/, 2026
NVIDIA Corporation. Parallel Thread Execution ISA.https://docs.nvidia.com/cuda /parallel-thread-execution/, 2026. Accessed 2026-06-21
2026
-
[17]
How does Cloudflare’s Speed Test really work?https://blog.cloudflare
Cloudflare. How does Cloudflare’s Speed Test really work?https://blog.cloudflare. com/how-does-cloudflares-speed-test-really-work/, 2025. Accessed 2026-06-21
2025
-
[18]
ndt7 Protocol.https://www.measurementlab.net/tests/ndt/ndt7/,
Measurement Lab. ndt7 Protocol.https://www.measurementlab.net/tests/ndt/ndt7/,
-
[19]
RIPE Atlas REST API: Measurements.https://atlas.ripe.net/docs/ap is/rest-api-reference/measurements/, 2026
RIPE NCC. RIPE Atlas REST API: Measurements.https://atlas.ripe.net/docs/ap is/rest-api-reference/measurements/, 2026. Accessed 2026-06-21
2026
-
[20]
Secure, Governable Chips: Using On-Chip Mechanisms to Manage National Security Risks from AI and Advanced Computing
Tim Fist and Erich Grunewald. Secure, Governable Chips: Using On-Chip Mechanisms to Manage National Security Risks from AI and Advanced Computing. Center for a New American Security (CNAS) report, 2023. Accessed 2026-06-22
2023
-
[21]
Location Verification for AI Chips.https://www.ia ps.ai/research/location-verification-for-ai-chips, 2025
Institute for AI Policy and Strategy. Location Verification for AI Chips.https://www.ia ps.ai/research/location-verification-for-ai-chips, 2025. Accessed 2026-06-22
2025
-
[22]
Aaron Scher and Lisa Thiergart. Mechanisms to Verify International Agreements About AI Development.arXiv preprint arXiv:2506.15867, 2025. doi: 10.48550/arXiv.2506.15867
-
[23]
Distance-bounding protocols
Stefan Brands and David Chaum. Distance-bounding protocols. InAdvances in Cryptology — EUROCRYPT ’93, volume 765 ofLNCS, pages 344–359. Springer, 1994. doi: 10.1007/ 3-540-48285-7_30
1994
-
[24]
Understanding GPU resource interference one level deeper
Paul Elvinger, Foteini Strati, Natalie Enright Jerger, and Ana Klimovic. Understanding GPU resource interference one level deeper. InProceedings of the 2025 ACM Symposium on Cloud Computing (SoCC), 2025. doi: 10.1145/3772052.3772270
-
[25]
Policies for Format Requirements.https://info.arxiv.org/help/policies/f ormat_requirements.html, 2026
arXiv. Policies for Format Requirements.https://info.arxiv.org/help/policies/f ormat_requirements.html, 2026. Accessed 2026-06-21
2026
-
[26]
Oversized Submissions
arXiv. Oversized Submissions. https://info.arxiv.org/help/sizes.html , 2026. Accessed 2026-06-21
2026
-
[27]
Ancillary Files (data, code, images).https://info.arxiv.org/help/ancillary_ files.html, 2026
arXiv. Ancillary Files (data, code, images).https://info.arxiv.org/help/ancillary_ files.html, 2026. Accessed 2026-06-21
2026
-
[28]
Support for data sets associated with arXiv articles.https://info.arxiv.org/h elp/datasets.html, 2026
arXiv. Support for data sets associated with arXiv articles.https://info.arxiv.org/h elp/datasets.html, 2026. Accessed 2026-06-21. 12
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.