AgileOS: A GPU Operating System Layer for Protected CUDA Services

Alex Jones; Peipei Zhou; Yiyu Shi; Zhuoping Yang

arxiv: 2606.06697 · v1 · pith:LK7PSMNMnew · submitted 2026-06-04 · 💻 cs.CR · cs.OS

AgileOS: A GPU Operating System Layer for Protected CUDA Services

Zhuoping Yang , Yiyu Shi , Alex Jones , Peipei Zhou This is my paper

Pith reviewed 2026-06-28 00:14 UTC · model grok-4.3

classification 💻 cs.CR cs.OS

keywords CUDA virtualizationGPU securityoperating system layerPTX injectionmemory protectiontrusted executionlibrary interception

0 comments

The pith

AgileOS virtualizes CUDA at the library boundary so a trusted worker can mediate access and protect service state from applications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AgileOS as a GPU operating system layer to provide protection around CUDA services. Modern applications need isolation for device queues, memory-mapped regions, and library state, but standard CUDA exposes these directly to each application. AgileOS achieves this by having applications link to client shims while a trusted worker handles the real CUDA context and uses pointer validation with PTX guards for memory separation. This allows support for services and libraries without ad hoc mechanisms.

Core claim

AgileOS virtualizes CUDA at the library boundary: applications link against client-side CUDA Runtime, Driver, and selected library shims, while a trusted runtime worker owns the real CUDA context and mediates supported operations. To protect service state and module interfaces, AgileOS defines a GPU memory-management model that separates user allocations from protected module/MMIO ranges, using pointer validation and memory access guards via PTX injection.

What carries the argument

The library-boundary virtualization with a trusted runtime worker owning the CUDA context and PTX-injected memory access guards for separating user and protected allocations.

If this is right

Applications can interact with protected GPU services without direct access to context or device pointers.
Library shims enable compatibility with existing code while routing calls to the worker.
The memory model prevents exposure of protected ranges to untrusted kernels.
Modular design supports various services and libraries such as cuFFT and PyTorch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This design could enable more secure sharing of GPU resources in multi-user environments.
PTX injection might be adaptable for other forms of runtime enforcement in GPU code.
Future extensions could include support for additional device interactions like storage or networking.

Load-bearing premise

That applications cannot bypass the client-side shims to access the real CUDA context directly, and that PTX-level guards can enforce memory separation without breaking library compatibility.

What would settle it

An application that successfully accesses protected module state or MMIO regions without mediation by the trusted worker would disprove the isolation claim.

Figures

Figures reproduced from arXiv: 2606.06697 by Alex Jones, Peipei Zhou, Yiyu Shi, Zhuoping Yang.

**Figure 2.** Figure 2: PTX-level kernel memory guard; (a) a CUDA kernel in the user program; (b) received PTX code from the compiled user program in AgileOS; [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Modern GPU applications increasingly interact with storage systems, network devices, vendor libraries, and GPU-resident services rather than executing only isolated compute kernels. This shift creates a need for operating-system-like protection around GPU services, where service metadata, device queues, memory-mapped I/O regions, and library-internal state should not be directly exposed to untrusted application kernels. However, today's CUDA programming model, by default, still gives each application direct ownership of its CUDA context, device pointers, runtime handles, module loading path, and kernel launches, leaving protected GPU services to build their own ad hoc interfaces and isolation mechanisms. This paper presents the initial design and prototype scope of AgileOS, a GPU operating-system layer for protected CUDA services. AgileOS virtualizes CUDA at the library boundary: applications link against client-side CUDA Runtime, Driver, and selected library shims, while a trusted runtime worker owns the real CUDA context and mediates supported operations. To protect service state and module interfaces, AgileOS also defines a GPU memory-management model that separates user allocations from protected module/MMIO ranges, using pointer validation and memory access guards via PTX injection. AgileOS is modularized and flexible, supporting a range of protected services and existing libraries such as cuFFT and PyTorch. The prototype includes client-side interceptors, worker-side CUDA handlers, virtualized CUDA object tables, protected AgileOS modules, a GPU memory manager that separates user allocations from protected module/MMIO ranges, selected trusted library adapters, and the PTX-level kernel memory guard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgileOS sketches a library-shim plus PTX-guard model for CUDA service isolation but gives no data or analysis to show it holds up.

read the letter

The core of this paper is a design for virtualizing CUDA at the library boundary so that protected services do not hand direct context ownership to untrusted apps. Applications link to client shims for the runtime, driver, and libraries while a trusted worker owns the real context and mediates calls. The other new piece is a memory model that keeps user allocations separate from protected module and MMIO ranges, enforced with pointer checks and PTX injection.

The paper does a clear job stating the problem: modern workloads use vendor libraries and external devices, yet CUDA still exposes queues, handles, and memory maps directly. The modular breakdown into interceptors, handlers, object tables, and library adapters is straightforward and shows they thought about supporting existing code like cuFFT and PyTorch.

The soft spots are exactly where the stress-test note points. Nothing in the description shows that the shims catch every entry point or that PTX guards block all loads and stores to protected ranges without missing cases or breaking compatibility. There are also no performance numbers, no attack tests, and no coverage analysis. The design rests on unverified assumptions about bypass paths and guard reliability.

This is for systems people working on GPU security or accelerator OS layers. A reader who wants concrete ideas for isolation mechanisms can get value from the architecture sketch. It deserves a serious referee because the problem is real and the proposed mechanisms are specific enough to review, even though the current version is still at the proposal stage and will need substantial added evidence to be convincing.

Referee Report

2 major / 1 minor

Summary. The paper presents the initial design and prototype scope of AgileOS, a GPU operating-system layer for protected CUDA services. It virtualizes CUDA at the library boundary by having applications link against client-side CUDA Runtime, Driver, and library shims, while a trusted runtime worker owns the real CUDA context and mediates operations. Protection of service state uses a GPU memory-management model separating user allocations from protected module/MMIO ranges via pointer validation and PTX injection guards. The prototype includes client-side interceptors, worker-side handlers, virtualized object tables, protected modules, a memory manager, trusted library adapters, and the PTX-level kernel memory guard. It claims support for libraries such as cuFFT and PyTorch.

Significance. If the protection claims hold, AgileOS could address a real need for OS-like isolation around GPU services in modern applications interacting with storage, networks, and vendor libraries. However, the manuscript supplies no evaluation data, security analysis, performance numbers, or bypass testing, so the significance cannot be assessed beyond the architectural proposal itself.

major comments (2)

[Prototype description] Prototype description (abstract and full design section): The central protection claim rests on the assumption that linking against client-side shims prevents direct CUDA context ownership and that PTX injection reliably separates user allocations from protected ranges. No coverage analysis of CUDA entry points, direct driver handles, inline PTX, or module loading paths is provided, nor any attack surface evaluation.
[Memory management model] Memory management model (abstract): The PTX-level memory access guards are presented as enforcing separation, but the manuscript contains no formal argument, test cases, or compatibility analysis showing that guards catch all loads/stores to protected MMIO/module ranges without false negatives or breaking existing libraries.

minor comments (1)

The abstract and prototype list could benefit from explicit section numbering or a diagram of the client/worker boundary to improve readability of the modular design.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments on the scope of our initial design and prototype. The manuscript focuses on the architecture and implementation of the virtualization layer rather than a complete security evaluation. We respond to each major comment below.

read point-by-point responses

Referee: [Prototype description] Prototype description (abstract and full design section): The central protection claim rests on the assumption that linking against client-side shims prevents direct CUDA context ownership and that PTX injection reliably separates user allocations from protected ranges. No coverage analysis of CUDA entry points, direct driver handles, inline PTX, or module loading paths is provided, nor any attack surface evaluation.

Authors: We agree that the manuscript lacks a coverage analysis of CUDA entry points, direct driver handles, inline PTX, module loading paths, and attack surface evaluation. This is because the work presents an initial design and prototype scope; a full security analysis and bypass testing are outside the current contribution. We will revise the manuscript to explicitly discuss the assumptions, the limited set of supported entry points in the prototype, and the planned directions for comprehensive analysis. revision: partial
Referee: [Memory management model] Memory management model (abstract): The PTX-level memory access guards are presented as enforcing separation, but the manuscript contains no formal argument, test cases, or compatibility analysis showing that guards catch all loads/stores to protected MMIO/module ranges without false negatives or breaking existing libraries.

Authors: We acknowledge that the manuscript provides no formal argument, test cases, or compatibility analysis for the PTX guards. The current prototype implements the guards as part of the memory manager, but without the requested verification. We will add a revised section describing the guard insertion mechanism, its intended coverage for loads/stores, and any observed compatibility with the supported libraries (cuFFT, PyTorch), while noting the absence of exhaustive testing. revision: partial

Circularity Check

0 steps flagged

No circularity: architectural design proposal without derivations or fitted predictions

full rationale

The paper presents an architectural proposal for AgileOS, describing a virtualization approach at the CUDA library boundary with PTX injection for memory guards. No equations, parameter fittings, predictions, or derivation chains are present in the abstract or described content. The work relies on design choices and prototype implementation rather than any self-referential mathematical reductions or self-citation load-bearing claims. This is a standard non-finding for systems papers that do not claim first-principles derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that current CUDA exposes service state directly and that mediation plus PTX guards can enforce isolation. No free parameters or new physical entities are introduced.

axioms (1)

domain assumption Today's CUDA programming model gives each application direct ownership of its CUDA context, device pointers, runtime handles, module loading path, and kernel launches.
Stated explicitly in the opening paragraph as the motivation for the work.

pith-pipeline@v0.9.1-grok · 5809 in / 1264 out tokens · 31314 ms · 2026-06-28T00:14:37.324001+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Deep Learning Workload Scheduling in GPU Datacenters: A Survey,

Z. Ye, W. Gao, Q. Hu, P. Sun, X. Wang, Y . Luo, T. Zhang, and Y . Wen, “Deep Learning Workload Scheduling in GPU Datacenters: A Survey,” ACM Computing Surveys, vol. 56, no. 6, pp. 1–38, 2024

2024
[2]

Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better,

G. Menghani, “Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better,”ACM Computing Surveys, vol. 55, no. 12, pp. 1–37, 2023

2023
[3]

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems,

X. Miao, G. Oliaro, Z. Zhang, X. Cheng, H. Jin, T. Chen, and Z. Jia, “Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems,”ACM Computing Surveys, vol. 58, no. 1, pp. 1–37, 2025

2025
[4]

BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing,

T. Liu, Y . Chen, D. Li, C. Wu, Y . Zhu, J. He, Y . Peng, H. Chen, H. Chen, and C. Guo, “BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing,” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 103–118

2023
[5]

GraphBLAST: A High- Performance Linear Algebra-based Graph Framework on the GPU,

C. Yang, A. Buluc ¸, and J. D. Owens, “GraphBLAST: A High- Performance Linear Algebra-based Graph Framework on the GPU,” ACM Transactions on Mathematical Software (TOMS), vol. 48, no. 1, pp. 1–51, 2022

2022
[6]

gSampler: General and Efficient GPU-based Graph Sampling for Graph Learning,

P. Gong, R. Liu, Z. Mao, Z. Cai, X. Yan, C. Li, M. Wang, and Z. Li, “gSampler: General and Efficient GPU-based Graph Sampling for Graph Learning,” inProceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 562–578

2023
[7]

Efficient and Scalable Graph Pattern Mining on GPUs,

X. Chenet al., “Efficient and Scalable Graph Pattern Mining on GPUs,” in16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022, pp. 857–877

2022
[8]

The Transformational Role of GPU Computing and Deep Learning in Drug Discovery,

M. Pandey, M. Fernandez, F. Gentile, O. Isayev, A. Tropsha, A. C. Stern, and A. Cherkasov, “The Transformational Role of GPU Computing and Deep Learning in Drug Discovery,”Nature Machine Intelligence, vol. 4, no. 3, pp. 211–221, 2022

2022
[9]

Jax-fem: A differentiable gpu-accelerated 3d finite element solver for automatic inverse design and mechanistic data science,

T. Xue, S. Liao, Z. Gan, C. Park, X. Xie, W. K. Liu, and J. Cao, “Jax-fem: A differentiable gpu-accelerated 3d finite element solver for automatic inverse design and mechanistic data science,”Computer Physics Communications, vol. 291, p. 108802, 2023

2023
[10]

cuquantum sdk: A high-performance library for accelerating quantum science,

H. Bayraktar, A. Charara, D. Clark, S. Cohen, T. Costa, Y .-L. L. Fang, Y . Gao, J. Guan, J. Gunnels, A. Haidaret al., “cuquantum sdk: A high-performance library for accelerating quantum science,” in2023 IEEE International Conference on Quantum Computing and Engineering (QCE), vol. 1. IEEE, 2023, pp. 1050–1061

2023
[11]

GPU-Initiated On-Demand High- Throughput Storage Access in the BaM System Architecture,

Z. Qureshi, V . S. Mailthody, I. Gelado, S. Min, A. Masood, J. Park, J. Xiong, C. J. Newburn, D. Vainbrand, I.-H. Chung, M. Gar- land, W. Dally, and W.-m. Hwu, “GPU-Initiated On-Demand High- Throughput Storage Access in the BaM System Architecture,” inPro- ceedings of the 28th ACM International Conference on Architectural Support for Programming Languages...

2023
[12]

GMT: GPU Orchestrated Memory Tiering for the Big Data Era,

C.-H. Chang, J. Han, A. Sivasubramaniam, V . Sharma Mailthody, Z. Qureshi, and W.-M. Hwu, “GMT: GPU Orchestrated Memory Tiering for the Big Data Era,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2024, pp. 464–478

2024
[13]

AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration,

Z. Yang, J. Zhuang, X. Chen, A. Jones, and P. Zhou, “AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 1028–1042

2025
[14]

Asynchrony and GPUs: Bridging this Dichotomy for I/O with AGIO,

J. Han, A. Sivasubramaniam, C.-H. Chang, V . S. Mailthody, Z. Qureshi, and W.-M. Hwu, “Asynchrony and GPUs: Bridging this Dichotomy for I/O with AGIO,” inProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2026, pp. 208–222

2026
[15]

GeminiFS: A Companion File System for GPUs,

S. Qiu, W. Liu, Y . Hu, J. Yan, Z. Shen, X. Yao, R. Chen, G. Zhang, and Y . Zhang, “GeminiFS: A Companion File System for GPUs,” in 23rd USENIX Conference on File and Storage Technologies (FAST 25). Santa Clara, CA: USENIX Association, Feb. 2025, pp. 221–236

2025
[16]

Managing Scalable Direct Storage Accesses for GPUs with GoFS,

S. Li, Y . E. Zhou, Y . Xue, Y . Xu, and J. Huang, “Managing Scalable Direct Storage Accesses for GPUs with GoFS,” inProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, 2025, pp. 979–995

2025
[17]

FpgaNIC: An FPGA-based Versatile 100Gb SmartNIC for GPUs,

Z. Wang, H. Huang, J. Zhang, F. Wu, and G. Alonso, “FpgaNIC: An FPGA-based Versatile 100Gb SmartNIC for GPUs,” in2022 USENIX Annual Technical Conference (USENIX ATC 22), 2022, pp. 967–986

2022
[18]

Enabling Efficient GPU Communication over Multiple NICs with FuseLink,

Z. Ren, Y . Li, Z. Wang, X. Huang, W. Li, K. Xu, X. Liao, Y . Sun, B. Liu, H. Tian, J. Zhang, M. Wang, Z. Zhong, G. Liu, Y . Zhang, and K. Chen, “Enabling Efficient GPU Communication over Multiple NICs with FuseLink,” in19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), 2025, pp. 91–108

2025
[19]

GPUOS: A GPU Operating System Primitive for Transparent Operation Fusion

Y . Yang, X. Gao, Y . Zhou, Y . Gan, Y . Zheng, and A. Quinn, “GPUOS: A GPU Operating System Primitive for Transparent Operation Fusion,” arXiv preprint arXiv:2604.17861, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

LithOS: An Operating System for Efficient Machine Learning on GPUs,

P. H. Coppock, B. Zhang, E. H. Solomon, V . Kypriotis, L. Yang, B. Sharma, D. Schatzberg, T. C. Mowry, and D. Skarlatos, “LithOS: An Operating System for Efficient Machine Learning on GPUs,” in Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, 2025, pp. 1–17

2025
[21]

Ai and memory wall,

A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer, “Ai and memory wall,”IEEE Micro, vol. 44, no. 3, pp. 33–39, 2024

2024
[22]

Zero- infinity: Breaking the gpu memory wall for extreme scale deep learning,

S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y . He, “Zero- infinity: Breaking the gpu memory wall for extreme scale deep learning,” inProceedings of the international conference for high performance computing, networking, storage and analysis, 2021, pp. 1–14

2021
[23]

Mlp- offload: Multi-level, multi-path offloading for llm pre-training to break the gpu memory wall,

A. K. Maurya, M. M. Rafique, F. Cappello, and B. Nicolae, “Mlp- offload: Multi-level, multi-path offloading for llm pre-training to break the gpu memory wall,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 1381–1394

2025
[24]

G10: Enabling an efficient unified gpu memory and storage architecture with smart tensor migrations,

H. Zhang, Y . Zhou, Y . Xue, Y . Liu, and J. Huang, “G10: Enabling an efficient unified gpu memory and storage architecture with smart tensor migrations,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023, pp. 395–410

2023
[25]

Overcoming the memory wall with{CXL- Enabled}{SSDs},

S.-P. Yang, M. Kim, S. Nam, J. Park, J.-Y . Choi, E. H. Nam, E. Lee, S. Lee, and B. S. Kim, “Overcoming the memory wall with{CXL- Enabled}{SSDs},” in2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 601–617

2023
[26]

Demystifying nccl: An in-depth analysis of gpu communication protocols and algorithms,

Z. Hu, S. Shen, T. Bonato, S. Jeaugey, C. Alexander, E. Spada, J. Dinan, J. Hammond, and T. Hoefler, “Demystifying nccl: An in-depth analysis of gpu communication protocols and algorithms,” in2025 IEEE Symposium on High-Performance Interconnects (HOTI). IEEE, 2025, pp. 48–59

2025
[27]

CUDA Graphs,

NVIDIA Corporation, “CUDA Graphs,”CUDA Programming Guide, Version 13.3, 2026, accessed: 2026-06-03. [Online]. Available: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topic s/cuda-graphs.html

2026
[28]

Breakable CUDA Graph,

SGLang Team, “Breakable CUDA Graph,”SGLang Documentation, 2026, last updated: Jun. 4, 2026. Accessed: 2026-06-03. [Online]. Available: https://sgl-project.github.io/advanced features/breakable cud a graph.html

2026
[29]

Medusa: Accelerating Serverless LLM Inference with Materialization,

S. Zeng, M. Xie, S. Gao, Y . Chen, and Y . Lu, “Medusa: Accelerating Serverless LLM Inference with Materialization,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2025, pp. 653–668

2025
[30]

Multi-Process Service,

NVIDIA Corporation, “Multi-Process Service,”NVIDIA GPU Management and Deployment Documentation, 2026, accessed: 2026- 06-03. [Online]. Available: https://docs.nvidia.com/deploy/mps/latest/in dex.html

2026
[31]

NVIDIA Multi-Instance GPU,

——, “NVIDIA Multi-Instance GPU,”NVIDIA Technologies, 2026, accessed: 2026-06-03. [Online]. Available: https://www.nvidia.com/e n-us/technologies/multi-instance-gpu/

2026
[32]

Transparent GPU Sharing in Container Clouds for Deep Learning Workloads,

B. Wu, Z. Zhang, Z. Bai, X. Liu, and X. Jin, “Transparent GPU Sharing in Container Clouds for Deep Learning Workloads,” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 69–85

2023
[33]

Characterizing network requirements for gpu api remoting in ai applications,

T. Wang, Z. Chen, X. Wei, J. Gu, R. Chen, and H. Chen, “Characterizing network requirements for gpu api remoting in ai applications,”arXiv preprint arXiv:2401.13354, 2024

work page arXiv 2024
[34]

{gVulkan}: Scalable{GPU}pooling for{Pixel-Grained}rendering in ray tracing,

Y . Gu, Y . Wang, Y . Sun, Y . Xiang, Y . Jiang, X. Hu, Z. Qi, and H. Guan, “{gVulkan}: Scalable{GPU}pooling for{Pixel-Grained}rendering in ray tracing,” in2024 USENIX Annual Technical Conference (USENIX ATC 24), 2024, pp. 1151–1165

2024
[35]

Krisp: Enabling kernel-wise right-sizing for spatial partitioned gpu inference servers,

M. Chow, A. Jahanshahi, and D. Wong, “Krisp: Enabling kernel-wise right-sizing for spatial partitioned gpu inference servers,” in2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2023, pp. 624–637

2023
[36]

Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space Interception,

S. Zhang, A. Xu, Q. Chen, H. Zhao, W. Cui, Z. Wang, Y . Li, L. Xiao, and M. Guo, “Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space Interception,” in2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025, pp. 1003–1019

2025
[37]

GPU Memory Exploitation for Fun and Profit,

Y . Guo, Z. Zhang, and J. Yang, “GPU Memory Exploitation for Fun and Profit,” in33rd USENIX Security Symposium (USENIX Security 24), 2024, pp. 4033–4050

2024
[38]

Virtual Memory Management,

NVIDIA Corporation, “Virtual Memory Management,”CUDA Programming Guide, Version 13.3, 2026, accessed: 2026-06-03. [Online]. Available: https://docs.nvidia.com/cuda/cuda-programming-g uide/04-special-topics/virtual-memory-management.html

2026
[39]

DONGLE: Direct FPGA-Orchestrated NVMe Storage for HLS,

Wong, Linus Y and Zhang, Jialiang and Li, Jing, “DONGLE: Direct FPGA-Orchestrated NVMe Storage for HLS,” inProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2023, pp. 3–13

2023
[40]

DONGLE 2.0: Direct FPGA- Orchestrated NVMe Storage for HLS,

L. Y . Wong, J. Zhang, and J. Li, “DONGLE 2.0: Direct FPGA- Orchestrated NVMe Storage for HLS,”ACM Transactions on Recon- figurable Technology and Systems, vol. 17, no. 3, pp. 1–32, 2024

2024
[41]

HiLFS: FPGA-Orchestrated File System for High-Level Synthesis,

Y . Na, L. Y . Wong, A. DeHon, and J. Li, “HiLFS: FPGA-Orchestrated File System for High-Level Synthesis,” inProceedings of the 2026 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2026, pp. 126–136

2026

[1] [1]

Deep Learning Workload Scheduling in GPU Datacenters: A Survey,

Z. Ye, W. Gao, Q. Hu, P. Sun, X. Wang, Y . Luo, T. Zhang, and Y . Wen, “Deep Learning Workload Scheduling in GPU Datacenters: A Survey,” ACM Computing Surveys, vol. 56, no. 6, pp. 1–38, 2024

2024

[2] [2]

Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better,

G. Menghani, “Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better,”ACM Computing Surveys, vol. 55, no. 12, pp. 1–37, 2023

2023

[3] [3]

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems,

X. Miao, G. Oliaro, Z. Zhang, X. Cheng, H. Jin, T. Chen, and Z. Jia, “Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems,”ACM Computing Surveys, vol. 58, no. 1, pp. 1–37, 2025

2025

[4] [4]

BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing,

T. Liu, Y . Chen, D. Li, C. Wu, Y . Zhu, J. He, Y . Peng, H. Chen, H. Chen, and C. Guo, “BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing,” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 103–118

2023

[5] [5]

GraphBLAST: A High- Performance Linear Algebra-based Graph Framework on the GPU,

C. Yang, A. Buluc ¸, and J. D. Owens, “GraphBLAST: A High- Performance Linear Algebra-based Graph Framework on the GPU,” ACM Transactions on Mathematical Software (TOMS), vol. 48, no. 1, pp. 1–51, 2022

2022

[6] [6]

gSampler: General and Efficient GPU-based Graph Sampling for Graph Learning,

P. Gong, R. Liu, Z. Mao, Z. Cai, X. Yan, C. Li, M. Wang, and Z. Li, “gSampler: General and Efficient GPU-based Graph Sampling for Graph Learning,” inProceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 562–578

2023

[7] [7]

Efficient and Scalable Graph Pattern Mining on GPUs,

X. Chenet al., “Efficient and Scalable Graph Pattern Mining on GPUs,” in16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), 2022, pp. 857–877

2022

[8] [8]

The Transformational Role of GPU Computing and Deep Learning in Drug Discovery,

M. Pandey, M. Fernandez, F. Gentile, O. Isayev, A. Tropsha, A. C. Stern, and A. Cherkasov, “The Transformational Role of GPU Computing and Deep Learning in Drug Discovery,”Nature Machine Intelligence, vol. 4, no. 3, pp. 211–221, 2022

2022

[9] [9]

Jax-fem: A differentiable gpu-accelerated 3d finite element solver for automatic inverse design and mechanistic data science,

T. Xue, S. Liao, Z. Gan, C. Park, X. Xie, W. K. Liu, and J. Cao, “Jax-fem: A differentiable gpu-accelerated 3d finite element solver for automatic inverse design and mechanistic data science,”Computer Physics Communications, vol. 291, p. 108802, 2023

2023

[10] [10]

cuquantum sdk: A high-performance library for accelerating quantum science,

H. Bayraktar, A. Charara, D. Clark, S. Cohen, T. Costa, Y .-L. L. Fang, Y . Gao, J. Guan, J. Gunnels, A. Haidaret al., “cuquantum sdk: A high-performance library for accelerating quantum science,” in2023 IEEE International Conference on Quantum Computing and Engineering (QCE), vol. 1. IEEE, 2023, pp. 1050–1061

2023

[11] [11]

GPU-Initiated On-Demand High- Throughput Storage Access in the BaM System Architecture,

Z. Qureshi, V . S. Mailthody, I. Gelado, S. Min, A. Masood, J. Park, J. Xiong, C. J. Newburn, D. Vainbrand, I.-H. Chung, M. Gar- land, W. Dally, and W.-m. Hwu, “GPU-Initiated On-Demand High- Throughput Storage Access in the BaM System Architecture,” inPro- ceedings of the 28th ACM International Conference on Architectural Support for Programming Languages...

2023

[12] [12]

GMT: GPU Orchestrated Memory Tiering for the Big Data Era,

C.-H. Chang, J. Han, A. Sivasubramaniam, V . Sharma Mailthody, Z. Qureshi, and W.-M. Hwu, “GMT: GPU Orchestrated Memory Tiering for the Big Data Era,” inProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, 2024, pp. 464–478

2024

[13] [13]

AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration,

Z. Yang, J. Zhuang, X. Chen, A. Jones, and P. Zhou, “AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 1028–1042

2025

[14] [14]

Asynchrony and GPUs: Bridging this Dichotomy for I/O with AGIO,

J. Han, A. Sivasubramaniam, C.-H. Chang, V . S. Mailthody, Z. Qureshi, and W.-M. Hwu, “Asynchrony and GPUs: Bridging this Dichotomy for I/O with AGIO,” inProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2026, pp. 208–222

2026

[15] [15]

GeminiFS: A Companion File System for GPUs,

S. Qiu, W. Liu, Y . Hu, J. Yan, Z. Shen, X. Yao, R. Chen, G. Zhang, and Y . Zhang, “GeminiFS: A Companion File System for GPUs,” in 23rd USENIX Conference on File and Storage Technologies (FAST 25). Santa Clara, CA: USENIX Association, Feb. 2025, pp. 221–236

2025

[16] [16]

Managing Scalable Direct Storage Accesses for GPUs with GoFS,

S. Li, Y . E. Zhou, Y . Xue, Y . Xu, and J. Huang, “Managing Scalable Direct Storage Accesses for GPUs with GoFS,” inProceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, 2025, pp. 979–995

2025

[17] [17]

FpgaNIC: An FPGA-based Versatile 100Gb SmartNIC for GPUs,

Z. Wang, H. Huang, J. Zhang, F. Wu, and G. Alonso, “FpgaNIC: An FPGA-based Versatile 100Gb SmartNIC for GPUs,” in2022 USENIX Annual Technical Conference (USENIX ATC 22), 2022, pp. 967–986

2022

[18] [18]

Enabling Efficient GPU Communication over Multiple NICs with FuseLink,

Z. Ren, Y . Li, Z. Wang, X. Huang, W. Li, K. Xu, X. Liao, Y . Sun, B. Liu, H. Tian, J. Zhang, M. Wang, Z. Zhong, G. Liu, Y . Zhang, and K. Chen, “Enabling Efficient GPU Communication over Multiple NICs with FuseLink,” in19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), 2025, pp. 91–108

2025

[19] [19]

GPUOS: A GPU Operating System Primitive for Transparent Operation Fusion

Y . Yang, X. Gao, Y . Zhou, Y . Gan, Y . Zheng, and A. Quinn, “GPUOS: A GPU Operating System Primitive for Transparent Operation Fusion,” arXiv preprint arXiv:2604.17861, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

LithOS: An Operating System for Efficient Machine Learning on GPUs,

P. H. Coppock, B. Zhang, E. H. Solomon, V . Kypriotis, L. Yang, B. Sharma, D. Schatzberg, T. C. Mowry, and D. Skarlatos, “LithOS: An Operating System for Efficient Machine Learning on GPUs,” in Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, 2025, pp. 1–17

2025

[21] [21]

Ai and memory wall,

A. Gholami, Z. Yao, S. Kim, C. Hooper, M. W. Mahoney, and K. Keutzer, “Ai and memory wall,”IEEE Micro, vol. 44, no. 3, pp. 33–39, 2024

2024

[22] [22]

Zero- infinity: Breaking the gpu memory wall for extreme scale deep learning,

S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and Y . He, “Zero- infinity: Breaking the gpu memory wall for extreme scale deep learning,” inProceedings of the international conference for high performance computing, networking, storage and analysis, 2021, pp. 1–14

2021

[23] [23]

Mlp- offload: Multi-level, multi-path offloading for llm pre-training to break the gpu memory wall,

A. K. Maurya, M. M. Rafique, F. Cappello, and B. Nicolae, “Mlp- offload: Multi-level, multi-path offloading for llm pre-training to break the gpu memory wall,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2025, pp. 1381–1394

2025

[24] [24]

G10: Enabling an efficient unified gpu memory and storage architecture with smart tensor migrations,

H. Zhang, Y . Zhou, Y . Xue, Y . Liu, and J. Huang, “G10: Enabling an efficient unified gpu memory and storage architecture with smart tensor migrations,” inProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023, pp. 395–410

2023

[25] [25]

Overcoming the memory wall with{CXL- Enabled}{SSDs},

S.-P. Yang, M. Kim, S. Nam, J. Park, J.-Y . Choi, E. H. Nam, E. Lee, S. Lee, and B. S. Kim, “Overcoming the memory wall with{CXL- Enabled}{SSDs},” in2023 USENIX Annual Technical Conference (USENIX ATC 23), 2023, pp. 601–617

2023

[26] [26]

Demystifying nccl: An in-depth analysis of gpu communication protocols and algorithms,

Z. Hu, S. Shen, T. Bonato, S. Jeaugey, C. Alexander, E. Spada, J. Dinan, J. Hammond, and T. Hoefler, “Demystifying nccl: An in-depth analysis of gpu communication protocols and algorithms,” in2025 IEEE Symposium on High-Performance Interconnects (HOTI). IEEE, 2025, pp. 48–59

2025

[27] [27]

CUDA Graphs,

NVIDIA Corporation, “CUDA Graphs,”CUDA Programming Guide, Version 13.3, 2026, accessed: 2026-06-03. [Online]. Available: https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topic s/cuda-graphs.html

2026

[28] [28]

Breakable CUDA Graph,

SGLang Team, “Breakable CUDA Graph,”SGLang Documentation, 2026, last updated: Jun. 4, 2026. Accessed: 2026-06-03. [Online]. Available: https://sgl-project.github.io/advanced features/breakable cud a graph.html

2026

[29] [29]

Medusa: Accelerating Serverless LLM Inference with Materialization,

S. Zeng, M. Xie, S. Gao, Y . Chen, and Y . Lu, “Medusa: Accelerating Serverless LLM Inference with Materialization,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2025, pp. 653–668

2025

[30] [30]

Multi-Process Service,

NVIDIA Corporation, “Multi-Process Service,”NVIDIA GPU Management and Deployment Documentation, 2026, accessed: 2026- 06-03. [Online]. Available: https://docs.nvidia.com/deploy/mps/latest/in dex.html

2026

[31] [31]

NVIDIA Multi-Instance GPU,

——, “NVIDIA Multi-Instance GPU,”NVIDIA Technologies, 2026, accessed: 2026-06-03. [Online]. Available: https://www.nvidia.com/e n-us/technologies/multi-instance-gpu/

2026

[32] [32]

Transparent GPU Sharing in Container Clouds for Deep Learning Workloads,

B. Wu, Z. Zhang, Z. Bai, X. Liu, and X. Jin, “Transparent GPU Sharing in Container Clouds for Deep Learning Workloads,” in20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), 2023, pp. 69–85

2023

[33] [33]

Characterizing network requirements for gpu api remoting in ai applications,

T. Wang, Z. Chen, X. Wei, J. Gu, R. Chen, and H. Chen, “Characterizing network requirements for gpu api remoting in ai applications,”arXiv preprint arXiv:2401.13354, 2024

work page arXiv 2024

[34] [34]

{gVulkan}: Scalable{GPU}pooling for{Pixel-Grained}rendering in ray tracing,

Y . Gu, Y . Wang, Y . Sun, Y . Xiang, Y . Jiang, X. Hu, Z. Qi, and H. Guan, “{gVulkan}: Scalable{GPU}pooling for{Pixel-Grained}rendering in ray tracing,” in2024 USENIX Annual Technical Conference (USENIX ATC 24), 2024, pp. 1151–1165

2024

[35] [35]

Krisp: Enabling kernel-wise right-sizing for spatial partitioned gpu inference servers,

M. Chow, A. Jahanshahi, and D. Wong, “Krisp: Enabling kernel-wise right-sizing for spatial partitioned gpu inference servers,” in2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2023, pp. 624–637

2023

[36] [36]

Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space Interception,

S. Zhang, A. Xu, Q. Chen, H. Zhao, W. Cui, Z. Wang, Y . Li, L. Xiao, and M. Guo, “Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space Interception,” in2025 USENIX Annual Technical Conference (USENIX ATC 25), 2025, pp. 1003–1019

2025

[37] [37]

GPU Memory Exploitation for Fun and Profit,

Y . Guo, Z. Zhang, and J. Yang, “GPU Memory Exploitation for Fun and Profit,” in33rd USENIX Security Symposium (USENIX Security 24), 2024, pp. 4033–4050

2024

[38] [38]

Virtual Memory Management,

NVIDIA Corporation, “Virtual Memory Management,”CUDA Programming Guide, Version 13.3, 2026, accessed: 2026-06-03. [Online]. Available: https://docs.nvidia.com/cuda/cuda-programming-g uide/04-special-topics/virtual-memory-management.html

2026

[39] [39]

DONGLE: Direct FPGA-Orchestrated NVMe Storage for HLS,

Wong, Linus Y and Zhang, Jialiang and Li, Jing, “DONGLE: Direct FPGA-Orchestrated NVMe Storage for HLS,” inProceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2023, pp. 3–13

2023

[40] [40]

DONGLE 2.0: Direct FPGA- Orchestrated NVMe Storage for HLS,

L. Y . Wong, J. Zhang, and J. Li, “DONGLE 2.0: Direct FPGA- Orchestrated NVMe Storage for HLS,”ACM Transactions on Recon- figurable Technology and Systems, vol. 17, no. 3, pp. 1–32, 2024

2024

[41] [41]

HiLFS: FPGA-Orchestrated File System for High-Level Synthesis,

Y . Na, L. Y . Wong, A. DeHon, and J. Li, “HiLFS: FPGA-Orchestrated File System for High-Level Synthesis,” inProceedings of the 2026 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2026, pp. 126–136

2026