GPUOS: A GPU Operating System Primitive for Transparent Operation Fusion

· 2026 · cs.DC · arXiv 2604.17861

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Modern deep learning workloads often consist of many small tensor operations, especially in inference, attention, and micro-batched training. In these settings, kernel launch overhead can become a major bottleneck, sometimes exceeding the actual computation time. We present GPUOS, a GPU runtime JIT system that reduces launch overhead using a persistent kernel architecture with runtime operator injection. GPUOS runs a single long-lived GPU kernel that continuously processes tasks from a host-managed work queue, eliminating repeated kernel launches. To support diverse operations, GPUOS uses NVIDIA NVRTC to just-in-time compile operators at runtime and inject them into the running kernel through device function pointer tables. This design enables operator updates without restarting the kernel or recompiling the system. GPUOS introduces four key ideas: (1) a persistent worker kernel with atomic task queues, (2) a runtime operator injection mechanism based on NVRTC and relocatable device code, (3) a dual-slot aliasing scheme for safe concurrent operator updates, and (4) transparent PyTorch integration through TorchDispatch that batches micro-operations into unified submissions. The system supports arbitrary tensor shapes, strides, data types, and broadcasting through a generic tensor abstraction. Experiments show that GPUOS achieves up to 15.3x speedup over standard PyTorch on workloads dominated by small operations, including micro-batched inference and attention patterns. GPUOS improves utilization while remaining compatible with the PyTorch ecosystem.

representative citing papers

AgileOS: A GPU Operating System Layer for Protected CUDA Services

cs.CR · 2026-06-04 · unverdicted · novelty 7.0

AgileOS virtualizes CUDA at the library boundary using client shims and a trusted worker that owns the real context, plus PTX-injected guards to separate user and protected memory ranges.

citing papers explorer

Showing 1 of 1 citing paper.

AgileOS: A GPU Operating System Layer for Protected CUDA Services cs.CR · 2026-06-04 · unverdicted · none · ref 19 · internal anchor
AgileOS virtualizes CUDA at the library boundary using client shims and a trusted worker that owns the real context, plus PTX-injected guards to separate user and protected memory ranges.

GPUOS: A GPU Operating System Primitive for Transparent Operation Fusion

fields

years

verdicts

representative citing papers

citing papers explorer