Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers
Pith reviewed 2026-05-16 05:09 UTC · model grok-4.3
The pith
The sequential-dispatch methodology shows that WebGPU dispatch overhead for LLM inference is 24-36 microseconds on Vulkan.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a sequential-dispatch methodology, the authors establish that naive single-operation benchmarks overestimate WebGPU dispatch cost by about 20 times. The true per-dispatch cost of WebGPU API overhead alone is 24-36 μs on Vulkan and 32-71 μs on Metal, while the total per-operation overhead including Python cost is ~95 μs. This overhead is a primary differentiator: kernel fusion improves throughput by 53% on Vulkan but provides no benefit on CUDA. The characterization covers four GPU vendors, three backends, three browsers, and two model sizes at batch size one.
What carries the argument
The sequential-dispatch methodology, which chains operations in sequence to isolate WebGPU API overhead from Python noise, OS scheduling, and browser timing artifacts.
If this is right
- Kernel fusion improves throughput by 53% on Vulkan but provides no benefit on CUDA.
- The torch-webgpu backend achieves 11-12% of CUDA performance on the reference platform.
- Backend choice dominates dispatch overhead, with up to 2.2x variation within the same Metal backend.
- Per-operation overhead dominates kernel compute efficiency at batch size 1 regardless of kernel quality.
- At float32, an RTX PRO 2000 achieves 1.4x the throughput of WebGPU despite having roughly 6x less compute than an RTX 5090.
Where Pith is reading between the lines
- Reducing per-dispatch validation costs inside WebGPU implementations could narrow the performance gap to native APIs for small-batch inference workloads.
- The same sequential chaining approach could be used to measure overhead in other web compute APIs or frameworks.
- Browser and runtime developers might achieve larger gains by targeting dispatch costs than by further tuning individual shader kernels.
Load-bearing premise
The sequential-dispatch methodology fully isolates WebGPU API overhead from Python interpreter noise, OS scheduling, or browser-specific timing artifacts.
What would settle it
Direct measurement of equivalent dispatch sequences using native Vulkan or Metal API calls without the WebGPU abstraction or Python layer to compare against the reported 24-36 μs and 32-71 μs values.
read the original abstract
WebGPU's security-focused design imposes per-operation validation that compounds across the many small dispatches in neural network inference, yet the true cost of this overhead is poorly characterized. We present a systematic characterization of WebGPU dispatch overhead for LLM inference at batch size 1, spanning four GPU vendors (NVIDIA, AMD, Apple, Intel), two native implementations (Dawn, wgpu-native) and three browsers (Chrome, Safari, Firefox), and two model sizes (Qwen2.5-0.5B and 1.5B). Our primary contribution is a sequential-dispatch methodology that reveals naive single-operation benchmarks overestimate dispatch cost by ${\sim}20\times$. The true per-dispatch cost of WebGPU API overhead alone is 24-36 $\mu$s on Vulkan and 32-71 $\mu$s on Metal, while the total per-operation overhead including Python cost is ${\sim}95$~$\mu$s, which turns out to be a distinction critical for optimization. On Vulkan, kernel fusion improves throughput by 53%, while CUDA fusion provides no benefit, confirming that per-operation overhead is a primary differentiator. LLM inference was tested across three major operating systems (Linux, Windows, macOS). We built $\texttt{torch-webgpu}$, a PrivateUse1-based out-of-tree PyTorch backend and an FX-to-WebGPU compiler, which on our reference platform achieves 11--12% of CUDA performance. At dtype-matched float32, RTX PRO 2000 achieves 1.4$\times$ WebGPU's throughput despite ${\sim}6\times$ less compute than RTX 5090. For dispatch overhead, backend choice is the dominant factor, although implementation choice also matters substantially within a backend (2.2$\times$ for Metal). In terms of dispatch vs kernel compute efficiency, we conclude that at batch=1 with the current dispatch-heavy pipeline, per-operation overhead dominates regardless of kernel quality. All code, benchmarks, and raw data are open source.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical characterization of WebGPU dispatch overhead for batch-1 LLM inference across four GPU vendors (NVIDIA, AMD, Apple, Intel), two native implementations (Dawn, wgpu-native), three browsers (Chrome, Safari, Firefox), and two model sizes (Qwen2.5-0.5B, 1.5B). The central claim is that a sequential-dispatch methodology reveals naive single-operation benchmarks overestimate per-dispatch cost by ~20×, yielding WebGPU API overheads of 24-36 μs on Vulkan and 32-71 μs on Metal, with total per-operation overhead (including Python) of ~95 μs; kernel fusion improves throughput by 53% on Vulkan but not CUDA. The authors also introduce torch-webgpu (a PrivateUse1 PyTorch backend) and an FX-to-WebGPU compiler that reaches 11-12% of CUDA performance, concluding that per-operation overhead dominates at batch=1.
Significance. If the overhead isolation holds, the work supplies concrete, cross-vendor data on a key bottleneck for browser-based LLM inference and demonstrates that fusion is a primary lever on certain backends. The open release of code, benchmarks, and raw data supports reproducibility and follow-on engineering. The finding that backend choice dominates dispatch cost while implementation details matter within a backend (2.2× on Metal) is actionable for WebGPU runtime developers.
major comments (2)
- [Methodology / Results] The sequential-dispatch methodology (described in the abstract and results) is load-bearing for the ~20× overestimate claim and the 24-36 μs / 32-71 μs API-only figures. No control experiment is described that uses a native-language harness or places high-resolution monotonic timing directly around the WebGPU call site to subtract Python interpreter, GIL, browser event-loop, or OS-timer contributions; any residual per-iteration cost would scale directly into the reported per-dispatch values.
- [Results] The 11-12% of CUDA performance claim for torch-webgpu on the reference platform (abstract) is presented without a side-by-side description of the CUDA baseline configuration, including whether the same FX compilation pipeline, kernel fusion, or dtype handling was applied; this makes it impossible to isolate how much of the gap is attributable to dispatch overhead versus other implementation differences.
minor comments (2)
- [Abstract] The abstract states that all code, benchmarks, and raw data are open source but does not include a repository URL or access instructions.
- [Abstract / Results] Reported overhead ranges (24-36 μs, 32-71 μs, ~95 μs) would be strengthened by explicit mention of the number of repetitions, timing method (e.g., monotonic clock), and any observed variability or error bars.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We have carefully considered the points raised and revised the paper to provide additional methodological details and clarifications on the baseline comparisons. Our point-by-point responses are provided below.
read point-by-point responses
-
Referee: [Methodology / Results] The sequential-dispatch methodology (described in the abstract and results) is load-bearing for the ~20× overestimate claim and the 24-36 μs / 32-71 μs API-only figures. No control experiment is described that uses a native-language harness or places high-resolution monotonic timing directly around the WebGPU call site to subtract Python interpreter, GIL, browser event-loop, or OS-timer contributions; any residual per-iteration cost would scale directly into the reported per-dispatch values.
Authors: We agree that explicit controls are important for validating the overhead isolation. In the revised manuscript, we have added a dedicated subsection in the Methodology describing the timing procedure: we employ high-resolution monotonic timers (via performance.now() in JavaScript and time.perf_counter in Python) placed directly around the WebGPU dispatch invocations within the sequential-dispatch loop. A separate control measures the overhead of the Python/JS loop by executing equivalent iterations with stubbed no-op dispatches, allowing us to subtract interpreter, GIL, and event-loop contributions. While we did not implement a full native-language (C++) harness for direct WebGPU calls—as our primary interest is the overhead in the torch-webgpu user-facing path—we have included this as a noted limitation and argue that the reported values reflect practical costs for LLM inference workloads. This revision ensures the per-dispatch figures are robust against residual costs. revision: yes
-
Referee: [Results] The 11-12% of CUDA performance claim for torch-webgpu on the reference platform (abstract) is presented without a side-by-side description of the CUDA baseline configuration, including whether the same FX compilation pipeline, kernel fusion, or dtype handling was applied; this makes it impossible to isolate how much of the gap is attributable to dispatch overhead versus other implementation differences.
Authors: We thank the referee for highlighting this ambiguity. The revised manuscript now includes an expanded description of the CUDA baseline in the Results section, along with a new comparison table. The CUDA experiments use the standard PyTorch CUDA backend with equivalent model loading, the same FX graph compilation where possible (via torch.compile), dtype matching (float32), and kernel fusion applied consistently for both backends. The 11-12% figure represents end-to-end throughput for the full inference pipeline on the reference hardware. We clarify that the performance gap arises from multiple factors including dispatch overhead, kernel optimization maturity, and memory management differences, with dispatch being a dominant contributor at batch size 1 as shown in our other experiments. This allows readers to better contextualize the result. revision: yes
Circularity Check
Empirical measurement study with no circular derivations or self-referential reductions
full rationale
The paper is a characterization study that reports measured dispatch overhead values (24-36 μs on Vulkan, 32-71 μs on Metal, ~95 μs total) obtained via a sequential-dispatch timing methodology across hardware and backends. No equations, fitted parameters, or derivations appear in the provided text; the reported numbers are direct empirical outputs rather than quantities constructed by re-using the same inputs or self-citations. The methodology is presented as an experimental technique for isolating overhead, with no load-bearing step that reduces the final figures to definitions or prior self-referential results. This is a standard self-contained empirical paper whose central claims rest on open-source benchmarks and raw data rather than any circular chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Timing measurements capture only the intended dispatch overhead without significant interference from system noise or measurement tools.
Reference graph
Works this paper leans on
-
[1]
Cross-platform CPU and GPU inference with quantization support. Accessed February 2026. gfx-rs Community. wgpu: Safe and portable GPU abstraction in Rust.https://wgpu.rs/, 2024. Pure-Rust WebGPU implementation with Vulkan, Metal, DX12, and OpenGL backends. Accessed February 2026. Google. Dawn: WebGPU implementation for Chromium.https://dawn.googlesource.c...
work page 2026
-
[2]
Explicit GPU API designed to reduce CPU overhead vs OpenGL. Accessed February 2026. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon- zalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th ACM Symposium on Operating Systems P...
work page 2026
-
[3]
Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =
doi: 10.1145/3600006.3613165. Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InProceedings of Machine Learning and Systems (MLSys), 2024. Microsoft. ONNX runtime.https://onnxruntime...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.