Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers

J\k{e}drzej Maczan

arxiv: 2604.02344 · v1 · submitted 2026-02-09 · 💻 cs.LG · cs.DC· cs.PF

Characterizing WebGPU Dispatch Overhead for LLM Inference Across Four GPU Vendors, Three Backends, and Three Browsers

J\k{e}drzej Maczan This is my paper

Pith reviewed 2026-05-16 05:09 UTC · model grok-4.3

classification 💻 cs.LG cs.DCcs.PF

keywords WebGPUdispatch overheadLLM inferenceGPU performancekernel fusionbrowser backendsVulkanMetal

0 comments

The pith

The sequential-dispatch methodology shows that WebGPU dispatch overhead for LLM inference is 24-36 microseconds on Vulkan.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a sequential-dispatch methodology to measure the actual per-operation overhead of WebGPU when running large language models at batch size one. It demonstrates that standard single-operation benchmarks overestimate this cost by roughly 20 times. The measurements establish that WebGPU API overhead alone is 24-36 microseconds on Vulkan and 32-71 microseconds on Metal, while total overhead including Python reaches about 95 microseconds. These numbers explain why kernel fusion delivers a 53 percent throughput gain on Vulkan but none on CUDA, and why the new torch-webgpu backend reaches only 11-12 percent of native CUDA performance across multiple GPU vendors and browsers.

Core claim

Using a sequential-dispatch methodology, the authors establish that naive single-operation benchmarks overestimate WebGPU dispatch cost by about 20 times. The true per-dispatch cost of WebGPU API overhead alone is 24-36 μs on Vulkan and 32-71 μs on Metal, while the total per-operation overhead including Python cost is ~95 μs. This overhead is a primary differentiator: kernel fusion improves throughput by 53% on Vulkan but provides no benefit on CUDA. The characterization covers four GPU vendors, three backends, three browsers, and two model sizes at batch size one.

What carries the argument

The sequential-dispatch methodology, which chains operations in sequence to isolate WebGPU API overhead from Python noise, OS scheduling, and browser timing artifacts.

If this is right

Kernel fusion improves throughput by 53% on Vulkan but provides no benefit on CUDA.
The torch-webgpu backend achieves 11-12% of CUDA performance on the reference platform.
Backend choice dominates dispatch overhead, with up to 2.2x variation within the same Metal backend.
Per-operation overhead dominates kernel compute efficiency at batch size 1 regardless of kernel quality.
At float32, an RTX PRO 2000 achieves 1.4x the throughput of WebGPU despite having roughly 6x less compute than an RTX 5090.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Reducing per-dispatch validation costs inside WebGPU implementations could narrow the performance gap to native APIs for small-batch inference workloads.
The same sequential chaining approach could be used to measure overhead in other web compute APIs or frameworks.
Browser and runtime developers might achieve larger gains by targeting dispatch costs than by further tuning individual shader kernels.

Load-bearing premise

The sequential-dispatch methodology fully isolates WebGPU API overhead from Python interpreter noise, OS scheduling, or browser-specific timing artifacts.

What would settle it

Direct measurement of equivalent dispatch sequences using native Vulkan or Metal API calls without the WebGPU abstraction or Python layer to compare against the reported 24-36 μs and 32-71 μs values.

read the original abstract

WebGPU's security-focused design imposes per-operation validation that compounds across the many small dispatches in neural network inference, yet the true cost of this overhead is poorly characterized. We present a systematic characterization of WebGPU dispatch overhead for LLM inference at batch size 1, spanning four GPU vendors (NVIDIA, AMD, Apple, Intel), two native implementations (Dawn, wgpu-native) and three browsers (Chrome, Safari, Firefox), and two model sizes (Qwen2.5-0.5B and 1.5B). Our primary contribution is a sequential-dispatch methodology that reveals naive single-operation benchmarks overestimate dispatch cost by ${\sim}20\times$. The true per-dispatch cost of WebGPU API overhead alone is 24-36 $\mu$s on Vulkan and 32-71 $\mu$s on Metal, while the total per-operation overhead including Python cost is ${\sim}95$~$\mu$s, which turns out to be a distinction critical for optimization. On Vulkan, kernel fusion improves throughput by 53%, while CUDA fusion provides no benefit, confirming that per-operation overhead is a primary differentiator. LLM inference was tested across three major operating systems (Linux, Windows, macOS). We built $\texttt{torch-webgpu}$, a PrivateUse1-based out-of-tree PyTorch backend and an FX-to-WebGPU compiler, which on our reference platform achieves 11--12% of CUDA performance. At dtype-matched float32, RTX PRO 2000 achieves 1.4$\times$ WebGPU's throughput despite ${\sim}6\times$ less compute than RTX 5090. For dispatch overhead, backend choice is the dominant factor, although implementation choice also matters substantially within a backend (2.2$\times$ for Metal). In terms of dispatch vs kernel compute efficiency, we conclude that at batch=1 with the current dispatch-heavy pipeline, per-operation overhead dominates regardless of kernel quality. All code, benchmarks, and raw data are open source.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper's main value is the first broad cross-vendor measurements of WebGPU dispatch overhead for batch-1 LLM inference, with a sequential-dispatch method that puts the per-op API cost at 24-71 μs depending on backend.

read the letter

The key takeaway is that naive single-dispatch timing overestimates WebGPU overhead by roughly 20x, and the actual per-operation cost sits in the 24-36 μs range on Vulkan and 32-71 μs on Metal, with total overhead including Python around 95 μs. That distinction matters for anyone trying to run small-batch inference in the browser. They back this up with comparisons across four GPU vendors, three browsers, two native backends, and three operating systems, plus throughput numbers on Qwen2.5 models at 0.5B and 1.5B scale. The fact that kernel fusion helps Vulkan by 53% but not CUDA lines up with the overhead being the dominant factor at batch size 1. Building and open-sourcing torch-webgpu plus the FX compiler, along with all raw data, makes the work easier to check and extend. Their reference implementation hits 11-12% of CUDA performance, which is low but gives a concrete baseline for web backends. The sequential-dispatch approach is the clearest new piece; it avoids the usual benchmark inflation and produces numbers that feel usable for optimization decisions. The main soft spot is that the method still runs through a Python harness, so any unaccounted interpreter or browser jitter per iteration could leak into the per-dispatch figures and soften the 20x claim. The abstract gives no error bars or timing-precision details, which leaves the exact microsecond ranges with some uncertainty until the full methods are examined. No obvious math or derivation issues since this is pure measurement. This is useful for systems people working on browser ML or WebGPU runtimes who need concrete numbers rather than hand-wavy claims. The data and method are solid enough to deserve referee time, even if the implementation performance is still far from native. I'd send it out for review.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical characterization of WebGPU dispatch overhead for batch-1 LLM inference across four GPU vendors (NVIDIA, AMD, Apple, Intel), two native implementations (Dawn, wgpu-native), three browsers (Chrome, Safari, Firefox), and two model sizes (Qwen2.5-0.5B, 1.5B). The central claim is that a sequential-dispatch methodology reveals naive single-operation benchmarks overestimate per-dispatch cost by ~20×, yielding WebGPU API overheads of 24-36 μs on Vulkan and 32-71 μs on Metal, with total per-operation overhead (including Python) of ~95 μs; kernel fusion improves throughput by 53% on Vulkan but not CUDA. The authors also introduce torch-webgpu (a PrivateUse1 PyTorch backend) and an FX-to-WebGPU compiler that reaches 11-12% of CUDA performance, concluding that per-operation overhead dominates at batch=1.

Significance. If the overhead isolation holds, the work supplies concrete, cross-vendor data on a key bottleneck for browser-based LLM inference and demonstrates that fusion is a primary lever on certain backends. The open release of code, benchmarks, and raw data supports reproducibility and follow-on engineering. The finding that backend choice dominates dispatch cost while implementation details matter within a backend (2.2× on Metal) is actionable for WebGPU runtime developers.

major comments (2)

[Methodology / Results] The sequential-dispatch methodology (described in the abstract and results) is load-bearing for the ~20× overestimate claim and the 24-36 μs / 32-71 μs API-only figures. No control experiment is described that uses a native-language harness or places high-resolution monotonic timing directly around the WebGPU call site to subtract Python interpreter, GIL, browser event-loop, or OS-timer contributions; any residual per-iteration cost would scale directly into the reported per-dispatch values.
[Results] The 11-12% of CUDA performance claim for torch-webgpu on the reference platform (abstract) is presented without a side-by-side description of the CUDA baseline configuration, including whether the same FX compilation pipeline, kernel fusion, or dtype handling was applied; this makes it impossible to isolate how much of the gap is attributable to dispatch overhead versus other implementation differences.

minor comments (2)

[Abstract] The abstract states that all code, benchmarks, and raw data are open source but does not include a repository URL or access instructions.
[Abstract / Results] Reported overhead ranges (24-36 μs, 32-71 μs, ~95 μs) would be strengthened by explicit mention of the number of repetitions, timing method (e.g., monotonic clock), and any observed variability or error bars.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We have carefully considered the points raised and revised the paper to provide additional methodological details and clarifications on the baseline comparisons. Our point-by-point responses are provided below.

read point-by-point responses

Referee: [Methodology / Results] The sequential-dispatch methodology (described in the abstract and results) is load-bearing for the ~20× overestimate claim and the 24-36 μs / 32-71 μs API-only figures. No control experiment is described that uses a native-language harness or places high-resolution monotonic timing directly around the WebGPU call site to subtract Python interpreter, GIL, browser event-loop, or OS-timer contributions; any residual per-iteration cost would scale directly into the reported per-dispatch values.

Authors: We agree that explicit controls are important for validating the overhead isolation. In the revised manuscript, we have added a dedicated subsection in the Methodology describing the timing procedure: we employ high-resolution monotonic timers (via performance.now() in JavaScript and time.perf_counter in Python) placed directly around the WebGPU dispatch invocations within the sequential-dispatch loop. A separate control measures the overhead of the Python/JS loop by executing equivalent iterations with stubbed no-op dispatches, allowing us to subtract interpreter, GIL, and event-loop contributions. While we did not implement a full native-language (C++) harness for direct WebGPU calls—as our primary interest is the overhead in the torch-webgpu user-facing path—we have included this as a noted limitation and argue that the reported values reflect practical costs for LLM inference workloads. This revision ensures the per-dispatch figures are robust against residual costs. revision: yes
Referee: [Results] The 11-12% of CUDA performance claim for torch-webgpu on the reference platform (abstract) is presented without a side-by-side description of the CUDA baseline configuration, including whether the same FX compilation pipeline, kernel fusion, or dtype handling was applied; this makes it impossible to isolate how much of the gap is attributable to dispatch overhead versus other implementation differences.

Authors: We thank the referee for highlighting this ambiguity. The revised manuscript now includes an expanded description of the CUDA baseline in the Results section, along with a new comparison table. The CUDA experiments use the standard PyTorch CUDA backend with equivalent model loading, the same FX graph compilation where possible (via torch.compile), dtype matching (float32), and kernel fusion applied consistently for both backends. The 11-12% figure represents end-to-end throughput for the full inference pipeline on the reference hardware. We clarify that the performance gap arises from multiple factors including dispatch overhead, kernel optimization maturity, and memory management differences, with dispatch being a dominant contributor at batch size 1 as shown in our other experiments. This allows readers to better contextualize the result. revision: yes

Circularity Check

0 steps flagged

Empirical measurement study with no circular derivations or self-referential reductions

full rationale

The paper is a characterization study that reports measured dispatch overhead values (24-36 μs on Vulkan, 32-71 μs on Metal, ~95 μs total) obtained via a sequential-dispatch timing methodology across hardware and backends. No equations, fitted parameters, or derivations appear in the provided text; the reported numbers are direct empirical outputs rather than quantities constructed by re-using the same inputs or self-citations. The methodology is presented as an experimental technique for isolating overhead, with no load-bearing step that reduces the final figures to definitions or prior self-referential results. This is a standard self-contained empirical paper whose central claims rest on open-source benchmarks and raw data rather than any circular chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Work rests on standard benchmarking assumptions rather than new theoretical entities or fitted parameters.

axioms (1)

domain assumption Timing measurements capture only the intended dispatch overhead without significant interference from system noise or measurement tools.
Implicit in the claim that the sequential-dispatch method reveals true per-dispatch cost.

pith-pipeline@v0.9.0 · 5683 in / 1253 out tokens · 51709 ms · 2026-05-16T05:09:42.926618+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Accessed February 2026

Cross-platform CPU and GPU inference with quantization support. Accessed February 2026. gfx-rs Community. wgpu: Safe and portable GPU abstraction in Rust.https://wgpu.rs/, 2024. Pure-Rust WebGPU implementation with Vulkan, Metal, DX12, and OpenGL backends. Accessed February 2026. Google. Dawn: WebGPU implementation for Chromium.https://dawn.googlesource.c...

work page 2026
[2]

Accessed February 2026

Explicit GPU API designed to reduce CPU overhead vs OpenGL. Accessed February 2026. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon- zalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th ACM Symposium on Operating Systems P...

work page 2026
[3]

Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =

doi: 10.1145/3600006.3613165. Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InProceedings of Machine Learning and Systems (MLSys), 2024. Microsoft. ONNX runtime.https://onnxruntime...

work page doi:10.1145/3600006.3613165 2024

[1] [1]

Accessed February 2026

Cross-platform CPU and GPU inference with quantization support. Accessed February 2026. gfx-rs Community. wgpu: Safe and portable GPU abstraction in Rust.https://wgpu.rs/, 2024. Pure-Rust WebGPU implementation with Vulkan, Metal, DX12, and OpenGL backends. Accessed February 2026. Google. Dawn: WebGPU implementation for Chromium.https://dawn.googlesource.c...

work page 2026

[2] [2]

Accessed February 2026

Explicit GPU API designed to reduce CPU overhead vs OpenGL. Accessed February 2026. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon- zalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th ACM Symposium on Operating Systems P...

work page 2026

[3] [3]

Efficient Memory Management for Large Language Model Serving with PagedAttention , booktitle =

doi: 10.1145/3600006.3613165. Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InProceedings of Machine Learning and Systems (MLSys), 2024. Microsoft. ONNX runtime.https://onnxruntime...

work page doi:10.1145/3600006.3613165 2024