pith. sign in

arxiv: 2604.13327 · v2 · submitted 2026-04-14 · 💻 cs.DC · cs.LG· cs.PL

Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel

Pith reviewed 2026-05-10 13:42 UTC · model grok-4.3

classification 💻 cs.DC cs.LGcs.PL
keywords GPU kernelsmegakernelsdynamic shapesdata-dependent computationcompiler abstractionpersistent kernelsLLM inferencescheduling transformations
0
0 comments X

The pith

Event Tensor encodes task dependencies to let compilers generate persistent megakernels that handle dynamic shapes and data-dependent logic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern GPU programs, particularly LLM inference, lose performance to repeated kernel launches and limited overlap between operations. Existing megakernel approaches fuse many operators into one long-running kernel to remove launch costs and expose parallelism, yet they cannot cope with input shapes that change at runtime or with computations whose control flow depends on data values. The paper proposes Event Tensor as a single abstraction that records dependencies among tiled tasks while treating both shape variation and data-dependent behavior as first-class features. Static and dynamic scheduling passes then turn this representation into efficient persistent kernels. If the approach works, compilers can produce megakernels for real workloads without the usual overhead penalties.

Core claim

Event Tensor is a unified compiler abstraction that encodes dependencies between tiled tasks, giving first-class support for both shape dynamism and data-dependent computation. The Event Tensor Compiler applies static and dynamic scheduling transformations on top of this abstraction to emit high-performance persistent kernels. Evaluations on LLM serving workloads show that the resulting kernels reach state-of-the-art latency while substantially lowering system warmup cost.

What carries the argument

The Event Tensor abstraction, which records dependencies among tiled tasks so that static and dynamic scheduling transformations can produce persistent kernels supporting shape and data-dependent dynamism.

If this is right

  • Persistent kernels become feasible for workloads whose tensor shapes are not known until runtime.
  • Data-dependent control flow can be expressed inside a single megakernel rather than requiring multiple launches.
  • LLM serving systems can overlap more operators and reduce launch gaps while still supporting realistic dynamism.
  • Warmup time drops because the compiler no longer needs to specialize separate kernels for every possible shape.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dependency-encoding idea could be applied to other irregular GPU codes such as graph neural networks or adaptive mesh refinement.
  • Compiler front-ends might adopt Event Tensor as an intermediate representation to automate fusion decisions across a wider range of dynamic programs.
  • If the abstraction scales, runtime systems could shift more scheduling logic from the host to the device without losing performance.

Load-bearing premise

The assumption that an Event Tensor representation can be built and scheduled efficiently enough to deliver high performance on arbitrary real-world dynamic shapes and data-dependent control flow without hidden overheads.

What would settle it

Compile and run an ETC-generated megakernel on an LLM variant that contains frequent data-dependent branches and highly irregular tensor shapes, then compare its end-to-end latency and warmup time against a hand-tuned baseline that uses separate kernels for each operator.

Figures

Figures reproduced from arXiv: 2604.13327 by Bohan Hou, Gabriele Oliaro, Guanjie Wang, Hongyi Jin, Jianan Ji, Jinchen Jiang, Jinqi Chen, Lijie Yang, Ruihang Lai, Tianqi Chen, Todd C. Mowry, Vinod Grover, Xinhao Cheng, Xupeng Miao, Yaxing Cai, Yilong Zhao, Yingyi Huang, Yixin Dong, Zhihao Jia, Zhihao Zhang, Zihao Ye.

Figure 1
Figure 1. Figure 1: Different GPU scheduling models. Kernel-by-kernel and CUDA Graph scheduling models enforce a coarse-grained sequential execution. Megakernels break operations into smaller tasks, achieving inter-kernel parallelism. on a subset of results from prior ones; in principle, these kernels could be overlapped or pipelined to improve through￾put. However, the boundaries between kernels hinder such fine-grained inte… view at source ↗
Figure 2
Figure 2. Figure 2: Event Tensor abstraction overview. A computation graph (left) is partitioned into tiled operators (tasks), and the Event Tensor captures fine-grained dependencies between tasks as a first-class, symbolic-shaped object, handling the primary sources of dynamism inherent to LLM serving: 1 Shape Dynamism: Tiled tensors and Event Tensors have symbolic dimensions, such as the dynamic batch size B. 2 Data-Depende… view at source ↗
Figure 3
Figure 3. Figure 3: Example Event Tensor–based program. 2 EVENT TENSOR ABSTRACTION 2.1 Language Constructs We first introduce the main language constructs in Event Tensor–based programs. Device Function. A device function defines a grid of tasks launched in parallel on the GPU. Each launch is parameter￾ized by a multidimensional coordinate, where each coordi￾nate identifies a task tile executed on a streaming multipro￾cessor … view at source ↗
Figure 4
Figure 4. Figure 4: Event Tensor handles shape dynamism with symbolic￾shape tensors that define a template for dependency graphs. At runtime, the template is instantiated with concrete shape values (e.g., producing a 1 × 2 graph for batch size 1 or a 2 × 2 graph for batch size 2) without recompilation or repeated graph capture. to represent event task relations.1 These dependency an￾notation implicitly maps to event notificat… view at source ↗
Figure 6
Figure 6. Figure 6: GEMM + Reduce-Scatter before and after static scheduling transformation. Two separate device functions are fused into a single persistent function, with explicit notify and wait calls on the Event Tensor to coordinate dependencies. SM0 SM1 t>T2 t … … ETensor 0 … … t=T2 MM0 MM0 wait t=T2 ETensor 0 … … MM0 MM0 RS MM1 wait T1 <t<T2 MM0 MM0 wait t ETensor 1 … … ●SM0 still waiting ●SM1 finishes MM0, notifying E… view at source ↗
Figure 7
Figure 7. Figure 7: Notify-and-wait mechanism for static scheduling. E[0].notify() and E[0].wait(). For simplicity, we use a round-robin policy to construct execution queues. 3.2 Dynamic Scheduling and Transformation When task execution time is unpredictable, dynamic scheduling improves load balance across SMs. ETC im￾plements Event Tensor–based dynamic scheduling using a lightweight on-GPU task scheduler. When an event is tr… view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of runtime architectures. (a) The task graph in traditional runtime executor is materialized in memory, and only tiled operators are compiled. (b) ETC compiles scheduling logic into megakernels without runtime task graph materialization. 3.3 Lowering to Minimal Runtime The static and dynamic scheduling in our compiler allows us to encapsulate low-level task dependencies and their handling direc… view at source ↗
Figure 8
Figure 8. Figure 8: GEMM + Reduce-Scatter after dynamic scheduling trans￾formation. Task push and pop are inserted, and task execution is dynamically coordinated by the scheduler. ETensor 0 Push task t>T2 t ETensor 0 … … t=T1 MM0 ETensor 1 … … SM0 SM1 t=T1 2 : Event Tensor with counter ↓ MM0 Scheduler t<T1 MM0 ETensor 2 … … t MM0 MM1 MM1 … Scheduler MM1 MM1 … Pop task T1 <t<T2 MM0 MM0 t ETensor 1 … … SM0 SM1 MM1 Scheduler MM1… view at source ↗
Figure 9
Figure 9. Figure 9: Push-and-pop mechanism for dynamic scheduling. tion of push-pop interface uses a centralized queue in global memory shared across all SMs. We choose this design for its implementation simplicity, though we acknowledge po￾tential contention at scale. We also discuss the runtime optimization for dynamic scheduler in Appendix E. Trade-off between static and dynamic scheduling. The choice between static and dy… view at source ↗
Figure 11
Figure 11. Figure 11: Performance results of GEMM + Reduce-Scatter on 8 B200s with dynamic scheduler. fixing the tensor-parallel size to 8 and the number of tokens to 8192 in all experiments. The configuration details are provided in Appendix C. We compare ETCs generated ker￾nels against several baselines, with implementation choices tailored to each workloads characteristics: • GEMM + Reduce-Scatter: The Reduce-Scatter collec… view at source ↗
Figure 12
Figure 12. Figure 12: Performance results of All-Gather + GEMM on 8 B200s with static scheduler. too large to effectively hide communication latency, and Triton-Dist’s experimental B200 support means its Triton￾based GEMM is not yet fully optimized for the Blackwell architecture. As a result, the unfused cuBLAS+NCCL base￾line is sometimes competitive with these fused approaches, underscoring the difficulty of achieving efficie… view at source ↗
Figure 14
Figure 14. Figure 14: End-to-end performance of model serving on Qwen3-30B-A3B and Qwen3-32B (Lower is better). family (both dense and MoE variants) as it is architecturally representative of modern LLMs (e.g., LLaMA 3, GPT). The benchmark uses a synthetic dataset with a prefill length of 512 and generates 100 output tokens, with batch size vary￾ing from 1 to 128. We measure the time-per-output-token (TPOT) metric, which best … view at source ↗
Figure 15
Figure 15. Figure 15: End-to-end compilation pipeline in ETC. A DYNAMIC SCHEDULING PSEUDOCODE Algorithm 2 describes the compiler pass to transform an event tensor graph to a dynamically scheduled megak￾ernel. A call to scheduler.pop tasks is inserted whenever an SM finishes its current task, while a call to scheduler.push tasks is inserted when the comple￾tion of a task decrements the associated event counters to zero, thereby… view at source ↗
Figure 17
Figure 17. Figure 17: Raw kernel relative performance results of Qwen-32B on a single B200. heads, this section evaluates the raw GPU kernel execution time in end-to-end LLM serving, using the same baselines and experimental settings as in §5.3 [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Raw kernel relative performance results of Qwen-32B on four B200s with tensor parallelism. MoE model within a single kernel, enabling optimizations such as increased parallelism across attention operators, fine-grained pipelining between GroupGEMMs, and model￾weight prefetching [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗
read the original abstract

Modern GPU workloads, especially large language model (LLM) inference, suffer from kernel launch overheads and coarse synchronization that limit inter-kernel parallelism. Recent megakernel techniques fuse multiple operators into a single persistent kernel to eliminate launch gaps and expose inter-kernel parallelism, but struggle to handle dynamic shapes and data-dependent computation in real workloads. We present Event Tensor, a unified compiler abstraction for dynamic megakernels. Event Tensor encodes dependencies between tiled tasks, and enables first-class support for both shape and data-dependent dynamism. Built atop this abstraction, our Event Tensor Compiler (ETC) applies static and dynamic scheduling transformations to generate high-performance persistent kernels. Evaluations show that ETC achieves state-of-the-art LLM serving latency while significantly reducing system warmup overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes Event Tensor, a unified compiler abstraction for dynamic megakernels that encodes dependencies between tiled tasks to support both shape dynamism and data-dependent computation. The Event Tensor Compiler (ETC) applies static and dynamic scheduling transformations to generate high-performance persistent kernels, with evaluations claiming state-of-the-art LLM serving latency and significantly reduced system warmup overhead compared to prior megakernel approaches.

Significance. If the central claims hold, this abstraction could meaningfully advance megakernel compilation techniques for real-world dynamic GPU workloads such as LLM inference by reducing kernel launch overheads and improving inter-kernel parallelism, addressing a key limitation of existing fusion methods.

major comments (1)
  1. The central claim that ETC handles arbitrary data-dependent dynamism without hidden costs (via static/dynamic scheduling) is load-bearing for the SOTA latency and warmup results, yet the abstract provides no specifics on benchmark coverage for irregular control flow or unbounded variability; this leaves the efficiency claim vulnerable as noted in the stress-test concern.
minor comments (1)
  1. Clarify notation for Event Tensor dependencies and scheduling transformations to improve readability for readers unfamiliar with megakernel literature.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for identifying a point where the abstract's brevity could leave key claims less clear. We address the major comment below with clarifications from the full manuscript and propose targeted revisions.

read point-by-point responses
  1. Referee: The central claim that ETC handles arbitrary data-dependent dynamism without hidden costs (via static/dynamic scheduling) is load-bearing for the SOTA latency and warmup results, yet the abstract provides no specifics on benchmark coverage for irregular control flow or unbounded variability; this leaves the efficiency claim vulnerable as noted in the stress-test concern.

    Authors: We agree that the abstract is concise and does not enumerate benchmark details. The full manuscript (Sections 4.2 and 5.1) evaluates ETC on production LLM inference workloads that include data-dependent control flow, such as variable-length sequences, conditional branching in attention, and dynamic tensor shapes arising from beam search and KV-cache management. These workloads exhibit irregular control flow within the bounds observed in real serving traces. Static scheduling handles compile-time shape dynamism while dynamic scheduling resolves data-dependent decisions at runtime with negligible overhead, as quantified by the warmup and latency results. We include stress tests in Appendix C that increase variability up to the limits of the evaluated models and report no hidden costs beyond those already accounted for in the persistent kernel design. To address the concern directly, we will revise the abstract to briefly note the benchmark coverage of irregular control flow and data-dependent dynamism in LLM serving, and we will expand the evaluation section to cross-reference the stress-test results more explicitly. revision: partial

Circularity Check

0 steps flagged

No circularity in the derivation chain

full rationale

The paper introduces Event Tensor as a new compiler abstraction for dynamic megakernels, along with the ETC compiler that applies static and dynamic scheduling transformations. All central claims rest on empirical evaluations of LLM serving latency and warmup overhead rather than any mathematical derivations, equations, or predictions that reduce to the paper's own inputs by construction. No self-definitional steps, fitted inputs presented as predictions, load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via citation appear in the abstract or description. The work is a self-contained systems contribution whose validity is externally falsifiable via the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on free parameters, axioms, or invented entities; the core contribution is the proposed abstraction itself.

pith-pipeline@v0.9.0 · 5502 in / 968 out tokens · 47228 ms · 2026-05-10T13:42:12.515695+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

  1. [1]

    Tenenbaum

    USENIX Association, November 2020. ISBN 978- 1-939133-19-9. URL https://www.usenix.org/ conference/osdi20/presentation/ma. Niu, W., Guan, J., Wang, Y ., Agrawal, G., and Ren, B. Dnnfusion: accelerating deep neural networks execution with advanced operator fusion. InProceedings of the 42nd ACM SIGPLAN International Conference on Pro- gramming Language Desi...

  2. [2]

    ISBN 9798331314385

    URL https://proceedings.mlsys. org/paper_files/paper/2022/file/ f89b79c9a28d4cae22ef9e557d9fa191-Paper. pdf. Zheng, L., Jia, C., Sun, M., Wu, Z., Yu, C. H., Haj-Ali, A., Wang, Y ., Yang, J., Zhuo, D., Sen, K., et al. Ansor: Generating high-performance tensor programs for deep learning. In14th USENIX symposium on operating sys- tems design and implementati...