arxiv: 2604.15379 · v1 · submitted 2026-04-15 · 💻 cs.AR

Recognition: unknown

Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs

Sangeeta Chowdhary , Ryan Swann , Sean Siddens , Muhammad Osama , Stephen Neuendorffer , Alexandru Dutu , Karthik Sangaiah , Sandeepa Bhuyan

show 2 more authors

Samuel Bayliss Ganesh Dasika

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:32 UTC · model grok-4.3

classification 💻 cs.AR

keywords chiplet taskshierarchical task modelmulti-die GPUsmegakernelsLLM inferencecache localitypersistent kernelsL2 reuse

0 comments

The pith

Fleet adds Chiplet-tasks to bind megakernel work to GPU chiplets and exploit their shared L2 caches for lower memory traffic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern GPUs use chiplet designs with separate cache hierarchies, yet standard programming models treat the device as flat and force redundant data movement in memory-bound jobs such as LLM inference. Fleet supplies a four-level task hierarchy that adds Chiplet-tasks as the missing middle layer between compute-unit tasks and device-wide tasks. These Chiplet-tasks tie work and data to one chiplet so that workers can reuse the chiplet's private L2 without leaving the scope. The model is realized through a persistent-kernel runtime that schedules tasks per chiplet, letting threads cooperate on weight tiles. The resulting higher cache reuse directly reduces high-bandwidth memory fetches and improves end-to-end latency and throughput on real multi-die hardware.

Core claim

Fleet is a multi-level task model that maps computation to memory scopes by introducing Chiplet-tasks, which bind work and data to a chiplet and enable coordination through its shared L2 cache. Wavefront-level, CU-level, and device-level tasks align with existing abstractions, while Chiplet-tasks expose a previously unaddressed level of the hierarchy. Fleet is implemented as a persistent kernel runtime with per-chiplet scheduling, allowing workers within a chiplet to cooperatively execute tasks with coordinated cache reuse.

What carries the argument

Chiplet-tasks, an abstraction that binds computation and data to a single chiplet so that persistent workers can coordinate through the chiplet's private L2 cache.

If this is right

Persistent kernel execution plus per-chiplet scheduling produces 1.3-1.5x lower decode latency than vLLM at batch sizes 1-8.
Cooperative weight tiling inside Chiplet-tasks raises L2 hit rate from 12 percent to 54 percent at batch size 32 and from 39 percent to 61 percent at batch size 64.
The same tiling cuts HBM traffic by as much as 37 percent.
Overall, Fleet delivers 1.27-1.30x speedup over a chiplet-unaware megakernel baseline at larger batch sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Chiplet-task binding could be applied to other memory-bound workloads such as graph analytics or scientific stencil codes that already use megakernels.
Future GPU programming interfaces may need to expose chiplet boundaries explicitly rather than leaving locality to opaque hardware schedulers.
Implementations on other vendors' multi-die GPUs would test whether the reported L2-hit-rate gains are portable or specific to AMD's cache geometry.
If the overhead of per-chiplet scheduling stays low, the approach could shift the performance limit for large-batch inference from memory bandwidth to on-chip reuse.

Load-bearing premise

LLM inference remains memory-bound enough that the gains from chiplet-level cache locality exceed the added cost of the persistent runtime and per-chiplet scheduler.

What would settle it

A run of the same Qwen3-8B decode workload at batch size 32 on the MI350 that shows no rise in L2 hit rate above the 12 percent baseline or no drop in HBM traffic when Fleet is enabled would falsify the core performance claim.

Figures

Figures reproduced from arXiv: 2604.15379 by Alexandru Dutu, Ganesh Dasika, Karthik Sangaiah, Muhammad Osama, Ryan Swann, Samuel Bayliss, Sandeepa Bhuyan, Sangeeta Chowdhary, Sean Siddens, Stephen Neuendorffer.

**Figure 1.** Figure 1: AMD Instinct MI350 memory hierarchy. Each of 8 XCDs has 32 CUs and a private 4 MB L2 cache (TCC). All XCDs share a 256 MB MALL (Infinity Cache) before HBM3. scheduler retrieves task descriptors from a data structure in device memory and distributes these tasks to its local workers on the same chiplet. For Chiplet-tasks, the scheduler broadcasts the same task to all workers in a single chiplet, which can t… view at source ↗

**Figure 2.** Figure 2: Standard block scheduling (a) assigns workers on the same XCD to different GEMMs, thrashing L2. Fleet’s scheduling (b) partitions weight columns across XCDs and coordinates all workers on each XCD to read the same partition, converting L2 misses into hits. For typical operation sizes (such as the gate-up projection in the Qwen3-8B FFN, where each XCD’s partition is [4096, 3072] = 24 MB in bf16), the weigh… view at source ↗

**Figure 3.** Figure 3: Tile traversal order within one Chiplet-task (4×6 output tile grid, 4 workers illustrated). Cell hue indicates which weight column is accessed; darker shading indicates later timesteps. (a) N-major: consecutive tiles advance along weight columns, so concurrent workers load distinct weight data, increasing L2 cache pressure. (b) M-major: consecutive tiles advance down activation rows and share the same weig… view at source ↗

**Figure 4.** Figure 4: Fleet system overview. (a) Task graph for one transformer layer at bs=1: standard dispatch decomposes each GEMM into 96–256 independent CU-tasks (1,407 total); Fleet uses eight Chiplet-tasks per GEMM (543 total, 2.6× fewer), with SiLU fused into the gate+up Chiplet-task. (b) Runtime dispatch: each XCD has a scheduler that assigns CU-tasks round-robin to individual workers, or broadcasts Chiplet-tasks to al… view at source ↗

**Figure 5.** Figure 5: Hierarchical synchronization protocol. Workers increment XCD-local L2 counters without fences (green). Only the last worker on each XCD issues a single buffer_wbl2 fence and updates the global HBM counter (red). Schedulers poll the global counter to dispatch downstream tasks. This reduces cross-XCD fences compared to per-worker global signaling. two-level counting avoids per-worker global fences entirely,… view at source ↗

**Figure 7.** Figure 7: Roofline analysis of GEMM operations on MI350 (bf16 MFMA). Standard scheduling operates at nominal arithmetic intensity AI = 𝐵 (batch size in FLOP/byte). Fleet’s L2 reuse increases effective AI by reducing HBM traffic: AIeff = 𝐵/(1 − L2 hit rate). At bs=32 with 51% L2 hit rate, Fleet shifts the effective AI from 32 to 65, a 2.0× rightward shift toward the ridge point (245). L2, reaches 51.0% L2 hit rate a… view at source ↗

read the original abstract

Modern GPUs adopt chiplet-based designs with multiple private cache hierarchies, but current programming models (CUDA/HIP) expose a flat execution hierarchy that cannot express chiplet-level locality or synchronization. This mismatch leads to redundant memory traffic and poor cache utilization in memory-bound workloads such as LLM inference. We present Fleet, a multi-level task model that maps computation to memory scopes. Fleet introduces Chiplet-tasks, a new abstraction that binds work and data to a chiplet and enables coordination through its shared L2 cache. Wavefront-level, CU-level, and device-level tasks align with existing abstractions, while Chiplet-tasks expose a previously unaddressed level of the hierarchy. Fleet is implemented as a persistent kernel runtime with per-chiplet scheduling, allowing workers within a chiplet to cooperatively execute tasks with coordinated cache reuse. On AMD Instinct MI350 with Qwen3-8B, Fleet achieves 1.3-1.5x lower decode latency than vLLM at batch sizes 1-8 through persistent kernel execution and per-chiplet scheduling. At larger batch sizes, cooperative weight tiling increases L2 hit rate (from 12% to 54% at batch size 32 and from 39% to 61% at batch size 64), reducing HBM traffic by up to 37% and delivering 1.27-1.30x speedup over a chiplet-unaware megakernel baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fleet adds a chiplet-task abstraction that lets megakernels target private L2 caches on multi-die GPUs, delivering measurable locality wins for LLM decode, though the small-batch latency claims lack overhead breakdowns.

read the letter

Fleet's main contribution is the Chiplet-task level in its hierarchy. Current models stop at wavefront or CU granularity and treat the whole device as flat, so code cannot bind data and work to one chiplet's shared L2. Fleet adds that binding plus a persistent-kernel runtime that schedules per chiplet, letting threads inside the chiplet cooperate on tasks. That matches the hardware reality on parts like the MI350 and is not just an incremental tweak to existing megakernel code.

Referee Report

3 major / 1 minor

Summary. The paper introduces Fleet, a hierarchical task-based abstraction for megakernels on multi-die GPUs. It defines Chiplet-tasks to bind work and data to chiplets for coordinated L2 cache reuse, alongside wavefront-, CU-, and device-level tasks. Implemented as a persistent kernel runtime with per-chiplet scheduling, Fleet is evaluated on AMD Instinct MI350 using Qwen3-8B, claiming 1.3-1.5x lower decode latency than vLLM at batch sizes 1-8 and 1.27-1.30x speedup at larger batches via improved L2 hit rates (12% to 54% at batch 32; 39% to 61% at batch 64) and up to 37% reduced HBM traffic.

Significance. If substantiated with complete methodology, the work is significant for computer architecture: it directly addresses the mismatch between flat CUDA/HIP models and chiplet-based GPU memory hierarchies in memory-bound LLM inference. The use of real hardware measurements on AMD MI350 and the introduction of a new chiplet-level abstraction provide a practical step toward better locality exploitation, with potential to inform future programming models and runtime systems.

major comments (3)

[Abstract] Abstract: The central claims of 1.3-1.5x lower decode latency at batch sizes 1-8 and 1.27-1.30x speedup at larger batches are presented without error bars, full experimental methodology, baseline implementation details (e.g., how the chiplet-unaware megakernel baseline was constructed), or variability data. This leaves the performance improvements only partially supported and makes it difficult to isolate the contribution of per-chiplet scheduling from other factors.
[Abstract] Abstract: No breakdown or measurement of the runtime overhead of the persistent kernel and per-chiplet coordination is provided. This is load-bearing for the small-batch latency claim, as any added costs from task mapping or chiplet-level synchronization could offset the reported gains in latency-sensitive decode workloads.
[Abstract] Abstract: The L2 hit-rate improvements (12% to 54% at batch size 32) and HBM traffic reduction (up to 37%) are stated without describing the measurement methodology or confirming that the cooperative weight tiling is the sole cause, rather than other unmentioned optimizations.

minor comments (1)

[Abstract] Abstract: The term 'megakernels' is used without a brief definition or reference, though it is central to the contribution and may not be universally understood.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, clarifying existing content in the full paper and committing to revisions that improve clarity and completeness of the experimental presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 1.3-1.5x lower decode latency at batch sizes 1-8 and 1.27-1.30x speedup at larger batches are presented without error bars, full experimental methodology, baseline implementation details (e.g., how the chiplet-unaware megakernel baseline was constructed), or variability data. This leaves the performance improvements only partially supported and makes it difficult to isolate the contribution of per-chiplet scheduling from other factors.

Authors: The abstract summarizes key results at a high level, while the full experimental methodology, baseline construction (a persistent megakernel without Chiplet-task binding or per-chiplet scheduling), and variability data appear in Section 5. We will revise the abstract to briefly describe the baseline and add error bars from repeated runs to the reported speedups. This will more clearly isolate the contribution of per-chiplet scheduling. revision: yes
Referee: [Abstract] Abstract: No breakdown or measurement of the runtime overhead of the persistent kernel and per-chiplet coordination is provided. This is load-bearing for the small-batch latency claim, as any added costs from task mapping or chiplet-level synchronization could offset the reported gains in latency-sensitive decode workloads.

Authors: We agree an explicit overhead breakdown strengthens the small-batch claims. The revised manuscript will add profiling measurements (using AMD tools) of persistent kernel launch, task mapping, and chiplet coordination overheads in the evaluation section, showing these costs are small relative to the locality-driven latency reductions. revision: yes
Referee: [Abstract] Abstract: The L2 hit-rate improvements (12% to 54% at batch size 32) and HBM traffic reduction (up to 37%) are stated without describing the measurement methodology or confirming that the cooperative weight tiling is the sole cause, rather than other unmentioned optimizations.

Authors: L2 hit rates were obtained via hardware performance counters and HBM traffic via memory controller statistics, as described in the evaluation section. We will update the abstract to reference this methodology and note that ablation studies isolate cooperative weight tiling (via Chiplet-tasks) as the primary cause, with no other unmentioned optimizations applied beyond the stated megakernel baseline. revision: partial

Circularity Check

0 steps flagged

No circularity: performance claims are direct hardware measurements with no derivation chain

full rationale

The paper introduces Fleet as a hierarchical task abstraction and persistent kernel runtime for chiplet-based GPUs, then reports empirical speedups (1.3-1.5x lower latency at small batches, 1.27-1.30x at larger batches) measured on AMD MI350 hardware against vLLM and a chiplet-unaware baseline. No equations, first-principles predictions, fitted parameters, or self-citations are invoked to derive results; the central claims rest on observed L2 hit rates, HBM traffic reduction, and latency numbers obtained from execution. The derivation chain is therefore self-contained and non-circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the domain assumption that each chiplet provides a shared L2 cache usable for coordinated reuse and on the new invented entity of Chiplet-tasks whose benefits are shown only through the reported benchmarks.

axioms (1)

domain assumption Chiplet-based GPUs expose private cache hierarchies with a shared L2 cache within each chiplet.
Invoked in the opening problem statement to motivate the need for chiplet-level tasks.

invented entities (1)

Chiplet-tasks no independent evidence
purpose: Bind work and data to a single chiplet to enable coordination through its shared L2 cache.
New abstraction introduced to fill the gap between CU-level and device-level tasks.

pith-pipeline@v0.9.0 · 5599 in / 1541 out tokens · 123966 ms · 2026-05-10T11:32:03.669908+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 12 canonical work pages · 2 internal anchors

[1]

AMD. 2024. AMD Instinct MI300X Accelerator.https://www. amd.com/en/products/accelerators/instinct/mi300/mi300x.htmlData Sheet

2024
[2]

2025.CDNA4 Instruction Set Architecture Reference Guide

AMD. 2025.CDNA4 Instruction Set Architecture Reference Guide. https://www.amd.com/content/dam/amd/en/documents/instinct- tech-docs/instruction-set-architectures/amd-instinct-cdna4- instruction-set-architecture.pdf

2025
[3]

Xinhao Cheng, Zhuohan Zhang, Yucheng Zhou, Jianan Ji, Jiarui Jiang, Zihao Zhao, Zhengxiao Xiao, Zhuoran Ye, Yuhui Huang, Rui Lai, Hongyi Jin, Bangcheng Hou, Mengdi Wu, Yinmin Dong, Alex Yip, Shuai Wang, Wei Yang, Xupeng Miao, Tianqi Chen, and Zhihao Jia. 2025. Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs.CoRRabs/25...

work page arXiv 2025
[4]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory- Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems (NeurIPS 2022). https://proceedings.neurips.cc/paper_files/paper/2022/hash/ 67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html

2022
[5]

Tri Dao, Jay Haziza, Francisco Massa, and Grigory Sizov
[6]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024

FlashAttention-3: Fast and Accurate Attention with 11 Asynchrony and Low-precision. InAdvances in Neural Infor- mation Processing Systems (NeurIPS 2024). arXiv:2407.08608 https://proceedings.neurips.cc/paper_files/paper/2024/hash/ 7ede97c3e082c6df10a8d6103a2eebd2-Abstract-Conference.html

work page arXiv 2024
[7]

Fu, Ryann Swann, Muhammad Osama, Christopher Ré, and Simran Arora

William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Y. Fu, Ryann Swann, Muhammad Osama, Christopher Ré, and Simran Arora. 2025. HipKittens: Fast and Furious AMD Kernels.CoRR abs/2511.08083 (Nov. 2025). arXiv:2511.08083

work page arXiv 2025
[8]

Aaron Jarmusch and Sunita Chandrasekaran. 2025. Microbenchmark- ing NVIDIA’s Blackwell Architecture: An In-depth Architectural Anal- ysis.CoRRabs/2512.02189 (Dec. 2025). arXiv:2512.02189

work page arXiv 2025
[9]

Yiwei Jiang, Sangeeta Chowdhary, Nathaniel Morris, Rutwik Jain, Sri- latha Manne, and Sam Bayliss. 2026. Power Aware Dynamic Realloca- tion For Inference.CoRRabs/2601.12241 (Jan. 2026). arXiv:2601.12241

work page arXiv 2026
[10]

Andrew Kerr, Duane Merrill, Julien Demouth, and John Tran. 2017. CUTLASS: Fast Linear Algebra in CUDA C++.https://devblogs.nvidia. com/cutlass-linear-algebra-cuda/NVIDIA Developer Blog

2017
[11]

In: Proceedings of the 29th Symposium on Operating Systems Principles

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP 2023). 611–626. doi:10.1145/3600006.3613165

work page doi:10.1145/3600006.3613165 2023
[12]

Aniruddha Nrusimha, William Brandon, Mayank Mishra, Yikang Shen, Rameswar Panda, Jonathan Ragan-Kelley, and Yoon Kim. 2025. Flash- Former: Whole-Model Kernels for Efficient Low-Batch Inference.CoRR abs/2505.22758 (May 2025). arXiv:2505.22758

work page arXiv 2025
[13]

NVIDIA. 2024. CUTLASS: CUDA Templates for Linear Algebra Sub- routines.https://github.com/NVIDIA/cutlass

2024
[14]

NVIDIA. 2024. NVIDIA Blackwell Architecture Technical Brief.https://resources.nvidia.com/en-us-blackwell-architecture/ blackwell-architecture-technical-brief

2024
[15]

NVIDIA. 2024. Programming Guide: Thread Block Clus- ters.https://docs.nvidia.com/cuda/cuda-c-programming-guide/index. html#thread-block-clusters

2024
[16]

Muhammad Osama, Duane Merrill, Cris Cecka, Michael Garland, and John D. Owens. 2023. Stream-K: Work-centric Parallel Decompo- sition for Dense Matrix-Matrix Multiplication on the GPU.CoRR abs/2301.03598, 2301.03598v1 (Jan. 2023). arXiv:2301.03598 [cs.DS]

work page arXiv 2023
[17]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Í nigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA 2024). doi:10.1109/ISCA59077.2024.00055

work page doi:10.1109/isca59077.2024.00055 2024
[18]

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shiv- ani Agrawal, and Jeff Dean. 2023. Efficiently Scaling Transformer Inference. InProceedings of Machine Learning and Systems (MLSys 2023).https://proceedings.mlsys.org/paper_files/paper/2023/hash/ c4be71ab8d24cdfb45e3d06dbfca2780...

2023
[19]

Qwen Team. 2025. Qwen3 Technical Report.CoRRabs/2505.09388 (May 2025). arXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Spector, Simran Arora, Aaryan Singhal, Daniel Y

Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, and Christopher Ré. 2025. ThunderKittens: Simple, Fast, and Adorable AI Kernels. InInternational Conference on Learning Representations (ICLR 2025).https://proceedings.iclr.cc/paper_files/paper/2025/hash/ 05dc08730e32441edff52b0fa6caab5f-Abstract-Conference.htmlSpot- light

2025
[21]

Spector, Jared Juravsky, Sohil Sul, Owen Dugan, Daniel Lim, Daniel Y

Benjamin F. Spector, Jared Juravsky, Sohil Sul, Owen Dugan, Daniel Lim, Daniel Y. Fu, Simran Arora, and Christopher Ré. 2025. Look Ma, No Bubbles! Fusing LLM Decoder Layers into One Megakernel. https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubblesHazy Research Blog, Stanford University

2025
[22]

Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: An Intermedi- ate Language and Compiler for Tiled Neural Network Computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL 2019). 10–19. doi:10.1145/3315508.3329973

work page doi:10.1145/3315508.3329973 2019
[23]

Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Keren Ao, Praneeth Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia
[24]

InProceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2025)

Mirage: A Multi-Level Superoptimizer for Tensor Programs. InProceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2025). 221–238.https://www.usenix. org/conference/osdi25/presentation/wu-mengdi

2025
[25]

Biao Zhang and Rico Sennrich. 2019. Root Mean Square Layer Normal- ization. InAdvances in Neural Information Processing Systems (NeurIPS 2019).https://proceedings.neurips.cc/paper_files/paper/2019/hash/ 1e8a19426224ca89e83cef47f1e7f53b-Abstract.html

2019
[26]

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue H. Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2023. SGLang: Efficient Exe- cution of Structured Language Model Programs.CoRRabs/2312.07104 (Dec. 2023). arXiv:2312.07104 12

work page internal anchor Pith review arXiv 2023