Recognition: unknown
Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs
Pith reviewed 2026-05-10 11:32 UTC · model grok-4.3
The pith
Fleet adds Chiplet-tasks to bind megakernel work to GPU chiplets and exploit their shared L2 caches for lower memory traffic.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fleet is a multi-level task model that maps computation to memory scopes by introducing Chiplet-tasks, which bind work and data to a chiplet and enable coordination through its shared L2 cache. Wavefront-level, CU-level, and device-level tasks align with existing abstractions, while Chiplet-tasks expose a previously unaddressed level of the hierarchy. Fleet is implemented as a persistent kernel runtime with per-chiplet scheduling, allowing workers within a chiplet to cooperatively execute tasks with coordinated cache reuse.
What carries the argument
Chiplet-tasks, an abstraction that binds computation and data to a single chiplet so that persistent workers can coordinate through the chiplet's private L2 cache.
If this is right
- Persistent kernel execution plus per-chiplet scheduling produces 1.3-1.5x lower decode latency than vLLM at batch sizes 1-8.
- Cooperative weight tiling inside Chiplet-tasks raises L2 hit rate from 12 percent to 54 percent at batch size 32 and from 39 percent to 61 percent at batch size 64.
- The same tiling cuts HBM traffic by as much as 37 percent.
- Overall, Fleet delivers 1.27-1.30x speedup over a chiplet-unaware megakernel baseline at larger batch sizes.
Where Pith is reading between the lines
- The same Chiplet-task binding could be applied to other memory-bound workloads such as graph analytics or scientific stencil codes that already use megakernels.
- Future GPU programming interfaces may need to expose chiplet boundaries explicitly rather than leaving locality to opaque hardware schedulers.
- Implementations on other vendors' multi-die GPUs would test whether the reported L2-hit-rate gains are portable or specific to AMD's cache geometry.
- If the overhead of per-chiplet scheduling stays low, the approach could shift the performance limit for large-batch inference from memory bandwidth to on-chip reuse.
Load-bearing premise
LLM inference remains memory-bound enough that the gains from chiplet-level cache locality exceed the added cost of the persistent runtime and per-chiplet scheduler.
What would settle it
A run of the same Qwen3-8B decode workload at batch size 32 on the MI350 that shows no rise in L2 hit rate above the 12 percent baseline or no drop in HBM traffic when Fleet is enabled would falsify the core performance claim.
Figures
read the original abstract
Modern GPUs adopt chiplet-based designs with multiple private cache hierarchies, but current programming models (CUDA/HIP) expose a flat execution hierarchy that cannot express chiplet-level locality or synchronization. This mismatch leads to redundant memory traffic and poor cache utilization in memory-bound workloads such as LLM inference. We present Fleet, a multi-level task model that maps computation to memory scopes. Fleet introduces Chiplet-tasks, a new abstraction that binds work and data to a chiplet and enables coordination through its shared L2 cache. Wavefront-level, CU-level, and device-level tasks align with existing abstractions, while Chiplet-tasks expose a previously unaddressed level of the hierarchy. Fleet is implemented as a persistent kernel runtime with per-chiplet scheduling, allowing workers within a chiplet to cooperatively execute tasks with coordinated cache reuse. On AMD Instinct MI350 with Qwen3-8B, Fleet achieves 1.3-1.5x lower decode latency than vLLM at batch sizes 1-8 through persistent kernel execution and per-chiplet scheduling. At larger batch sizes, cooperative weight tiling increases L2 hit rate (from 12% to 54% at batch size 32 and from 39% to 61% at batch size 64), reducing HBM traffic by up to 37% and delivering 1.27-1.30x speedup over a chiplet-unaware megakernel baseline.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Fleet, a hierarchical task-based abstraction for megakernels on multi-die GPUs. It defines Chiplet-tasks to bind work and data to chiplets for coordinated L2 cache reuse, alongside wavefront-, CU-, and device-level tasks. Implemented as a persistent kernel runtime with per-chiplet scheduling, Fleet is evaluated on AMD Instinct MI350 using Qwen3-8B, claiming 1.3-1.5x lower decode latency than vLLM at batch sizes 1-8 and 1.27-1.30x speedup at larger batches via improved L2 hit rates (12% to 54% at batch 32; 39% to 61% at batch 64) and up to 37% reduced HBM traffic.
Significance. If substantiated with complete methodology, the work is significant for computer architecture: it directly addresses the mismatch between flat CUDA/HIP models and chiplet-based GPU memory hierarchies in memory-bound LLM inference. The use of real hardware measurements on AMD MI350 and the introduction of a new chiplet-level abstraction provide a practical step toward better locality exploitation, with potential to inform future programming models and runtime systems.
major comments (3)
- [Abstract] Abstract: The central claims of 1.3-1.5x lower decode latency at batch sizes 1-8 and 1.27-1.30x speedup at larger batches are presented without error bars, full experimental methodology, baseline implementation details (e.g., how the chiplet-unaware megakernel baseline was constructed), or variability data. This leaves the performance improvements only partially supported and makes it difficult to isolate the contribution of per-chiplet scheduling from other factors.
- [Abstract] Abstract: No breakdown or measurement of the runtime overhead of the persistent kernel and per-chiplet coordination is provided. This is load-bearing for the small-batch latency claim, as any added costs from task mapping or chiplet-level synchronization could offset the reported gains in latency-sensitive decode workloads.
- [Abstract] Abstract: The L2 hit-rate improvements (12% to 54% at batch size 32) and HBM traffic reduction (up to 37%) are stated without describing the measurement methodology or confirming that the cooperative weight tiling is the sole cause, rather than other unmentioned optimizations.
minor comments (1)
- [Abstract] Abstract: The term 'megakernels' is used without a brief definition or reference, though it is central to the contribution and may not be universally understood.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, clarifying existing content in the full paper and committing to revisions that improve clarity and completeness of the experimental presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims of 1.3-1.5x lower decode latency at batch sizes 1-8 and 1.27-1.30x speedup at larger batches are presented without error bars, full experimental methodology, baseline implementation details (e.g., how the chiplet-unaware megakernel baseline was constructed), or variability data. This leaves the performance improvements only partially supported and makes it difficult to isolate the contribution of per-chiplet scheduling from other factors.
Authors: The abstract summarizes key results at a high level, while the full experimental methodology, baseline construction (a persistent megakernel without Chiplet-task binding or per-chiplet scheduling), and variability data appear in Section 5. We will revise the abstract to briefly describe the baseline and add error bars from repeated runs to the reported speedups. This will more clearly isolate the contribution of per-chiplet scheduling. revision: yes
-
Referee: [Abstract] Abstract: No breakdown or measurement of the runtime overhead of the persistent kernel and per-chiplet coordination is provided. This is load-bearing for the small-batch latency claim, as any added costs from task mapping or chiplet-level synchronization could offset the reported gains in latency-sensitive decode workloads.
Authors: We agree an explicit overhead breakdown strengthens the small-batch claims. The revised manuscript will add profiling measurements (using AMD tools) of persistent kernel launch, task mapping, and chiplet coordination overheads in the evaluation section, showing these costs are small relative to the locality-driven latency reductions. revision: yes
-
Referee: [Abstract] Abstract: The L2 hit-rate improvements (12% to 54% at batch size 32) and HBM traffic reduction (up to 37%) are stated without describing the measurement methodology or confirming that the cooperative weight tiling is the sole cause, rather than other unmentioned optimizations.
Authors: L2 hit rates were obtained via hardware performance counters and HBM traffic via memory controller statistics, as described in the evaluation section. We will update the abstract to reference this methodology and note that ablation studies isolate cooperative weight tiling (via Chiplet-tasks) as the primary cause, with no other unmentioned optimizations applied beyond the stated megakernel baseline. revision: partial
Circularity Check
No circularity: performance claims are direct hardware measurements with no derivation chain
full rationale
The paper introduces Fleet as a hierarchical task abstraction and persistent kernel runtime for chiplet-based GPUs, then reports empirical speedups (1.3-1.5x lower latency at small batches, 1.27-1.30x at larger batches) measured on AMD MI350 hardware against vLLM and a chiplet-unaware baseline. No equations, first-principles predictions, fitted parameters, or self-citations are invoked to derive results; the central claims rest on observed L2 hit rates, HBM traffic reduction, and latency numbers obtained from execution. The derivation chain is therefore self-contained and non-circular by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Chiplet-based GPUs expose private cache hierarchies with a shared L2 cache within each chiplet.
invented entities (1)
-
Chiplet-tasks
no independent evidence
Reference graph
Works this paper leans on
-
[1]
AMD. 2024. AMD Instinct MI300X Accelerator.https://www. amd.com/en/products/accelerators/instinct/mi300/mi300x.htmlData Sheet
2024
-
[2]
2025.CDNA4 Instruction Set Architecture Reference Guide
AMD. 2025.CDNA4 Instruction Set Architecture Reference Guide. https://www.amd.com/content/dam/amd/en/documents/instinct- tech-docs/instruction-set-architectures/amd-instinct-cdna4- instruction-set-architecture.pdf
2025
-
[3]
Xinhao Cheng, Zhuohan Zhang, Yucheng Zhou, Jianan Ji, Jiarui Jiang, Zihao Zhao, Zhengxiao Xiao, Zhuoran Ye, Yuhui Huang, Rui Lai, Hongyi Jin, Bangcheng Hou, Mengdi Wu, Yinmin Dong, Alex Yip, Shuai Wang, Wei Yang, Xupeng Miao, Tianqi Chen, and Zhihao Jia. 2025. Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs.CoRRabs/25...
-
[4]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory- Efficient Exact Attention with IO-Awareness. InAdvances in Neural Information Processing Systems (NeurIPS 2022). https://proceedings.neurips.cc/paper_files/paper/2022/hash/ 67d57c32e20fd0a7a302cb81d36e40d5-Abstract-Conference.html
2022
-
[5]
Tri Dao, Jay Haziza, Francisco Massa, and Grigory Sizov
-
[6]
Flashattention-3: Fast and accurate attention with asynchrony and low-precision, 2024
FlashAttention-3: Fast and Accurate Attention with 11 Asynchrony and Low-precision. InAdvances in Neural Infor- mation Processing Systems (NeurIPS 2024). arXiv:2407.08608 https://proceedings.neurips.cc/paper_files/paper/2024/hash/ 7ede97c3e082c6df10a8d6103a2eebd2-Abstract-Conference.html
-
[7]
Fu, Ryann Swann, Muhammad Osama, Christopher Ré, and Simran Arora
William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Y. Fu, Ryann Swann, Muhammad Osama, Christopher Ré, and Simran Arora. 2025. HipKittens: Fast and Furious AMD Kernels.CoRR abs/2511.08083 (Nov. 2025). arXiv:2511.08083
- [8]
- [9]
-
[10]
Andrew Kerr, Duane Merrill, Julien Demouth, and John Tran. 2017. CUTLASS: Fast Linear Algebra in CUDA C++.https://devblogs.nvidia. com/cutlass-linear-algebra-cuda/NVIDIA Developer Blog
2017
-
[11]
In: Proceedings of the 29th Symposium on Operating Systems Principles
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP 2023). 611–626. doi:10.1145/3600006.3613165
- [12]
-
[13]
NVIDIA. 2024. CUTLASS: CUDA Templates for Linear Algebra Sub- routines.https://github.com/NVIDIA/cutlass
2024
-
[14]
NVIDIA. 2024. NVIDIA Blackwell Architecture Technical Brief.https://resources.nvidia.com/en-us-blackwell-architecture/ blackwell-architecture-technical-brief
2024
-
[15]
NVIDIA. 2024. Programming Guide: Thread Block Clus- ters.https://docs.nvidia.com/cuda/cuda-c-programming-guide/index. html#thread-block-clusters
2024
- [16]
-
[17]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Í nigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA 2024). doi:10.1109/ISCA59077.2024.00055
-
[18]
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shiv- ani Agrawal, and Jeff Dean. 2023. Efficiently Scaling Transformer Inference. InProceedings of Machine Learning and Systems (MLSys 2023).https://proceedings.mlsys.org/paper_files/paper/2023/hash/ c4be71ab8d24cdfb45e3d06dbfca2780...
2023
-
[19]
Qwen Team. 2025. Qwen3 Technical Report.CoRRabs/2505.09388 (May 2025). arXiv:2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Spector, Simran Arora, Aaryan Singhal, Daniel Y
Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y. Fu, and Christopher Ré. 2025. ThunderKittens: Simple, Fast, and Adorable AI Kernels. InInternational Conference on Learning Representations (ICLR 2025).https://proceedings.iclr.cc/paper_files/paper/2025/hash/ 05dc08730e32441edff52b0fa6caab5f-Abstract-Conference.htmlSpot- light
2025
-
[21]
Spector, Jared Juravsky, Sohil Sul, Owen Dugan, Daniel Lim, Daniel Y
Benjamin F. Spector, Jared Juravsky, Sohil Sul, Owen Dugan, Daniel Lim, Daniel Y. Fu, Simran Arora, and Christopher Ré. 2025. Look Ma, No Bubbles! Fusing LLM Decoder Layers into One Megakernel. https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubblesHazy Research Blog, Stanford University
2025
-
[22]
Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: An Intermedi- ate Language and Compiler for Tiled Neural Network Computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL 2019). 10–19. doi:10.1145/3315508.3329973
-
[23]
Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Keren Ao, Praneeth Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia
-
[24]
InProceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2025)
Mirage: A Multi-Level Superoptimizer for Tensor Programs. InProceedings of the 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2025). 221–238.https://www.usenix. org/conference/osdi25/presentation/wu-mengdi
2025
-
[25]
Biao Zhang and Rico Sennrich. 2019. Root Mean Square Layer Normal- ization. InAdvances in Neural Information Processing Systems (NeurIPS 2019).https://proceedings.neurips.cc/paper_files/paper/2019/hash/ 1e8a19426224ca89e83cef47f1e7f53b-Abstract.html
2019
-
[26]
SGLang: Efficient Execution of Structured Language Model Programs
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue H. Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. 2023. SGLang: Efficient Exe- cution of Structured Language Model Programs.CoRRabs/2312.07104 (Dec. 2023). arXiv:2312.07104 12
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.