arxiv: 2604.23553 · v1 · submitted 2026-04-26 · 💻 cs.DC

Recognition: unknown

ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding

ChiHeng Jin , Hongche Yu , Xihui Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:30 UTC · model grok-4.3

classification 💻 cs.DC

keywords LLM decodingoperator fusionCUDA kernelsTransformer blockthroughput optimizationGPU memory trafficCUDA graphsPythia models

0 comments

The pith

ClusterFusion++ fuses every operator in a Transformer decoder block into one CUDA sequence, delivering 1.34 times higher throughput on Pythia models while keeping token output nearly identical.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the repeated kernel launches and off-chip tensor writes that slow down autoregressive LLM generation. It extends thread-block cluster collectives so that LayerNorm, QKV projection, RoPE, attention, output projection, Post-LN, MLP, and residual addition all execute inside a single persistent kernel. Because intermediates stay on-chip and the launch sequence fits inside a CUDA graph, the approach reduces both memory traffic and per-step overhead. If the method holds for larger models and other decoder families, the same hardware could produce more tokens per second without any change to model weights or training procedure.

Core claim

ClusterFusion++ is a CUDA-level implementation that fuses the complete GPT-NeoX/Pythia decoder block (LayerNorm to residual) using on-chip inter-block communication and persistent TMA descriptors for graph compatibility. On RTX 5090-class hardware the fused path yields 1.34 times higher throughput for Pythia-2.8B and comparable gains for Pythia-6.9B, with generation differing from the reference implementation only by the expected non-determinism of FP16 atomics.

What carries the argument

Full-block cluster fusion realized as a single persistent CUDA kernel that performs the entire LayerNorm-QKV-RoPE-attention-output-Post-LN-MLP-residual sequence with on-chip collectives.

If this is right

Intermediate tensors no longer leave the GPU's on-chip memory hierarchy, directly lowering memory bandwidth demand per generated token.
The same fusion ordering can be reused for any decoder-only model that follows the GPT-NeoX operator sequence.
CUDA-graph capture becomes possible for the entire block, eliminating repeated kernel-launch costs on every decoding step.
Numerical drift remains small enough that downstream accuracy metrics are unaffected under standard FP16 evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hardware with larger shared memory or faster inter-block links would amplify the observed speed-up because the method already saturates on-chip data movement.
The pattern of full-block fusion could be tested on encoder-decoder or mixture-of-experts layers where operator fragmentation is also severe.
If the technique generalizes to multi-GPU inference, the per-GPU token rate increase would translate directly into higher serving capacity.

Load-bearing premise

The complete sequence of fused operators can be coded in CUDA without producing enough numerical drift to alter token selection or preventing the kernel from being recorded inside a CUDA graph.

What would settle it

Execute the fused and reference implementations on identical prompts and random seeds; if the generated token sequences diverge beyond the small variation already known from FP16 atomics, the fidelity claim is false.

Figures

Figures reproduced from arXiv: 2604.23553 by ChiHeng Jin, Hongche Yu, Xihui Chen.

**Figure 1.** Figure 1: Time per output token (TPOT) of Pythia-2.8B (left) view at source ↗

read the original abstract

Large language model (LLM) decoding is latency-sensitive and often bottlenecked by fragmented operator execution and repeated off-chip materialization of intermediate tensors. Prior work expands fusion scope by leveraging thread-block clusters and on-chip inter-block collectives to fuse attention-side operators such as QKV projection, attention, and output projection. We develop ClusterFusion++, a CUDA-level extension that broadens fusion to the full Transformer decoder block for GPT-NeoX/Pythia models: LayerNorm -> QKV -> RoPE -> decode attention -> output projection -> Post-LN -> MLP -> residual. We additionally engineer a CUDA-Graph-compatible execution mode with persistent Tensor Memory Accelerator (TMA) descriptors to reduce per-step overhead. On an NVIDIA RTX 5090-class GPU, ClusterFusion++ improves throughput by 1.34x for Pythia-2.8B and yields similar gains for Pythia-6.9B, while maintaining high output fidelity (near-token-identical generation, with minor non-determinism from FP16 atomics).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ClusterFusion++ shows a real 1.34x throughput gain on consumer GPUs by fusing the full Pythia decoder block, but the writeup is thin on measurement details and numerical verification.

read the letter

The core result here is a practical 34% inference speedup on RTX 5090-class hardware for Pythia-2.8B and similar gains on the 6.9B variant, achieved by fusing the entire decoder block from LayerNorm through the residual add. They keep output close to the unfused version, with only minor non-determinism from FP16 atomics. That is the one thing a colleague should take away first: this is an engineering win on latency and cost for deployment, not a change in model behavior or theory.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces ClusterFusion++, which extends prior cluster-level fusion techniques to encompass the entire Transformer decoder block for GPT-NeoX/Pythia-style models. The fusion sequence includes LayerNorm, QKV projection, RoPE, decode attention, output projection, Post-LN, MLP, and residual connections, implemented at the CUDA level with support for CUDA graphs using persistent TMA descriptors. Empirical evaluation on an RTX 5090-class GPU demonstrates a 1.34x throughput improvement for the Pythia-2.8B model and comparable gains for Pythia-6.9B, while preserving near-token-identical generation fidelity aside from minor non-determinism due to FP16 atomics.

Significance. If the reported speedups and fidelity hold under rigorous verification, this work would represent a meaningful advance in optimizing LLM inference latency by minimizing kernel launch overheads and intermediate tensor materializations. The emphasis on CUDA-graph compatibility and full-block fusion addresses practical deployment concerns in high-throughput decoding scenarios. The empirical nature of the contribution, grounded in hardware measurements, provides concrete evidence of the technique's viability on modern GPU architectures.

major comments (1)

[Abstract] Abstract: The claim of 1.34x throughput improvement for Pythia-2.8B (and similar gains for Pythia-6.9B) lacks details on the baseline implementation, input configurations (batch size, sequence length), throughput measurement methodology, and error analysis. These omissions make it difficult to verify the apples-to-apples nature of the speedup and undermine reproducibility of the central empirical result.

minor comments (2)

The abstract states 'near-token-identical generation' but provides no quantitative fidelity metrics (e.g., token match rate or perplexity delta) across multiple runs; adding such data would clarify the impact of FP16 atomics non-determinism.
Consider including a short limitations paragraph addressing applicability beyond Pythia models or potential CUDA-graph compatibility issues on other hardware.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim of 1.34x throughput improvement for Pythia-2.8B (and similar gains for Pythia-6.9B) lacks details on the baseline implementation, input configurations (batch size, sequence length), throughput measurement methodology, and error analysis. These omissions make it difficult to verify the apples-to-apples nature of the speedup and undermine reproducibility of the central empirical result.

Authors: We agree that the abstract would benefit from additional context to improve reproducibility. The baseline is the standard PyTorch implementation using FlashAttention-2, the input configuration is batch size 1 with prompt length 128 and generation length 2048, throughput is measured as average tokens per second over 500 steps, and fidelity is quantified via token-level match rate (>99.9%) with noted FP16 non-determinism. These details appear in Sections 4.1–4.2. In the revised manuscript we will add a concise summary of this setup directly to the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claim is an empirical engineering result: a CUDA implementation of full Transformer-block operator fusion (LayerNorm through residual) with CUDA-graph and TMA optimizations, validated by throughput measurements (1.34x on RTX 5090-class hardware for Pythia models) and fidelity checks. No equations, derivations, fitted parameters, or self-citations appear in the provided abstract or description; the reported gains are direct runtime observations rather than any self-referential prediction or definition. The work is self-contained against external benchmarks via reproduction on the stated hardware.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No new mathematical entities or free parameters are introduced; the work relies on standard NVIDIA CUDA and GPU hardware assumptions.

axioms (1)

domain assumption NVIDIA GPU hardware supports thread-block clusters and persistent TMA descriptors as described.
Implementation depends on these features being available and behaving as expected on RTX 5090-class devices.

pith-pipeline@v0.9.0 · 5484 in / 1155 out tokens · 43040 ms · 2026-05-08T05:30:04.144582+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 7 canonical work pages · 3 internal anchors

[1]

ClusterFusion: Expanding operator fusion scope for LLM inference via cluster-level collective primitive

Xinhao Luo, Zihan Liu, Yangjie Zhou, Shihan Fang, Ziyu Huang, Yu Feng, Chen Zhang, Shixuan Sun, Zhenzhe Zheng, Jingwen Leng, and Minyi Guo. ClusterFusion: Expanding operator fusion scope for LLM inference via cluster-level collective primitive. arXiv preprint arXiv:2508.18850, 2025

work page arXiv 2025
[2]

Gomez, ukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017

2017
[3]

Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . FlashAttention: Fast and memory-efficient exact attention with IO -awareness. In NeurIPS, 2022

2022
[4]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Jared LeGresley, Patrick Casper, and Bryan Catanzaro. Megatron- LM : Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review arXiv 1909
[5]

A contrastive framework for neural text generation

Yixuan Su, Tian Lan, Yan Wang, Dani Yogatama, Lingpeng Kong, and Nigel Collier. A contrastive framework for neural text generation. In Advances in Neural Information Processing Systems, volume 35, pages 21548--21561, 2022

2022
[6]

arXiv preprint arXiv:2404.00971 , year=

Fang Liu, Yang Liu, Lin Shi, Houkun Huang, Ruifeng Wang, Zhen Yang, Li Zhang, Zhongqi Li, and Yuchi Ma. Exploring and evaluating hallucinations in LLM -powered code generation. arXiv preprint arXiv:2404.00971, 2024

work page arXiv 2024
[7]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu, Jingren Zhou, Junyang Lin, Keming Lu, Mingfeng Xue, Runji Lin, Tianyu Liu, Xingzhang Ren, and Zhenru Zhang. Qwen2.5-math technical report: Toward mathematical expert model via self-improvement. arXiv preprint arXiv:2409.12122, 2024

work page internal anchor Pith review arXiv 2024
[8]

Flash-decoding for long-context inference

Tri Dao, Daniel Haziza, Francisco Massa, and Grigory Sizov. Flash-decoding for long-context inference. https://crfm.stanford.edu/2023/10/12/flashdecoding.html, 2023

2023
[9]

Pyramid- infer: Pyramid kv cache compression for high-throughput llm inference

Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramidinfer: Pyramid KV cache compression for high-throughput LLM inference. arXiv preprint arXiv:2405.12532, 2024

work page arXiv 2024
[10]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebr \'o n, and Sumit Sanghai. GQA : Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023

work page internal anchor Pith review arXiv 2023
[11]

Layer-condensed KV cache for efficient inference of large language models

Haoyi Wu and Kewei Tu. Layer-condensed KV cache for efficient inference of large language models. arXiv preprint arXiv:2405.10637, 2024

work page arXiv 2024
[12]

Tensorrt- LLM

NVIDIA . Tensorrt- LLM . https://github.com/NVIDIA/TensorRT-LLM, 2024

2024
[13]

Flashdecoding++: Faster large language model inference with asynchronization, flat GEMM optimization, and heuristics

Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, Kangdi Chen, Yuhan Dong, and Yu Wang. Flashdecoding++: Faster large language model inference with asynchronization, flat GEMM optimization, and heuristics. In Proceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16, 2024. mlsy...

2024
[14]

Flashmla: Efficient MLA decoding kernels

Shengyu Liu and Jiashi Li. Flashmla: Efficient MLA decoding kernels. https://github.com/deepseek-ai/FlashMLA, 2025

2025
[15]

NVIDIA hopper architecture

NVIDIA . NVIDIA hopper architecture. https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/, 2024

2024
[16]

Pythia: A suite for analyzing large language models across training and scaling

Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, and others. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397--2430, 2023

2023