arxiv: 2601.20595 · v3 · submitted 2026-01-28 · 💻 cs.DC

Syncopate: Efficient Multi-GPU AI Kernels via Automatic Chunk-Centric Compute-Communication Overlap

Xinwei Qiang , Yue Guan , Zhengding Hu , Keren Zhou , Yufei Ding , Adnan Aziz This is my paper

Pith reviewed 2026-05-16 10:03 UTC · model grok-4.3

classification 💻 cs.DC

keywords multi-GPUcompute-communication overlapchunk abstractionkernel fusionTriton compilerdistributed AI kernelsperformance optimization

0 comments

The pith

Syncopate overlaps compute and communication at chunk granularity inside a single fused kernel for multi-GPU AI workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the growing communication bottleneck in large-scale GPU workloads where whole-kernel overlap at the stream level wastes time through extra launches, forced synchronizations, and slack from slow tiles. Syncopate instead introduces a chunk abstraction that lets the compiler rearrange a local kernel so computation aligns with the arrival of smaller data pieces. If the transformations succeed, the fused kernel hides more latency without changing the original program semantics. This matters for scaling AI training and inference because it improves hardware utilization on existing multi-GPU systems. The approach is realized as a source-to-source pass on Triton and reports concrete speedups.

Core claim

Given a local Triton kernel and a chunk schedule, Syncopate performs automatic transformations that fuse operations and overlap communication at chunk boundaries inside one kernel, decoupling the communication plan from kernel structure and backend details.

What carries the argument

The communication chunk abstraction that decouples communication granularity from kernel structure, allowing plans to be supplied from compilers, users, or templates and then aligned with computation.

If this is right

Fewer kernel launches and device-wide synchronizations at boundaries.
Chunk plans become portable across kernels without rewriting communication code.
Slack from the slowest tile or kernel is reduced because overlap happens at finer scale.
Average end-to-end speedup of 1.3x and up to 4.7x on multi-GPU workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same chunk abstraction could be adapted to other GPU languages if the transformation engine is reimplemented.
Automatic generation of chunk schedules might further reduce the need for manual input.
Benefits are likely to grow with larger models where the ratio of communication to compute increases.

Load-bearing premise

Chunk schedules can be supplied without introducing correctness errors or excessive transformation overhead while the resulting fused kernel preserves original semantics on real hardware.

What would settle it

Execute the transformed fused kernel alongside the original version on multi-GPU hardware and check whether outputs match exactly while total runtime decreases due to better overlap.

Figures

Figures reproduced from arXiv: 2601.20595 by Adnan Aziz, Keren Zhou, Xinwei Qiang, Yue Guan, Yufei Ding, Zhengding Hu.

**Figure 1.** Figure 1: Motivating example of Syncopate. Red numbers shows the direct improvements gained by fine-grained overlap over kernel-level overlap, while orange numbers show the additional improvements from the new design space enabled by Syncopate. tion is motivated by a key observation: fine-grained overlap requires a communication granularity that flexibly matches how tiles generate data and how communication backends… view at source ↗

**Figure 2.** Figure 2: Motivation experiment results. (a) SM utilization under different GEMM sizes and tile sizes. (b) Performance [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: System overview of Syncopate. lightweight scheduling metadata (Listing 1). Programmers write kernels as if they were running on a single device, using standard Triton primitives for indexing, tiling, and tensor descriptors. Optional Syncopate annotations (e.g., axis counts, tile identifiers, and dispatch regions) identify the logical tiles and iteration structure but do not change the kernel’s semantics.… view at source ↗

**Figure 4.** Figure 4: Communication schedule abstraction. (a) and (b) illustrate the same point-to-point exchange expressed as push and pull [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Compilation pipeline. In this example, we show communication using specialized SM as an independent kernel [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Tile scheduler transformation. (a) Computation and communication naturally follow different tile/chunk layouts, [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Communication backend selection. (a) Communica [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Performance comparison of GEMM operators optimized by [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Performance comparison of operators optimized by [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Evaluation lowering partitioned-based IR and loop [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Ablation and sensitivity studies of Syncopate’s auto-tuning design space. few SMs underutilizes the link bandwidth, whereas allocating too many starves the main kernel. The optimal SM count also shifts with model size, which matches the design of our backend code generation: the autotuner treats SM allocation as a first-class knob and automatically selects a near-optimal point for each operator/hardware p… view at source ↗

read the original abstract

Communication has become a first-order bottleneck in large-cale GPU workloads, and existing distributed compilers address it mainly by overlapping whole compute and communication kernels at the stream level. This coarse granularity incurs extra kernel launches, forces device-wide synchronizations at kernel boundaries, and leaves substantial slack when the slowest tile or kernel stretches the communication tail. We present Syncopate, a compiler and runtime that enables automatic fine-grained overlap inside a single fused kernel. Syncopate introduces a communication chunk abstraction that decouples communication granularity from kernel structure and backend mechanisms, allowing chunk-level plans to be ported from existing distributed compilers, written directly by users, or instantiated from reusable templates. Given a local Triton kernel and a chunk schedule, Syncopate performs transformations to align computation with chunk availability. Implemented as a source-to-source compiler on Triton, Syncopate delivers an average end-to-end speedup of 1.3$\times$ and up to 4.7$\times$ on multi-GPU workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Syncopate adds a chunk abstraction for intra-kernel compute-communication overlap in Triton-based multi-GPU kernels and reports 1.3x average speedups, but the abstract gives too little on workloads and measurements to judge how solid the gains are.

read the letter

Syncopate's main contribution is the communication chunk abstraction that lets a source-to-source compiler align compute tiles with arriving data inside one fused kernel instead of relying on stream-level overlap. This decouples granularity from kernel structure and lets schedules come from compilers, users, or templates, which is a practical engineering step for reducing slack in the communication tail on multi-GPU setups.

Referee Report

2 major / 2 minor

Summary. The paper introduces Syncopate, a source-to-source compiler and runtime built on Triton that enables automatic fine-grained compute-communication overlap inside a single fused kernel for multi-GPU AI workloads. It defines a communication chunk abstraction to decouple granularity from kernel structure, allowing chunk schedules to be supplied from compilers, users, or templates, then performs transformations to align computation with chunk availability. The central claim is an average end-to-end speedup of 1.3× and up to 4.7× on multi-GPU workloads compared to existing stream-level overlap approaches.

Significance. If the speedups are reproducible across diverse workloads with proper baselines, this work could meaningfully advance distributed compiler design by addressing communication slack at finer granularity than current kernel-boundary methods, reducing device-wide synchronizations and extra launches. The practical engineering focus on reusing existing chunk plans and Triton integration is a strength, though significance hinges on whether the chunk abstraction introduces hidden overheads or correctness risks on real hardware.

major comments (2)

[Evaluation section (implied by performance claims)] The abstract and evaluation claims (1.3× average, 4.7× peak) lack any description of workloads, hardware configuration, baselines, number of runs, or error bars, making it impossible to assess whether the reported speedups are load-bearing or sensitive to specific conditions.
[§3 (system design)] The central transformation relies on externally supplied chunk schedules preserving semantics; no formal argument or test is given showing that the fused kernel always matches the original multi-kernel behavior under chunk misalignment or varying tile sizes.

minor comments (2)

[§2] Notation for chunk schedules and the source-to-source passes could be clarified with a small example in pseudocode to show input Triton kernel versus transformed output.
[Evaluation] The paper should include a table comparing kernel launch counts and synchronization points before/after Syncopate to quantify the claimed reduction in overhead.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify clear gaps in the evaluation presentation and the rigor of the semantic claims in the system design. We respond to each major comment below and indicate the planned revisions.

read point-by-point responses

Referee: The abstract and evaluation claims (1.3× average, 4.7× peak) lack any description of workloads, hardware configuration, baselines, number of runs, or error bars, making it impossible to assess whether the reported speedups are load-bearing or sensitive to specific conditions.

Authors: We agree that the current manuscript provides insufficient detail on the experimental methodology. In the revised version we will expand the evaluation section with explicit descriptions of the workloads (specific models, layer sizes, and input dimensions), hardware configurations (GPU models, counts, and interconnect), baselines (including stream-level overlap, NCCL, and other distributed kernels), the number of runs per measurement, and error bars or standard deviations. These additions will make the speedups reproducible and allow assessment of sensitivity to conditions. revision: yes
Referee: The central transformation relies on externally supplied chunk schedules preserving semantics; no formal argument or test is given showing that the fused kernel always matches the original multi-kernel behavior under chunk misalignment or varying tile sizes.

Authors: The design assumes chunk schedules are supplied by trusted sources (upstream compilers, users, or templates) that derive from the original kernel structure, thereby preserving semantics by construction. However, we acknowledge the manuscript lacks an explicit argument or dedicated tests for misalignment and tile-size variation. We will revise §3 to include a detailed informal explanation of how the alignment transformations maintain equivalence, and add empirical tests in the evaluation demonstrating behavioral matching under different tile sizes and controlled misalignment cases. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an engineering compiler system (source-to-source transformations on Triton) that fuses compute with externally supplied chunk schedules for fine-grained overlap. No equations, fitted parameters, predictions, or self-citation chains appear in the abstract or described mechanism. The central claims are empirical speedups from implementation, not a derivation that reduces to its own inputs by construction. The system is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that Triton kernels admit semantics-preserving source-to-source rewrites and that chunk schedules can be provided externally without introducing new free parameters or invented hardware entities.

axioms (1)

domain assumption Triton kernel semantics are preserved under the described chunk-aligned transformations
Required for the compiler to produce correct fused kernels; invoked implicitly when claiming the approach works on existing Triton code.

invented entities (1)

communication chunk no independent evidence
purpose: Decouple communication granularity from kernel structure and backend mechanisms
New abstraction introduced to enable fine-grained overlap plans; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5489 in / 1331 out tokens · 24864 ms · 2026-05-16T10:03:35.602421+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Syncopate introduces a communication chunk abstraction that decouples communication granularity from kernel structure... Given a local Triton kernel and a chunk schedule, Syncopate performs transformations to align computation with chunk availability.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 4 internal anchors

[1]

Partir: Composing spmd partitioning strategies for ma- chine learning

Sami Alabed, Daniel Belov, Bart Chrzaszcz, Juliana Franco, Dominik Grewe, Dougal Maclaurin, James Mol- loy, Tom Natan, Tamara Norman, Xiaoyue Pan, et al. Partir: Composing spmd partitioning strategies for ma- chine learning. In Proceedings of the 30th ACM Inter- national Conference on Architectural Support for Pro- gramming Languages and Operating Systems...

work page 2025
[2]

Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael V oznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban Des- maison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh, Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michae...

work page 2024
[3]

Flux: fast software-based communication overlap on gpus through kernel fusion

Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, et al. Flux: fast software-based communication overlap on gpus through kernel fusion. arXiv preprint arXiv:2406.06858, 2024

work page arXiv 2024
[4]

Cen- tauri: Enabling efficient scheduling for communication- computation overlap in large model training via commu- nication partitioning

Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei Duan, Peng Sun, Xingcheng Zhang, and Chao Yang. Cen- tauri: Enabling efficient scheduling for communication- computation overlap in large model training via commu- nication partitioning. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Sy...

work page 2024
[5]

{TVM}: An automated {End-to-End} optimizing compiler for deep learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. In 13th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 18), pages 578–594, 2018

work page 2018
[6]

Learning to optimize tensor programs

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. In Proceedings of the 32nd International Conference on Neural Information Processing Systems , NIPS’18, page 3393–3404, Red Hook, NY , USA, 2018. Curran Associates Inc

work page 2018
[7]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Flashattention: Fast and memory- efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory- efficient exact attention with io-awareness. Advances in neural information processing systems , 35:16344– 16359, 2022

work page 2022
[9]

Tensorir: An abstraction for auto- matic tensorized program optimization

Siyuan Feng, Bohan Hou, Hongyi Jin, Wuwei Lin, Junru Shao, Ruihang Lai, Zihao Ye, Lianmin Zheng, Cody Hao Yu, Yong Yu, et al. Tensorir: An abstraction for auto- matic tensorized program optimization. In Proceedings of the 28th ACM International Conference on Architec- tural Support for Programming Languages and Operat- ing Systems, Volume 2, pages 804–817, 2023

work page 2023
[10]

Tokenweave: Efficient compute-communication overlap for distributed llm inference, 2025

Raja Gond, Nipun Kwatra, and Ramachandran Ramjee. Tokenweave: Efficient compute-communication overlap for distributed llm inference, 2025

work page 2025
[11]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Loongtrain: Effi- cient training of long-sequence llms with head-context parallelism

Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, et al. Loongtrain: Effi- cient training of long-sequence llms with head-context parallelism. arXiv preprint arXiv:2406.18485, 2024

work page arXiv 2024
[13]

Mercury: Unlocking multi- gpu operator optimization for llms via remote memory scheduling

Yue Guan, Xinwei Qiang, Zaifeng Pan, Daniels John- son, Yuanwei Fang, Keren Zhou, Yuke Wang, Wanlu Li, Yufei Ding, and Adnan Aziz. Mercury: Unlocking multi- gpu operator optimization for llms via remote memory scheduling. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, SOSP ’25, page 1046–1061, New York, NY , USA, 2025. As...

work page 2025
[14]

Flashoverlap: A lightweight design for efficiently overlapping communication and computation

Ke Hong, Xiuhong Li, Minxu Liu, Qiuli Mao, Tianqi Wu, Zixiao Huang, Lufang Chen, Zhong Wang, Yichong Zhang, Zhenhua Zhu, et al. Flashoverlap: A lightweight design for efficiently overlapping communication and computation. arXiv preprint arXiv:2504.19519, 2025. 13

work page arXiv 2025
[15]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajb- handari, and Yuxiong He. Deepspeed ulysses: Sys- tem optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Breaking the computation and commu- nication abstraction barrier in distributed machine learning workloads

Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, and Olli Saarikivi. Breaking the computation and commu- nication abstraction barrier in distributed machine learning workloads. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming L...

work page 2022
[17]

Reducing activation re- computation in large transformer models

Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation re- computation in large transformer models. Proceedings of Machine Learning and Systems, 5:341–353, 2023

work page 2023
[18]

Ringat- tention with blockwise transformers for near-infinite context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ringat- tention with blockwise transformers for near-infinite context. In The Twelfth International Conference on Learning Representations

work page
[19]

NVIDIA H100 Tensor Core GPU Architec- ture

NVIDIA. NVIDIA H100 Tensor Core GPU Architec- ture. Technical report, NVIDIA, mar 2022. White paper

work page 2022
[20]

NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.0.0, 2024

NVIDIA. NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) Rev 3.0.0, 2024. Ac- cessed: 2025-02-10

work page 2024
[21]

Nvidia nvlink high-speed inter- connect: Application performance

NVIDIA Corporation. Nvidia nvlink high-speed inter- connect: Application performance. Technical report, NVIDIA Corporation, 2015. Accessed: 2025-04-16

work page 2015
[22]

NVIDIA Collective Communica- tions Library (NCCL), 2025

NVIDIA Corporation. NVIDIA Collective Communica- tions Library (NCCL), 2025. Version 2.26.2

work page 2025
[23]

Openmp appli- cation programming interface

OpenMP Architecture Review Board. Openmp appli- cation programming interface. https://www.openmp. org/specifications/, 2023

work page 2023
[24]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Day- iheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Ke- qin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao...

work page 2025
[25]

Distir: An intermediate representation and simulator for efficient neural network distribution

Keshav Santhanam, Siddharth Krishna, Ryota Tomioka, Tim Harris, and Matei Zaharia. Distir: An intermediate representation and simulator for efficient neural network distribution. arXiv preprint arXiv:2111.05426, 2021

work page arXiv 2021
[26]

Flashattention- 3: Fast and accurate attention with asynchrony and low- precision

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. Flashattention- 3: Fast and accurate attention with asynchrony and low- precision. Advances in Neural Information Processing Systems, 37:68658–68685, 2024

work page 2024
[27]

Tensor program opti- mization with probabilistic programs

Junru Shao, Xiyou Zhou, Siyuan Feng, Bohan Hou, Rui- hang Lai, Hongyi Jin, Wuwei Lin, Masahiro Masuda, Cody Hao Yu, and Tianqi Chen. Tensor program opti- mization with probabilistic programs. Advances in Neu- ral Information Processing Systems, 35:35783–35796, 2022

work page 2022
[28]

Mesh- tensorflow: Deep learning for supercomputers, 2018

Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sepassi, and Blake Hechtman. Mesh- tensorflow: Deep learning for supercomputers, 2018

work page 2018
[29]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter lan- guage models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[30]

Spector, Simran Arora, Aaryan Singhal, Daniel Y

Benjamin F. Spector, Simran Arora, Aaryan Singhal, Daniel Y . Fu, and Christopher Ré. Thunderkittens: Sim- ple, fast, and adorable ai kernels, 2024

work page 2024
[31]

Sul, Simran Arora, Benjamin F

Stuart H. Sul, Simran Arora, Benjamin F. Spector, and Christopher Ré. Parallelkittens: Systematic and practical simplification of multi-gpu ai kernels, 2025

work page 2025
[32]

Philippe Tillet, H. T. Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, page 10–19, New York, NY , USA, 2019. Association for Computing Machinery

work page 2019
[33]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[34]

Domino: Eliminating commu- nication in llm training via generic tensor slicing and overlapping, 2024

Guanhua Wang, Chengming Zhang, Zheyu Shen, Ang Li, and Olatunji Ruwase. Domino: Eliminating commu- nication in llm training via generic tensor slicing and overlapping, 2024. 14

work page 2024
[35]

Primepar: Efficient spatial- temporal tensor partitioning for large transformer model training

Haoran Wang, Lei Wang, Haobo Xu, Ying Wang, Yum- ing Li, and Yinhe Han. Primepar: Efficient spatial- temporal tensor partitioning for large transformer model training. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS ’24, page 801–817, New York, NY , USA, 202...

work page 2024
[36]

Overlap communication with dependent computation via decomposition in large deep learning models

Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake Hechtman, Dehao Chen, Karthik Srinivasa Murthy, Marcello Maggioni, Qiao Zhang, et al. Overlap communication with dependent computation via decomposition in large deep learning models. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Langu...

work page 2022
[37]

Overlap communication with dependent computation via decomposition in large deep learning models

Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake Hechtman, Dehao Chen, Karthik Srinivasa Murthy, Marcello Maggioni, Qiao Zhang, Sameer Kumar, Tongfei Guo, Yuanzhong Xu, and Zongwei Zhou. Overlap communication with dependent computation via decomposition in large deep learning models. In Proceedings of the 28th ACM International Confe...

work page 2023
[38]

Tacos: Topology-aware collective algorithm synthesizer for distributed machine learning

William Won, Midhilesh Elavazhagan, Sudarshan Srini- vasan, Swati Gupta, and Tushar Krishna. Tacos: Topology-aware collective algorithm synthesizer for distributed machine learning. In Proceedings of the 2024 57th IEEE/ACM International Symposium on Mi- croarchitecture, MICRO ’24, page 856–870. IEEE Press, 2024

work page 2024
[39]

Mirage: A {Multi- Level} superoptimizer for tensor programs

Mengdi Wu, Xinhao Cheng, Shengyu Liu, Chunan Shi, Jianan Ji, Man Kit Ao, Praveen Velliengiri, Xupeng Miao, Oded Padon, and Zhihao Jia. Mirage: A {Multi- Level} superoptimizer for tensor programs. In 19th USENIX Symposium on Operating Systems Design and Implementation (OSDI 25), pages 21–38, 2025

work page 2025
[40]

Gspmd: General and scalable parallelization for ml computation graphs, 2021

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, Ruom- ing Pang, Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu, and Zhifeng Chen. Gspmd: General and scalable parallelization for ml computation graphs, 2021

work page 2021
[41]

Comet: Fine-grained computation-communication overlapping for mixture- of-experts

Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, et al. Comet: Fine-grained computation-communication overlapping for mixture- of-experts. arXiv preprint arXiv:2502.19811, 2025

work page arXiv 2025
[42]

Ansor: Generating {High-Performance} tensor programs for deep learn- ing

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, et al. Ansor: Generating {High-Performance} tensor programs for deep learn- ing. In 14th USENIX symposium on operating systems design and implementation (OSDI 20), pages 863–879, 2020

work page 2020
[43]

Gonzalez, and Ion Stoica

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, Danyang Zhuo, Koushik Sen, Joseph E. Gonzalez, and Ion Stoica. Ansor: generating high-performance ten- sor programs for deep learning. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation, OSDI’20, USA, 2020. USENIX Association

work page 2020
[44]

Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning. In 16th USENIX Sympo- sium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, 2022

work page 2022
[45]

Triton-distributed: Programming overlapping kernels on distributed ai systems with the triton compiler

Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chenhui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, et al. Triton-distributed: Programming overlapping kernels on distributed ai systems with the triton compiler. arXiv preprint arXiv:2504.19442, 2025

work page arXiv 2025
[46]

Tilelink: Gen- erating efficient compute-communication overlapping kernels using tile-centric primitives

Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wen- lei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi Ye, Haibin Lin, et al. Tilelink: Gen- erating efficient compute-communication overlapping kernels using tile-centric primitives. arXiv preprint arXiv:2503.20313, 2025

work page arXiv 2025
[47]

Nanoflow: Towards optimal large lan- guage model serving throughput, 2025

Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Tian Tang, Qinyu Xu, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Ziren Wang, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci. Nanoflow: Towards optimal large lan- guage model serving throughput, 2025. 15

work page 2025