arxiv: 2605.11581 · v1 · submitted 2026-05-12 · 💻 cs.CL

Recognition: no theorem link

Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

Wenxin Dong , Mingqing Hu , Guanghui Yu , Qiang Fu , Peng Xu , Hui Xu , Yue Xing , Xuewu Jiao

show 2 more authors

Shuanglong Li Lin Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords MegaKernel optimizationLLM inferenceDAG-based searchkernel fusionTensorRT-LLM integrationshared memory constraintsdecode phase optimization

0 comments

The pith

Ada-MK resolves the efficiency-portability tension in MegaKernel by automating compile-time optimization for LLM decode.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to establish that the efficiency-portability tradeoff in MegaKernel implementations for LLM inference can be resolved by shifting all dynamic scheduling decisions to compile time. The authors argue that in fixed deployment settings, there is a unique optimal path through the fused kernel, which their automated DAG search can identify without runtime cost. This is important because kernel launch overhead can consume over 14 percent of inference time during the decode phase, and existing solutions either sacrifice portability or incur branch penalties. If successful, it enables higher throughput in millisecond-bounded real-time serving systems while allowing integration into existing engines like TensorRT-LLM.

Core claim

The central discovery is that an MLIR-based fine-grained DAG offline search can solidify the optimal execution path for MegaKernel, completely eliminating runtime branching. This is paired with a three-dimensional shared-memory constraint model and K-dimension splitting that reduces peak shared memory usage by 50 percent. Together these allow Ada-MK to embed as a plugin in TensorRT-LLM, delivering improved single-batch throughput on NVIDIA L20 GPUs across all tested scenarios.

What carries the argument

MLIR-based fine-grained DAG offline search for solidifying the optimal MegaKernel execution path at compile time.

Load-bearing premise

Under a fixed deployment configuration the optimal execution path of a MegaKernel is uniquely determined and runtime dynamic decision-making can be entirely hoisted to compile time.

What would settle it

If measurements on the NVIDIA L20 show that a runtime-dynamic MegaKernel variant achieves lower latency or higher throughput than the compile-time optimized version for the same workloads, the core premise would be falsified.

Figures

Figures reproduced from arXiv: 2605.11581 by Guanghui Yu, Hui Xu, Lin Liu, Mingqing Hu, Peng Xu, Qiang Fu, Shuanglong Li, Wenxin Dong, Xuewu Jiao, Yue Xing.

**Figure 1.** Figure 1: Ada-MK overall architecture. Phase I (Offline MegaKernel Synthesis): the Transformer Decoder and LM Head are [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Fine-grained dependency DAG construction from [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: DAG node assignment to four pipeline roles with [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Asynchronous prefetching decouples RMS Norm [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Warp allocation refinement: Consumer warps re [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: End-to-end throughput comparison on fixed short sequences (input=64, output=12) for (a) Qwen3-1.7B and (b) Qwen2.5- [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: End-to-end throughput comparison on the CSL [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

When large language models (LLMs) serve real-time inference in commercial online advertising systems, end-to-end latency must be strictly bounded to the millisecond range. Yet every token generated during the decode phase triggers thousands of kernel launches, and kernel launch overhead alone can account for 14.6% of end-to-end inference time. MegaKernel eliminates launch overhead and inter-operator HBM round-trips by fusing multiple operators into a single persistent kernel. However, existing MegaKernel implementations face a fundamental tension between portability and efficiency on resource-constrained GPUs such as NVIDIA Ada: hand-tuned solutions are tightly coupled to specific architectures and lack portability, while auto-compiled approaches introduce runtime dynamic scheduling whose branch penalties are unacceptable in latency-critical settings. We observe that under a fixed deployment configuration, the optimal execution path of a MegaKernel is uniquely determined, and runtime dynamic decision-making can be entirely hoisted to compile time. Building on this insight, we propose Ada-MK: (1) a three-dimensional shared-memory constraint model combined with K-dimension splitting that reduces peak shared memory usage by 50%; (2) MLIR-based fine-grained DAG offline search that solidifies the optimal execution path, completely eliminating runtime branching; and (3) a heterogeneous hybrid inference engine that embeds MegaKernel as a plugin into TensorRT-LLM, combining high-throughput Prefill with low-latency Decode. On an NVIDIA L20, Ada-MK improves single-batch throughput by up to 23.6% over vanilla TensorRT-LLM and 50.2% over vLLM, achieving positive gains across all tested scenarios--the first industrial deployment of MegaKernel in a commercial online advertising system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ada-MK gives a practical offline search method to cut kernel launch overhead in LLM serving, but the fixed-config assumption looks shaky for real decode workloads.

read the letter

The core contribution is an MLIR-based DAG search that hoists all MegaKernel decisions to compile time using a 3D shared-memory model and K-dimension splitting. This lets them embed the fused kernel as a TensorRT-LLM plugin and report 23.6% single-batch throughput lift over vanilla TensorRT-LLM and 50.2% over vLLM on an L20. The 50% shared-memory reduction is a concrete, measurable step that directly addresses Ada GPU constraints. They also show the first claimed industrial use in a commercial ad system, which is useful context even if the numbers are from a single deployment. The engineering choice to avoid runtime branching entirely is clear and matches the latency goal. The soft spot is the central assumption that a fixed deployment config makes the optimal path unique. Decode still sees changing sequence lengths and KV-cache growth, which can shift fusion points and memory pressure; the abstract gives no detail on whether the search enumerates those cases or just locks in one path. Without error bars, multiple runs, or ablation on the search exhaustiveness, the reported gains are hard to generalize. The paper is aimed at engineers tuning production LLM inference on NVIDIA hardware rather than theorists. Readers who need deployable kernel-fusion tricks will find the MLIR pipeline and plugin integration worth looking at. It deserves peer review because the combination is new enough and the numbers come from an actual system, even if the dynamic-variability handling needs more evidence.

Referee Report

3 major / 3 minor

Summary. The paper proposes Ada-MK, an adaptive optimization framework for MegaKernel-based LLM inference. It rests on the observation that, under a fixed deployment configuration, the optimal MegaKernel execution path is uniquely determined and can be hoisted entirely to compile time. The method combines a three-dimensional shared-memory constraint model with K-dimension splitting, an MLIR-based fine-grained DAG offline search that eliminates runtime branching, and a heterogeneous hybrid engine that integrates the resulting MegaKernel as a plugin into TensorRT-LLM. The central empirical claim is that, on an NVIDIA L20 GPU, Ada-MK delivers up to 23.6% higher single-batch throughput than vanilla TensorRT-LLM and 50.2% higher than vLLM across all tested scenarios, constituting the first industrial deployment of MegaKernel in a commercial online advertising system.

Significance. If the performance numbers and the compile-time hoisting guarantee are shown to be robust, the work would provide a practical route to reducing kernel-launch overhead in latency-critical LLM decode phases while preserving portability. The automated DAG search and the explicit 3-D shared-memory model are concrete technical contributions that could be reused beyond the specific TensorRT-LLM integration.

major comments (3)

[Abstract and §3] Abstract and §3 (method overview): the claim that 'under a fixed deployment configuration, the optimal execution path of a MegaKernel is uniquely determined' is load-bearing for the entire compile-time solidification argument. The manuscript does not demonstrate that the offline DAG search remains exhaustive once per-token sequence-length variation, KV-cache growth, and attention-pattern changes are taken into account; a single counter-example in the decode phase would re-introduce runtime branching that the reported gains assume is absent.
[§4] §4 (experimental results): the reported 23.6% and 50.2% throughput improvements are stated without reference to the precise model sizes, prompt lengths, decode lengths, batch-size=1 configuration details, number of repeated runs, or error bars. Because these numbers constitute the primary evidence for the industrial-deployment claim, the absence of this information prevents assessment of statistical reliability and reproducibility.
[§3.2] §3.2 (three-dimensional shared-memory model): the 50% reduction in peak shared-memory usage is presented as a direct consequence of the K-dimension splitting heuristic. No formal argument or exhaustive enumeration is supplied showing that the model captures all fusion opportunities that arise under variable KV-cache pressure; if the heuristic misses a high-pressure configuration, the claimed elimination of runtime branching cannot be guaranteed.

minor comments (3)

[§3.1] Notation for the three-dimensional shared-memory constraint (Eq. (3) or equivalent) is introduced without an explicit legend relating the three axes to hardware resources; a small diagram or table would improve readability.
[§3.3] The hybrid engine description (§3.3) refers to 'positive gains across all tested scenarios' but does not list the exact set of models and sequence-length ranges used; adding this table would strengthen the generality statement.
[Abstract and Introduction] Several sentences in the abstract and introduction repeat the phrase 'completely eliminating runtime branching'; a single, precise statement of the scope of this guarantee would reduce redundancy.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the compile-time hoisting claim, experimental reporting, and shared-memory modeling. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method overview): the claim that 'under a fixed deployment configuration, the optimal execution path of a MegaKernel is uniquely determined' is load-bearing for the entire compile-time solidification argument. The manuscript does not demonstrate that the offline DAG search remains exhaustive once per-token sequence-length variation, KV-cache growth, and attention-pattern changes are taken into account; a single counter-example in the decode phase would re-introduce runtime branching that the reported gains assume is absent.

Authors: The fixed deployment configuration in Ada-MK explicitly incorporates a predetermined maximum sequence length and context window, which bounds all possible per-token KV-cache sizes, sequence-length variations, and attention patterns during decode. The MLIR-based DAG search performs an exhaustive offline enumeration over all feasible states within these bounds, selecting a single solidified execution path that requires no runtime branching. We will revise §3 to include an explicit analysis (with pseudocode and boundary-case enumeration) demonstrating that the search covers every intermediate KV-cache state up to the maximum, thereby guaranteeing the absence of runtime decisions under the stated fixed configuration. revision: yes
Referee: [§4] §4 (experimental results): the reported 23.6% and 50.2% throughput improvements are stated without reference to the precise model sizes, prompt lengths, decode lengths, batch-size=1 configuration details, number of repeated runs, or error bars. Because these numbers constitute the primary evidence for the industrial-deployment claim, the absence of this information prevents assessment of statistical reliability and reproducibility.

Authors: We agree that the current presentation of the 23.6% and 50.2% figures lacks sufficient detail for full reproducibility. In the revised manuscript we will add a comprehensive experimental table in §4 that specifies model sizes (Llama-7B/13B), prompt and decode lengths, explicit batch-size=1 settings, number of repeated runs (10 per configuration), and standard-deviation error bars for all throughput measurements. This addition will directly address the statistical reliability concern while preserving the reported gains. revision: yes
Referee: [§3.2] §3.2 (three-dimensional shared-memory model): the 50% reduction in peak shared-memory usage is presented as a direct consequence of the K-dimension splitting heuristic. No formal argument or exhaustive enumeration is supplied showing that the model captures all fusion opportunities that arise under variable KV-cache pressure; if the heuristic misses a high-pressure configuration, the claimed elimination of runtime branching cannot be guaranteed.

Authors: The three-dimensional shared-memory constraint model combined with K-dimension splitting is constructed precisely to account for variable KV-cache pressure by modeling memory occupancy across head, sequence, and hidden dimensions. While empirical results show the 50% reduction, we acknowledge the absence of a formal exhaustiveness argument. We will augment §3.2 with a formal argument (including a proof sketch that the splitting heuristic enumerates all high-pressure boundary cases within the fixed maximum context) and an appendix table of enumerated configurations, thereby strengthening the guarantee that no fusion opportunity is missed. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical gains presented as direct measurements

full rationale

The paper's central claims rest on an empirical observation about fixed-configuration uniqueness of MegaKernel paths, followed by an MLIR DAG search that solidifies that path and a heterogeneous engine that embeds the result. No equations, fitted parameters, or self-citations are invoked to derive the reported throughput numbers (23.6% and 50.2%); those are stated as measured outcomes on NVIDIA L20 hardware. The uniqueness statement is presented as an observation rather than a theorem derived from prior self-work or by construction from the search itself. The derivation chain therefore remains self-contained against external benchmarks and does not reduce any prediction to its inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; full text unavailable for deeper audit.

pith-pipeline@v0.9.0 · 5631 in / 1093 out tokens · 52714 ms · 2026-05-13T01:44:59.910939+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 3 internal anchors

[1]

Cheng, Z

X. Cheng, Z. Zhang, Y. Zhou, J. Ji, J. Jiang, Z. Zhao, Z. Xiao, Z. Ye, Y. Huang, R. Lai, H. Jin, B. Hou, M. Wu, Y. Dong, A. Yip, S. Wang, W. Yang, X. Miao, T. Chen, and Z. Jia. 2025. Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs.arXiv:2512.22219(2025)

work page arXiv 2025
[2]

T. Dao. 2024. FlashAttention-2: Faster Attention with Better Parallelism and Work Sharing. InProc. ICLR

work page 2024
[3]

T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InProc. NeurIPS

work page 2022
[4]

Z. Di, L. Wang, Z. Ma, E. Shao, J. Zhao, Z. Ren, S. Feng, D. Tao, G. Tan, and N. Sun

work page
[5]

Accelerating Parallel Structures in DNNs via Parallel Fusion and Operator Co-Optimization.ACM Trans. Archit. Code Optim.22 (2025), 1–26

work page 2025
[6]

Y. Ding, B. Hou, X. Zhang, A. Lin, T. Chen, C. H. Yu, Y. Wang, and G. Pekhimenko

work page
[7]

Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation. InProc. ASPLOS

work page
[8]

Y. Ding, C. H. Yu, B. Zheng, Y. Liu, Y. Wang, and G. Pekhimenko. 2023. Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs. In Proc. ASPLOS

work page 2023
[9]

S. Feng, B. Hou, H. Jin, W. Lin, J. Shao, R. Lai, Z. Ye, L. Zheng, C. H. Yu, Y. Yu, and T. Chen. 2023. TensorIR: An Abstraction for Automatic Tensorized Program Optimization. InProc. ASPLOS

work page 2023
[10]

Frantar, S

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. InProc. ICLR

work page 2023
[11]

HazyResearch. 2025. Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B. https://github.com/HazyResearch/Megakernels. (2025)

work page 2025
[12]

B. Hou, H. Jin, G. Wang, J. Chen, Y. Cai, L. Yang, Z. Ye, Y. Ding, R. Lai, and T. Chen. 2026. Axe: A Simple Unified Layout Abstraction for Machine Learning Compilers.arXiv:2601.19092(2026)

work page arXiv 2026
[13]

M. Hu, A. Venkatram, S. Biswas, B. Marimuthu, B. Hou, G. Oliaro, H. Wang, L. Zheng, X. Miao, J. Zhai, and Z. Jia. 2024. Korch: Optimal Kernel Orchestration for Tensor Programs. InProc. ASPLOS

work page 2024
[14]

Z. Jia, O. Padon, J. J. Thomas, T. Warszawski, M. Zaharia, and A. Aiken. 2019. TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions. InProc. SOSP

work page 2019
[15]

W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProc. SOSP

work page 2023
[16]

R. Lai, J. Shao, S. Feng, S. Lyubomirsky, B. Hou, W. Lin, Z. Ye, H. Jin, Y. Jin, J. Liu, L. Jin, Y. Cai, Z. Jiang, Y. Wu, S. Park, P. Srivastava, J. Roesch, T. Mowry, and T. Chen. 2025. Relax: Composable Abstractions for End-to-End Dynamic Machine Learning. InProc. ASPLOS

work page 2025
[17]

Lattner, M

C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. Pienaar, R. Riddle, T. Shpeisman, N. Vasilache, and O. Zinenko. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. InProc. CGO

work page 2021
[18]

J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han. 2024. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. InProc. MLSys

work page 2024
[19]

Y. Lin, H. Tang, S. Yang, et al. 2025. QServe: W4A8KV4 Quantization and System Co-Design for Efficient LLM Serving. InProc. MLSys, Vol. 7

work page 2025
[20]

NVIDIA. 2022. NVIDIA Ada Lovelace GPU Architecture Whitepaper

work page 2022
[21]

NVIDIA. 2022. NVIDIA H100 Tensor Core GPU Architecture Whitepaper

work page 2022
[22]

NVIDIA. 2023. CUTLASS: CUDA Templates for Linear Algebra Subroutines and Solvers. https://github.com/NVIDIA/cutlass

work page 2023
[23]

NVIDIA. 2024. TensorRT-LLM: High-Performance LLM Inference. https://github. com/NVIDIA/TensorRT-LLM

work page 2024
[24]

L. Qiao, J. Shi, X. Hao, X. Fang, S. Zhang, M. Zhao, Z. Zhu, J. Chen, H. An, X. Tang, B. Li, H. Yuan, and X. Wang. 2025. Pruner: A Draft-then-Verify Exploration Mechanism to Accelerate Tensor Program Tuning. InProc. ASPLOS

work page 2025
[25]

Qwen Team. 2024. Qwen2.5 Technical Report.arXiv:2412.15115(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Qwen Team. 2025. Qwen3 Technical Report.arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Tillet, H

P. Tillet, H. T. Kung, and D. Cox. 2019. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. InProc. MAPL

work page 2019
[28]

H. Wang, J. Zhai, M. Gao, F. Zhang, T. Wang, Z. Ma, S. Tang, L. Zheng, W. Wang, K. Rong, Y. Chen, and Z. Jia. 2023. PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections.IEEE Trans. Comput.72 (2023), 3546–3560

work page 2023
[29]

M. Wu, X. Cheng, S. Liu, C. Shi, J. Ji, M. Ao, P. Velliengiri, X. Miao, O. Padon, and Z. Jia. 2024. Mirage: A Multi-Level Superoptimizer for Tensor Programs. InProc. PLDI

work page 2024
[30]

Zhang, Y

X. Zhang, Y. Ding, B. Sun, Y. Hu, T. Shpeisman, and G. Pekhimenko. 2026. Hexcute: A Compiler Framework for Automating Layout Synthesis in GPU Programs. In Proc. CGO

work page 2026
[31]

Zhang, D

Z. Zhang, D. Yang, X. Zhou, and D. Cheng. 2024. MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators. InProc. SC

work page 2024
[32]

Y. Zhao, E. Johnson, P. Chatarasi, V. S. Adve, and S. Misailovic. 2025. Nep- tune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs. arXiv:2510.08726(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Zheng, C

L. Zheng, C. Jia, M. Sun, Z. Wu, C. H. Yu, A. Haj-Ali, Y. Wang, J. Yang, D. Zhuo, K. Sen, J. E. Gonzalez, and I. Stoica. 2020. Ansor: Generating High-Performance Tensor Programs for Deep Learning. InProc. OSDI

work page 2020
[34]

Zheng, H

L. Zheng, H. Wang, J. Zhai, M. Hu, Z. Ma, T. Wang, S. Tang, L. Xie, K. Huang, and Z. Jia. 2022. OLLIE: Derivation-based Tensor Program Optimizer.arXiv:2208.02025 (2022)

work page arXiv 2022
[35]

Zheng, L

L. Zheng, L. Yin, Z. Xie, J. Sun, C. Cui, E. Xie, and H. Zhang. 2025. SGLang: Efficient Execution of Structured Language Model Programs. InProc. ICLR

work page 2025
[36]

Zheng, J

S. Zheng, J. Fang, X. Zheng, Q. Hou, W. Bao, N. Zheng, Z. Jiang, D. Wang, J. Ye, H. Lin, L.-W. Chang, and X. Liu. 2025. TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives. arXiv:2503.20313(2025)

work page arXiv 2025