Recognition: no theorem link
Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference
Pith reviewed 2026-05-13 01:44 UTC · model grok-4.3
The pith
Ada-MK resolves the efficiency-portability tension in MegaKernel by automating compile-time optimization for LLM decode.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that an MLIR-based fine-grained DAG offline search can solidify the optimal execution path for MegaKernel, completely eliminating runtime branching. This is paired with a three-dimensional shared-memory constraint model and K-dimension splitting that reduces peak shared memory usage by 50 percent. Together these allow Ada-MK to embed as a plugin in TensorRT-LLM, delivering improved single-batch throughput on NVIDIA L20 GPUs across all tested scenarios.
What carries the argument
MLIR-based fine-grained DAG offline search for solidifying the optimal MegaKernel execution path at compile time.
Load-bearing premise
Under a fixed deployment configuration the optimal execution path of a MegaKernel is uniquely determined and runtime dynamic decision-making can be entirely hoisted to compile time.
What would settle it
If measurements on the NVIDIA L20 show that a runtime-dynamic MegaKernel variant achieves lower latency or higher throughput than the compile-time optimized version for the same workloads, the core premise would be falsified.
Figures
read the original abstract
When large language models (LLMs) serve real-time inference in commercial online advertising systems, end-to-end latency must be strictly bounded to the millisecond range. Yet every token generated during the decode phase triggers thousands of kernel launches, and kernel launch overhead alone can account for 14.6% of end-to-end inference time. MegaKernel eliminates launch overhead and inter-operator HBM round-trips by fusing multiple operators into a single persistent kernel. However, existing MegaKernel implementations face a fundamental tension between portability and efficiency on resource-constrained GPUs such as NVIDIA Ada: hand-tuned solutions are tightly coupled to specific architectures and lack portability, while auto-compiled approaches introduce runtime dynamic scheduling whose branch penalties are unacceptable in latency-critical settings. We observe that under a fixed deployment configuration, the optimal execution path of a MegaKernel is uniquely determined, and runtime dynamic decision-making can be entirely hoisted to compile time. Building on this insight, we propose Ada-MK: (1) a three-dimensional shared-memory constraint model combined with K-dimension splitting that reduces peak shared memory usage by 50%; (2) MLIR-based fine-grained DAG offline search that solidifies the optimal execution path, completely eliminating runtime branching; and (3) a heterogeneous hybrid inference engine that embeds MegaKernel as a plugin into TensorRT-LLM, combining high-throughput Prefill with low-latency Decode. On an NVIDIA L20, Ada-MK improves single-batch throughput by up to 23.6% over vanilla TensorRT-LLM and 50.2% over vLLM, achieving positive gains across all tested scenarios--the first industrial deployment of MegaKernel in a commercial online advertising system.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Ada-MK, an adaptive optimization framework for MegaKernel-based LLM inference. It rests on the observation that, under a fixed deployment configuration, the optimal MegaKernel execution path is uniquely determined and can be hoisted entirely to compile time. The method combines a three-dimensional shared-memory constraint model with K-dimension splitting, an MLIR-based fine-grained DAG offline search that eliminates runtime branching, and a heterogeneous hybrid engine that integrates the resulting MegaKernel as a plugin into TensorRT-LLM. The central empirical claim is that, on an NVIDIA L20 GPU, Ada-MK delivers up to 23.6% higher single-batch throughput than vanilla TensorRT-LLM and 50.2% higher than vLLM across all tested scenarios, constituting the first industrial deployment of MegaKernel in a commercial online advertising system.
Significance. If the performance numbers and the compile-time hoisting guarantee are shown to be robust, the work would provide a practical route to reducing kernel-launch overhead in latency-critical LLM decode phases while preserving portability. The automated DAG search and the explicit 3-D shared-memory model are concrete technical contributions that could be reused beyond the specific TensorRT-LLM integration.
major comments (3)
- [Abstract and §3] Abstract and §3 (method overview): the claim that 'under a fixed deployment configuration, the optimal execution path of a MegaKernel is uniquely determined' is load-bearing for the entire compile-time solidification argument. The manuscript does not demonstrate that the offline DAG search remains exhaustive once per-token sequence-length variation, KV-cache growth, and attention-pattern changes are taken into account; a single counter-example in the decode phase would re-introduce runtime branching that the reported gains assume is absent.
- [§4] §4 (experimental results): the reported 23.6% and 50.2% throughput improvements are stated without reference to the precise model sizes, prompt lengths, decode lengths, batch-size=1 configuration details, number of repeated runs, or error bars. Because these numbers constitute the primary evidence for the industrial-deployment claim, the absence of this information prevents assessment of statistical reliability and reproducibility.
- [§3.2] §3.2 (three-dimensional shared-memory model): the 50% reduction in peak shared-memory usage is presented as a direct consequence of the K-dimension splitting heuristic. No formal argument or exhaustive enumeration is supplied showing that the model captures all fusion opportunities that arise under variable KV-cache pressure; if the heuristic misses a high-pressure configuration, the claimed elimination of runtime branching cannot be guaranteed.
minor comments (3)
- [§3.1] Notation for the three-dimensional shared-memory constraint (Eq. (3) or equivalent) is introduced without an explicit legend relating the three axes to hardware resources; a small diagram or table would improve readability.
- [§3.3] The hybrid engine description (§3.3) refers to 'positive gains across all tested scenarios' but does not list the exact set of models and sequence-length ranges used; adding this table would strengthen the generality statement.
- [Abstract and Introduction] Several sentences in the abstract and introduction repeat the phrase 'completely eliminating runtime branching'; a single, precise statement of the scope of this guarantee would reduce redundancy.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the compile-time hoisting claim, experimental reporting, and shared-memory modeling. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (method overview): the claim that 'under a fixed deployment configuration, the optimal execution path of a MegaKernel is uniquely determined' is load-bearing for the entire compile-time solidification argument. The manuscript does not demonstrate that the offline DAG search remains exhaustive once per-token sequence-length variation, KV-cache growth, and attention-pattern changes are taken into account; a single counter-example in the decode phase would re-introduce runtime branching that the reported gains assume is absent.
Authors: The fixed deployment configuration in Ada-MK explicitly incorporates a predetermined maximum sequence length and context window, which bounds all possible per-token KV-cache sizes, sequence-length variations, and attention patterns during decode. The MLIR-based DAG search performs an exhaustive offline enumeration over all feasible states within these bounds, selecting a single solidified execution path that requires no runtime branching. We will revise §3 to include an explicit analysis (with pseudocode and boundary-case enumeration) demonstrating that the search covers every intermediate KV-cache state up to the maximum, thereby guaranteeing the absence of runtime decisions under the stated fixed configuration. revision: yes
-
Referee: [§4] §4 (experimental results): the reported 23.6% and 50.2% throughput improvements are stated without reference to the precise model sizes, prompt lengths, decode lengths, batch-size=1 configuration details, number of repeated runs, or error bars. Because these numbers constitute the primary evidence for the industrial-deployment claim, the absence of this information prevents assessment of statistical reliability and reproducibility.
Authors: We agree that the current presentation of the 23.6% and 50.2% figures lacks sufficient detail for full reproducibility. In the revised manuscript we will add a comprehensive experimental table in §4 that specifies model sizes (Llama-7B/13B), prompt and decode lengths, explicit batch-size=1 settings, number of repeated runs (10 per configuration), and standard-deviation error bars for all throughput measurements. This addition will directly address the statistical reliability concern while preserving the reported gains. revision: yes
-
Referee: [§3.2] §3.2 (three-dimensional shared-memory model): the 50% reduction in peak shared-memory usage is presented as a direct consequence of the K-dimension splitting heuristic. No formal argument or exhaustive enumeration is supplied showing that the model captures all fusion opportunities that arise under variable KV-cache pressure; if the heuristic misses a high-pressure configuration, the claimed elimination of runtime branching cannot be guaranteed.
Authors: The three-dimensional shared-memory constraint model combined with K-dimension splitting is constructed precisely to account for variable KV-cache pressure by modeling memory occupancy across head, sequence, and hidden dimensions. While empirical results show the 50% reduction, we acknowledge the absence of a formal exhaustiveness argument. We will augment §3.2 with a formal argument (including a proof sketch that the splitting heuristic enumerates all high-pressure boundary cases within the fixed maximum context) and an appendix table of enumerated configurations, thereby strengthening the guarantee that no fusion opportunity is missed. revision: partial
Circularity Check
No significant circularity; empirical gains presented as direct measurements
full rationale
The paper's central claims rest on an empirical observation about fixed-configuration uniqueness of MegaKernel paths, followed by an MLIR DAG search that solidifies that path and a heterogeneous engine that embeds the result. No equations, fitted parameters, or self-citations are invoked to derive the reported throughput numbers (23.6% and 50.2%); those are stated as measured outcomes on NVIDIA L20 hardware. The uniqueness statement is presented as an observation rather than a theorem derived from prior self-work or by construction from the search itself. The derivation chain therefore remains self-contained against external benchmarks and does not reduce any prediction to its inputs by definition.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
X. Cheng, Z. Zhang, Y. Zhou, J. Ji, J. Jiang, Z. Zhao, Z. Xiao, Z. Ye, Y. Huang, R. Lai, H. Jin, B. Hou, M. Wu, Y. Dong, A. Yip, S. Wang, W. Yang, X. Miao, T. Chen, and Z. Jia. 2025. Mirage Persistent Kernel: A Compiler and Runtime for Mega-Kernelizing Tensor Programs.arXiv:2512.22219(2025)
-
[2]
T. Dao. 2024. FlashAttention-2: Faster Attention with Better Parallelism and Work Sharing. InProc. ICLR
work page 2024
-
[3]
T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InProc. NeurIPS
work page 2022
-
[4]
Z. Di, L. Wang, Z. Ma, E. Shao, J. Zhao, Z. Ren, S. Feng, D. Tao, G. Tan, and N. Sun
-
[5]
Accelerating Parallel Structures in DNNs via Parallel Fusion and Operator Co-Optimization.ACM Trans. Archit. Code Optim.22 (2025), 1–26
work page 2025
-
[6]
Y. Ding, B. Hou, X. Zhang, A. Lin, T. Chen, C. H. Yu, Y. Wang, and G. Pekhimenko
-
[7]
Tilus: A Tile-Level GPGPU Programming Language for Low-Precision Computation. InProc. ASPLOS
-
[8]
Y. Ding, C. H. Yu, B. Zheng, Y. Liu, Y. Wang, and G. Pekhimenko. 2023. Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs. In Proc. ASPLOS
work page 2023
-
[9]
S. Feng, B. Hou, H. Jin, W. Lin, J. Shao, R. Lai, Z. Ye, L. Zheng, C. H. Yu, Y. Yu, and T. Chen. 2023. TensorIR: An Abstraction for Automatic Tensorized Program Optimization. InProc. ASPLOS
work page 2023
-
[10]
E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. 2023. GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. InProc. ICLR
work page 2023
-
[11]
HazyResearch. 2025. Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B. https://github.com/HazyResearch/Megakernels. (2025)
work page 2025
- [12]
-
[13]
M. Hu, A. Venkatram, S. Biswas, B. Marimuthu, B. Hou, G. Oliaro, H. Wang, L. Zheng, X. Miao, J. Zhai, and Z. Jia. 2024. Korch: Optimal Kernel Orchestration for Tensor Programs. InProc. ASPLOS
work page 2024
-
[14]
Z. Jia, O. Padon, J. J. Thomas, T. Warszawski, M. Zaharia, and A. Aiken. 2019. TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions. InProc. SOSP
work page 2019
-
[15]
W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProc. SOSP
work page 2023
-
[16]
R. Lai, J. Shao, S. Feng, S. Lyubomirsky, B. Hou, W. Lin, Z. Ye, H. Jin, Y. Jin, J. Liu, L. Jin, Y. Cai, Z. Jiang, Y. Wu, S. Park, P. Srivastava, J. Roesch, T. Mowry, and T. Chen. 2025. Relax: Composable Abstractions for End-to-End Dynamic Machine Learning. InProc. ASPLOS
work page 2025
-
[17]
C. Lattner, M. Amini, U. Bondhugula, A. Cohen, A. Davis, J. Pienaar, R. Riddle, T. Shpeisman, N. Vasilache, and O. Zinenko. 2021. MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. InProc. CGO
work page 2021
-
[18]
J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han. 2024. AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. InProc. MLSys
work page 2024
-
[19]
Y. Lin, H. Tang, S. Yang, et al. 2025. QServe: W4A8KV4 Quantization and System Co-Design for Efficient LLM Serving. InProc. MLSys, Vol. 7
work page 2025
-
[20]
NVIDIA. 2022. NVIDIA Ada Lovelace GPU Architecture Whitepaper
work page 2022
-
[21]
NVIDIA. 2022. NVIDIA H100 Tensor Core GPU Architecture Whitepaper
work page 2022
-
[22]
NVIDIA. 2023. CUTLASS: CUDA Templates for Linear Algebra Subroutines and Solvers. https://github.com/NVIDIA/cutlass
work page 2023
-
[23]
NVIDIA. 2024. TensorRT-LLM: High-Performance LLM Inference. https://github. com/NVIDIA/TensorRT-LLM
work page 2024
-
[24]
L. Qiao, J. Shi, X. Hao, X. Fang, S. Zhang, M. Zhao, Z. Zhu, J. Chen, H. An, X. Tang, B. Li, H. Yuan, and X. Wang. 2025. Pruner: A Draft-then-Verify Exploration Mechanism to Accelerate Tensor Program Tuning. InProc. ASPLOS
work page 2025
-
[25]
Qwen Team. 2024. Qwen2.5 Technical Report.arXiv:2412.15115(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Qwen Team. 2025. Qwen3 Technical Report.arXiv:2505.09388(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [27]
-
[28]
H. Wang, J. Zhai, M. Gao, F. Zhang, T. Wang, Z. Ma, S. Tang, L. Zheng, W. Wang, K. Rong, Y. Chen, and Z. Jia. 2023. PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections.IEEE Trans. Comput.72 (2023), 3546–3560
work page 2023
-
[29]
M. Wu, X. Cheng, S. Liu, C. Shi, J. Ji, M. Ao, P. Velliengiri, X. Miao, O. Padon, and Z. Jia. 2024. Mirage: A Multi-Level Superoptimizer for Tensor Programs. InProc. PLDI
work page 2024
- [30]
- [31]
-
[32]
Y. Zhao, E. Johnson, P. Chatarasi, V. S. Adve, and S. Misailovic. 2025. Nep- tune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs. arXiv:2510.08726(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [33]
- [34]
- [35]
- [36]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.