Recognition: unknown
RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts
Pith reviewed 2026-05-07 16:32 UTC · model grok-4.3
The pith
A four-parameter wave cost model selects near-optimal kernel configurations for Mixture-of-Experts inference from the runtime expert histogram.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that a performance-region analysis derived solely from hardware constants correctly predicts when each optimization helps on all eight tested architectures, including three previously unseen. From this foundation, a four-parameter wave cost model selects the fastest polymorphic configuration from the runtime expert histogram, achieving 0.93% mean regret versus exhaustive search after brief one-time profiling. When combined with a CuTe DSL megakernel that exposes 134-268 configurations, the method produces 1.22x kernel speedup over static dispatch and 1.30x end-to-end speedup in vLLM serving.
What carries the argument
The four-parameter wave cost model that estimates kernel execution time from CTA grid geometry and the runtime expert routing histogram to choose among polymorphic configurations.
If this is right
- Static batch-size-only dispatch leaves 10-70% of attainable kernel throughput unrealized in MoE serving.
- RaMP delivers 1.22x kernel speedup and 1.30x end-to-end speedup over Triton, DeepGEMM, and FlashInfer baselines.
- The same selection logic transfers to unmodified kernels such as Alpha-MoE, producing 1.14x improvement.
- Hardware-constant predictions hold for all eight evaluated architectures without per-architecture retuning.
Where Pith is reading between the lines
- Histogram-driven selection could extend to other sparse workloads whose optimal kernels also vary with activation patterns.
- Compiler integration might reduce the one-time profiling step to near-zero for new model variants.
- Online histogram collection could support per-request adaptation when serving mixes of models on shared hardware.
Load-bearing premise
That a four-parameter model based only on CTA grid geometry can accurately rank kernel configurations across different expert routing distributions.
What would settle it
Recording a mean regret substantially above 0.93% when the fitted wave cost model is applied to a new MoE architecture or GPU not used during the initial 10-24 minute profiling.
Figures
read the original abstract
The optimal kernel configuration for Mixture-of-Experts (MoE) inference depends on both batch size and the expert routing distribution, yet production systems dispatch from batch size alone, leaving 10-70% of kernel throughput unrealized. We present RaMP, a routing-aware dispatch framework. A performance-region analysis derives, from hardware constants alone, when each optimization helps, correctly predicting all 8 tested architectures, including 3 unseen. A four-parameter wave cost model selects the fastest configuration from the runtime expert histogram, achieving 0.93% mean regret versus exhaustive search, fitted from just 10-24 minutes of one-time profiling per model. Because the model depends only on CTA grid geometry, it is kernel-agnostic: applied to Alpha-MoE, it delivers 1.14x with no source modification. Paired with a co-designed CuTe DSL kernel exposing 134-268 polymorphic configurations, RaMP delivers 1.22x kernel speedup over static dispatch and 1.30x end-to-end speedup in vLLM serving over Triton, 1.41x over DeepGEMM, and 1.13x over FlashInfer CUTLASS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RaMP, a routing-aware dispatch framework for Mixture-of-Experts inference. It features a performance-region analysis derived from hardware constants alone that predicts when each optimization helps and correctly forecasts behavior across all 8 tested architectures (including 3 unseen). A four-parameter wave cost model, fitted via 10-24 minutes of one-time profiling per model, selects the fastest configuration from the runtime expert histogram and achieves 0.93% mean regret versus exhaustive search. The model depends only on CTA grid geometry, making it kernel-agnostic; when paired with a co-designed CuTe DSL kernel exposing 134-268 polymorphic configurations, RaMP yields 1.22x kernel speedup over static dispatch and 1.30x end-to-end speedup in vLLM serving (1.41x over DeepGEMM, 1.13x over FlashInfer CUTLASS).
Significance. If the performance-region analysis and low-regret selection hold, the work could meaningfully advance efficient MoE serving by closing the 10-70% throughput gap left by batch-size-only dispatch. The kernel-agnostic property, low profiling overhead, and demonstrated speedups on multiple architectures (including application to Alpha-MoE with no source changes) are strengths that would support practical adoption in production inference systems.
major comments (3)
- [Performance-region analysis] Performance-region analysis: the central claim that this analysis derives purely from hardware constants and correctly predicts optimization benefits on all 8 architectures (including 3 unseen) is load-bearing for both the kernel-agnostic property and the reported speedups, yet no derivation steps, explicit equations, or list of constants appear in the manuscript. This leaves the generalization risk unaddressed.
- [Wave cost model] Four-parameter wave cost model: the model is fitted directly to profiling data collected on the target hardware, creating a circularity burden for the 0.93% mean regret claim; the fitting/validation procedure (including how the four parameters were chosen and whether cross-hardware testing was performed) must be detailed to confirm it is not post-hoc tuning.
- [Experimental results] Empirical evaluation: the abstract and results report concrete speedups (1.22x kernel, 1.30x end-to-end) and low regret without error bars, number of runs, or statistical significance tests; the tables or figures presenting these numbers should include variance to allow assessment of robustness.
minor comments (2)
- The abstract would be clearer if it briefly defined 'megakernel polymorphism' and 'CTA grid geometry' on first use.
- [Wave cost model] Notation for the wave cost model parameters is introduced without an accompanying equation or table listing their values across architectures.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and commit to revisions that strengthen the presentation of our methods and results.
read point-by-point responses
-
Referee: [Performance-region analysis] Performance-region analysis: the central claim that this analysis derives purely from hardware constants and correctly predicts optimization benefits on all 8 architectures (including 3 unseen) is load-bearing for both the kernel-agnostic property and the reported speedups, yet no derivation steps, explicit equations, or list of constants appear in the manuscript. This leaves the generalization risk unaddressed.
Authors: We agree that the manuscript lacks sufficient detail on the derivation. The performance-region analysis is constructed from a roofline comparison of each configuration's arithmetic intensity against hardware constants (peak FP16 throughput, memory bandwidth, L2 cache size, and CTA occupancy limits) obtained from vendor specifications. Regions are delineated by the balance point where memory-bound vs. compute-bound behavior changes. We will add a new subsection (or appendix) containing the explicit equations, the full list of constants for all eight architectures, and the step-by-step prediction procedure that was validated on the three unseen architectures. revision: yes
-
Referee: [Wave cost model] Four-parameter wave cost model: the model is fitted directly to profiling data collected on the target hardware, creating a circularity burden for the 0.93% mean regret claim; the fitting/validation procedure (including how the four parameters were chosen and whether cross-hardware testing was performed) must be detailed to confirm it is not post-hoc tuning.
Authors: The four parameters map directly to observable quantities (wave launch overhead, per-CTA compute time, per-CTA memory time, and synchronization cost) and are fitted once via least-squares on a modest set of representative histograms collected in 10-24 minutes. The 0.93% regret is computed against exhaustive search on the identical hardware and workload distribution, which is the correct baseline for a runtime selector. We will expand the manuscript with the exact fitting procedure, rationale for the four-parameter form, cross-validation protocol (hold-out histograms), and results of applying the model structure across architectures (with per-hardware refitting of coefficients). revision: yes
-
Referee: [Experimental results] Empirical evaluation: the abstract and results report concrete speedups (1.22x kernel, 1.30x end-to-end) and low regret without error bars, number of runs, or statistical significance tests; the tables or figures presenting these numbers should include variance to allow assessment of robustness.
Authors: We concur that variance information improves assessment of robustness. All reported speedups and regret figures are means over at least five independent runs per configuration. We will update the relevant tables and figures to display standard deviations as error bars, state the number of repetitions explicitly, and add a brief discussion of statistical significance (e.g., paired t-tests for key comparisons). revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The abstract explicitly describes the four-parameter wave cost model as fitted from one-time profiling data and the performance-region analysis as derived from hardware constants alone, with empirical results (0.93% regret, correct predictions on 8 architectures including 3 unseen) presented as outcomes rather than definitional. No equations, self-referential definitions, or reductions of predictions to inputs by construction appear in the provided text. The kernel-agnostic claim and speedups rest on these stated derivations without evidence of self-citation load-bearing or ansatz smuggling. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- four wave cost model parameters
axioms (1)
- domain assumption Performance regions for kernel optimizations can be derived from hardware constants alone and correctly predict behavior on unseen architectures
Reference graph
Works this paper leans on
-
[1]
DeepSeek-AI, “DeepSeek-V3 technical report,”arXiv preprint arXiv:2412.19437, 2024. https://doi.org/10.48550/arXiv.2412.19437
-
[2]
Olmoe: Open mixture-of-experts language models
N. Muennighoff, L. Soldaini, D. Groeneveld, K. Lo, J. Morrison, S. Min, W. Shi, P. Walsh, O. Tafjord, N. Lambert, Y . Gu, S. Arora, A. Bhagia, D. Schwenk, D. Wadden, A. Wettig, B. Hui, T. Dettmers, D. Kiela, A. Farhadi, N. A. Smith, P. W. Koh, A. Singh, and H. Hajishirzi, “OLMoE: Open mixture-of-experts language models,”arXiv preprint arXiv:2409.02060, 20...
-
[3]
Qwen Team, “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025. https://doi.org/10.48550/arXiv.2505.09388
work page internal anchor Pith review doi:10.48550/arxiv.2505.09388 2025
-
[4]
InProceedings of the 29th Symposium on Operating Systems Principles (SOSP ’23)
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with PagedAttention,” inProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pp. 611– 626, 2023. https://doi.org/10.1145/3600006.3613165
-
[5]
Alpha-moe: Fused mixture-of-experts kernel
Aleph Alpha, “Alpha-moe: Fused mixture-of-experts kernel.” https: //github.com/Aleph-Alpha/Alpha-MoE, 2025
2025
-
[6]
DeepGEMM: Clean and efficient fp8 gemm kernels
DeepSeek-AI, “DeepGEMM: Clean and efficient fp8 gemm kernels.” https://github.com/deepseek-ai/DeepGEMM, 2025
2025
-
[7]
Z. Ye, L. Chen, R. Lai, W. Lin, Y . Zhang, S. Wang, T. Chen, B. Kasikci, V . Grover, A. Krishnamurthy, and L. Ceze, “FlashInfer: Efficient and customizable attention engine for LLM inference serving,” inProceedings of Machine Learning and Systems (MLSys), 2025. https://doi.org/10.48550/arXiv.2501.01005
-
[8]
SonicMoE: Accelerating MoE with IO and tile-aware optimizations,
W. Guo, M. Mishra, X. Cheng, I. Stoica, and T. Dao, “SonicMoE: Accelerating MoE with IO and tile-aware optimizations,”arXiv preprint arXiv:2512.14080, 2025. https://doi.org/10.48550/arXiv.2512.14080
-
[9]
CUTLASS: Cuda templates for linear algebra subroutines
NVIDIA, “CUTLASS: Cuda templates for linear algebra subroutines.” https://github.com/NVIDIA/cutlass, 2024
2024
-
[10]
arXiv:2211.15841 [cs.LG]https://arxiv.org/abs/2211.15841
T. Gale, D. Narayanan, C. Young, and M. Zaharia, “MegaBlocks: Efficient sparse training with mixture-of-experts,” inProceedings of Machine Learning and Systems (MLSys), 2023. https://doi.org/10.48550/ arXiv.2211.15841
-
[11]
Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks
C. Hwang, W. Cui, Y . Xiong, Z. Yang, Z. Liu, H. Hu, Z. Wang, R. Salas, J. Jose, P. Ram, H. Chau, P. Cheng, F. Yang, M. Yang, and Y . Xiong, “Tutel: Adaptive mixture-of-experts at scale,” inProceedings of Machine Learning and Systems (MLSys), 2023. https://doi.org/10.48550/arXiv. 2206.03382
work page internal anchor Pith review doi:10.48550/arxiv 2023
-
[12]
Faster- MoE: Modeling and optimizing training of large-scale dynamic pre- trained models,
J. He, J. Zhai, T. Antunes, H. Wang, F. Luo, S. Shi, and Q. Li, “Faster- MoE: Modeling and optimizing training of large-scale dynamic pre- trained models,” inProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 120– 134, 2022. https://doi.org/10.1145/3503221.3508418
-
[13]
Scattered mixture-of- experts implementation,
S. Tan, Y . Shen, R. Panda, and A. Courville, “Scattered mixture-of- experts implementation,”arXiv preprint arXiv:2403.08245, 2024. https: //doi.org/10.48550/arXiv.2403.08245
-
[14]
SGLang: Efficient Execution of Structured Language Model Programs
L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y . Sheng, “SGLang: Efficient execution of structured language model programs,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 37, 2024. https://doi.org/10.48550/arXiv.2312.07104
work page internal anchor Pith review doi:10.48550/arxiv.2312.07104 2024
-
[15]
Triton : an intermediate language and compiler for tiled neural network computations
P. Tillet, H. T. Kung, and D. Cox, “Triton: An intermediate language and compiler for tiled neural network computations,” inProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), pp. 10–19, 2019. https://doi.org/ 10.1145/3315508.3329973
-
[16]
Ansor: Generating high-performance tensor programs for deep learning,
L. Zheng, C. Jia, M. Sun, Z. Wu, C. H. Yu, A. Haj-Ali, Y . Wang, J. Yang, D. Zhuo, K. Sen, J. E. Gonzalez, R. Bodik, and I. Stoica, “Ansor: Generating high-performance tensor programs for deep learning,” in14th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 863–879, 2020. https://doi.org/10.48550/arXiv.2006.06762
-
[17]
Roofline: An insightful visual performance model for multicore architectures,
S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009. https://doi.org/10.1145/ 1498765.1498785
-
[18]
H. Stengel, J. Treibig, G. Hager, and G. Wellein, “Quantifying per- formance bottlenecks of stencil computations using the Execution- Cache-Memory model,” inProceedings of the 29th ACM International Conference on Supercomputing (ICS), pp. 207–216, ACM, 2015. https: //doi.org/10.1145/2751205.2751240
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.