pith. sign in

arxiv: 2606.28565 · v1 · pith:CVBUA33Lnew · submitted 2026-06-26 · 💻 cs.PF · cs.AI· cs.AR

KernelSight-LM: A Kernel-Level LLM Inference Simulator

Pith reviewed 2026-06-30 00:44 UTC · model grok-4.3

classification 💻 cs.PF cs.AIcs.AR
keywords LLM inference simulationkernel latency predictionGPU performance modelingroofline analysisdiscrete-event schedulingserving systemshardware co-designcontinuous batching
0
0 comments X

The pith

KernelSight-LM predicts per-kernel LLM inference latency on unseen GPU generations to 12.1% error with no target measurements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a simulator that decomposes LLM serving steps into roofline kernel execution, communication, and host overhead components scheduled by a discrete-event system that also handles prefix caching and continuous batching. It offers a cross-generation prediction mode that relies solely on hardware specifications and microbenchmarks from earlier GPUs rather than any measurements on the target device. A sympathetic reader would care because current practice requires slow, deployment-specific on-device benchmarking that does not generalize across hardware generations or serving configurations. The work shows this decomposition yields end-to-end median errors of 15.4% for TTFT, 12.8% for TPOT, and 3.0% for throughput across six model families in the cross-generation tier.

Core claim

KernelSight-LM decomposes each serving step into a roofline kernel model with a learned efficiency term, a communication model, and a host-overhead model, composed through a discrete-event scheduler that captures prefix caching and continuous batching. The cross-generation tier uses only hardware specifications and kernel microbenchmarks from previously profiled GPUs to predict per-kernel latency on an unseen GPU generation to 12.1% error, a 1.8x improvement over the roofline baseline of 22.0%. The target-measured tier adds one model-agnostic kernel-microbenchmark sweep on the target GPU to reach 3.8% per-kernel error.

What carries the argument

Roofline kernel model with a learned efficiency term fitted from microbenchmarks, composed with communication and host-overhead models via a discrete-event scheduler.

If this is right

  • End-to-end median errors reach 15.4% TTFT, 12.8% TPOT, and 3.0% throughput in cross-generation mode across six model families.
  • Target-measured mode sharpens per-kernel error to 3.8% with a single model-agnostic microbenchmark sweep.
  • Kernel-level bottleneck breakdowns directly support hardware and software co-design decisions.
  • Both tiers require far less target-GPU data collection than prior profiling systems they extend.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition could be tested on non-GPU accelerators if the roofline-plus-efficiency structure holds.
  • Capacity planning tools could ingest the kernel breakdowns to estimate cluster sizing before hardware purchase.
  • Serving policy designers could iterate on batching and caching rules using simulated traces instead of repeated deployments.

Load-bearing premise

The efficiency term learned from microbenchmarks on previously profiled GPUs generalizes to new GPU generations and diverse serving workloads without target-specific measurements or retraining.

What would settle it

Running the cross-generation tier on a new GPU generation and observing per-kernel prediction error that exceeds 12.1% or fails to beat the 22.0% roofline baseline by a substantial margin would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.28565 by Ashish Khetan, George Karypis, Hengzhi Pei, Kyle Ulrich, Leonard Lausen, Martin Herbordt, Taeho Kim, Xiang Song, Xinle Liu, Xiteng Yao.

Figure 1
Figure 1. Figure 1: A GPU is an array of streaming multiprocessors [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Wave quantization. A kernel’s 𝐵 thread blocks are scheduled onto the 𝑁SM SMs in waves; when 𝐵 is not a multiple of 𝑁SM the final wave is partial and leaves SMs idle (here 9 blocks on 4 SMs span 3 waves, leaving 3 of 4 SMs idle in the last), inflating execution time by 𝑢 = ⌈𝐵/𝑁SM⌉ 𝑁SM/𝐵 ≥ 1 in the compute roofline (Eq. (2)) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example of a roofline plot of A100 GPU. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An overview of KernelSight-LM’s architecture. The left-hand input boxes are the data the user provides (hardware [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Our per-kernel latency prediction workflow. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The bounded analytical head adapted from [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: NVLink and PCIe differ in absolute bandwidth, but [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Kernel-time composition over an example single request’s progress (Qwen3-8B, GB200). During prefill, GEMM [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: GB200 real-vs-sim accuracy across select serving runs (models [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The cross-generation test. Per-kernel 𝜂 = measured/roofline versus each family’s dominant shape parameter, across six GPUs. The five analysis devices are gray; GB200 (blue) is the held-out target. Where GB200’s curve lies within the analysis envelope (GEMM), 𝜂 interpolates; where it lies apart (attention, KV-cache, RoPE), 𝜂 is device-specific and cannot be recovered from the other devices. Dashed: GB200’s… view at source ↗
Figure 11
Figure 11. Figure 11: When 𝜂 extrapolates. Each family placed by 𝜂 magnitude (geomean over devices; 1 = roofline exact) and cross-device spread (max/min; 1 = portable). The safe region (blue) holds families where the roofline is accurate and 𝜂 ports; the danger region (orange) holds those where 𝜂 both carries the prediction and varies across devices [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prediction error vs. offered load (One run of Qwen3, TP1, GB200). [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
read the original abstract

As large language models (LLMs) move into production serving, practitioners must rapidly evaluate inference performance across diverse hardware, models, and serving parameters to meet cost and latency targets. However, the end-to-end behavior of LLMs couples serving-layer policies with low-level GPU kernel execution and rapidly evolving architectures, forcing slow, deployment-specific benchmarking that is hard to generalize. We present KernelSight-LM, a fine-grained inference simulator that models token-level execution and produces kernel-level latency breakdowns. It decomposes each serving step into a roofline kernel model with a learned efficiency term, a communication model, and a host-overhead model, composed through a discrete-event scheduler that also captures mechanisms like prefix caching and continuous batching. KernelSight-LM offers two prediction tiers that trade target-GPU data for accuracy. The cross-generation tier uses no target-GPU measurements, only hardware specifications and kernel microbenchmarks from previously profiled GPUs, and predicts per-kernel latency on an unseen GPU generation to 12.1% error, a 1.8x improvement over the roofline baseline (22.0%). A second target-measured tier adds one model-agnostic kernel-microbenchmark sweep on the target GPU, sharpening per-kernel error to 3.8%, a 7.3x improvement over a comparable baseline (27.7%). Both tiers require far less target-GPU data than the prior systems they extend. In our simulator, these predictions yield end-to-end median (p50) errors across six model families of 15.4%, 12.8%, and 3.0% (TTFT, TPOT, throughput) in the cross-generation tier and 14.3%, 6.2%, and 2.7% in the target-measured tier, matching dedicated profiling tools while collecting far less on-device data. Beyond prediction, its kernel-level bottleneck breakdowns support hardware/software co-design and capacity planning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents KernelSight-LM, a simulator for LLM inference performance that models token-level execution by decomposing each serving step into a roofline-based kernel model augmented with a learned efficiency term, a communication model, and a host-overhead model. These are composed using a discrete-event scheduler that incorporates serving mechanisms such as prefix caching and continuous batching. The system provides two prediction tiers: a cross-generation tier that uses only hardware specifications and prior GPU microbenchmarks to predict per-kernel latencies on unseen GPU generations with 12.1% error (1.8× improvement over a 22.0% roofline baseline), and a target-measured tier that adds one model-agnostic microbenchmark sweep to achieve 3.8% error. End-to-end median errors across six model families are reported as 15.4%, 12.8%, and 3.0% for TTFT, TPOT, and throughput in the cross-generation tier, and 14.3%, 6.2%, and 2.7% in the target-measured tier.

Significance. If the generalization of the learned efficiency term holds, KernelSight-LM would provide a valuable tool for evaluating LLM inference across diverse hardware with minimal target-specific data collection, supporting hardware/software co-design and capacity planning in production serving environments. The kernel-level bottleneck breakdowns are a particular strength for identifying optimization opportunities. The paper is credited for extending roofline models with a learned term and discrete-event simulation to capture serving dynamics, and for reporting concrete quantitative improvements over baselines.

major comments (3)
  1. [Abstract] The abstract claims a 12.1% per-kernel error for the cross-generation tier using 'no target-GPU measurements' and 'kernel microbenchmarks from previously profiled GPUs'. No information is given on the number of source GPU generations used to derive the learned efficiency term, the exact fitting procedure, or validation across multiple target generations. This is load-bearing for the central claim of 1.8× improvement over the roofline baseline, as the error may reflect tuning rather than portable generalization.
  2. [§3 (Model Description)] The decomposition into roofline kernel, communication, and host-overhead models with the learned efficiency term is described, but the manuscript does not clarify if the efficiency term is a single scalar or incorporates additional parameters, nor how it is fitted without target data. This directly affects whether the cross-generation predictions are independent of the target architecture.
  3. [§5 (Evaluation)] The end-to-end results (15.4% TTFT, 12.8% TPOT, 3.0% throughput for cross-generation) are presented across six model families, but the section supplies no experimental setup details, baseline implementations, validation splits, or confirmation that the learned term was not fitted in a manner that inflates the cross-generation performance. This undermines assessment of the reported errors.
minor comments (2)
  1. [Abstract] The specific model families used in the evaluation are not named; providing their identities would aid reproducibility and context.
  2. [Throughout] The notation and definitions for the learned efficiency term and its integration into the roofline model could be more explicitly introduced early in the text to improve accessibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight areas where additional clarity on the cross-generation methodology and experimental setup would strengthen the manuscript. We address each major comment below and will incorporate revisions to provide the requested details without altering the core claims or results.

read point-by-point responses
  1. Referee: [Abstract] The abstract claims a 12.1% per-kernel error for the cross-generation tier using 'no target-GPU measurements' and 'kernel microbenchmarks from previously profiled GPUs'. No information is given on the number of source GPU generations used to derive the learned efficiency term, the exact fitting procedure, or validation across multiple target generations. This is load-bearing for the central claim of 1.8× improvement over the roofline baseline, as the error may reflect tuning rather than portable generalization.

    Authors: We agree the abstract would benefit from explicit mention of the source data scope to support the generalization claim. Section 3 of the manuscript details the use of microbenchmarks from multiple prior GPU generations to fit the efficiency term via regression, with validation on unseen target generations. We will revise the abstract to briefly note the source GPU generations and that fitting uses only prior data, preserving the reported 12.1% error and 1.8× improvement as measured on held-out targets. revision: yes

  2. Referee: [§3 (Model Description)] The decomposition into roofline kernel, communication, and host-overhead models with the learned efficiency term is described, but the manuscript does not clarify if the efficiency term is a single scalar or incorporates additional parameters, nor how it is fitted without target data. This directly affects whether the cross-generation predictions are independent of the target architecture.

    Authors: The efficiency term is implemented as a single scalar per kernel type, fitted exclusively on source GPU microbenchmarks by regressing the ratio of measured to roofline-predicted latency. No target-GPU data enters the fit, ensuring independence. We will add an explicit sentence in §3.2 stating the scalar nature and confirming the fitting procedure uses only prior-generation data. revision: yes

  3. Referee: [§5 (Evaluation)] The end-to-end results (15.4% TTFT, 12.8% TPOT, 3.0% throughput for cross-generation) are presented across six model families, but the section supplies no experimental setup details, baseline implementations, validation splits, or confirmation that the learned term was not fitted in a manner that inflates the cross-generation performance. This undermines assessment of the reported errors.

    Authors: We acknowledge that §5 would be strengthened by expanded setup details. The evaluation uses source-only fitting of the efficiency term with a validation split on unseen targets and generations; baselines are pure roofline models without the learned term. We will revise §5 to include the validation splits, baseline implementations, and explicit confirmation that fitting excludes all target data, ensuring the reported errors reflect true cross-generation performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; cross-generation prediction tests generalization of fitted efficiency term on unseen GPUs.

full rationale

The paper's core claim decomposes inference into a roofline model plus a learned efficiency term fitted exclusively on microbenchmarks from previously profiled GPUs, then applies the term (with only target hardware specifications) to predict latencies on entirely unseen GPU generations. This is a standard held-out generalization test rather than a self-definitional loop or a fitted parameter renamed as a prediction on the same data. No equations or self-citations are shown reducing the reported 12.1% error to the input microbenchmarks by construction; the baseline comparison (pure roofline) further isolates the contribution of the fitted term as an independent modeling choice. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on a learned efficiency term whose fitting procedure is not detailed and on the assumption that the three-component decomposition plus discrete-event scheduler captures all relevant LLM serving dynamics.

free parameters (1)
  • learned efficiency term
    Adjustment factor in the roofline kernel model; described as learned and required for the reported accuracy gains.
axioms (1)
  • domain assumption LLM inference execution can be decomposed into independent roofline kernel, communication, and host-overhead models whose composition via discrete-event scheduling reproduces end-to-end behavior.
    Invoked to justify the simulator architecture and the two prediction tiers.

pith-pipeline@v0.9.1-grok · 5926 in / 1426 out tokens · 60223 ms · 2026-06-30T00:44:48.400177+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 29 canonical work pages · 13 internal anchors

  1. [1]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, et al . 2024. Phi-4 Technical Report. arXiv:2412.08905 [cs.CL] doi:10.48550/arXiv.2412.08905

  2. [2]

    Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S Gulavani, Ramachandran Ramjee, and Alexey Tumanov. 2024. Vidur: A Large-Scale Simulation Framework for Llm Inference.Proceedings of Machine Learning and Systems6 (2024), 351–366

  3. [3]

    Gulavani, Alexey Tumanov, and Ramachandran Ramjee

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwa- tra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. arXiv:2403.02310 [cs.LG] doi:10.48550/arXiv.2403.02310

  4. [4]

    Gulavani, and Ramachandran Ramjee

    Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. arXiv:2308.16369 [cs.LG] doi:10. 48550/arXiv.2308.16369

  5. [5]

    Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa: Training Generalized Multi-Query Trans- former Models from Multi-Head Checkpoints.arXiv preprint arXiv:2305.13245 (2023). arXiv:2305.13245

  6. [6]

    Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, et al. 2024. Deepseek Llm: Scaling Open-Source Language Models with Longtermism.arXiv preprint arXiv:2401.02954(2024). arXiv:2401.02954

  7. [7]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al. 2020. Language Models Are Few-Shot Learners. InProceedings of the 34th International Conference on Neural Information Processing Systems (NIPS ’20). Curran Associates Inc., Red Hook, NY, USA, 1877–1901

  8. [8]

    Jaehong Cho, Hyunmin Choi, and Jongse Park. 2025. LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure.IEEE Computer Architecture Letters24, 2 (July 2025), 361–364. arXiv:2511.07229 [cs] doi:10.1109/LCA.2025.3628325

  9. [9]

    Tri Dao. 2023. Flashattention-2: Faster Attention with Better Parallelism and Work Partitioning.arXiv preprint arXiv:2307.08691(2023). arXiv:2307.08691

  10. [10]

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135 [cs.LG] doi:10.48550/arXiv.2205.14135

  11. [11]

    Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, et al. 2024. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. arXiv:2403.19708 [cs.CL] doi:10.48550/arXiv.2403.19708

  12. [12]

    In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. arXiv:2311.04934 [cs.CL] doi:10.48550/arXiv.2311.04934

  13. [13]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, et al. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024). arXiv:2407.21783

  14. [14]

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, et al. 2023. Text- books Are All You Need. arXiv:2306.11644 [cs.CL] doi:10.48550/arXiv.2306.11644

  15. [15]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guant- ing Chen, Xiao Bi, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE] doi:10.48550/arXiv.2401.14196

  16. [16]

    Ke Hong, Xiuhong Li, Lufang Chen, Qiuli Mao, Guohao Dai, Xuefei Ning, Shengen Yan, Yun Liang, et al. 2025. SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling. InEighth Conference on Machine Learning and Systems

  17. [17]

    Hugging Face. 2023. Text Generation Inference

  18. [18]

    Saki Imai, Rina Nakazawa, Marcelo Amaral, Sunyanan Choochotkaew, and Tat- suhiro Chiba. 2024. Predicting LLM Inference Latency: A Roofline-Driven ML Method. InAnnual Conference on Neural Information Processing Systems

  19. [19]

    Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

    Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv:1804.06826 [cs.DC] doi:10.48550/arXiv.1804.06826

  20. [20]

    Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, et al

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, et al

  21. [21]

    Mistral 7B

    Mistral 7B. arXiv:2310.06825 [cs.CL] doi:10.48550/arXiv.2310.06825

  22. [22]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, et al. 2023. Efficient Memory Man- agement for Large Language Model Serving with Pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles. 611–626

  23. [23]

    Seonho Lee, Amar Phanishayee, and Divya Mahajan. 2025. Forecasting GPU Performance for Deep Learning Training and Inference. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 493–508. arXiv:2407.13853 [cs] doi:10.1145/3669940.3707265

  24. [24]

    Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient Generative Inference of Large Language Models with Dynamic {KV} Cache Management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172

  25. [25]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, et al. 2020. Retrieval-Augmented Generation for Knowledge-Intensive Nlp Tasks.Advances in neural information processing systems33 (2020), 9459–9474

  26. [26]

    Yi-Chien Lin, Woosuk Kwon, Ronald Pineda, and Fanny Nina Paravecino. 2025. APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving. arXiv:2411.17651 [cs] doi:10.48550/arXiv.2411.17651

  27. [27]

    Amama Mahmood, Junxiang Wang, Bingsheng Yao, Dakuo Wang, and Chien- Ming Huang. 2023. Llm-Powered Conversational Voice Assistants: Interaction Patterns, Opportunities, Challenges, and Design Guidelines.arXiv preprint arXiv:2309.13879(2023). arXiv:2309.13879

  28. [28]

    Microsoft, Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, et al. 2025. Phi-4-Mini Tech- nical Report: Compact yet Powerful Multimodal Language Models via Mixture- of-LoRAs. arXiv:2503.01743 [cs.CL] doi:10.48550/arXiv.2503.01743

  29. [29]

    NVIDIA Corporation. 2023. Matrix Multiplication Background User’s Guide. NVIDIA Deep Learning Performance Documentation, https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix- multiplication/index.html. Accessed: 2026-06-25

  30. [30]

    NVIDIA Corporation. 2023. TensorRT-LLM: NVIDIA’s Inference Optimization Library

  31. [31]

    NVIDIA Corporation. 2024. NCCL: Optimized Primitives for Collective Multi- GPU Communication. https://github.com/NVIDIA/nccl. Accessed: 2026-06-25

  32. [32]

    NVIDIA Corporation. 2024. NVIDIA CUDA Toolkit, Version 12.x

  33. [33]

    Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal All-Reduce Algorithms for Clusters of Workstations.J. Parallel and Distrib. Comput.69, 2 (Feb. 2009), 117–124. doi:10.1016/j.jpdc.2008.09.002

  34. [34]

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Infer- ence Using Phase Splitting. In2024 ACM/IEEE 51st Annual International Sympo- sium on Computer Architecture (ISCA). IEEE, Buenos Aires, Argentina, 118–132. doi:10.1109/ISCA59077.2024.00019 arXiv Prepr...

  35. [35]

    Sparks, and Ameet Talwalkar

    Hang Qi, Evan R. Sparks, and Ameet Talwalkar. 2017. Paleo: A Performance Model for Deep Neural Networks. InInternational Conference on Learning Repre- sentations

  36. [36]

    Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. arXiv:2407.00079 [cs.DC] doi:10.48550/arXiv.2407. 00079

  37. [37]

    Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, et al. 2025. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL] doi:10.48550/arXiv.2412.15115

  38. [38]

    Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna

  39. [39]

    In2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

    ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms. In2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 81–92. doi:10.1109/ISPASS48437.2020.00018

  40. [40]

    James Reed, Zachary DeVito, Horace He, Ansley Ussery, and Jason Ansel. 2022. Torch. Fx: Practical Program Capture and Transformation for Deep Learning in Python.Proceedings of Machine Learning and Systems4 (2022), 638–651

  41. [41]

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. 2024. Flashattention-3: Fast and Accurate Attention with Asynchrony and Low-Precision.Advances in Neural Information Processing Systems37 (2024), 68658–68685

  42. [42]

    Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head Is All You Need. arXiv:1911.02150 [cs.NE] doi:10.48550/arXiv.1911.02150

  43. [43]

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL] doi:10. 48550/arXiv.1909.08053

  44. [44]

    Wei Sun, Ang Li, Tong Geng, Sander Stuijk, and Henk Corporaal. 2023. Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors. IEEE Transactions on Parallel and Distributed Systems34, 1 (Jan. 2023), 246–261. arXiv:2206.02874 [cs.AR] doi:10.1109/TPDS.2022.3217824

  45. [45]

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cas- sidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, et al

  46. [46]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118 [cs.CL] doi:10.48550/arXiv.2408.00118

  47. [47]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs] doi:10.48550/arXiv.1706.03762

  48. [48]

    Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An Insightful Visual Performance Model for Multicore Architectures.Commun. ACM 52, 4 (April 2009), 65–76. doi:10.1145/1498765.1498785

  49. [49]

    William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. 2023. ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 283–294. arXiv:2303.14006 [cs] doi:10.1109/ISPA...

  50. [50]

    Feiyang Wu, Zhuohang Bian, Guoyang Duan, Tianle Xu, Junchi Wu, Teng Ma, Yongqiang Yao, Ruihao Gong, et al . 2025. TokenSim: Enabling Hard- ware and Software Exploration for Large Language Model Inference Systems. arXiv:2503.08415 [cs] doi:10.48550/arXiv.2503.08415

  51. [51]

    Tianhao Xu, Yiming Liu, Xianglong Lu, Yijia Zhao, Xuting Zhou, Aichen Feng, Yiyi Chen, Yi Shen, et al. 2026. AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving. arXiv:2601.06288 [cs.LG] doi:10. 48550/arXiv.2601.06288

  52. [52]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, et al . 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] doi:10.48550/arXiv.2505.09388

  53. [53]

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521–538. https://www.usenix.org/conference/osdi22/ presentation/yu

  54. [54]

    Yu, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko

    Geoffrey X. Yu, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko. 2021. Habi- tat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training. In2021 USENIX Annual Technical Conference (USENIX ATC 21). 503–521

  55. [55]

    Li Lyna Zhang, Shihao Han, Jianyu Wei, Ningxin Zheng, Ting Cao, Yuqing Yang, and Yunxin Liu. 2021. Nn-Meter: Towards Accurate Latency Prediction of Deep- Learning Model Inference on Diverse Edge Devices. InProceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services. ACM, Virtual Event Wisconsin, 81–93. doi:10.1145...

  56. [56]

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, et al . 2024. SGLang: Efficient Execution of Structured Language Model Programs. arXiv:2312.07104 [cs.AI] doi:10.48550/arXiv.2312.07104

  57. [57]

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. arXiv:2401.09670 [cs.DC] doi:10.48550/arXiv.2401.09670 KernelSight-LM: A Kernel-Level LLM Inference Simulator arXiv Preprint, June, 2026, A Experimental Se...