KernelSight-LM: A Kernel-Level LLM Inference Simulator

Ashish Khetan; George Karypis; Hengzhi Pei; Kyle Ulrich; Leonard Lausen; Martin Herbordt; Taeho Kim; Xiang Song; Xinle Liu; Xiteng Yao

arxiv: 2606.28565 · v1 · pith:CVBUA33Lnew · submitted 2026-06-26 · 💻 cs.PF · cs.AI· cs.AR

KernelSight-LM: A Kernel-Level LLM Inference Simulator

Xiteng Yao , Taeho Kim , Hengzhi Pei , Xinle Liu , Kyle Ulrich , Leonard Lausen , Ashish Khetan , Xiang Song

show 2 more authors

George Karypis Martin Herbordt

This is my paper

Pith reviewed 2026-06-30 00:44 UTC · model grok-4.3

classification 💻 cs.PF cs.AIcs.AR

keywords LLM inference simulationkernel latency predictionGPU performance modelingroofline analysisdiscrete-event schedulingserving systemshardware co-designcontinuous batching

0 comments

The pith

KernelSight-LM predicts per-kernel LLM inference latency on unseen GPU generations to 12.1% error with no target measurements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a simulator that decomposes LLM serving steps into roofline kernel execution, communication, and host overhead components scheduled by a discrete-event system that also handles prefix caching and continuous batching. It offers a cross-generation prediction mode that relies solely on hardware specifications and microbenchmarks from earlier GPUs rather than any measurements on the target device. A sympathetic reader would care because current practice requires slow, deployment-specific on-device benchmarking that does not generalize across hardware generations or serving configurations. The work shows this decomposition yields end-to-end median errors of 15.4% for TTFT, 12.8% for TPOT, and 3.0% for throughput across six model families in the cross-generation tier.

Core claim

KernelSight-LM decomposes each serving step into a roofline kernel model with a learned efficiency term, a communication model, and a host-overhead model, composed through a discrete-event scheduler that captures prefix caching and continuous batching. The cross-generation tier uses only hardware specifications and kernel microbenchmarks from previously profiled GPUs to predict per-kernel latency on an unseen GPU generation to 12.1% error, a 1.8x improvement over the roofline baseline of 22.0%. The target-measured tier adds one model-agnostic kernel-microbenchmark sweep on the target GPU to reach 3.8% per-kernel error.

What carries the argument

Roofline kernel model with a learned efficiency term fitted from microbenchmarks, composed with communication and host-overhead models via a discrete-event scheduler.

If this is right

End-to-end median errors reach 15.4% TTFT, 12.8% TPOT, and 3.0% throughput in cross-generation mode across six model families.
Target-measured mode sharpens per-kernel error to 3.8% with a single model-agnostic microbenchmark sweep.
Kernel-level bottleneck breakdowns directly support hardware and software co-design decisions.
Both tiers require far less target-GPU data collection than prior profiling systems they extend.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition could be tested on non-GPU accelerators if the roofline-plus-efficiency structure holds.
Capacity planning tools could ingest the kernel breakdowns to estimate cluster sizing before hardware purchase.
Serving policy designers could iterate on batching and caching rules using simulated traces instead of repeated deployments.

Load-bearing premise

The efficiency term learned from microbenchmarks on previously profiled GPUs generalizes to new GPU generations and diverse serving workloads without target-specific measurements or retraining.

What would settle it

Running the cross-generation tier on a new GPU generation and observing per-kernel prediction error that exceeds 12.1% or fails to beat the 22.0% roofline baseline by a substantial margin would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.28565 by Ashish Khetan, George Karypis, Hengzhi Pei, Kyle Ulrich, Leonard Lausen, Martin Herbordt, Taeho Kim, Xiang Song, Xinle Liu, Xiteng Yao.

**Figure 2.** Figure 2: Wave quantization. A kernel’s 𝐵 thread blocks are scheduled onto the 𝑁SM SMs in waves; when 𝐵 is not a multiple of 𝑁SM the final wave is partial and leaves SMs idle (here 9 blocks on 4 SMs span 3 waves, leaving 3 of 4 SMs idle in the last), inflating execution time by 𝑢 = ⌈𝐵/𝑁SM⌉ 𝑁SM/𝐵 ≥ 1 in the compute roofline (Eq. (2)) [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: An example of a roofline plot of A100 GPU. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: An overview of KernelSight-LM’s architecture. The left-hand input boxes are the data the user provides (hardware [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Our per-kernel latency prediction workflow. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: The bounded analytical head adapted from [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: NVLink and PCIe differ in absolute bandwidth, but [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Kernel-time composition over an example single request’s progress (Qwen3-8B, GB200). During prefill, GEMM [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: GB200 real-vs-sim accuracy across select serving runs (models [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: The cross-generation test. Per-kernel 𝜂 = measured/roofline versus each family’s dominant shape parameter, across six GPUs. The five analysis devices are gray; GB200 (blue) is the held-out target. Where GB200’s curve lies within the analysis envelope (GEMM), 𝜂 interpolates; where it lies apart (attention, KV-cache, RoPE), 𝜂 is device-specific and cannot be recovered from the other devices. Dashed: GB200’s… view at source ↗

**Figure 11.** Figure 11: When 𝜂 extrapolates. Each family placed by 𝜂 magnitude (geomean over devices; 1 = roofline exact) and cross-device spread (max/min; 1 = portable). The safe region (blue) holds families where the roofline is accurate and 𝜂 ports; the danger region (orange) holds those where 𝜂 both carries the prediction and varies across devices [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Prediction error vs. offered load (One run of Qwen3, TP1, GB200). [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

read the original abstract

As large language models (LLMs) move into production serving, practitioners must rapidly evaluate inference performance across diverse hardware, models, and serving parameters to meet cost and latency targets. However, the end-to-end behavior of LLMs couples serving-layer policies with low-level GPU kernel execution and rapidly evolving architectures, forcing slow, deployment-specific benchmarking that is hard to generalize. We present KernelSight-LM, a fine-grained inference simulator that models token-level execution and produces kernel-level latency breakdowns. It decomposes each serving step into a roofline kernel model with a learned efficiency term, a communication model, and a host-overhead model, composed through a discrete-event scheduler that also captures mechanisms like prefix caching and continuous batching. KernelSight-LM offers two prediction tiers that trade target-GPU data for accuracy. The cross-generation tier uses no target-GPU measurements, only hardware specifications and kernel microbenchmarks from previously profiled GPUs, and predicts per-kernel latency on an unseen GPU generation to 12.1% error, a 1.8x improvement over the roofline baseline (22.0%). A second target-measured tier adds one model-agnostic kernel-microbenchmark sweep on the target GPU, sharpening per-kernel error to 3.8%, a 7.3x improvement over a comparable baseline (27.7%). Both tiers require far less target-GPU data than the prior systems they extend. In our simulator, these predictions yield end-to-end median (p50) errors across six model families of 15.4%, 12.8%, and 3.0% (TTFT, TPOT, throughput) in the cross-generation tier and 14.3%, 6.2%, and 2.7% in the target-measured tier, matching dedicated profiling tools while collecting far less on-device data. Beyond prediction, its kernel-level bottleneck breakdowns support hardware/software co-design and capacity planning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The cross-generation tier's 12.1% kernel error claim depends on a learned efficiency term whose portability to new GPU generations is not demonstrated in the abstract.

read the letter

The main takeaway is that KernelSight-LM adds a learned efficiency scalar to a roofline model, then layers on communication and host-overhead components inside a discrete-event scheduler that handles prefix caching and continuous batching. This produces two tiers: a cross-generation mode that uses only prior-GPU microbenchmarks and claims 12.1% per-kernel error (1.8x better than plain roofline at 22%), and a target-measured mode that adds one model-agnostic sweep for 3.8% error. End-to-end median errors across six model families are reported as 15.4/12.8/3.0% for TTFT/TPOT/throughput in the first tier and tighter in the second.

What stands out is the concrete error numbers and the attempt to keep target data low while still modeling serving policies. The scheduler and the two explicit tiers are a step past standard roofline work for LLM inference.

The soft spot is exactly the one the stress-test flags. The learned efficiency term is the load-bearing piece for the cross-generation results, yet the abstract supplies no details on how many source GPU generations went into the fit, whether the term includes any architecture-specific adjustments, or whether it was checked on more than one unseen target. If the scalar absorbs generation-specific effects instead of capturing something portable, the reported improvement over baseline will not transfer. The end-to-end numbers are also given without validation-split information or confirmation that the term was not tuned on data that overlaps with the test cases.

This paper is aimed at practitioners doing capacity planning or hardware/software co-design for LLM serving who want faster estimates than full profiling. It has enough structure and quantitative claims to deserve a serious referee, mainly to verify the fitting procedure and the independence of the cross-generation tests. I would send it to review rather than desk-reject.

Referee Report

3 major / 2 minor

Summary. The manuscript presents KernelSight-LM, a simulator for LLM inference performance that models token-level execution by decomposing each serving step into a roofline-based kernel model augmented with a learned efficiency term, a communication model, and a host-overhead model. These are composed using a discrete-event scheduler that incorporates serving mechanisms such as prefix caching and continuous batching. The system provides two prediction tiers: a cross-generation tier that uses only hardware specifications and prior GPU microbenchmarks to predict per-kernel latencies on unseen GPU generations with 12.1% error (1.8× improvement over a 22.0% roofline baseline), and a target-measured tier that adds one model-agnostic microbenchmark sweep to achieve 3.8% error. End-to-end median errors across six model families are reported as 15.4%, 12.8%, and 3.0% for TTFT, TPOT, and throughput in the cross-generation tier, and 14.3%, 6.2%, and 2.7% in the target-measured tier.

Significance. If the generalization of the learned efficiency term holds, KernelSight-LM would provide a valuable tool for evaluating LLM inference across diverse hardware with minimal target-specific data collection, supporting hardware/software co-design and capacity planning in production serving environments. The kernel-level bottleneck breakdowns are a particular strength for identifying optimization opportunities. The paper is credited for extending roofline models with a learned term and discrete-event simulation to capture serving dynamics, and for reporting concrete quantitative improvements over baselines.

major comments (3)

[Abstract] The abstract claims a 12.1% per-kernel error for the cross-generation tier using 'no target-GPU measurements' and 'kernel microbenchmarks from previously profiled GPUs'. No information is given on the number of source GPU generations used to derive the learned efficiency term, the exact fitting procedure, or validation across multiple target generations. This is load-bearing for the central claim of 1.8× improvement over the roofline baseline, as the error may reflect tuning rather than portable generalization.
[§3 (Model Description)] The decomposition into roofline kernel, communication, and host-overhead models with the learned efficiency term is described, but the manuscript does not clarify if the efficiency term is a single scalar or incorporates additional parameters, nor how it is fitted without target data. This directly affects whether the cross-generation predictions are independent of the target architecture.
[§5 (Evaluation)] The end-to-end results (15.4% TTFT, 12.8% TPOT, 3.0% throughput for cross-generation) are presented across six model families, but the section supplies no experimental setup details, baseline implementations, validation splits, or confirmation that the learned term was not fitted in a manner that inflates the cross-generation performance. This undermines assessment of the reported errors.

minor comments (2)

[Abstract] The specific model families used in the evaluation are not named; providing their identities would aid reproducibility and context.
[Throughout] The notation and definitions for the learned efficiency term and its integration into the roofline model could be more explicitly introduced early in the text to improve accessibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight areas where additional clarity on the cross-generation methodology and experimental setup would strengthen the manuscript. We address each major comment below and will incorporate revisions to provide the requested details without altering the core claims or results.

read point-by-point responses

Referee: [Abstract] The abstract claims a 12.1% per-kernel error for the cross-generation tier using 'no target-GPU measurements' and 'kernel microbenchmarks from previously profiled GPUs'. No information is given on the number of source GPU generations used to derive the learned efficiency term, the exact fitting procedure, or validation across multiple target generations. This is load-bearing for the central claim of 1.8× improvement over the roofline baseline, as the error may reflect tuning rather than portable generalization.

Authors: We agree the abstract would benefit from explicit mention of the source data scope to support the generalization claim. Section 3 of the manuscript details the use of microbenchmarks from multiple prior GPU generations to fit the efficiency term via regression, with validation on unseen target generations. We will revise the abstract to briefly note the source GPU generations and that fitting uses only prior data, preserving the reported 12.1% error and 1.8× improvement as measured on held-out targets. revision: yes
Referee: [§3 (Model Description)] The decomposition into roofline kernel, communication, and host-overhead models with the learned efficiency term is described, but the manuscript does not clarify if the efficiency term is a single scalar or incorporates additional parameters, nor how it is fitted without target data. This directly affects whether the cross-generation predictions are independent of the target architecture.

Authors: The efficiency term is implemented as a single scalar per kernel type, fitted exclusively on source GPU microbenchmarks by regressing the ratio of measured to roofline-predicted latency. No target-GPU data enters the fit, ensuring independence. We will add an explicit sentence in §3.2 stating the scalar nature and confirming the fitting procedure uses only prior-generation data. revision: yes
Referee: [§5 (Evaluation)] The end-to-end results (15.4% TTFT, 12.8% TPOT, 3.0% throughput for cross-generation) are presented across six model families, but the section supplies no experimental setup details, baseline implementations, validation splits, or confirmation that the learned term was not fitted in a manner that inflates the cross-generation performance. This undermines assessment of the reported errors.

Authors: We acknowledge that §5 would be strengthened by expanded setup details. The evaluation uses source-only fitting of the efficiency term with a validation split on unseen targets and generations; baselines are pure roofline models without the learned term. We will revise §5 to include the validation splits, baseline implementations, and explicit confirmation that fitting excludes all target data, ensuring the reported errors reflect true cross-generation performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; cross-generation prediction tests generalization of fitted efficiency term on unseen GPUs.

full rationale

The paper's core claim decomposes inference into a roofline model plus a learned efficiency term fitted exclusively on microbenchmarks from previously profiled GPUs, then applies the term (with only target hardware specifications) to predict latencies on entirely unseen GPU generations. This is a standard held-out generalization test rather than a self-definitional loop or a fitted parameter renamed as a prediction on the same data. No equations or self-citations are shown reducing the reported 12.1% error to the input microbenchmarks by construction; the baseline comparison (pure roofline) further isolates the contribution of the fitted term as an independent modeling choice. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on a learned efficiency term whose fitting procedure is not detailed and on the assumption that the three-component decomposition plus discrete-event scheduler captures all relevant LLM serving dynamics.

free parameters (1)

learned efficiency term
Adjustment factor in the roofline kernel model; described as learned and required for the reported accuracy gains.

axioms (1)

domain assumption LLM inference execution can be decomposed into independent roofline kernel, communication, and host-overhead models whose composition via discrete-event scheduling reproduces end-to-end behavior.
Invoked to justify the simulator architecture and the two prediction tiers.

pith-pipeline@v0.9.1-grok · 5926 in / 1426 out tokens · 60223 ms · 2026-06-30T00:44:48.400177+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 29 canonical work pages · 13 internal anchors

[1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, et al . 2024. Phi-4 Technical Report. arXiv:2412.08905 [cs.CL] doi:10.48550/arXiv.2412.08905

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.08905 2024
[2]

Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S Gulavani, Ramachandran Ramjee, and Alexey Tumanov. 2024. Vidur: A Large-Scale Simulation Framework for Llm Inference.Proceedings of Machine Learning and Systems6 (2024), 351–366

2024
[3]

Gulavani, Alexey Tumanov, and Ramachandran Ramjee

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwa- tra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. arXiv:2403.02310 [cs.LG] doi:10.48550/arXiv.2403.02310

work page doi:10.48550/arxiv.2403.02310 2024
[4]

Gulavani, and Ramachandran Ramjee

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. arXiv:2308.16369 [cs.LG] doi:10. 48550/arXiv.2308.16369

Pith/arXiv arXiv 2023
[5]

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa: Training Generalized Multi-Query Trans- former Models from Multi-Head Checkpoints.arXiv preprint arXiv:2305.13245 (2023). arXiv:2305.13245

Pith/arXiv arXiv 2023
[6]

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, et al. 2024. Deepseek Llm: Scaling Open-Source Language Models with Longtermism.arXiv preprint arXiv:2401.02954(2024). arXiv:2401.02954

Pith/arXiv arXiv 2024
[7]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al. 2020. Language Models Are Few-Shot Learners. InProceedings of the 34th International Conference on Neural Information Processing Systems (NIPS ’20). Curran Associates Inc., Red Hook, NY, USA, 1877–1901

2020
[8]

Jaehong Cho, Hyunmin Choi, and Jongse Park. 2025. LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure.IEEE Computer Architecture Letters24, 2 (July 2025), 361–364. arXiv:2511.07229 [cs] doi:10.1109/LCA.2025.3628325

work page doi:10.1109/lca.2025.3628325 2025
[9]

Tri Dao. 2023. Flashattention-2: Faster Attention with Better Parallelism and Work Partitioning.arXiv preprint arXiv:2307.08691(2023). arXiv:2307.08691

Pith/arXiv arXiv 2023
[10]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135 [cs.LG] doi:10.48550/arXiv.2205.14135

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.14135 2022
[11]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, et al. 2024. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. arXiv:2403.19708 [cs.CL] doi:10.48550/arXiv.2403.19708

work page doi:10.48550/arxiv.2403.19708 2024
[12]

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. arXiv:2311.04934 [cs.CL] doi:10.48550/arXiv.2311.04934

work page doi:10.48550/arxiv.2311.04934 2024
[13]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, et al. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024). arXiv:2407.21783

Pith/arXiv arXiv 2024
[14]

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, et al. 2023. Text- books Are All You Need. arXiv:2306.11644 [cs.CL] doi:10.48550/arXiv.2306.11644

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.11644 2023
[15]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guant- ing Chen, Xiao Bi, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE] doi:10.48550/arXiv.2401.14196

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.14196 2024
[16]

Ke Hong, Xiuhong Li, Lufang Chen, Qiuli Mao, Guohao Dai, Xuefei Ning, Shengen Yan, Yun Liang, et al. 2025. SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling. InEighth Conference on Machine Learning and Systems

2025
[17]

Hugging Face. 2023. Text Generation Inference

2023
[18]

Saki Imai, Rina Nakazawa, Marcelo Amaral, Sunyanan Choochotkaew, and Tat- suhiro Chiba. 2024. Predicting LLM Inference Latency: A Roofline-Driven ML Method. InAnnual Conference on Neural Information Processing Systems

2024
[19]

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv:1804.06826 [cs.DC] doi:10.48550/arXiv.1804.06826

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1804.06826 2018
[20]

Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, et al

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, et al
[21]

Mistral 7B

Mistral 7B. arXiv:2310.06825 [cs.CL] doi:10.48550/arXiv.2310.06825

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825
[22]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, et al. 2023. Efficient Memory Man- agement for Large Language Model Serving with Pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles. 611–626

2023
[23]

Seonho Lee, Amar Phanishayee, and Divya Mahajan. 2025. Forecasting GPU Performance for Deep Learning Training and Inference. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 493–508. arXiv:2407.13853 [cs] doi:10.1145/3669940.3707265

work page doi:10.1145/3669940.3707265 2025
[24]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient Generative Inference of Large Language Models with Dynamic {KV} Cache Management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172

2024
[25]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, et al. 2020. Retrieval-Augmented Generation for Knowledge-Intensive Nlp Tasks.Advances in neural information processing systems33 (2020), 9459–9474

2020
[26]

Yi-Chien Lin, Woosuk Kwon, Ronald Pineda, and Fanny Nina Paravecino. 2025. APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving. arXiv:2411.17651 [cs] doi:10.48550/arXiv.2411.17651

work page doi:10.48550/arxiv.2411.17651 2025
[27]

Amama Mahmood, Junxiang Wang, Bingsheng Yao, Dakuo Wang, and Chien- Ming Huang. 2023. Llm-Powered Conversational Voice Assistants: Interaction Patterns, Opportunities, Challenges, and Design Guidelines.arXiv preprint arXiv:2309.13879(2023). arXiv:2309.13879

arXiv 2023
[28]

Microsoft, Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, et al. 2025. Phi-4-Mini Tech- nical Report: Compact yet Powerful Multimodal Language Models via Mixture- of-LoRAs. arXiv:2503.01743 [cs.CL] doi:10.48550/arXiv.2503.01743

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.01743 2025
[29]

NVIDIA Corporation. 2023. Matrix Multiplication Background User’s Guide. NVIDIA Deep Learning Performance Documentation, https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix- multiplication/index.html. Accessed: 2026-06-25

2023
[30]

NVIDIA Corporation. 2023. TensorRT-LLM: NVIDIA’s Inference Optimization Library

2023
[31]

NVIDIA Corporation. 2024. NCCL: Optimized Primitives for Collective Multi- GPU Communication. https://github.com/NVIDIA/nccl. Accessed: 2026-06-25

2024
[32]

NVIDIA Corporation. 2024. NVIDIA CUDA Toolkit, Version 12.x

2024
[33]

Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal All-Reduce Algorithms for Clusters of Workstations.J. Parallel and Distrib. Comput.69, 2 (Feb. 2009), 117–124. doi:10.1016/j.jpdc.2008.09.002

work page doi:10.1016/j.jpdc.2008.09.002 2009
[34]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Infer- ence Using Phase Splitting. In2024 ACM/IEEE 51st Annual International Sympo- sium on Computer Architecture (ISCA). IEEE, Buenos Aires, Argentina, 118–132. doi:10.1109/ISCA59077.2024.00019 arXiv Prepr...

work page doi:10.1109/isca59077.2024.00019 2024
[35]

Sparks, and Ameet Talwalkar

Hang Qi, Evan R. Sparks, and Ameet Talwalkar. 2017. Paleo: A Performance Model for Deep Neural Networks. InInternational Conference on Learning Repre- sentations

2017
[36]

Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. arXiv:2407.00079 [cs.DC] doi:10.48550/arXiv.2407. 00079

work page doi:10.48550/arxiv.2407 2025
[37]

Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, et al. 2025. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL] doi:10.48550/arXiv.2412.15115

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2025
[38]

Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna
[39]

In2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms. In2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 81–92. doi:10.1109/ISPASS48437.2020.00018

work page doi:10.1109/ispass48437.2020.00018 2020
[40]

James Reed, Zachary DeVito, Horace He, Ansley Ussery, and Jason Ansel. 2022. Torch. Fx: Practical Program Capture and Transformation for Deep Learning in Python.Proceedings of Machine Learning and Systems4 (2022), 638–651

2022
[41]

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. 2024. Flashattention-3: Fast and Accurate Attention with Asynchrony and Low-Precision.Advances in Neural Information Processing Systems37 (2024), 68658–68685

2024
[42]

Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head Is All You Need. arXiv:1911.02150 [cs.NE] doi:10.48550/arXiv.1911.02150

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1911.02150 2019
[43]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL] doi:10. 48550/arXiv.1909.08053

Pith/arXiv arXiv 2020
[44]

Wei Sun, Ang Li, Tong Geng, Sander Stuijk, and Henk Corporaal. 2023. Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors. IEEE Transactions on Parallel and Distributed Systems34, 1 (Jan. 2023), 246–261. arXiv:2206.02874 [cs.AR] doi:10.1109/TPDS.2022.3217824

work page doi:10.1109/tpds.2022.3217824 2023
[45]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cas- sidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, et al
[46]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118 [cs.CL] doi:10.48550/arXiv.2408.00118

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00118
[47]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs] doi:10.48550/arXiv.1706.03762

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2023
[48]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An Insightful Visual Performance Model for Multicore Architectures.Commun. ACM 52, 4 (April 2009), 65–76. doi:10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009
[49]

William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. 2023. ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 283–294. arXiv:2303.14006 [cs] doi:10.1109/ISPA...

work page doi:10.1109/ispass57527.2023.00035 2023
[50]

Feiyang Wu, Zhuohang Bian, Guoyang Duan, Tianle Xu, Junchi Wu, Teng Ma, Yongqiang Yao, Ruihao Gong, et al . 2025. TokenSim: Enabling Hard- ware and Software Exploration for Large Language Model Inference Systems. arXiv:2503.08415 [cs] doi:10.48550/arXiv.2503.08415

work page doi:10.48550/arxiv.2503.08415 2025
[51]

Tianhao Xu, Yiming Liu, Xianglong Lu, Yijia Zhao, Xuting Zhou, Aichen Feng, Yiyi Chen, Yi Shen, et al. 2026. AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving. arXiv:2601.06288 [cs.LG] doi:10. 48550/arXiv.2601.06288

arXiv 2026
[52]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, et al . 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] doi:10.48550/arXiv.2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[53]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521–538. https://www.usenix.org/conference/osdi22/ presentation/yu

2022
[54]

Yu, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko

Geoffrey X. Yu, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko. 2021. Habi- tat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training. In2021 USENIX Annual Technical Conference (USENIX ATC 21). 503–521

2021
[55]

Li Lyna Zhang, Shihao Han, Jianyu Wei, Ningxin Zheng, Ting Cao, Yuqing Yang, and Yunxin Liu. 2021. Nn-Meter: Towards Accurate Latency Prediction of Deep- Learning Model Inference on Diverse Edge Devices. InProceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services. ACM, Virtual Event Wisconsin, 81–93. doi:10.1145...

work page doi:10.1145/3458864.3467882 2021
[56]

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, et al . 2024. SGLang: Efficient Execution of Structured Language Model Programs. arXiv:2312.07104 [cs.AI] doi:10.48550/arXiv.2312.07104

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.07104 2024
[57]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. arXiv:2401.09670 [cs.DC] doi:10.48550/arXiv.2401.09670 KernelSight-LM: A Kernel-Level LLM Inference Simulator arXiv Preprint, June, 2026, A Experimental Se...

work page doi:10.48550/arxiv.2401.09670 2024

[1] [1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, et al . 2024. Phi-4 Technical Report. arXiv:2412.08905 [cs.CL] doi:10.48550/arXiv.2412.08905

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.08905 2024

[2] [2]

Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S Gulavani, Ramachandran Ramjee, and Alexey Tumanov. 2024. Vidur: A Large-Scale Simulation Framework for Llm Inference.Proceedings of Machine Learning and Systems6 (2024), 351–366

2024

[3] [3]

Gulavani, Alexey Tumanov, and Ramachandran Ramjee

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwa- tra, Bhargav S. Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. arXiv:2403.02310 [cs.LG] doi:10.48550/arXiv.2403.02310

work page doi:10.48550/arxiv.2403.02310 2024

[4] [4]

Gulavani, and Ramachandran Ramjee

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. 2023. SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills. arXiv:2308.16369 [cs.LG] doi:10. 48550/arXiv.2308.16369

Pith/arXiv arXiv 2023

[5] [5]

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. Gqa: Training Generalized Multi-Query Trans- former Models from Multi-Head Checkpoints.arXiv preprint arXiv:2305.13245 (2023). arXiv:2305.13245

Pith/arXiv arXiv 2023

[6] [6]

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, et al. 2024. Deepseek Llm: Scaling Open-Source Language Models with Longtermism.arXiv preprint arXiv:2401.02954(2024). arXiv:2401.02954

Pith/arXiv arXiv 2024

[7] [7]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al. 2020. Language Models Are Few-Shot Learners. InProceedings of the 34th International Conference on Neural Information Processing Systems (NIPS ’20). Curran Associates Inc., Red Hook, NY, USA, 1877–1901

2020

[8] [8]

Jaehong Cho, Hyunmin Choi, and Jongse Park. 2025. LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure.IEEE Computer Architecture Letters24, 2 (July 2025), 361–364. arXiv:2511.07229 [cs] doi:10.1109/LCA.2025.3628325

work page doi:10.1109/lca.2025.3628325 2025

[9] [9]

Tri Dao. 2023. Flashattention-2: Faster Attention with Better Parallelism and Work Partitioning.arXiv preprint arXiv:2307.08691(2023). arXiv:2307.08691

Pith/arXiv arXiv 2023

[10] [10]

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135 [cs.LG] doi:10.48550/arXiv.2205.14135

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.14135 2022

[11] [11]

Bin Gao, Zhuomin He, Puru Sharma, Qingxuan Kang, Djordje Jevdjic, Junbo Deng, Xingkun Yang, Zhou Yu, et al. 2024. Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention. arXiv:2403.19708 [cs.CL] doi:10.48550/arXiv.2403.19708

work page doi:10.48550/arxiv.2403.19708 2024

[12] [12]

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2024. Prompt Cache: Modular Attention Reuse for Low-Latency Inference. arXiv:2311.04934 [cs.CL] doi:10.48550/arXiv.2311.04934

work page doi:10.48550/arxiv.2311.04934 2024

[13] [13]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, et al. 2024. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783(2024). arXiv:2407.21783

Pith/arXiv arXiv 2024

[14] [14]

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, et al. 2023. Text- books Are All You Need. arXiv:2306.11644 [cs.CL] doi:10.48550/arXiv.2306.11644

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.11644 2023

[15] [15]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guant- ing Chen, Xiao Bi, et al. 2024. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. arXiv:2401.14196 [cs.SE] doi:10.48550/arXiv.2401.14196

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.14196 2024

[16] [16]

Ke Hong, Xiuhong Li, Lufang Chen, Qiuli Mao, Guohao Dai, Xuefei Ning, Shengen Yan, Yun Liang, et al. 2025. SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling. InEighth Conference on Machine Learning and Systems

2025

[17] [17]

Hugging Face. 2023. Text Generation Inference

2023

[18] [18]

Saki Imai, Rina Nakazawa, Marcelo Amaral, Sunyanan Choochotkaew, and Tat- suhiro Chiba. 2024. Predicting LLM Inference Latency: A Roofline-Driven ML Method. InAnnual Conference on Neural Information Processing Systems

2024

[19] [19]

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P. Scarpazza. 2018. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv:1804.06826 [cs.DC] doi:10.48550/arXiv.1804.06826

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1804.06826 2018

[20] [20]

Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, et al

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Deven- dra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, et al

[21] [21]

Mistral 7B

Mistral 7B. arXiv:2310.06825 [cs.CL] doi:10.48550/arXiv.2310.06825

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.06825

[22] [22]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, et al. 2023. Efficient Memory Man- agement for Large Language Model Serving with Pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles. 611–626

2023

[23] [23]

Seonho Lee, Amar Phanishayee, and Divya Mahajan. 2025. Forecasting GPU Performance for Deep Learning Training and Inference. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 493–508. arXiv:2407.13853 [cs] doi:10.1145/3669940.3707265

work page doi:10.1145/3669940.3707265 2025

[24] [24]

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. {InfiniGen}: Efficient Generative Inference of Large Language Models with Dynamic {KV} Cache Management. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172

2024

[25] [25]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, et al. 2020. Retrieval-Augmented Generation for Knowledge-Intensive Nlp Tasks.Advances in neural information processing systems33 (2020), 9459–9474

2020

[26] [26]

Yi-Chien Lin, Woosuk Kwon, Ronald Pineda, and Fanny Nina Paravecino. 2025. APEX: An Extensible and Dynamism-Aware Simulator for Automated Parallel Execution in LLM Serving. arXiv:2411.17651 [cs] doi:10.48550/arXiv.2411.17651

work page doi:10.48550/arxiv.2411.17651 2025

[27] [27]

Amama Mahmood, Junxiang Wang, Bingsheng Yao, Dakuo Wang, and Chien- Ming Huang. 2023. Llm-Powered Conversational Voice Assistants: Interaction Patterns, Opportunities, Challenges, and Design Guidelines.arXiv preprint arXiv:2309.13879(2023). arXiv:2309.13879

arXiv 2023

[28] [28]

Microsoft, Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, et al. 2025. Phi-4-Mini Tech- nical Report: Compact yet Powerful Multimodal Language Models via Mixture- of-LoRAs. arXiv:2503.01743 [cs.CL] doi:10.48550/arXiv.2503.01743

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.01743 2025

[29] [29]

NVIDIA Corporation. 2023. Matrix Multiplication Background User’s Guide. NVIDIA Deep Learning Performance Documentation, https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix- multiplication/index.html. Accessed: 2026-06-25

2023

[30] [30]

NVIDIA Corporation. 2023. TensorRT-LLM: NVIDIA’s Inference Optimization Library

2023

[31] [31]

NVIDIA Corporation. 2024. NCCL: Optimized Primitives for Collective Multi- GPU Communication. https://github.com/NVIDIA/nccl. Accessed: 2026-06-25

2024

[32] [32]

NVIDIA Corporation. 2024. NVIDIA CUDA Toolkit, Version 12.x

2024

[33] [33]

Pitch Patarasuk and Xin Yuan. 2009. Bandwidth Optimal All-Reduce Algorithms for Clusters of Workstations.J. Parallel and Distrib. Comput.69, 2 (Feb. 2009), 117–124. doi:10.1016/j.jpdc.2008.09.002

work page doi:10.1016/j.jpdc.2008.09.002 2009

[34] [34]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient Generative LLM Infer- ence Using Phase Splitting. In2024 ACM/IEEE 51st Annual International Sympo- sium on Computer Architecture (ISCA). IEEE, Buenos Aires, Argentina, 118–132. doi:10.1109/ISCA59077.2024.00019 arXiv Prepr...

work page doi:10.1109/isca59077.2024.00019 2024

[35] [35]

Sparks, and Ameet Talwalkar

Hang Qi, Evan R. Sparks, and Ameet Talwalkar. 2017. Paleo: A Performance Model for Deep Neural Networks. InInternational Conference on Learning Repre- sentations

2017

[36] [36]

Ruoyu Qin, Zheming Li, Weiran He, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. arXiv:2407.00079 [cs.DC] doi:10.48550/arXiv.2407. 00079

work page doi:10.48550/arxiv.2407 2025

[37] [37]

Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, et al. 2025. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL] doi:10.48550/arXiv.2412.15115

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2025

[38] [38]

Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna

[39] [39]

In2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

ASTRA-SIM: Enabling SW/HW Co-Design Exploration for Distributed DL Training Platforms. In2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 81–92. doi:10.1109/ISPASS48437.2020.00018

work page doi:10.1109/ispass48437.2020.00018 2020

[40] [40]

James Reed, Zachary DeVito, Horace He, Ansley Ussery, and Jason Ansel. 2022. Torch. Fx: Practical Program Capture and Transformation for Deep Learning in Python.Proceedings of Machine Learning and Systems4 (2022), 638–651

2022

[41] [41]

Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. 2024. Flashattention-3: Fast and Accurate Attention with Asynchrony and Low-Precision.Advances in Neural Information Processing Systems37 (2024), 68658–68685

2024

[42] [42]

Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head Is All You Need. arXiv:1911.02150 [cs.NE] doi:10.48550/arXiv.1911.02150

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1911.02150 2019

[43] [43]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL] doi:10. 48550/arXiv.1909.08053

Pith/arXiv arXiv 2020

[44] [44]

Wei Sun, Ang Li, Tong Geng, Sander Stuijk, and Henk Corporaal. 2023. Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numeric Behaviors. IEEE Transactions on Parallel and Distributed Systems34, 1 (Jan. 2023), 246–261. arXiv:2206.02874 [cs.AR] doi:10.1109/TPDS.2022.3217824

work page doi:10.1109/tpds.2022.3217824 2023

[45] [45]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cas- sidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, et al

[46] [46]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118 [cs.CL] doi:10.48550/arXiv.2408.00118

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2408.00118

[47] [47]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2023. Attention Is All You Need. arXiv:1706.03762 [cs] doi:10.48550/arXiv.1706.03762

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2023

[48] [48]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An Insightful Visual Performance Model for Multicore Architectures.Commun. ACM 52, 4 (April 2009), 65–76. doi:10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009

[49] [49]

William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. 2023. ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 283–294. arXiv:2303.14006 [cs] doi:10.1109/ISPA...

work page doi:10.1109/ispass57527.2023.00035 2023

[50] [50]

Feiyang Wu, Zhuohang Bian, Guoyang Duan, Tianle Xu, Junchi Wu, Teng Ma, Yongqiang Yao, Ruihao Gong, et al . 2025. TokenSim: Enabling Hard- ware and Software Exploration for Large Language Model Inference Systems. arXiv:2503.08415 [cs] doi:10.48550/arXiv.2503.08415

work page doi:10.48550/arxiv.2503.08415 2025

[51] [51]

Tianhao Xu, Yiming Liu, Xianglong Lu, Yijia Zhao, Xuting Zhou, Aichen Feng, Yiyi Chen, Yi Shen, et al. 2026. AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving. arXiv:2601.06288 [cs.LG] doi:10. 48550/arXiv.2601.06288

arXiv 2026

[52] [52]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, et al . 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] doi:10.48550/arXiv.2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025

[53] [53]

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung- Gon Chun. 2022. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 521–538. https://www.usenix.org/conference/osdi22/ presentation/yu

2022

[54] [54]

Yu, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko

Geoffrey X. Yu, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko. 2021. Habi- tat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training. In2021 USENIX Annual Technical Conference (USENIX ATC 21). 503–521

2021

[55] [55]

Li Lyna Zhang, Shihao Han, Jianyu Wei, Ningxin Zheng, Ting Cao, Yuqing Yang, and Yunxin Liu. 2021. Nn-Meter: Towards Accurate Latency Prediction of Deep- Learning Model Inference on Diverse Edge Devices. InProceedings of the 19th Annual International Conference on Mobile Systems, Applications, and Services. ACM, Virtual Event Wisconsin, 81–93. doi:10.1145...

work page doi:10.1145/3458864.3467882 2021

[56] [56]

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, et al . 2024. SGLang: Efficient Execution of Structured Language Model Programs. arXiv:2312.07104 [cs.AI] doi:10.48550/arXiv.2312.07104

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.07104 2024

[57] [57]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. arXiv:2401.09670 [cs.DC] doi:10.48550/arXiv.2401.09670 KernelSight-LM: A Kernel-Level LLM Inference Simulator arXiv Preprint, June, 2026, A Experimental Se...

work page doi:10.48550/arxiv.2401.09670 2024