Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles

Avinash Maurya; Bogdan Nicolae; Moiz Arif; Sudharshan Vazhkudai

arxiv: 2605.19775 · v1 · pith:YY6JZTQLnew · submitted 2026-05-19 · 💻 cs.DC · cs.PF

Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles

Moiz Arif , Avinash Maurya , Sudharshan Vazhkudai , Bogdan Nicolae This is my paper

Pith reviewed 2026-05-20 02:08 UTC · model grok-4.3

classification 💻 cs.DC cs.PF

keywords inference scalingLLM parallelismKV cachereasoning modelsdata parallelismtensor parallelismcapacity bound inference

0 comments

The pith

Data parallelism for reasoning LLMs hits a capacity trap from KV-cache fragmentation while tensor parallelism frees memory with sublinear gains near 32B parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how inference scales for large language models that perform extended chain-of-thought reasoning rather than simple generation. It shows that data parallelism delivers good throughput on small models but runs into early throttling on reasoning tasks because fragmented key-value caches leave GPUs underutilized. Tensor parallelism reduces this fragmentation and improves memory use, though the performance lift tapers off around the 32 billion parameter mark. At the largest scales, dense models become limited by interconnect and memory bandwidth and therefore favor high tensor parallelism, while sparse mixture-of-experts models are held back by routing and synchronization costs and need mixed strategies. These patterns matter for anyone building systems that must run long reasoning sequences efficiently across GPU clusters.

Core claim

Reasoning workloads shift inference into a capacity-bound regime in which data parallelism suffers from KV-cache fragmentation that forces early throttling and leaves compute idle, whereas tensor parallelism unlocks stranded memory and yields sublinear scaling improvements that become noticeable near the 32B parameter crossover; at frontier sizes dense models favor high-degree tensor parallelism because of interconnect and bandwidth limits while sparse MoE models are constrained by routing latency and benefit from hybrid parallelism choices.

What carries the argument

The interaction of data, tensor, and pipeline parallelism in managing KV-cache memory and interconnect traffic during long-sequence reasoning inference.

If this is right

Data parallelism remains the default choice only for models well below 32B and for short-context workloads.
Tensor parallelism becomes the preferred strategy once models reach roughly 32B parameters to avoid wasting GPU memory on fragmented caches.
Frontier dense models require the highest practical degree of tensor parallelism to stay within memory-bandwidth and interconnect limits.
Mixture-of-experts models at scale need hybrid parallelism that reduces routing and synchronization overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Inference schedulers could monitor current KV-cache fragmentation and dynamically increase tensor-parallelism degree as reasoning chains lengthen.
Hardware designs that reduce interconnect latency would likely shift the crossover point where tensor parallelism stops helping.
The same capacity-trap pattern may appear in other long-context tasks such as multi-turn agent loops or retrieval-augmented generation.

Load-bearing premise

The measured differences in throughput and utilization are caused mainly by KV-cache fragmentation and interconnect limits rather than by model-specific details, workload mixes, or cluster hardware choices that were not tested.

What would settle it

Run the same reasoning workloads on the same model sizes but with explicit KV-cache defragmentation or higher-bandwidth interconnects and check whether the early throttling and sublinear tensor-parallelism gains disappear.

Figures

Figures reproduced from arXiv: 2605.19775 by Avinash Maurya, Bogdan Nicolae, Moiz Arif, Sudharshan Vazhkudai.

**Figure 1.** Figure 1: Input, output, and reasoning token distributions for 100k samples from Meta’s Natural Reasoning dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Timeline of inference engine metrics on scaling the number of sequences for DeepSeek-8B on one H200 GPU. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overall serving statistics on scaling maximum number of sequences for DeepSeek-8B on one H200 GPU. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Batch size scaling for DeepSeek-8B on 8x H200 GPUs with 8-way DP. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: vLLM Metrics for 500, 2000 and 5000 batch sizes for [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Scale up for DeepSeek-8B model (best strategy: DP scaling). [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Mixed config scaling for small models (2k BS). [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: DP Scaling for small models [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 11.** Figure 11: Model parameter scaling on 8x H200 GPUs. [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

**Figure 12.** Figure 12: Analysis of Prefill and Decode Phase during AI Inference. [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗

**Figure 13.** Figure 13: Analysis of Prefill and Decode resource utilization for varying context lengths. [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

**Figure 14.** Figure 14: Analysis of Prefill and Decode resource utilization of Llama405B for varying context lengths. [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗

**Figure 15.** Figure 15: Analysis of Prefill and Decode Memory Requirements during AI Inference. [PITH_FULL_IMAGE:figures/full_fig_p012_15.png] view at source ↗

read the original abstract

The transition from standard generative AI to \emph{reasoning-centric architectures}, exemplified by models capable of extensive Chain-of-Thought~(CoT) processing, marks a fundamental paradigm shift in system requirements. Unlike traditional workloads dominated by compute-bound prefill, reasoning workloads generate long chains of reasoning tokens that shift inference into a \emph{Capacity-Bound regime}. This paper presents a comprehensive system characterization, evaluating models ranging from 8B to 671B parameters on GPUs clusters. By systematically exploring the interplay between Data, Tensor, and Pipeline parallelism, we identify critical bottlenecks that defy standard scaling heuristics. Our analysis reveals that data parallelism is throughput efficient for small models but hits a capacity trap on reasoning workloads as KV-cache fragmentation forces early throttling resulting in sub-optimal compute utilization. Tensor parallelism unlocks stranded memory and delivers sublinear gains near the 32B crossover. At frontier scale, dense models (e.g., Llama-405B) are interconnect and memory-bandwidth bound and favor high-degree TP, while sparse Mixture-of-Experts (MoE) models (e.g., DeepSeek-R1) are limited by routing and synchronization latency and benefit from hybrid strategies. These insights provide a rigorous decision framework for navigating the reasoning cliff, establishing new architectural imperatives for the next generation of inference infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a capacity trap in data-parallel inference for long-CoT reasoning workloads and points to tensor parallelism as a partial fix around 32B, but the causal story on KV-cache fragmentation needs tighter controls.

read the letter

The key takeaway is that data parallelism delivers good throughput on small models but runs into early throttling on reasoning workloads once KV-cache fragmentation strands memory and forces suboptimal utilization. Tensor parallelism mitigates some of that by unlocking capacity, with the crossover benefits showing up near 32B; at larger scales the advice splits between dense models (favor high-degree TP due to interconnect and bandwidth) and MoE models (limited by routing and sync, so hybrid approaches). These observations come from runs across 8B to 671B models on GPU clusters, which is a decent sweep for a systems paper. The shift from compute-bound prefill to capacity-bound decode under long chains of thought is a useful framing that matches what many deployment teams are seeing now. The practical decision framework for choosing parallelism strategies is the part that could actually influence hardware and software choices. The main weakness is that the fragmentation claim is not clearly separated from overall memory scaling. The abstract gives no indication they varied sequence-length distributions, batch policies, or allocator behavior independently of parallelism degree, so the throttling could just reflect aggregate capacity limits rather than fragmentation per se. Without error bars, dataset details, or exclusion criteria, it is hard to judge how stable the reported crossovers are. If the full paper has those controls and quantitative breakdowns, the story strengthens; otherwise the causal link stays suggestive. This work is aimed at people who actually run frontier-scale inference clusters and need rules of thumb for reasoning models. A systems or deployment reader would find the scale-dependent recommendations worth checking, even if they end up re-testing the numbers themselves. It is worth sending to peer review because the topic is timely and the experimental range is broad enough to merit referee input, though the methods section will need close attention.

Referee Report

2 major / 2 minor

Summary. The paper claims that reasoning workloads with long Chain-of-Thought processing shift LLM inference into a capacity-bound regime, unlike compute-bound prefill in traditional workloads. Through systematic evaluation of models from 8B to 671B parameters on GPU clusters, it identifies that data parallelism is throughput-efficient for small models but encounters a capacity trap on reasoning tasks because KV-cache fragmentation forces early throttling and sub-optimal compute utilization. Tensor parallelism unlocks stranded memory and yields sublinear gains near the 32B crossover. At frontier scale, dense models (e.g., Llama-405B) are interconnect- and memory-bandwidth-bound and favor high-degree tensor parallelism, while sparse MoE models (e.g., DeepSeek-R1) are limited by routing and synchronization latency and benefit from hybrid strategies. These observations are presented as a decision framework for navigating the 'reasoning cliff' in inference infrastructure.

Significance. If the empirical observations hold after addressing controls and quantification, the work would provide practically useful guidelines for choosing parallelism strategies when deploying reasoning models, highlighting the transition to capacity-bound regimes and the differing needs of dense versus MoE architectures. The breadth of model sizes and parallelism degrees explored is a clear strength and supplies timely empirical data for systems design. The absence of error bars, dataset details, and isolation of the proposed causal factors nevertheless limits how strongly the conclusions can be generalized.

major comments (2)

[Abstract] Abstract: the central claim that 'KV-cache fragmentation forces early throttling' under data parallelism on reasoning workloads is load-bearing for the paper's contribution, yet the provided description gives no indication that sequence-length distributions, batching policies, or memory-allocator behavior were varied independently of parallelism degree or total KV-cache size. This leaves open the possibility that observed throttling reflects aggregate memory capacity rather than fragmentation specifically, weakening the causal attribution to fragmentation and interconnect limits.
[Abstract] Abstract (and presumed experimental sections): the reported 'sublinear gains near the 32B crossover' and the contrasting preferences for high-degree TP versus hybrid strategies lack accompanying quantitative metrics (throughput, utilization percentages, or statistical measures) or controls for unvaried factors such as specific model architectures and workload distributions, making it difficult to verify the robustness of the trade-off claims.

minor comments (2)

Include error bars or confidence intervals on all performance and utilization measurements to allow readers to assess run-to-run variability.
Define 'capacity trap' and 'stranded memory' with explicit metrics or formulas rather than qualitative description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments help clarify the presentation of our causal claims and the need for explicit quantification. We address each major comment below and have revised the manuscript to incorporate additional experimental details, controls, and metrics where this strengthens the work without misrepresenting our existing results.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'KV-cache fragmentation forces early throttling' under data parallelism on reasoning workloads is load-bearing for the paper's contribution, yet the provided description gives no indication that sequence-length distributions, batching policies, or memory-allocator behavior were varied independently of parallelism degree or total KV-cache size. This leaves open the possibility that observed throttling reflects aggregate memory capacity rather than fragmentation specifically, weakening the causal attribution to fragmentation and interconnect limits.

Authors: We agree that the abstract's brevity leaves the causal isolation implicit. In the full experimental methodology (Section 3), we fixed total KV-cache capacity across parallelism configurations while independently sampling sequence lengths from real reasoning CoT distributions and holding batching policies constant; memory-allocator behavior was logged via CUDA memory snapshots to confirm fragmentation as the driver of early throttling rather than raw capacity exhaustion. To make this explicit, we have added a short clarifying paragraph to the abstract and expanded the methodology subsection with a table showing the controlled variables. These revisions directly address the concern while preserving the original empirical observations. revision: yes
Referee: [Abstract] Abstract (and presumed experimental sections): the reported 'sublinear gains near the 32B crossover' and the contrasting preferences for high-degree TP versus hybrid strategies lack accompanying quantitative metrics (throughput, utilization percentages, or statistical measures) or controls for unvaried factors such as specific model architectures and workload distributions, making it difficult to verify the robustness of the trade-off claims.

Authors: We acknowledge that the abstract does not include the supporting numbers. The body of the paper already reports tokens-per-second throughput, SM utilization, and memory-bandwidth measurements for each scale and parallelism degree, with comparisons performed within model families (e.g., Llama variants) and on fixed reasoning workloads. To further strengthen verifiability, we have added error bars from repeated runs, a summary table of quantitative trade-offs at the 32B crossover, and explicit statements of the workload distribution parameters. These additions make the sublinear gains and dense-vs-MoE strategy preferences more transparent without changing the reported trends. revision: partial

Circularity Check

0 steps flagged

Empirical characterization with no derived equations or self-referential claims

full rationale

The paper presents a system characterization based on direct experimental measurements of inference performance across model sizes and parallelism strategies on GPU clusters. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. Claims about capacity traps, KV-cache effects, and parallelism trade-offs are framed as observations from cluster runs rather than quantities defined in terms of prior fitted values or self-cited uniqueness theorems. The work is self-contained against external benchmarks as an empirical study, with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical systems paper with no mathematical derivations, no new physical constants, and no postulated entities; all claims rest on standard assumptions about GPU memory hierarchy and interconnect behavior.

pith-pipeline@v0.9.0 · 5777 in / 1107 out tokens · 33733 ms · 2026-05-20T02:08:30.988357+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

data parallelism ... hits a capacity trap on reasoning workloads as KV-cache fragmentation forces early throttling
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Tensor parallelism unlocks stranded memory ... near the 32B crossover

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 6 internal anchors

[1]

Vidur: A large-scale simulation frame- work for llm inference,

A. Agrawal, N. Kedia, J. Mohan, A. Panwar, N. Kwatra, B. S. Gulavani, R. Ramjee, and A. Tumanov, “Vidur: A large-scale simulation frame- work for llm inference,”Proceedings of Machine Learning and Systems, vol. 6, pp. 351–366, 2024. 13

work page 2024
[2]

Taming{Throughput-Latency}tradeoff in{LLM}inference with{Sarathi-Serve},

A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. Gulavani, A. Tumanov, and R. Ramjee, “Taming{Throughput-Latency}tradeoff in{LLM}inference with{Sarathi-Serve},” in18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 2024, pp. 117–134

work page 2024
[3]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee, “Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills,”arXiv preprint arXiv:2308.16369, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

J. Ainslie, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebron, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head checkpoints,”arXiv preprint arXiv:2305.13245, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Llm in a flash: Efficient large language model inference with limited memory,

K. Alizadeh, S. I. Mirzadeh, D. Belenko, S. Khatamifard, M. Cho, C. C. Del Mundo, M. Rastegari, and M. Farajtabar, “Llm in a flash: Efficient large language model inference with limited memory,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 12 562–12 584

work page 2024
[6]

Exploiting cxl-based memory for distributed deep learning,

M. Arif, K. Assogba, M. M. Rafique, and S. Vazhkudai, “Exploiting cxl-based memory for distributed deep learning,” inProceedings of the 51st International Conference on Parallel Processing, ser. ICPP ’22. New York, NY , USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/10.1145/3545008.3545054

work page doi:10.1145/3545008.3545054 2023
[7]

Accelerating performance of gpu-based workloads using cxl,

M. Arif, A. Maurya, and M. M. Rafique, “Accelerating performance of gpu-based workloads using cxl,” inProceedings of the 13th Workshop on AI and Scientific Computing at Scale Using Flexible Computing, ser. FlexScience ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 27–31. [Online]. Available: https://doi.org/10.1145/3589013.3596678

work page doi:10.1145/3589013.3596678 2023
[8]

Moe-lightning: High-throughput moe inference on memory-constrained gpus,

S. Cao, S. Liu, T. Griggs, P. Schafhalter, X. Liu, Y . Sheng, J. E. Gon- zalez, M. Zaharia, and I. Stoica, “Moe-lightning: High-throughput moe inference on memory-constrained gpus,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2025, pp. 715–730

work page 2025
[9]

Lmcache: An efficient kv cache layer for enterprise-scale llm inference,

Y . Cheng, Y . Liu, J. Yao, Y . An, X. Chen, S. Feng, Y . Huang, S. Shen, K. Du, and J. Jiang, “Lmcache: An efficient kv cache layer for enterprise-scale llm inference,”arXiv preprint arXiv:2510.09665, 2025

work page arXiv 2025
[10]

Llm-inference- bench: Inference benchmarking of large language models on ai acceler- ators,

K. T. Chitty-Venkata, S. Raskar, B. Kale, F. Ferdaus, A. Tanikanti, K. Raffenetti, V . Taylor, M. Emani, and V . Vishwanath, “Llm-inference- bench: Inference benchmarking of large language models on ai acceler- ators,” inSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024, pp. 1362–1379

work page 2024
[11]

PagedEviction: Structured block-wise KV cache pruning for efficient large language model inference,

K. T. Chitty-Venkata, J. Ye, S. Raskar, A. Kougkas, X. Sun, M. Emani, V . Vishwanath, and B. Nicolae, “PagedEviction: Structured block-wise KV cache pruning for efficient large language model inference,” inEACL 2026: 19th Conference of the European Chapter of the Association for Computational Linguistics, Rabat, Morocco, 2026, pp. 3207–3218

work page 2026
[12]

Multi-head attention: Collaborate instead of concatenate,

J.-B. Cordonnier, A. Loukas, and M. Jaggi, “Multi-head attention: Collaborate instead of concatenate,”arXiv preprint arXiv:2006.16362, 2020

work page arXiv 2006
[13]

Compute express link,

CXL, “Compute express link,” 2025, accessed: 2025-12-12. [Online]. Available: https://computeexpresslink.org/

work page 2025
[14]

Corsair™. built for generative a,

D-Matrix, “Corsair™. built for generative a,” 2025, accessed: 2025-12-

work page 2025
[15]

Available: https://www.d-matrix.ai/product/

[Online]. Available: https://www.d-matrix.ai/product/

work page
[16]

Why we decoupled execution to accelerate i/o,

——, “Why we decoupled execution to accelerate i/o,” 2025, accessed: 2025-12-12. [Online]. Available: https://www.d-matrix.ai/ why-we-decoupled-execution-to-accelerate-i-o/

work page 2025
[17]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Sun, Z. Zhu, M. Zhang, M. Cheng, S. Li, M. A. R. Bigas, Y . Hu, S. Zhu, and Z. Kuang, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025. [Online]. Available: https://arxiv.o...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Accelerating LLM inference throughput via asynchronous KV cache prefetching,

Y . Dong, Y . Miao, W. Li, X. Zheng, C. Wang, J. Wu, and F. Lyu, “Accelerating LLM inference throughput via asynchronous KV cache prefetching,” inAAAI’26: The 2026 AAAI Conference on Artificial Intelligence, vol. 40, no. 25, 2026, pp. 20 844–20 851

work page 2026
[19]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahleet al., “The llama 3 herd of models,” 2024. [Online]. Available: https: //arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Nvidia dynamo,

Dynamo, “Nvidia dynamo,” 2025, accessed: 2025-12-12. [Online]. Available: https://developer.nvidia.com/dynamo

work page 2025
[21]

Gpipe: Efficient training of giant neural networks using pipeline parallelism,

Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wu, and Z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” inAdvances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[22]

Calculon: a methodology and tool for high-level co-design of systems and large language models,

M. Isaev, N. Mcdonald, L. Dennison, and R. Vuduc, “Calculon: a methodology and tool for high-level co-design of systems and large language models,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1–14

work page 2023
[23]

Efficient memory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles (SOSP), 2023

work page 2023
[24]

Llm inference serving: Survey of recent advances and opportunities,

B. Li, Y . Jiang, V . Gadepally, and D. Tiwari, “Llm inference serving: Survey of recent advances and opportunities,” in2024 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2024, pp. 1–8

work page 2024
[25]

A survey on large lan- guage model acceleration based on kv cache management

H. Li, Y . Li, A. Tian, T. Tang, Z. Xu, X. Chen, N. Hu, W. Dong, Q. Li, and L. Chen, “A survey on large language model acceleration based on kv cache management,”arXiv preprint arXiv:2412.19442, 2024

work page arXiv 2024
[26]

Deepseek-v3 technical report,

A. Liu, B. Feng, B. Wang, B. Wang, B. Liuet al., “Deepseek-v3 technical report,” 2024. [Online]. Available: https://arxiv.org/abs/2412. 19437

work page 2024
[27]

Minicache: Kv cache compression in depth dimension for large language models,

A. Liu, J. Liu, Z. Pan, Y . He, G. Haffari, and B. Zhuang, “Minicache: Kv cache compression in depth dimension for large language models,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 139 997– 140 031, 2024

work page 2024
[28]

Mlp-offload: Multi-level, multi-path offloading for llm pre-training to break the gpu memory wall,

A. Maurya, M. Rafique, F. Cappello, and B. Nicolae, “Mlp-offload: Multi-level, multi-path offloading for llm pre-training to break the gpu memory wall,” inSC’25: 38th International Conference for High Performance Computing, Networking, Storage and Analytics, St Louis, USA, 2025

work page 2025
[29]

Openai o1 system card,

OpenAI, “Openai o1 system card,” 2024, accessed: 2025-02-12. [Online]. Available: https://openai.com/index/openai-o1-system-card/

work page 2024
[30]

A survey on inference engines for large language models: Perspectives on optimization and efficiency.arXiv preprint arXiv:2505.01658, 2025

S. Park, S. Jeon, C. Lee, S. Jeon, B.-S. Kim, and J. Lee, “A survey on in- ference engines for large language models: Perspectives on optimization and efficiency,”arXiv preprint arXiv:2505.01658, 2025

work page arXiv 2025
[31]

Mooncake: A kvcache-centric disaggregated architecture for llm serving,

R. Qin, Z. Li, W. He, J. Cui, H. Tang, F. Ren, T. Ma, S. Cai, Y . Zhang, M. Zhanget al., “Mooncake: A kvcache-centric disaggregated architecture for llm serving,”ACM Transactions on Storage, 2024

work page 2024
[32]

Prophet: An llm infer- ence engine optimized for head-of-line blocking,

S. Saereesitthipitak, A. Rao, C. Zhou, and W. Li, “Prophet: An llm infer- ence engine optimized for head-of-line blocking,” Stanford University, Technical Report (CS244B), 2024

work page 2024
[33]

Hbf: High bandwidth flash,

Sandisk, “Hbf: High bandwidth flash,” 2025, accessed: 2025-12-12. [Online]. Available: https://www.sandisk.com/company/newsroom/ blogs/2025/memory-centric-ai-sandisks-high-bandwidth-flash-will- redefine-ai-infrastructure

work page 2025
[34]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan- zaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,”arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[35]

Mechanistic interpretability of attention heads in reasoning llms,

Y . Wang and Z. Li, “Mechanistic interpretability of attention heads in reasoning llms,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025, explains the ”thinking” process at the attention layer level

work page 2025
[36]

Chain-of-thought prompting elicits reasoning in large language mod- els,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language mod- els,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[37]

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

F. Xu, Q. Hao, Z. Zong, J. Wang, Y . Zhang, J. Wang, X. Lan, J. Gong, T. Ouyang, F. Menget al., “Towards large reasoning models: A survey of reinforced reasoning with large language models,”arXiv preprint arXiv:2501.09686, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Characterizing the behavior and impact of kv caching on transformer inferences under concurrency,

J. Ye, J. Cernuda, A. Maurya, X.-H. Sun, A. Kougas, and B. Nicolae, “Characterizing the behavior and impact of kv caching on transformer inferences under concurrency,” inIPDPS’25: The 39th IEEE International Parallel and Distributed Processing Symposium, Milan, Italy, 2025. [Online]. Available: https://hal.inria.fr/hal-04984000

work page 2025
[39]

Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions,

W. Yuan, J. Yu, S. Jiang, K. Padthe, Y . Li, I. Kulikov, K. Cho, D. Wang, Y . Tian, J. E. Westonet al., “Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions,”arXiv preprint arXiv:2502.13124, 2025

work page arXiv 2025
[40]

Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,

C. Zhao, C. Deng, C. Ruan, D. Dai, H. Gao, J. Li, L. Zhang, P. Huang, S. Zhou, S. Maet al., “Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 1731–1745. 14

work page 2025

[1] [1]

Vidur: A large-scale simulation frame- work for llm inference,

A. Agrawal, N. Kedia, J. Mohan, A. Panwar, N. Kwatra, B. S. Gulavani, R. Ramjee, and A. Tumanov, “Vidur: A large-scale simulation frame- work for llm inference,”Proceedings of Machine Learning and Systems, vol. 6, pp. 351–366, 2024. 13

work page 2024

[2] [2]

Taming{Throughput-Latency}tradeoff in{LLM}inference with{Sarathi-Serve},

A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. Gulavani, A. Tumanov, and R. Ramjee, “Taming{Throughput-Latency}tradeoff in{LLM}inference with{Sarathi-Serve},” in18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 2024, pp. 117–134

work page 2024

[3] [3]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee, “Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills,”arXiv preprint arXiv:2308.16369, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

J. Ainslie, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebron, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head checkpoints,”arXiv preprint arXiv:2305.13245, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Llm in a flash: Efficient large language model inference with limited memory,

K. Alizadeh, S. I. Mirzadeh, D. Belenko, S. Khatamifard, M. Cho, C. C. Del Mundo, M. Rastegari, and M. Farajtabar, “Llm in a flash: Efficient large language model inference with limited memory,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 12 562–12 584

work page 2024

[6] [6]

Exploiting cxl-based memory for distributed deep learning,

M. Arif, K. Assogba, M. M. Rafique, and S. Vazhkudai, “Exploiting cxl-based memory for distributed deep learning,” inProceedings of the 51st International Conference on Parallel Processing, ser. ICPP ’22. New York, NY , USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/10.1145/3545008.3545054

work page doi:10.1145/3545008.3545054 2023

[7] [7]

Accelerating performance of gpu-based workloads using cxl,

M. Arif, A. Maurya, and M. M. Rafique, “Accelerating performance of gpu-based workloads using cxl,” inProceedings of the 13th Workshop on AI and Scientific Computing at Scale Using Flexible Computing, ser. FlexScience ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 27–31. [Online]. Available: https://doi.org/10.1145/3589013.3596678

work page doi:10.1145/3589013.3596678 2023

[8] [8]

Moe-lightning: High-throughput moe inference on memory-constrained gpus,

S. Cao, S. Liu, T. Griggs, P. Schafhalter, X. Liu, Y . Sheng, J. E. Gon- zalez, M. Zaharia, and I. Stoica, “Moe-lightning: High-throughput moe inference on memory-constrained gpus,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2025, pp. 715–730

work page 2025

[9] [9]

Lmcache: An efficient kv cache layer for enterprise-scale llm inference,

Y . Cheng, Y . Liu, J. Yao, Y . An, X. Chen, S. Feng, Y . Huang, S. Shen, K. Du, and J. Jiang, “Lmcache: An efficient kv cache layer for enterprise-scale llm inference,”arXiv preprint arXiv:2510.09665, 2025

work page arXiv 2025

[10] [10]

Llm-inference- bench: Inference benchmarking of large language models on ai acceler- ators,

K. T. Chitty-Venkata, S. Raskar, B. Kale, F. Ferdaus, A. Tanikanti, K. Raffenetti, V . Taylor, M. Emani, and V . Vishwanath, “Llm-inference- bench: Inference benchmarking of large language models on ai acceler- ators,” inSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024, pp. 1362–1379

work page 2024

[11] [11]

PagedEviction: Structured block-wise KV cache pruning for efficient large language model inference,

K. T. Chitty-Venkata, J. Ye, S. Raskar, A. Kougkas, X. Sun, M. Emani, V . Vishwanath, and B. Nicolae, “PagedEviction: Structured block-wise KV cache pruning for efficient large language model inference,” inEACL 2026: 19th Conference of the European Chapter of the Association for Computational Linguistics, Rabat, Morocco, 2026, pp. 3207–3218

work page 2026

[12] [12]

Multi-head attention: Collaborate instead of concatenate,

J.-B. Cordonnier, A. Loukas, and M. Jaggi, “Multi-head attention: Collaborate instead of concatenate,”arXiv preprint arXiv:2006.16362, 2020

work page arXiv 2006

[13] [13]

Compute express link,

CXL, “Compute express link,” 2025, accessed: 2025-12-12. [Online]. Available: https://computeexpresslink.org/

work page 2025

[14] [14]

Corsair™. built for generative a,

D-Matrix, “Corsair™. built for generative a,” 2025, accessed: 2025-12-

work page 2025

[15] [15]

Available: https://www.d-matrix.ai/product/

[Online]. Available: https://www.d-matrix.ai/product/

work page

[16] [16]

Why we decoupled execution to accelerate i/o,

——, “Why we decoupled execution to accelerate i/o,” 2025, accessed: 2025-12-12. [Online]. Available: https://www.d-matrix.ai/ why-we-decoupled-execution-to-accelerate-i-o/

work page 2025

[17] [17]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Sun, Z. Zhu, M. Zhang, M. Cheng, S. Li, M. A. R. Bigas, Y . Hu, S. Zhu, and Z. Kuang, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025. [Online]. Available: https://arxiv.o...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Accelerating LLM inference throughput via asynchronous KV cache prefetching,

Y . Dong, Y . Miao, W. Li, X. Zheng, C. Wang, J. Wu, and F. Lyu, “Accelerating LLM inference throughput via asynchronous KV cache prefetching,” inAAAI’26: The 2026 AAAI Conference on Artificial Intelligence, vol. 40, no. 25, 2026, pp. 20 844–20 851

work page 2026

[19] [19]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahleet al., “The llama 3 herd of models,” 2024. [Online]. Available: https: //arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Nvidia dynamo,

Dynamo, “Nvidia dynamo,” 2025, accessed: 2025-12-12. [Online]. Available: https://developer.nvidia.com/dynamo

work page 2025

[21] [21]

Gpipe: Efficient training of giant neural networks using pipeline parallelism,

Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wu, and Z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” inAdvances in Neural Information Processing Systems (NeurIPS), 2019

work page 2019

[22] [22]

Calculon: a methodology and tool for high-level co-design of systems and large language models,

M. Isaev, N. Mcdonald, L. Dennison, and R. Vuduc, “Calculon: a methodology and tool for high-level co-design of systems and large language models,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1–14

work page 2023

[23] [23]

Efficient memory management for large language model serving with pagedattention,

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles (SOSP), 2023

work page 2023

[24] [24]

Llm inference serving: Survey of recent advances and opportunities,

B. Li, Y . Jiang, V . Gadepally, and D. Tiwari, “Llm inference serving: Survey of recent advances and opportunities,” in2024 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2024, pp. 1–8

work page 2024

[25] [25]

A survey on large lan- guage model acceleration based on kv cache management

H. Li, Y . Li, A. Tian, T. Tang, Z. Xu, X. Chen, N. Hu, W. Dong, Q. Li, and L. Chen, “A survey on large language model acceleration based on kv cache management,”arXiv preprint arXiv:2412.19442, 2024

work page arXiv 2024

[26] [26]

Deepseek-v3 technical report,

A. Liu, B. Feng, B. Wang, B. Wang, B. Liuet al., “Deepseek-v3 technical report,” 2024. [Online]. Available: https://arxiv.org/abs/2412. 19437

work page 2024

[27] [27]

Minicache: Kv cache compression in depth dimension for large language models,

A. Liu, J. Liu, Z. Pan, Y . He, G. Haffari, and B. Zhuang, “Minicache: Kv cache compression in depth dimension for large language models,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 139 997– 140 031, 2024

work page 2024

[28] [28]

Mlp-offload: Multi-level, multi-path offloading for llm pre-training to break the gpu memory wall,

A. Maurya, M. Rafique, F. Cappello, and B. Nicolae, “Mlp-offload: Multi-level, multi-path offloading for llm pre-training to break the gpu memory wall,” inSC’25: 38th International Conference for High Performance Computing, Networking, Storage and Analytics, St Louis, USA, 2025

work page 2025

[29] [29]

Openai o1 system card,

OpenAI, “Openai o1 system card,” 2024, accessed: 2025-02-12. [Online]. Available: https://openai.com/index/openai-o1-system-card/

work page 2024

[30] [30]

A survey on inference engines for large language models: Perspectives on optimization and efficiency.arXiv preprint arXiv:2505.01658, 2025

S. Park, S. Jeon, C. Lee, S. Jeon, B.-S. Kim, and J. Lee, “A survey on in- ference engines for large language models: Perspectives on optimization and efficiency,”arXiv preprint arXiv:2505.01658, 2025

work page arXiv 2025

[31] [31]

Mooncake: A kvcache-centric disaggregated architecture for llm serving,

R. Qin, Z. Li, W. He, J. Cui, H. Tang, F. Ren, T. Ma, S. Cai, Y . Zhang, M. Zhanget al., “Mooncake: A kvcache-centric disaggregated architecture for llm serving,”ACM Transactions on Storage, 2024

work page 2024

[32] [32]

Prophet: An llm infer- ence engine optimized for head-of-line blocking,

S. Saereesitthipitak, A. Rao, C. Zhou, and W. Li, “Prophet: An llm infer- ence engine optimized for head-of-line blocking,” Stanford University, Technical Report (CS244B), 2024

work page 2024

[33] [33]

Hbf: High bandwidth flash,

Sandisk, “Hbf: High bandwidth flash,” 2025, accessed: 2025-12-12. [Online]. Available: https://www.sandisk.com/company/newsroom/ blogs/2025/memory-centric-ai-sandisks-high-bandwidth-flash-will- redefine-ai-infrastructure

work page 2025

[34] [34]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan- zaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,”arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[35] [35]

Mechanistic interpretability of attention heads in reasoning llms,

Y . Wang and Z. Li, “Mechanistic interpretability of attention heads in reasoning llms,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025, explains the ”thinking” process at the attention layer level

work page 2025

[36] [36]

Chain-of-thought prompting elicits reasoning in large language mod- els,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language mod- els,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[37] [37]

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

F. Xu, Q. Hao, Z. Zong, J. Wang, Y . Zhang, J. Wang, X. Lan, J. Gong, T. Ouyang, F. Menget al., “Towards large reasoning models: A survey of reinforced reasoning with large language models,”arXiv preprint arXiv:2501.09686, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Characterizing the behavior and impact of kv caching on transformer inferences under concurrency,

J. Ye, J. Cernuda, A. Maurya, X.-H. Sun, A. Kougas, and B. Nicolae, “Characterizing the behavior and impact of kv caching on transformer inferences under concurrency,” inIPDPS’25: The 39th IEEE International Parallel and Distributed Processing Symposium, Milan, Italy, 2025. [Online]. Available: https://hal.inria.fr/hal-04984000

work page 2025

[39] [39]

Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions,

W. Yuan, J. Yu, S. Jiang, K. Padthe, Y . Li, I. Kulikov, K. Cho, D. Wang, Y . Tian, J. E. Westonet al., “Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions,”arXiv preprint arXiv:2502.13124, 2025

work page arXiv 2025

[40] [40]

Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,

C. Zhao, C. Deng, C. Ruan, D. Dai, H. Gao, J. Li, L. Zhang, P. Huang, S. Zhou, S. Maet al., “Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 1731–1745. 14

work page 2025