Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles
Pith reviewed 2026-05-20 02:08 UTC · model grok-4.3
The pith
Data parallelism for reasoning LLMs hits a capacity trap from KV-cache fragmentation while tensor parallelism frees memory with sublinear gains near 32B parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reasoning workloads shift inference into a capacity-bound regime in which data parallelism suffers from KV-cache fragmentation that forces early throttling and leaves compute idle, whereas tensor parallelism unlocks stranded memory and yields sublinear scaling improvements that become noticeable near the 32B parameter crossover; at frontier sizes dense models favor high-degree tensor parallelism because of interconnect and bandwidth limits while sparse MoE models are constrained by routing latency and benefit from hybrid parallelism choices.
What carries the argument
The interaction of data, tensor, and pipeline parallelism in managing KV-cache memory and interconnect traffic during long-sequence reasoning inference.
If this is right
- Data parallelism remains the default choice only for models well below 32B and for short-context workloads.
- Tensor parallelism becomes the preferred strategy once models reach roughly 32B parameters to avoid wasting GPU memory on fragmented caches.
- Frontier dense models require the highest practical degree of tensor parallelism to stay within memory-bandwidth and interconnect limits.
- Mixture-of-experts models at scale need hybrid parallelism that reduces routing and synchronization overhead.
Where Pith is reading between the lines
- Inference schedulers could monitor current KV-cache fragmentation and dynamically increase tensor-parallelism degree as reasoning chains lengthen.
- Hardware designs that reduce interconnect latency would likely shift the crossover point where tensor parallelism stops helping.
- The same capacity-trap pattern may appear in other long-context tasks such as multi-turn agent loops or retrieval-augmented generation.
Load-bearing premise
The measured differences in throughput and utilization are caused mainly by KV-cache fragmentation and interconnect limits rather than by model-specific details, workload mixes, or cluster hardware choices that were not tested.
What would settle it
Run the same reasoning workloads on the same model sizes but with explicit KV-cache defragmentation or higher-bandwidth interconnects and check whether the early throttling and sublinear tensor-parallelism gains disappear.
Figures
read the original abstract
The transition from standard generative AI to \emph{reasoning-centric architectures}, exemplified by models capable of extensive Chain-of-Thought~(CoT) processing, marks a fundamental paradigm shift in system requirements. Unlike traditional workloads dominated by compute-bound prefill, reasoning workloads generate long chains of reasoning tokens that shift inference into a \emph{Capacity-Bound regime}. This paper presents a comprehensive system characterization, evaluating models ranging from 8B to 671B parameters on GPUs clusters. By systematically exploring the interplay between Data, Tensor, and Pipeline parallelism, we identify critical bottlenecks that defy standard scaling heuristics. Our analysis reveals that data parallelism is throughput efficient for small models but hits a capacity trap on reasoning workloads as KV-cache fragmentation forces early throttling resulting in sub-optimal compute utilization. Tensor parallelism unlocks stranded memory and delivers sublinear gains near the 32B crossover. At frontier scale, dense models (e.g., Llama-405B) are interconnect and memory-bandwidth bound and favor high-degree TP, while sparse Mixture-of-Experts (MoE) models (e.g., DeepSeek-R1) are limited by routing and synchronization latency and benefit from hybrid strategies. These insights provide a rigorous decision framework for navigating the reasoning cliff, establishing new architectural imperatives for the next generation of inference infrastructure.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that reasoning workloads with long Chain-of-Thought processing shift LLM inference into a capacity-bound regime, unlike compute-bound prefill in traditional workloads. Through systematic evaluation of models from 8B to 671B parameters on GPU clusters, it identifies that data parallelism is throughput-efficient for small models but encounters a capacity trap on reasoning tasks because KV-cache fragmentation forces early throttling and sub-optimal compute utilization. Tensor parallelism unlocks stranded memory and yields sublinear gains near the 32B crossover. At frontier scale, dense models (e.g., Llama-405B) are interconnect- and memory-bandwidth-bound and favor high-degree tensor parallelism, while sparse MoE models (e.g., DeepSeek-R1) are limited by routing and synchronization latency and benefit from hybrid strategies. These observations are presented as a decision framework for navigating the 'reasoning cliff' in inference infrastructure.
Significance. If the empirical observations hold after addressing controls and quantification, the work would provide practically useful guidelines for choosing parallelism strategies when deploying reasoning models, highlighting the transition to capacity-bound regimes and the differing needs of dense versus MoE architectures. The breadth of model sizes and parallelism degrees explored is a clear strength and supplies timely empirical data for systems design. The absence of error bars, dataset details, and isolation of the proposed causal factors nevertheless limits how strongly the conclusions can be generalized.
major comments (2)
- [Abstract] Abstract: the central claim that 'KV-cache fragmentation forces early throttling' under data parallelism on reasoning workloads is load-bearing for the paper's contribution, yet the provided description gives no indication that sequence-length distributions, batching policies, or memory-allocator behavior were varied independently of parallelism degree or total KV-cache size. This leaves open the possibility that observed throttling reflects aggregate memory capacity rather than fragmentation specifically, weakening the causal attribution to fragmentation and interconnect limits.
- [Abstract] Abstract (and presumed experimental sections): the reported 'sublinear gains near the 32B crossover' and the contrasting preferences for high-degree TP versus hybrid strategies lack accompanying quantitative metrics (throughput, utilization percentages, or statistical measures) or controls for unvaried factors such as specific model architectures and workload distributions, making it difficult to verify the robustness of the trade-off claims.
minor comments (2)
- Include error bars or confidence intervals on all performance and utilization measurements to allow readers to assess run-to-run variability.
- Define 'capacity trap' and 'stranded memory' with explicit metrics or formulas rather than qualitative description.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments help clarify the presentation of our causal claims and the need for explicit quantification. We address each major comment below and have revised the manuscript to incorporate additional experimental details, controls, and metrics where this strengthens the work without misrepresenting our existing results.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that 'KV-cache fragmentation forces early throttling' under data parallelism on reasoning workloads is load-bearing for the paper's contribution, yet the provided description gives no indication that sequence-length distributions, batching policies, or memory-allocator behavior were varied independently of parallelism degree or total KV-cache size. This leaves open the possibility that observed throttling reflects aggregate memory capacity rather than fragmentation specifically, weakening the causal attribution to fragmentation and interconnect limits.
Authors: We agree that the abstract's brevity leaves the causal isolation implicit. In the full experimental methodology (Section 3), we fixed total KV-cache capacity across parallelism configurations while independently sampling sequence lengths from real reasoning CoT distributions and holding batching policies constant; memory-allocator behavior was logged via CUDA memory snapshots to confirm fragmentation as the driver of early throttling rather than raw capacity exhaustion. To make this explicit, we have added a short clarifying paragraph to the abstract and expanded the methodology subsection with a table showing the controlled variables. These revisions directly address the concern while preserving the original empirical observations. revision: yes
-
Referee: [Abstract] Abstract (and presumed experimental sections): the reported 'sublinear gains near the 32B crossover' and the contrasting preferences for high-degree TP versus hybrid strategies lack accompanying quantitative metrics (throughput, utilization percentages, or statistical measures) or controls for unvaried factors such as specific model architectures and workload distributions, making it difficult to verify the robustness of the trade-off claims.
Authors: We acknowledge that the abstract does not include the supporting numbers. The body of the paper already reports tokens-per-second throughput, SM utilization, and memory-bandwidth measurements for each scale and parallelism degree, with comparisons performed within model families (e.g., Llama variants) and on fixed reasoning workloads. To further strengthen verifiability, we have added error bars from repeated runs, a summary table of quantitative trade-offs at the 32B crossover, and explicit statements of the workload distribution parameters. These additions make the sublinear gains and dense-vs-MoE strategy preferences more transparent without changing the reported trends. revision: partial
Circularity Check
Empirical characterization with no derived equations or self-referential claims
full rationale
The paper presents a system characterization based on direct experimental measurements of inference performance across model sizes and parallelism strategies on GPU clusters. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. Claims about capacity traps, KV-cache effects, and parallelism trade-offs are framed as observations from cluster runs rather than quantities defined in terms of prior fitted values or self-cited uniqueness theorems. The work is self-contained against external benchmarks as an empirical study, with no reduction of results to inputs by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
data parallelism ... hits a capacity trap on reasoning workloads as KV-cache fragmentation forces early throttling
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Tensor parallelism unlocks stranded memory ... near the 32B crossover
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Vidur: A large-scale simulation frame- work for llm inference,
A. Agrawal, N. Kedia, J. Mohan, A. Panwar, N. Kwatra, B. S. Gulavani, R. Ramjee, and A. Tumanov, “Vidur: A large-scale simulation frame- work for llm inference,”Proceedings of Machine Learning and Systems, vol. 6, pp. 351–366, 2024. 13
work page 2024
-
[2]
Taming{Throughput-Latency}tradeoff in{LLM}inference with{Sarathi-Serve},
A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. Gulavani, A. Tumanov, and R. Ramjee, “Taming{Throughput-Latency}tradeoff in{LLM}inference with{Sarathi-Serve},” in18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 2024, pp. 117–134
work page 2024
-
[3]
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee, “Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills,”arXiv preprint arXiv:2308.16369, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
J. Ainslie, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebron, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head checkpoints,”arXiv preprint arXiv:2305.13245, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Llm in a flash: Efficient large language model inference with limited memory,
K. Alizadeh, S. I. Mirzadeh, D. Belenko, S. Khatamifard, M. Cho, C. C. Del Mundo, M. Rastegari, and M. Farajtabar, “Llm in a flash: Efficient large language model inference with limited memory,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 12 562–12 584
work page 2024
-
[6]
Exploiting cxl-based memory for distributed deep learning,
M. Arif, K. Assogba, M. M. Rafique, and S. Vazhkudai, “Exploiting cxl-based memory for distributed deep learning,” inProceedings of the 51st International Conference on Parallel Processing, ser. ICPP ’22. New York, NY , USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/10.1145/3545008.3545054
-
[7]
Accelerating performance of gpu-based workloads using cxl,
M. Arif, A. Maurya, and M. M. Rafique, “Accelerating performance of gpu-based workloads using cxl,” inProceedings of the 13th Workshop on AI and Scientific Computing at Scale Using Flexible Computing, ser. FlexScience ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 27–31. [Online]. Available: https://doi.org/10.1145/3589013.3596678
-
[8]
Moe-lightning: High-throughput moe inference on memory-constrained gpus,
S. Cao, S. Liu, T. Griggs, P. Schafhalter, X. Liu, Y . Sheng, J. E. Gon- zalez, M. Zaharia, and I. Stoica, “Moe-lightning: High-throughput moe inference on memory-constrained gpus,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2025, pp. 715–730
work page 2025
-
[9]
Lmcache: An efficient kv cache layer for enterprise-scale llm inference,
Y . Cheng, Y . Liu, J. Yao, Y . An, X. Chen, S. Feng, Y . Huang, S. Shen, K. Du, and J. Jiang, “Lmcache: An efficient kv cache layer for enterprise-scale llm inference,”arXiv preprint arXiv:2510.09665, 2025
-
[10]
Llm-inference- bench: Inference benchmarking of large language models on ai acceler- ators,
K. T. Chitty-Venkata, S. Raskar, B. Kale, F. Ferdaus, A. Tanikanti, K. Raffenetti, V . Taylor, M. Emani, and V . Vishwanath, “Llm-inference- bench: Inference benchmarking of large language models on ai acceler- ators,” inSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024, pp. 1362–1379
work page 2024
-
[11]
PagedEviction: Structured block-wise KV cache pruning for efficient large language model inference,
K. T. Chitty-Venkata, J. Ye, S. Raskar, A. Kougkas, X. Sun, M. Emani, V . Vishwanath, and B. Nicolae, “PagedEviction: Structured block-wise KV cache pruning for efficient large language model inference,” inEACL 2026: 19th Conference of the European Chapter of the Association for Computational Linguistics, Rabat, Morocco, 2026, pp. 3207–3218
work page 2026
-
[12]
Multi-head attention: Collaborate instead of concatenate,
J.-B. Cordonnier, A. Loukas, and M. Jaggi, “Multi-head attention: Collaborate instead of concatenate,”arXiv preprint arXiv:2006.16362, 2020
-
[13]
CXL, “Compute express link,” 2025, accessed: 2025-12-12. [Online]. Available: https://computeexpresslink.org/
work page 2025
-
[14]
Corsair™. built for generative a,
D-Matrix, “Corsair™. built for generative a,” 2025, accessed: 2025-12-
work page 2025
-
[15]
Available: https://www.d-matrix.ai/product/
[Online]. Available: https://www.d-matrix.ai/product/
-
[16]
Why we decoupled execution to accelerate i/o,
——, “Why we decoupled execution to accelerate i/o,” 2025, accessed: 2025-12-12. [Online]. Available: https://www.d-matrix.ai/ why-we-decoupled-execution-to-accelerate-i-o/
work page 2025
-
[17]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Sun, Z. Zhu, M. Zhang, M. Cheng, S. Li, M. A. R. Bigas, Y . Hu, S. Zhu, and Z. Kuang, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025. [Online]. Available: https://arxiv.o...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Accelerating LLM inference throughput via asynchronous KV cache prefetching,
Y . Dong, Y . Miao, W. Li, X. Zheng, C. Wang, J. Wu, and F. Lyu, “Accelerating LLM inference throughput via asynchronous KV cache prefetching,” inAAAI’26: The 2026 AAAI Conference on Artificial Intelligence, vol. 40, no. 25, 2026, pp. 20 844–20 851
work page 2026
-
[19]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahleet al., “The llama 3 herd of models,” 2024. [Online]. Available: https: //arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Dynamo, “Nvidia dynamo,” 2025, accessed: 2025-12-12. [Online]. Available: https://developer.nvidia.com/dynamo
work page 2025
-
[21]
Gpipe: Efficient training of giant neural networks using pipeline parallelism,
Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wu, and Z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” inAdvances in Neural Information Processing Systems (NeurIPS), 2019
work page 2019
-
[22]
Calculon: a methodology and tool for high-level co-design of systems and large language models,
M. Isaev, N. Mcdonald, L. Dennison, and R. Vuduc, “Calculon: a methodology and tool for high-level co-design of systems and large language models,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1–14
work page 2023
-
[23]
Efficient memory management for large language model serving with pagedattention,
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles (SOSP), 2023
work page 2023
-
[24]
Llm inference serving: Survey of recent advances and opportunities,
B. Li, Y . Jiang, V . Gadepally, and D. Tiwari, “Llm inference serving: Survey of recent advances and opportunities,” in2024 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2024, pp. 1–8
work page 2024
-
[25]
A survey on large lan- guage model acceleration based on kv cache management
H. Li, Y . Li, A. Tian, T. Tang, Z. Xu, X. Chen, N. Hu, W. Dong, Q. Li, and L. Chen, “A survey on large language model acceleration based on kv cache management,”arXiv preprint arXiv:2412.19442, 2024
-
[26]
A. Liu, B. Feng, B. Wang, B. Wang, B. Liuet al., “Deepseek-v3 technical report,” 2024. [Online]. Available: https://arxiv.org/abs/2412. 19437
work page 2024
-
[27]
Minicache: Kv cache compression in depth dimension for large language models,
A. Liu, J. Liu, Z. Pan, Y . He, G. Haffari, and B. Zhuang, “Minicache: Kv cache compression in depth dimension for large language models,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 139 997– 140 031, 2024
work page 2024
-
[28]
Mlp-offload: Multi-level, multi-path offloading for llm pre-training to break the gpu memory wall,
A. Maurya, M. Rafique, F. Cappello, and B. Nicolae, “Mlp-offload: Multi-level, multi-path offloading for llm pre-training to break the gpu memory wall,” inSC’25: 38th International Conference for High Performance Computing, Networking, Storage and Analytics, St Louis, USA, 2025
work page 2025
-
[29]
OpenAI, “Openai o1 system card,” 2024, accessed: 2025-02-12. [Online]. Available: https://openai.com/index/openai-o1-system-card/
work page 2024
-
[30]
S. Park, S. Jeon, C. Lee, S. Jeon, B.-S. Kim, and J. Lee, “A survey on in- ference engines for large language models: Perspectives on optimization and efficiency,”arXiv preprint arXiv:2505.01658, 2025
-
[31]
Mooncake: A kvcache-centric disaggregated architecture for llm serving,
R. Qin, Z. Li, W. He, J. Cui, H. Tang, F. Ren, T. Ma, S. Cai, Y . Zhang, M. Zhanget al., “Mooncake: A kvcache-centric disaggregated architecture for llm serving,”ACM Transactions on Storage, 2024
work page 2024
-
[32]
Prophet: An llm infer- ence engine optimized for head-of-line blocking,
S. Saereesitthipitak, A. Rao, C. Zhou, and W. Li, “Prophet: An llm infer- ence engine optimized for head-of-line blocking,” Stanford University, Technical Report (CS244B), 2024
work page 2024
-
[33]
Sandisk, “Hbf: High bandwidth flash,” 2025, accessed: 2025-12-12. [Online]. Available: https://www.sandisk.com/company/newsroom/ blogs/2025/memory-centric-ai-sandisks-high-bandwidth-flash-will- redefine-ai-infrastructure
work page 2025
-
[34]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan- zaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,”arXiv preprint arXiv:1909.08053, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[35]
Mechanistic interpretability of attention heads in reasoning llms,
Y . Wang and Z. Li, “Mechanistic interpretability of attention heads in reasoning llms,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025, explains the ”thinking” process at the attention layer level
work page 2025
-
[36]
Chain-of-thought prompting elicits reasoning in large language mod- els,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language mod- els,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[37]
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
F. Xu, Q. Hao, Z. Zong, J. Wang, Y . Zhang, J. Wang, X. Lan, J. Gong, T. Ouyang, F. Menget al., “Towards large reasoning models: A survey of reinforced reasoning with large language models,”arXiv preprint arXiv:2501.09686, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
Characterizing the behavior and impact of kv caching on transformer inferences under concurrency,
J. Ye, J. Cernuda, A. Maurya, X.-H. Sun, A. Kougas, and B. Nicolae, “Characterizing the behavior and impact of kv caching on transformer inferences under concurrency,” inIPDPS’25: The 39th IEEE International Parallel and Distributed Processing Symposium, Milan, Italy, 2025. [Online]. Available: https://hal.inria.fr/hal-04984000
work page 2025
-
[39]
Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions,
W. Yuan, J. Yu, S. Jiang, K. Padthe, Y . Li, I. Kulikov, K. Cho, D. Wang, Y . Tian, J. E. Westonet al., “Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions,”arXiv preprint arXiv:2502.13124, 2025
-
[40]
Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,
C. Zhao, C. Deng, C. Ruan, D. Dai, H. Gao, J. Li, L. Zhang, P. Huang, S. Zhou, S. Maet al., “Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 1731–1745. 14
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.