pith. sign in

arxiv: 2605.19775 · v1 · pith:YY6JZTQLnew · submitted 2026-05-19 · 💻 cs.DC · cs.PF

Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles

Pith reviewed 2026-05-20 02:08 UTC · model grok-4.3

classification 💻 cs.DC cs.PF
keywords inference scalingLLM parallelismKV cachereasoning modelsdata parallelismtensor parallelismcapacity bound inference
0
0 comments X

The pith

Data parallelism for reasoning LLMs hits a capacity trap from KV-cache fragmentation while tensor parallelism frees memory with sublinear gains near 32B parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how inference scales for large language models that perform extended chain-of-thought reasoning rather than simple generation. It shows that data parallelism delivers good throughput on small models but runs into early throttling on reasoning tasks because fragmented key-value caches leave GPUs underutilized. Tensor parallelism reduces this fragmentation and improves memory use, though the performance lift tapers off around the 32 billion parameter mark. At the largest scales, dense models become limited by interconnect and memory bandwidth and therefore favor high tensor parallelism, while sparse mixture-of-experts models are held back by routing and synchronization costs and need mixed strategies. These patterns matter for anyone building systems that must run long reasoning sequences efficiently across GPU clusters.

Core claim

Reasoning workloads shift inference into a capacity-bound regime in which data parallelism suffers from KV-cache fragmentation that forces early throttling and leaves compute idle, whereas tensor parallelism unlocks stranded memory and yields sublinear scaling improvements that become noticeable near the 32B parameter crossover; at frontier sizes dense models favor high-degree tensor parallelism because of interconnect and bandwidth limits while sparse MoE models are constrained by routing latency and benefit from hybrid parallelism choices.

What carries the argument

The interaction of data, tensor, and pipeline parallelism in managing KV-cache memory and interconnect traffic during long-sequence reasoning inference.

If this is right

  • Data parallelism remains the default choice only for models well below 32B and for short-context workloads.
  • Tensor parallelism becomes the preferred strategy once models reach roughly 32B parameters to avoid wasting GPU memory on fragmented caches.
  • Frontier dense models require the highest practical degree of tensor parallelism to stay within memory-bandwidth and interconnect limits.
  • Mixture-of-experts models at scale need hybrid parallelism that reduces routing and synchronization overhead.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Inference schedulers could monitor current KV-cache fragmentation and dynamically increase tensor-parallelism degree as reasoning chains lengthen.
  • Hardware designs that reduce interconnect latency would likely shift the crossover point where tensor parallelism stops helping.
  • The same capacity-trap pattern may appear in other long-context tasks such as multi-turn agent loops or retrieval-augmented generation.

Load-bearing premise

The measured differences in throughput and utilization are caused mainly by KV-cache fragmentation and interconnect limits rather than by model-specific details, workload mixes, or cluster hardware choices that were not tested.

What would settle it

Run the same reasoning workloads on the same model sizes but with explicit KV-cache defragmentation or higher-bandwidth interconnects and check whether the early throttling and sublinear tensor-parallelism gains disappear.

Figures

Figures reproduced from arXiv: 2605.19775 by Avinash Maurya, Bogdan Nicolae, Moiz Arif, Sudharshan Vazhkudai.

Figure 1
Figure 1. Figure 1: Input, output, and reasoning token distributions for 100k samples from Meta’s Natural Reasoning dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Timeline of inference engine metrics on scaling the number of sequences for DeepSeek-8B on one H200 GPU. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overall serving statistics on scaling maximum number of sequences for DeepSeek-8B on one H200 GPU. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Batch size scaling for DeepSeek-8B on 8x H200 GPUs with 8-way DP. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: vLLM Metrics for 500, 2000 and 5000 batch sizes for [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scale up for DeepSeek-8B model (best strategy: DP scaling). [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Mixed config scaling for small models (2k BS). [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: DP Scaling for small models [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Model parameter scaling on 8x H200 GPUs. [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Analysis of Prefill and Decode Phase during AI Inference. [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Analysis of Prefill and Decode resource utilization for varying context lengths. [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Analysis of Prefill and Decode resource utilization of Llama405B for varying context lengths. [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Analysis of Prefill and Decode Memory Requirements during AI Inference. [PITH_FULL_IMAGE:figures/full_fig_p012_15.png] view at source ↗
read the original abstract

The transition from standard generative AI to \emph{reasoning-centric architectures}, exemplified by models capable of extensive Chain-of-Thought~(CoT) processing, marks a fundamental paradigm shift in system requirements. Unlike traditional workloads dominated by compute-bound prefill, reasoning workloads generate long chains of reasoning tokens that shift inference into a \emph{Capacity-Bound regime}. This paper presents a comprehensive system characterization, evaluating models ranging from 8B to 671B parameters on GPUs clusters. By systematically exploring the interplay between Data, Tensor, and Pipeline parallelism, we identify critical bottlenecks that defy standard scaling heuristics. Our analysis reveals that data parallelism is throughput efficient for small models but hits a capacity trap on reasoning workloads as KV-cache fragmentation forces early throttling resulting in sub-optimal compute utilization. Tensor parallelism unlocks stranded memory and delivers sublinear gains near the 32B crossover. At frontier scale, dense models (e.g., Llama-405B) are interconnect and memory-bandwidth bound and favor high-degree TP, while sparse Mixture-of-Experts (MoE) models (e.g., DeepSeek-R1) are limited by routing and synchronization latency and benefit from hybrid strategies. These insights provide a rigorous decision framework for navigating the reasoning cliff, establishing new architectural imperatives for the next generation of inference infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that reasoning workloads with long Chain-of-Thought processing shift LLM inference into a capacity-bound regime, unlike compute-bound prefill in traditional workloads. Through systematic evaluation of models from 8B to 671B parameters on GPU clusters, it identifies that data parallelism is throughput-efficient for small models but encounters a capacity trap on reasoning tasks because KV-cache fragmentation forces early throttling and sub-optimal compute utilization. Tensor parallelism unlocks stranded memory and yields sublinear gains near the 32B crossover. At frontier scale, dense models (e.g., Llama-405B) are interconnect- and memory-bandwidth-bound and favor high-degree tensor parallelism, while sparse MoE models (e.g., DeepSeek-R1) are limited by routing and synchronization latency and benefit from hybrid strategies. These observations are presented as a decision framework for navigating the 'reasoning cliff' in inference infrastructure.

Significance. If the empirical observations hold after addressing controls and quantification, the work would provide practically useful guidelines for choosing parallelism strategies when deploying reasoning models, highlighting the transition to capacity-bound regimes and the differing needs of dense versus MoE architectures. The breadth of model sizes and parallelism degrees explored is a clear strength and supplies timely empirical data for systems design. The absence of error bars, dataset details, and isolation of the proposed causal factors nevertheless limits how strongly the conclusions can be generalized.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'KV-cache fragmentation forces early throttling' under data parallelism on reasoning workloads is load-bearing for the paper's contribution, yet the provided description gives no indication that sequence-length distributions, batching policies, or memory-allocator behavior were varied independently of parallelism degree or total KV-cache size. This leaves open the possibility that observed throttling reflects aggregate memory capacity rather than fragmentation specifically, weakening the causal attribution to fragmentation and interconnect limits.
  2. [Abstract] Abstract (and presumed experimental sections): the reported 'sublinear gains near the 32B crossover' and the contrasting preferences for high-degree TP versus hybrid strategies lack accompanying quantitative metrics (throughput, utilization percentages, or statistical measures) or controls for unvaried factors such as specific model architectures and workload distributions, making it difficult to verify the robustness of the trade-off claims.
minor comments (2)
  1. Include error bars or confidence intervals on all performance and utilization measurements to allow readers to assess run-to-run variability.
  2. Define 'capacity trap' and 'stranded memory' with explicit metrics or formulas rather than qualitative description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments help clarify the presentation of our causal claims and the need for explicit quantification. We address each major comment below and have revised the manuscript to incorporate additional experimental details, controls, and metrics where this strengthens the work without misrepresenting our existing results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'KV-cache fragmentation forces early throttling' under data parallelism on reasoning workloads is load-bearing for the paper's contribution, yet the provided description gives no indication that sequence-length distributions, batching policies, or memory-allocator behavior were varied independently of parallelism degree or total KV-cache size. This leaves open the possibility that observed throttling reflects aggregate memory capacity rather than fragmentation specifically, weakening the causal attribution to fragmentation and interconnect limits.

    Authors: We agree that the abstract's brevity leaves the causal isolation implicit. In the full experimental methodology (Section 3), we fixed total KV-cache capacity across parallelism configurations while independently sampling sequence lengths from real reasoning CoT distributions and holding batching policies constant; memory-allocator behavior was logged via CUDA memory snapshots to confirm fragmentation as the driver of early throttling rather than raw capacity exhaustion. To make this explicit, we have added a short clarifying paragraph to the abstract and expanded the methodology subsection with a table showing the controlled variables. These revisions directly address the concern while preserving the original empirical observations. revision: yes

  2. Referee: [Abstract] Abstract (and presumed experimental sections): the reported 'sublinear gains near the 32B crossover' and the contrasting preferences for high-degree TP versus hybrid strategies lack accompanying quantitative metrics (throughput, utilization percentages, or statistical measures) or controls for unvaried factors such as specific model architectures and workload distributions, making it difficult to verify the robustness of the trade-off claims.

    Authors: We acknowledge that the abstract does not include the supporting numbers. The body of the paper already reports tokens-per-second throughput, SM utilization, and memory-bandwidth measurements for each scale and parallelism degree, with comparisons performed within model families (e.g., Llama variants) and on fixed reasoning workloads. To further strengthen verifiability, we have added error bars from repeated runs, a summary table of quantitative trade-offs at the 32B crossover, and explicit statements of the workload distribution parameters. These additions make the sublinear gains and dense-vs-MoE strategy preferences more transparent without changing the reported trends. revision: partial

Circularity Check

0 steps flagged

Empirical characterization with no derived equations or self-referential claims

full rationale

The paper presents a system characterization based on direct experimental measurements of inference performance across model sizes and parallelism strategies on GPU clusters. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. Claims about capacity traps, KV-cache effects, and parallelism trade-offs are framed as observations from cluster runs rather than quantities defined in terms of prior fitted values or self-cited uniqueness theorems. The work is self-contained against external benchmarks as an empirical study, with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical systems paper with no mathematical derivations, no new physical constants, and no postulated entities; all claims rest on standard assumptions about GPU memory hierarchy and interconnect behavior.

pith-pipeline@v0.9.0 · 5777 in / 1107 out tokens · 33733 ms · 2026-05-20T02:08:30.988357+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 6 internal anchors

  1. [1]

    Vidur: A large-scale simulation frame- work for llm inference,

    A. Agrawal, N. Kedia, J. Mohan, A. Panwar, N. Kwatra, B. S. Gulavani, R. Ramjee, and A. Tumanov, “Vidur: A large-scale simulation frame- work for llm inference,”Proceedings of Machine Learning and Systems, vol. 6, pp. 351–366, 2024. 13

  2. [2]

    Taming{Throughput-Latency}tradeoff in{LLM}inference with{Sarathi-Serve},

    A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. Gulavani, A. Tumanov, and R. Ramjee, “Taming{Throughput-Latency}tradeoff in{LLM}inference with{Sarathi-Serve},” in18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), 2024, pp. 117–134

  3. [3]

    SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

    A. Agrawal, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, and R. Ramjee, “Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills,”arXiv preprint arXiv:2308.16369, 2023

  4. [4]

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    J. Ainslie, J. Lee-Thorp, M. de Jong, Y . Zemlyanskiy, F. Lebron, and S. Sanghai, “Gqa: Training generalized multi-query transformer models from multi-head checkpoints,”arXiv preprint arXiv:2305.13245, 2023

  5. [5]

    Llm in a flash: Efficient large language model inference with limited memory,

    K. Alizadeh, S. I. Mirzadeh, D. Belenko, S. Khatamifard, M. Cho, C. C. Del Mundo, M. Rastegari, and M. Farajtabar, “Llm in a flash: Efficient large language model inference with limited memory,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 12 562–12 584

  6. [6]

    Exploiting cxl-based memory for distributed deep learning,

    M. Arif, K. Assogba, M. M. Rafique, and S. Vazhkudai, “Exploiting cxl-based memory for distributed deep learning,” inProceedings of the 51st International Conference on Parallel Processing, ser. ICPP ’22. New York, NY , USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/10.1145/3545008.3545054

  7. [7]

    Accelerating performance of gpu-based workloads using cxl,

    M. Arif, A. Maurya, and M. M. Rafique, “Accelerating performance of gpu-based workloads using cxl,” inProceedings of the 13th Workshop on AI and Scientific Computing at Scale Using Flexible Computing, ser. FlexScience ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 27–31. [Online]. Available: https://doi.org/10.1145/3589013.3596678

  8. [8]

    Moe-lightning: High-throughput moe inference on memory-constrained gpus,

    S. Cao, S. Liu, T. Griggs, P. Schafhalter, X. Liu, Y . Sheng, J. E. Gon- zalez, M. Zaharia, and I. Stoica, “Moe-lightning: High-throughput moe inference on memory-constrained gpus,” inProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1, 2025, pp. 715–730

  9. [9]

    Lmcache: An efficient kv cache layer for enterprise-scale llm inference,

    Y . Cheng, Y . Liu, J. Yao, Y . An, X. Chen, S. Feng, Y . Huang, S. Shen, K. Du, and J. Jiang, “Lmcache: An efficient kv cache layer for enterprise-scale llm inference,”arXiv preprint arXiv:2510.09665, 2025

  10. [10]

    Llm-inference- bench: Inference benchmarking of large language models on ai acceler- ators,

    K. T. Chitty-Venkata, S. Raskar, B. Kale, F. Ferdaus, A. Tanikanti, K. Raffenetti, V . Taylor, M. Emani, and V . Vishwanath, “Llm-inference- bench: Inference benchmarking of large language models on ai acceler- ators,” inSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024, pp. 1362–1379

  11. [11]

    PagedEviction: Structured block-wise KV cache pruning for efficient large language model inference,

    K. T. Chitty-Venkata, J. Ye, S. Raskar, A. Kougkas, X. Sun, M. Emani, V . Vishwanath, and B. Nicolae, “PagedEviction: Structured block-wise KV cache pruning for efficient large language model inference,” inEACL 2026: 19th Conference of the European Chapter of the Association for Computational Linguistics, Rabat, Morocco, 2026, pp. 3207–3218

  12. [12]

    Multi-head attention: Collaborate instead of concatenate,

    J.-B. Cordonnier, A. Loukas, and M. Jaggi, “Multi-head attention: Collaborate instead of concatenate,”arXiv preprint arXiv:2006.16362, 2020

  13. [13]

    Compute express link,

    CXL, “Compute express link,” 2025, accessed: 2025-12-12. [Online]. Available: https://computeexpresslink.org/

  14. [14]

    Corsair™. built for generative a,

    D-Matrix, “Corsair™. built for generative a,” 2025, accessed: 2025-12-

  15. [15]

    Available: https://www.d-matrix.ai/product/

    [Online]. Available: https://www.d-matrix.ai/product/

  16. [16]

    Why we decoupled execution to accelerate i/o,

    ——, “Why we decoupled execution to accelerate i/o,” 2025, accessed: 2025-12-12. [Online]. Available: https://www.d-matrix.ai/ why-we-decoupled-execution-to-accelerate-i-o/

  17. [17]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Sun, Z. Zhu, M. Zhang, M. Cheng, S. Li, M. A. R. Bigas, Y . Hu, S. Zhu, and Z. Kuang, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025. [Online]. Available: https://arxiv.o...

  18. [18]

    Accelerating LLM inference throughput via asynchronous KV cache prefetching,

    Y . Dong, Y . Miao, W. Li, X. Zheng, C. Wang, J. Wu, and F. Lyu, “Accelerating LLM inference throughput via asynchronous KV cache prefetching,” inAAAI’26: The 2026 AAAI Conference on Artificial Intelligence, vol. 40, no. 25, 2026, pp. 20 844–20 851

  19. [19]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahleet al., “The llama 3 herd of models,” 2024. [Online]. Available: https: //arxiv.org/abs/2407.21783

  20. [20]

    Nvidia dynamo,

    Dynamo, “Nvidia dynamo,” 2025, accessed: 2025-12-12. [Online]. Available: https://developer.nvidia.com/dynamo

  21. [21]

    Gpipe: Efficient training of giant neural networks using pipeline parallelism,

    Y . Huang, Y . Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V . Le, Y . Wu, and Z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” inAdvances in Neural Information Processing Systems (NeurIPS), 2019

  22. [22]

    Calculon: a methodology and tool for high-level co-design of systems and large language models,

    M. Isaev, N. Mcdonald, L. Dennison, and R. Vuduc, “Calculon: a methodology and tool for high-level co-design of systems and large language models,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1–14

  23. [23]

    Efficient memory management for large language model serving with pagedattention,

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory management for large language model serving with pagedattention,” inProceedings of the 29th Symposium on Operating Systems Principles (SOSP), 2023

  24. [24]

    Llm inference serving: Survey of recent advances and opportunities,

    B. Li, Y . Jiang, V . Gadepally, and D. Tiwari, “Llm inference serving: Survey of recent advances and opportunities,” in2024 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 2024, pp. 1–8

  25. [25]

    A survey on large lan- guage model acceleration based on kv cache management

    H. Li, Y . Li, A. Tian, T. Tang, Z. Xu, X. Chen, N. Hu, W. Dong, Q. Li, and L. Chen, “A survey on large language model acceleration based on kv cache management,”arXiv preprint arXiv:2412.19442, 2024

  26. [26]

    Deepseek-v3 technical report,

    A. Liu, B. Feng, B. Wang, B. Wang, B. Liuet al., “Deepseek-v3 technical report,” 2024. [Online]. Available: https://arxiv.org/abs/2412. 19437

  27. [27]

    Minicache: Kv cache compression in depth dimension for large language models,

    A. Liu, J. Liu, Z. Pan, Y . He, G. Haffari, and B. Zhuang, “Minicache: Kv cache compression in depth dimension for large language models,”Ad- vances in Neural Information Processing Systems, vol. 37, pp. 139 997– 140 031, 2024

  28. [28]

    Mlp-offload: Multi-level, multi-path offloading for llm pre-training to break the gpu memory wall,

    A. Maurya, M. Rafique, F. Cappello, and B. Nicolae, “Mlp-offload: Multi-level, multi-path offloading for llm pre-training to break the gpu memory wall,” inSC’25: 38th International Conference for High Performance Computing, Networking, Storage and Analytics, St Louis, USA, 2025

  29. [29]

    Openai o1 system card,

    OpenAI, “Openai o1 system card,” 2024, accessed: 2025-02-12. [Online]. Available: https://openai.com/index/openai-o1-system-card/

  30. [30]

    A survey on inference engines for large language models: Perspectives on optimization and efficiency.arXiv preprint arXiv:2505.01658, 2025

    S. Park, S. Jeon, C. Lee, S. Jeon, B.-S. Kim, and J. Lee, “A survey on in- ference engines for large language models: Perspectives on optimization and efficiency,”arXiv preprint arXiv:2505.01658, 2025

  31. [31]

    Mooncake: A kvcache-centric disaggregated architecture for llm serving,

    R. Qin, Z. Li, W. He, J. Cui, H. Tang, F. Ren, T. Ma, S. Cai, Y . Zhang, M. Zhanget al., “Mooncake: A kvcache-centric disaggregated architecture for llm serving,”ACM Transactions on Storage, 2024

  32. [32]

    Prophet: An llm infer- ence engine optimized for head-of-line blocking,

    S. Saereesitthipitak, A. Rao, C. Zhou, and W. Li, “Prophet: An llm infer- ence engine optimized for head-of-line blocking,” Stanford University, Technical Report (CS244B), 2024

  33. [33]

    Hbf: High bandwidth flash,

    Sandisk, “Hbf: High bandwidth flash,” 2025, accessed: 2025-12-12. [Online]. Available: https://www.sandisk.com/company/newsroom/ blogs/2025/memory-centric-ai-sandisks-high-bandwidth-flash-will- redefine-ai-infrastructure

  34. [34]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan- zaro, “Megatron-lm: Training multi-billion parameter language models using model parallelism,”arXiv preprint arXiv:1909.08053, 2019

  35. [35]

    Mechanistic interpretability of attention heads in reasoning llms,

    Y . Wang and Z. Li, “Mechanistic interpretability of attention heads in reasoning llms,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025, explains the ”thinking” process at the attention layer level

  36. [36]

    Chain-of-thought prompting elicits reasoning in large language mod- els,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language mod- els,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

  37. [37]

    Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

    F. Xu, Q. Hao, Z. Zong, J. Wang, Y . Zhang, J. Wang, X. Lan, J. Gong, T. Ouyang, F. Menget al., “Towards large reasoning models: A survey of reinforced reasoning with large language models,”arXiv preprint arXiv:2501.09686, 2025

  38. [38]

    Characterizing the behavior and impact of kv caching on transformer inferences under concurrency,

    J. Ye, J. Cernuda, A. Maurya, X.-H. Sun, A. Kougas, and B. Nicolae, “Characterizing the behavior and impact of kv caching on transformer inferences under concurrency,” inIPDPS’25: The 39th IEEE International Parallel and Distributed Processing Symposium, Milan, Italy, 2025. [Online]. Available: https://hal.inria.fr/hal-04984000

  39. [39]

    Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions,

    W. Yuan, J. Yu, S. Jiang, K. Padthe, Y . Li, I. Kulikov, K. Cho, D. Wang, Y . Tian, J. E. Westonet al., “Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions,”arXiv preprint arXiv:2502.13124, 2025

  40. [40]

    Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,

    C. Zhao, C. Deng, C. Ruan, D. Dai, H. Gao, J. Li, L. Zhang, P. Huang, S. Zhou, S. Maet al., “Insights into deepseek-v3: Scaling challenges and reflections on hardware for ai architectures,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, 2025, pp. 1731–1745. 14