Reasoning workloads shift LLM inference to a capacity-bound regime where KV-cache fragmentation limits data parallelism, tensor parallelism unlocks memory at the 32B scale, and MoE models require hybrid strategies to avoid routing latency.
Vidur: A large-scale simulation frame- work for llm inference,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.DC 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles
Reasoning workloads shift LLM inference to a capacity-bound regime where KV-cache fragmentation limits data parallelism, tensor parallelism unlocks memory at the 32B scale, and MoE models require hybrid strategies to avoid routing latency.