Human-Less LLM Serving: Quantifying the Human Tax on Throughput
Pith reviewed 2026-07-01 00:42 UTC · model grok-4.3
The pith
LLM serving systems lose 60-93% of throughput by enforcing human TTFT and TPOT targets on programmatic workloads that never observe them.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Every major LLM serving system is built to satisfy TTFT and TPOT SLOs that reflect human perception of latency. Programmatic workloads that invoke LLMs in closed loops without a human observer do not require these SLOs, yet they still pay the resulting throughput cost. Systematic experiments demonstrate that this human tax ranges from 60 to 93 percent, increases with context length and concurrency, and is comparable across two production-grade serving stacks. An unconstrained human-less serving prototype achieves the higher baseline throughput on real workloads, showing that the tax is avoidable when the workload class is known.
What carries the argument
The human tax, quantified as the throughput reduction between a human-less unconstrained baseline and the throughput achieved when TTFT and TPOT SLOs are enforced.
If this is right
- At 64K token contexts, production-typical TTFT SLOs remove a large fraction of throughput relative to the human-less baseline.
- The throughput penalty grows substantially as context length increases.
- Higher concurrency amplifies the size of the human tax.
- The tax magnitude is qualitatively the same for SGLang and Sarathi-Serve.
- Serving systems can avoid the tax by exposing separate SLA configurations for different workload classes.
Where Pith is reading between the lines
- Routing programmatic traffic to dedicated high-throughput serving instances could recover most of the lost capacity without changing the serving code.
- Automatic workload classification at the gateway could let systems apply the appropriate SLA set without manual configuration.
- Existing LLM serving benchmarks that assume human SLOs may understate achievable throughput for non-interactive use cases.
- Higher sustained throughput from human-less serving would reduce the number of GPUs needed for the same aggregate request rate.
Load-bearing premise
The workloads, concurrency levels, and system behaviors tested represent real programmatic LLM usage, and a human-less baseline can be realized in practice without introducing other unmeasured costs.
What would settle it
Throughput measurements on production programmatic workloads that show either no throughput difference or losses outside the reported 60-93 percent range when human TTFT and TPOT SLOs are applied versus removed.
Figures
read the original abstract
Every major LLM serving system is designed to meet TTFT and TPOT SLOs. These metrics capture latency as a human user perceives it, and the mechanisms built to satisfy them are now standard infrastructure. We observe that long-horizon AI tasks call LLMs programmatically in tight loops where no human observes TTFT or TPOT. We ask: how much throughput do serving systems sacrifice to meet TTFT and TPOT SLAs that these workloads never need? We conduct a systematic measurement study across chunk sizes, SLO settings, context lengths, and concurrency levels. We find that the human tax on throughput grows substantially with context length and lands in the 60-93% range. At 64K token contexts, tightening the TTFT SLO to production-typical settings costs a large fraction of throughput versus the human-less baseline. The human tax is larger at higher concurrency and is qualitatively similar across SGLang and Sarathi-Serve. We term the unconstrained optimum human-less serving and provide a prototype demonstrating that it is practical on real workloads. Our findings argue that serving systems should expose workload-class-aware SLA configurations rather than silently applying the human tax uniformly to all traffic.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a measurement study of LLM serving systems (SGLang and Sarathi-Serve) and claims that standard TTFT/TPOT SLO mechanisms impose a 'human tax' of 60-93% throughput loss for programmatic long-horizon workloads that do not observe these latencies. The tax grows with context length (especially at 64K tokens) and concurrency; the authors introduce 'human-less serving' as an unconstrained baseline and provide a prototype showing it is practical, arguing that systems should expose workload-class-aware SLA configurations.
Significance. If the measurements are representative and the prototype incurs no offsetting costs, the result identifies a substantial, previously unquantified inefficiency in serving infrastructure for non-interactive AI tasks. The explicit prototype and cross-system consistency are strengths that could support follow-on work on class-aware schedulers.
major comments (3)
- [§4] §4 (Workload and concurrency setup): the claim that the tested patterns proxy real programmatic LLM usage rests on synthetic request distributions whose parameters are not compared against production traces; without this, the 60-93% range cannot be shown to generalize beyond the measured synthetic loops.
- [§6] §6 (Human-less prototype): the assertion that the unconstrained scheduler achieves the reported gains without hidden costs (extra memory, scheduling overhead, or degraded tail metrics) is not supported by any reported measurements of those quantities; the throughput comparison alone is insufficient to establish practicality.
- [§5.3] §5.3 (64K context results): the headline 60-93% tax at production-typical TTFT SLOs is presented without error bars, number of runs, or data-exclusion rules, making it impossible to judge whether the range reflects stable behavior or sensitivity to particular random seeds or request ordering.
minor comments (2)
- Notation for TTFT/TPOT SLO settings is introduced without a consolidated table; a single reference table would improve readability across the measurement sections.
- The abstract states the tax is 'qualitatively similar' across systems, but the main text does not quantify the similarity (e.g., overlap in ranges or statistical test); this should be made explicit.
Simulated Author's Rebuttal
We thank the referee for their thoughtful comments, which help improve the manuscript's clarity and rigor. We respond to each major comment below.
read point-by-point responses
-
Referee: [§4] §4 (Workload and concurrency setup): the claim that the tested patterns proxy real programmatic LLM usage rests on synthetic request distributions whose parameters are not compared against production traces; without this, the 60-93% range cannot be shown to generalize beyond the measured synthetic loops.
Authors: We agree that direct comparison to production traces would strengthen the generalization claim. Our synthetic distributions are motivated by common programmatic usage patterns described in the literature on AI agents and long-horizon tasks (e.g., iterative tool use with fixed chunk sizes). In the revised version, we will expand §4 to include references to public datasets and benchmarks that support our parameter choices, and discuss the range of the tax across variations to show robustness. We cannot access proprietary traces, but believe the synthetic setup captures the essential characteristics. revision: partial
-
Referee: [§6] §6 (Human-less prototype): the assertion that the unconstrained scheduler achieves the reported gains without hidden costs (extra memory, scheduling overhead, or degraded tail metrics) is not supported by any reported measurements of those quantities; the throughput comparison alone is insufficient to establish practicality.
Authors: This is a valid concern. The current manuscript focuses on throughput as the primary metric, but to demonstrate practicality, we will add in the revision measurements of memory consumption, CPU/GPU scheduling overhead, and tail latencies (P95, P99) for the human-less prototype versus standard configurations. These additional results will be presented in §6 to confirm the absence of significant hidden costs. revision: yes
-
Referee: [§5.3] §5.3 (64K context results): the headline 60-93% tax at production-typical TTFT SLOs is presented without error bars, number of runs, or data-exclusion rules, making it impossible to judge whether the range reflects stable behavior or sensitivity to particular random seeds or request ordering.
Authors: We thank the referee for pointing this out. The experiments were run multiple times with different seeds to ensure stability, but the details were omitted for brevity. In the revised manuscript, we will include error bars representing standard deviation across runs, specify that results are from 5 independent runs, and clarify that no data points were excluded beyond standard outlier filtering based on request completion. This will be added to §5.3. revision: yes
Circularity Check
No significant circularity: empirical measurement study without derivations or self-referential predictions
full rationale
This is a measurement study that quantifies throughput differences via direct experiments across chunk sizes, SLOs, context lengths, and concurrency levels, plus a prototype implementation. No mathematical derivations, fitted parameters renamed as predictions, self-citation load-bearing arguments, uniqueness theorems, or ansatzes are present in the abstract or described approach. The central claims rest on observed measurements rather than any chain that reduces to its own inputs by construction, so the analysis is self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Agrawal, N
A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, A. Tumanov, and R. Ramjee. Sarathi-Serve: Efficient LLM inference by piggybacking decodes with chunked prefills. InOSDI, 2024
2024
- [2]
-
[3]
LLM inference performance engineering: Best practices
Databricks Engineering. LLM inference performance engineering: Best practices. Databricks Engineering Blog, 2023. 9
2023
- [4]
-
[5]
vLLM optimization techniques: 5 practical methods to improve performance
Jarvislabs. vLLM optimization techniques: 5 practical methods to improve performance. https://jarvislabs. ai/blog/vllm-optimization-techniques, 2024
2024
-
[6]
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with PagedAttention. InSOSP, 2023
2023
- [7]
-
[8]
R. B. Miller. Response time in man-computer conversational transactions. InProc. AFIPS Fall Joint Computer Conference, pages 267–277, 1968
1968
-
[9]
J. Nielsen. Response times: The 3 important limits. Technical report, Nielsen Norman Group, 1993. Excerpt from Usability Engineering, Morgan Kaufmann
1993
-
[10]
Patel, E
P. Patel, E. Choukse, C. Zhang, A. Shah, Í. Goiri, S. Maleki, and R. Bianchini. Splitwise: Efficient generative LLM inference using phase splitting. InISCA, 2024
2024
- [11]
-
[12]
Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Sheng, L
Y . Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin, B. Chen, P. Liang, C. Ré, I. Stoica, and C. Zhang. FlexGen: High-throughput generative inference of large language models with a single GPU. InICML, 2023
2023
- [14]
-
[15]
G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun. Orca: A distributed serving system for Transformer- based generative models. InOSDI, 2022
2022
-
[16]
SGLang: Efficient Execution of Structured Language Model Programs
L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y . Sheng. SGLang: Efficient execution of structured language model programs.arXiv preprint arXiv:2312.07104, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Zhong, S
Y . Zhong, S. Liu, J. Chen, J. Hu, Y . Zhu, X. Liu, X. Jin, and H. Zhang. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. InOSDI, 2024. 10
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.