pith. sign in

arxiv: 2606.11690 · v1 · pith:ZA4S2CLGnew · submitted 2026-06-10 · 💻 cs.DC · cs.PF

Beyond Per-Token Pricing: A Concurrency-Aware Methodology for LLM Infrastructure Cost Estimation

Pith reviewed 2026-06-27 08:43 UTC · model grok-4.3

classification 💻 cs.DC cs.PF
keywords LLM inference costGPU utilizationrequest rateconcurrencycost estimationvLLMMoE modelsFP8 quantization
0
0 comments X

The pith

LLM inference cost on the same H100 hardware varies from $0.21 to $15.25 per million output tokens depending on offered request rate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Public cost calculators treat GPU utilization as a fixed user input or preset, but the paper demonstrates this produces large systematic errors because utilization is actually set by the operator-controlled request rate lambda. Lambda determines in-flight concurrency through Little's Law, so on identical hardware the effective cost per token can swing by factors of 2.5x to 36.3x across realistic enterprise loads from 1-10 requests per second down to near idle. The authors therefore introduce an explicit measurement methodology that expresses effective cost as C_eff = f(H, M, Q, lambda, L) and validate it across 42 benchmarks on dense and MoE models plus 56 runs on A100 hardware. They release an open-source meter that attaches to a live vLLM server and reports real costs against the operator's own traffic pattern. The data also indicate that active parameter count, rather than total model size, is the stronger predictor of saturation economics and that FP8 gains are larger for the MoE models tested.

Core claim

On identical H100 hardware, effective cost per million output tokens ranges from $0.21 to $15.25 as a direct function of offered request rate lambda, which sets in-flight concurrency via Little's Law and produces underutilization penalties of 2.5-24x at 1-10 rps and up to 36.3x near idle; any calculator that takes utilization as a fixed input therefore understates true cost by exactly 1/U, most severely for low-traffic workloads, and the same load-driven spread of 7.0-11.4x reproduces on A100 hardware.

What carries the argument

The parameterized cost function C_eff = f(H, M, Q, lambda, L) that makes offered request rate lambda the explicit driver of concurrency and utilization instead of treating utilization as an independent input.

If this is right

  • Utilization-naive calculators understate true cost by exactly 1/U for any given workload.
  • Self-hosting estimates are systematically too optimistic, with the largest overstatement at low request rates.
  • FP8 quantization improves throughput 2.2-2.4 times more for the tested MoE models than for the dense model.
  • Active parameter count, not total model size, is the primary predictor of saturation economics.
  • The load-driven cost spread of 7.0-11.4x and the active-parameter ordering both reproduce on A100 hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Enterprise operators could attach the released meter to their production traffic to obtain workload-specific cost figures before choosing between self-hosting and cloud.
  • Pricing models for inference services could expose lambda or concurrency as a pricing dimension rather than quoting only per-token rates.
  • The same measurement approach could be applied to inference engines other than vLLM to test whether the lambda-driven spread is engine-independent.

Load-bearing premise

The 42 benchmarks on dense and MoE models plus 56 A100 runs are representative of general LLM inference workloads and no other unmeasured variables significantly affect the utilization-cost relationship beyond lambda.

What would settle it

A new set of measurements on different hardware or models in which effective cost per token does not vary with lambda over the same range or fails to match the 1/U understatement predicted by the model.

Figures

Figures reproduced from arXiv: 2606.11690 by Chitral Patil.

Figure 1
Figure 1. Figure 1: Ceff vs. offered load λ across six H100 configurations (log-log); solid lines are FP16, dashed FP8. Effective cost drops by roughly 20× over the first 25 rps, then flattens as the engine saturates. Calculators that omit λ and assume a fixed utilization or peak throughput therefore misreport cost by close to an order of magnitude in the low-λ regime. At λ ≥ 50 the server is queue-limited; throughput and cos… view at source ↗
Figure 2
Figure 2. Figure 2: FP8 gives the dense model +31% peak throughput and the two MoE models +69% (Mixtral) and +74% (Qwen)—a roughly 2.2–2.4× larger win for the MoE architectures. Quantization behaves as an MoE-first optimiza￾tion that also helps dense models. weight-memory movement is amortized across the batch. At small batch sizes or in prefill￾dominated regimes, total parameters drive mem￾ory bandwidth cost and the active-v… view at source ↗
Figure 3
Figure 3. Figure 3: Underutilization penalty (idle-edge Ceff over saturation Csat) by config and of￾fered load. The penalty lives almost entirely in the λ=1 column (17.5–36.3×) and collapses to ≈1.0× by λ=50 across every config. Within each model the FP8 variant carries the larger penalty—quantization lowers the saturation floor more than the idle-edge cost—peaking at the darkest cell, 36.3× for Qwen3-30B-A3B FP8. to 8.73% at… view at source ↗
Figure 5
Figure 5. Figure 5: Both axes use cost per million out￾put tokens. Self-hosted Ceff is GPU $/hr di￾vided by output throughput; dashed API lines are each provider’s list output-token price only (the “/out” suffix in the inline labels is a reminder of this). Input-token billing, prompt-caching, and batch-API discounts are not applied to the API lines—all three would shift them down (see §5.6 caveat). Plotted self-hosted lines a… view at source ↗
Figure 7
Figure 7. Figure 7: Effective cost at λ∈{1, 25, 100} across three I/O shapes (chat 512:256, RAG 4096:1024, agentic 1024:4096) for C2 and C4 with prefix caching off and Poisson arrivals. The RAG/chat cost ratio is non-monotonic: RAG is cheaper than chat at λ=1 (long prefill amortises cold￾start cost), 2.4–2.7× at λ=25, and peaks at 3.2– 3.7× at λ=100; prompt-heavy workloads dom￾inate cost at high load while preserving the offe… view at source ↗
Figure 8
Figure 8. Figure 8: Effective cost Ceff vs. offered load λ on H100 NVL (solid blue) vs. A100 80GB PCIe (dashed orange), log-log, for six paired configurations. The saturation cliff reproduces on A100; the absolute spread is narrower (7.0–11.4× vs. 17.5–36.3×) because A100’s lower peak throughput and lower hourly rate compress the numerator-denominator ratio. At low λ (near-idle), A100’s lower hourly rate makes it cheaper per-… view at source ↗
read the original abstract

Every public LLM cost calculator we surveyed treats GPU utilization as a fixed input -- entered by the user, baked in as a preset, or silently assumed at 100% -- never measured against the operator's actual load. We show that this assumption is the dominant source of error: on identical H100 hardware, effective cost spans \$0.21 to \$15.25 per million output tokens, an underutilization penalty of 2.5-24x across low-to-moderate enterprise loads (1-10 rps) and up to 36.3x near idle -- driven by one operator-controlled variable, offered request rate lambda, which sets in-flight concurrency via Little's Law and which no open-source calculator exposes. Because calculators take utilization as a user-supplied input, any utilization-naive estimate understates true cost by exactly 1/U, systematically mispricing self-hosting -- most severely over-selling it for low-traffic workloads. We propose a measurement methodology that parameterizes the relationship as C_eff = f(H, M, Q, lambda, L), validate it with 42 benchmarks across dense, ultra-sparse MoE, and sparse MoE models, and release vllm-cost-meter, an open-source cost meter that attaches to a live vLLM server and reports real \$/M-tokens against the operator's own traffic. We further show that FP8 quantization benefits the MoE architectures we tested roughly 2.2-2.4x more than the dense model (+69 to +74% vs. +31% peak throughput; n=3, broader validation needed), and our data are consistent with active parameter count, not total model size, being a primary predictor of saturation economics. To rule out single-hardware confounding we repeat the core sweep on A100 80GB PCIe (56 runs): the load-driven spread reproduces at 7.0-11.4x, the active-parameters ordering survives at FP8, and the dense-FP8 advantage inverts on silicon without native FP8 tensor cores -- a hardware-conditional caveat the framework already accommodates.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript argues that public LLM cost calculators treat GPU utilization as a fixed or user-supplied input, which is the dominant source of error in cost estimation. On identical H100 hardware, effective cost per million output tokens ranges from $0.21 to $15.25 depending on the operator-controlled offered request rate λ (which sets concurrency via Little's Law), producing underutilization penalties of 2.5–36.3× for low-to-moderate loads. The authors formalize the relationship as C_eff = f(H, M, Q, λ, L), validate the approach empirically across 42 benchmarks on dense, ultra-sparse MoE, and sparse MoE models plus 56 A100 runs, demonstrate hardware-conditional FP8 gains for MoE models, and release the open-source vllm-cost-meter tool that reports live $/M-token costs against real traffic.

Significance. If the measurements and dominance of λ hold, the work identifies a systematic bias in existing calculators that most severely affects low-traffic self-hosting decisions and supplies a practical measurement methodology plus tooling to correct it. Credit is due for the open-source release, the cross-hardware reproduction of the load-driven spread, and the empirical observation that active parameter count (rather than total size) better predicts saturation economics. The FP8 result is noted by the authors as preliminary.

major comments (2)
  1. [Abstract and empirical validation] Abstract and validation description: the central claim that λ is the dominant operator-controlled variable (and that the reported 2.5–36.3× spread generalizes) requires evidence that query characteristics Q and context L were varied over a representative range; the provided description of the 42 benchmarks supplies no quantitative distribution or sweep over input/output lengths, arrival processes, or context, leaving open whether the cost multipliers would persist under realistic enterprise Q variation.
  2. [Methodology and results] The assertion that utilization-naive estimates understate true cost by exactly 1/U is load-bearing for the mispricing claim, yet it assumes that the measured utilization is determined solely by λ once H and M are fixed; without explicit controls or ablations showing that other server-internal factors (batching policy, scheduling) do not materially alter the λ–U relationship, the exact 1/U multiplier may not be universal across inference servers.
minor comments (2)
  1. [Abstract] The abstract states 'n=3' for the FP8 comparison but does not indicate whether these are independent runs or repeated measurements on the same configuration; adding error bars or explicit replication details would strengthen the quantitative claims.
  2. [Methodology] Notation for the function C_eff = f(H, M, Q, lambda, L) is introduced without an accompanying table or diagram that maps each variable to its measured or controlled status across the benchmark suite.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify areas where additional detail will strengthen the manuscript. We respond to each major comment below and will incorporate revisions to address the concerns on benchmark characterization and methodological robustness.

read point-by-point responses
  1. Referee: [Abstract and empirical validation] Abstract and validation description: the central claim that λ is the dominant operator-controlled variable (and that the reported 2.5–36.3× spread generalizes) requires evidence that query characteristics Q and context L were varied over a representative range; the provided description of the 42 benchmarks supplies no quantitative distribution or sweep over input/output lengths, arrival processes, or context, leaving open whether the cost multipliers would persist under realistic enterprise Q variation.

    Authors: We agree that the current description of the 42 benchmarks lacks a quantitative summary of Q and L distributions. In the revision we will add a table (new Table 2 or appendix) reporting ranges, means, and distributions for input lengths, output lengths, context lengths, and arrival processes (Poisson) across the benchmarks. These cover standard enterprise-like workloads with variation in Q and L sufficient to support the observed λ-driven spreads. The core claim centers on λ as the operator-controlled variable via Little's Law, with the spread reproduced across model types; the added table will make the representativeness explicit. revision: yes

  2. Referee: [Methodology and results] The assertion that utilization-naive estimates understate true cost by exactly 1/U is load-bearing for the mispricing claim, yet it assumes that the measured utilization is determined solely by λ once H and M are fixed; without explicit controls or ablations showing that other server-internal factors (batching policy, scheduling) do not materially alter the λ–U relationship, the exact 1/U multiplier may not be universal across inference servers.

    Authors: The 1/U relationship follows from the fixed hardware cost and the empirical observation that token throughput scales linearly with utilization U under the measured λ (via Little's Law). This is specific to the vLLM server used. To address potential effects of batching and scheduling, the revision will include an ablation (new subsection or appendix) varying vLLM parameters such as max_num_seqs and max_batch_size while holding λ fixed, confirming that the λ–U curve and resulting 1/U cost factor remain stable within the tested operating range. The framework itself is parameterized by H and M to accommodate different servers. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on direct empirical measurements

full rationale

The paper's headline cost spans ($0.21–$15.25/M tokens, 2.5–36.3× penalties) are obtained from 42 benchmarks plus 56 A100 runs that directly measure effective cost under varying offered request rates lambda on fixed hardware; Little's Law is invoked solely as the standard queueing relation L = λW to interpret in-flight concurrency, not as a fitted or derived quantity. The parameterization C_eff = f(H, M, Q, lambda, L) is presented as an empirical measurement framework validated against those runs, with no self-citations, no fitted inputs renamed as predictions, and no derivations that reduce by construction to the inputs. The result is therefore independent of the measurement data rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard queueing theory and empirical benchmarks on specific hardware and models. No new entities postulated and no free parameters fitted to produce the central claim.

axioms (1)
  • standard math Little's Law: average concurrency equals arrival rate times average latency
    Invoked to link offered request rate lambda to in-flight concurrency and thus GPU utilization.

pith-pipeline@v0.9.1-grok · 5923 in / 1333 out tokens · 25542 ms · 2026-06-27T08:43:56.175194+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Memory as a Wasting Asset: Pricing Flash Endurance for Embodied Agents, and the Limits of Doing So

    cs.AI 2026-06 unverdicted novelty 6.0

    Flash endurance is priced via shadow price η making placement cost-optimal for any sign of value-write correlation χ, with χ positive only in recurrent long-horizon manipulation and the budget binding only on low-endu...

Reference graph

Works this paper leans on

35 extracted references · 4 linked inside Pith · cited by 1 Pith paper

  1. [1]

    G. Pan, V. Chodnekar, A. Roy, and H. Wang. A cost-benefit analysis of on- premise large language model deployment: Breaking even with commercial LLM ser- vices.arXiv preprint arXiv:2509.18101, 2025

  2. [2]

    Irugalbandara, A

    C. Irugalbandara, A. Mahendra, R. Liyan- age, et al. Scaling down to scale up: A cost- benefit analysis of replacing OpenAI’s LLM with open source SLMs in production. In IEEE ISPASS, 2024

  3. [3]

    Agrawal, N

    A. Agrawal, N. Kedia, J. Mohan, et al. VIDUR: A large-scale simulation framework for LLM inference. InMLSys, 2024

  4. [4]

    Griggs, X

    T. Griggs, X. Liu, J. Yu, et al. Mélange: Cost efficient large language model serving by exploiting GPU heterogeneity.arXiv preprint arXiv:2404.14527, 2024

  5. [5]

    Jaiswal et al

    S. Jaiswal et al. SageServe: Optimizing LLM serving on cloud data centers with forecast aware auto-scaling.arXiv preprint arXiv:2502.14617, 2025

  6. [6]

    E. Erdil. Inference economics of language models.arXiv preprint arXiv:2506.04645, 2025

  7. [7]

    Zhuang, J

    B. Zhuang, J. Qiao, M. Liu, et al. Beyond benchmarks: The economics of AI inference. arXiv preprint arXiv:2510.26136, 2025

  8. [8]

    K. Kim, J. Li, K. Hong, and A. Aila- maki. Faster LLM inference using DBMS- inspired preemption and cache replacement policies (INFERMAX).arXiv preprint arXiv:2411.07447, 2024

  9. [9]

    W. Kwon, Z. Li, S. Zhuang, et al. Ef- ficient memory management for large lan- guage model serving with PagedAttention. InSOSP, 2023

  10. [10]

    NVIDIA. LLM inference benchmarking: How much does your LLM inference cost? NVIDIA Developer Blog, June 2025.https: //developer.nvidia.com/blog/llm-inf erence-benchmarking-how-much-does-y our-llm-inference-cost/. Accessed 2026- 04-23

  11. [11]

    Inference unit economics: The true cost per million tokens.Introl Blog, Febru- ary 2026.https://introl.com/blog/infe rence-unit-economics-true-cost-per-m illion-tokens-guide

    Introl. Inference unit economics: The true cost per million tokens.Introl Blog, Febru- ary 2026.https://introl.com/blog/infe rence-unit-economics-true-cost-per-m illion-tokens-guide. Accessed 2026-04- 23

  12. [12]

    LLM Inference TCO Calculator v2.4.https://acnicessc.github.io/ll mcalc/, 2025

    acnicessc. LLM Inference TCO Calculator v2.4.https://acnicessc.github.io/ll mcalc/, 2025. Accessed 2026-04-23

  13. [13]

    llm-optimizer: Benchmark and optimize LLM inference across frameworks

    BentoML. llm-optimizer: Benchmark and optimize LLM inference across frameworks. https://github.com/bentoml/llm-optim izer, 2025. Accessed 2026-04-23

  14. [14]

    InferenceX (formerly Infer- enceMAX): Open-source continuous infer- ence benchmarking.https://github.com /SemiAnalysisAI/InferenceX, 2025

    SemiAnalysis. InferenceX (formerly Infer- enceMAX): Open-source continuous infer- ence benchmarking.https://github.com /SemiAnalysisAI/InferenceX, 2025. Ac- cessed 2026-04-23

  15. [15]

    GuideLLM: Evaluate LLM deployments for real-world inference.http s://github.com/vllm-project/guidellm,

    vLLM Project. GuideLLM: Evaluate LLM deployments for real-world inference.http s://github.com/vllm-project/guidellm,

  16. [16]

    Qwen3 Technical Report

    Qwen Team. Qwen3 Technical Report. arXiv preprint arXiv:2505.09388, 2025

  17. [17]

    Grattafiori, A

    A. Grattafiori, A. Dubey, A. Jauhri, et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024

  18. [18]

    A. Q. Jiang, A. Sablayrolles, A. Roux, et al. Mixtral of Experts.arXiv preprint arXiv:2401.04088, 2024

  19. [19]

    API Pricing.https://develope rs.openai.com/api/docs/pricing, 2026

    OpenAI. API Pricing.https://develope rs.openai.com/api/docs/pricing, 2026. Accessed 2026-06-09

  20. [20]

    Claude API Pricing.https:// claude.com/pricing, 2026

    Anthropic. Claude API Pricing.https:// claude.com/pricing, 2026. Accessed 2026- 06-09. 25

  21. [21]

    Gemini API Pricing.https://ai .google.dev/gemini-api/docs/pricing,

    Google. Gemini API Pricing.https://ai .google.dev/gemini-api/docs/pricing,

  22. [22]

    inference-perf: GenAI inference performance benchmarking tool.https://github.com/kubernetes-s igs/inference-perf, 2025

    Kubernetes SIG Serving. inference-perf: GenAI inference performance benchmarking tool.https://github.com/kubernetes-s igs/inference-perf, 2025. Accessed 2026- 04-23

  23. [23]

    LLMPerf: A tool for benchmark- ing LLMs.https://github.com/ray-pro ject/llmperf, 2024

    Anyscale. LLMPerf: A tool for benchmark- ing LLMs.https://github.com/ray-pro ject/llmperf, 2024. Accessed 2026-04-23

  24. [24]

    GenAI-Perf: Benchmark genera- tive AI models.https://github.com/tri ton-inference-server/perf_analyzer,

    NVIDIA. GenAI-Perf: Benchmark genera- tive AI models.https://github.com/tri ton-inference-server/perf_analyzer,

  25. [25]

    Patel, E

    P. Patel, E. Choukse, C. Zhang, et al. Split- wise: Efficient generative LLM inference us- ing phase splitting. InISCA, 2024

  26. [26]

    Zhong, S

    Y. Zhong, S. Liu, J. Chen, et al. Dist- Serve: Disaggregating prefill and decoding for goodput-optimized large language model serving. InOSDI, 2024

  27. [27]

    Agrawal, N

    A. Agrawal, N. Kedia, A. Panwar, et al. Taming Throughput-Latency Tradeoff in LLMInferencewithSarathi-Serve. InOSDI, 2024

  28. [28]

    Zheng, L

    L. Zheng, L. Yin, Z. Xie, et al. SGLang: Efficient execution of structured language model programs. InNeurIPS, 2024

  29. [29]

    Zheng, W.-L

    L. Zheng, W.-L. Chiang, Y. Sheng, et al. LMSYS-Chat-1M: A large-scale real-world LLM conversation dataset. InICLR, 2024

  30. [30]

    G.-I. Yu, J. S. Jeong, G.-W. Kim, et al. Orca: A distributed serving system for transformer-based generative models. In OSDI, 2022

  31. [31]

    R. Qin, Z. Li, W. He, et al. Moon- cake: A KVCache-centric disaggregated ar- chitecture for LLM serving.arXiv preprint arXiv:2407.00079, 2024

  32. [32]

    Y. Liu, Y. Cheng, J. Yao, et al. LM- Cache: An efficient KV cache layer for enterprise-scale LLM inference.arXiv preprint arXiv:2510.09665, 2025

  33. [33]

    B. Sun, Z. Huang, H. Zhao, et al. Llum- nix: Dynamic scheduling for large language model serving. InOSDI, 2024

  34. [34]

    Y. Wang, Y. Chen, Z. Li, et al. Burst- GPT: A real-world workload dataset to op- timize LLM serving systems.arXiv preprint arXiv:2401.17644, 2024

  35. [35]

    Xiang, X

    Y. Xiang, X. Li, K. Qian, et al. ServeGen: Workloadcharacterizationandgenerationof large language model serving in production. arXiv preprint arXiv:2505.09999, 2025. 26