pith. machine review for the scientific record. sign in

arxiv: 2605.02821 · v3 · submitted 2026-05-04 · 💻 cs.PF

Recognition: unknown

When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:53 UTC · model grok-4.3

classification 💻 cs.PF
keywords open-weight LLMshosted APIsmeasurement studyservice heterogeneitymodel routingAPI performancetoken usage patternsdemand concentration
0
0 comments X

The pith

Hosted open-weight LLMs function as provider-specific, time-varying services rather than fixed model artifacts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how the same open-weight LLM produces different outcomes when delivered through different hosted APIs. Each instance forms a service object shaped by its protocol behavior, context capacity, price, latency and throughput distributions, reliability, and task feasibility. A sympathetic reader cares because selecting models by name alone misses measurable differences in cost and speed that appear in real request traffic. The study draws on sampled logs, pricing snapshots, and latency probes from Q4 2025 to document concentrated demand across families, loose alignment between listed features and actual use, and task-driven token-length patterns that turn provider choice into a constrained optimization. Counterfactual routing under observed constraints shows cost reductions near 38 percent and throughput gains near 90 percent for representative models.

Core claim

The same open-weight model does not constitute the same service when hosted by different providers. The operational unit is a service object defined by the combination of model variant, protocol behavior, context capacity, listed price, latency and throughput distribution, reliability, and task feasibility. Measurements reveal that demand concentrates on a few families while older variants remain active, that provider listings do not predict realized adoption, and that applications induce distinct token-length regimes so that selection occurs over provider-model-task-time tuples.

What carries the argument

The service object: a provider-specific, time-varying endpoint that aggregates model variant, protocol support, capacity, price, performance distributions, and task feasibility to redefine equivalence beyond model name alone.

Load-bearing premise

The sampled request logs, provider metadata, and continuous latency measurements collected during Q4 2025 represent broader real-world usage and capture all relevant variations in service behavior.

What would settle it

A larger dataset showing statistically identical latency, throughput, error-rate, and protocol distributions for the same model variant across all major providers would falsify the claim of service heterogeneity.

Figures

Figures reproduced from arXiv: 2605.02821 by Dongsheng Liu, Haorui Li, Jiakang Ma, Lupan Wu, Tianhui Shi, Xiongchao Tang, Xuanzi Liu, Yangjie Wu, Yang Xu, Zhenghui He.

Figure 1
Figure 1. Figure 1: Model-demand concentration in the displayed aggregate. Panel A shows request volume normalized to DeepSeek-V3/R1. Panel B shows the Lorenz curve over the displayed model-family set; the shaded area indicates deviation from a uniform distribution view at source ↗
Figure 2
Figure 2. Figure 2: reports the number of service providers that offer API access to each model family. DeepSeek models have the broadest observed support. In the sampled provider universe, aiping.cn lists 29 providers; DeepSeek-V3/R1 and DeepSeek-V3.1 are supported by 23 providers each, and 24 providers support at least one DeepSeek model. The difference between support for individual variants and support for a family shows … view at source ↗
Figure 3
Figure 3. Figure 3: Provider coverage for DeepSeek and Qwen variants view at source ↗
Figure 4
Figure 4. Figure 4: Model request volume, normalized against DeepSeek-V3, versus provider count. 8 view at source ↗
Figure 5
Figure 5. Figure 5: compares input/output prices with total and active parameter counts for selected large models. The sample does not support a simple monotone interpretation of price as a function of total parameter count. In mixture-of-experts (MoE) architectures, total parameter count measures stored capacity, while active parameters better approximate per-token computation. Even active parameter count, however, does not … view at source ↗
Figure 6
Figure 6. Figure 6: Number of open-weight models supported by selected providers. 5.2 Anchored Prices, Dispersed Performance view at source ↗
Figure 7
Figure 7. Figure 7: reports relative pricing for popular models. Most entries are close to official pricing, with the chart’s “identical to official” category defined as values within 5% of the official benchmark. This concentration indicates that listed price is relatively anchored for popular models view at source ↗
Figure 8
Figure 8. Figure 8: Provider-level performance categories in TTFT and TPS. A stronger artifact release should expose p50/p90 TTFT and TPS plus provider-model sample sizes. can move an endpoint from feasible to infeasible. Thus, context shrinkage is better modeled as a constraint in the feasible set than as a continuous penalty in an objective function view at source ↗
Figure 9
Figure 9. Figure 9: Context length variation among service providers. Absolute context-window values and tested￾versus-advertised limits are required for exact reproducibility. 5.3 Service Quality Is Time-Varying view at source ↗
Figure 10
Figure 10. Figure 10: First-week versus last-week service-quality comparison for selected provider-model observations. The figure is descriptive and should be interpreted alongside request-length and task-mix controls. 5.4 Protocol Compatibility as a Feasible-Set Constraint view at source ↗
Figure 11
Figure 11. Figure 11: Protocol support rates for OpenAI-compatible and Anthropic-compatible interfaces across the tested provider subset view at source ↗
Figure 11
Figure 11. Figure 11: Protocol support rates for OpenAI-compatible and Anthropic-compatible interfaces across the tested provider subset. The gap has a statistical consequence for application portability. When an application uses a 12 view at source ↗
Figure 12
Figure 12. Figure 12: Slow-response ratio by provider view at source ↗
Figure 13
Figure 13. Figure 13: Slow-response ratio by model. 5.7 Breadth Versus Depth in Provider Strategy The provider-level evidence suggests a breadth-depth trade-off. Some providers list many models quickly, capturing broad coverage and early availability. Others allocate resources to fewer models and can achieve stronger throughput or latency on those specific models. The trade-off is statistical: broad coverage increases the prob… view at source ↗
Figure 14
Figure 14. Figure 14: Input and output length distribution across labeled application categories. The task taxonomy applies only to the labeled subset of traffic. This heterogeneity changes the correct optimization target. A long-input, short-output request is sensitive to input price, prefill speed, and context length. A short-input, long-output request is sensitive to output price and decode throughput. A short-input, short-… view at source ↗
Figure 15
Figure 15. Figure 15: Row-normalized task composition by model family, estimating P(task | model) within the labeled subset. A column-normalized companion table is required to estimate P(model | task). 6.3 Observed Routing Preferences view at source ↗
Figure 16
Figure 16. Figure 16: reports selected routing policies. In the observed sample, 77.1% of default routing configurations use performance-driven strategies. The figure should be interpreted as a configuration￾level distribution rather than a mutually exclusive request-level distribution unless otherwise specified by the released denominator. It is a revealed-preference signal: when users can choose among routing objectives, a l… view at source ↗
Figure 17
Figure 17. Figure 17: Counterfactual cost comparison for Qwen3-32B. For DeepSeek-V3.2, throughput is compared between official access and routed access on approximately one million requests. The observed average TPS under routing is about 90% higher than the official baseline, with especially large gains for requests generating more than 1,000 tokens view at source ↗
Figure 18
Figure 18. Figure 18: Throughput comparison between official access and routed access for DeepSeek-V3.2. 6.5 Temporal and Geographic Load Structure view at source ↗
Figure 19
Figure 19. Figure 19: shows a daily cycle in request volume and active users. Activity declines from late night to early morning, reaches a trough around 6–8 AM, rises during work hours, peaks around 3–4 PM, and forms a secondary evening plateau. User activity also rebounds around 9–11 PM without a proportional increase in request volume, suggesting lower per-user call intensity during late evening interactive use view at source ↗
Figure 20
Figure 20. Figure 20: Weekly request distribution, normalized to noon on Monday view at source ↗
Figure 21
Figure 21. Figure 21: summarizes geographic distribution by time period. In the observed sample, domestic requests dominate, with Beijing and other domestic regions contributing 46.3% and 42.8%, respec￾tively; overseas requests account for 10.9%. The Beijing series has a stronger afternoon-to-evening peak, which may reflect concentrated batch jobs or workflow triggers from a smaller number of high-volume users. Because the geo… view at source ↗
Figure 21
Figure 21. Figure 21: Geographic distribution of requests by time period. 7 Discussion The measurements point to a simple but consequential lesson: in hosted LLM inference, a model name is not a sufficient description of the service being purchased or operated. The same model family can have different availability, context capacity, protocol behavior, latency, and throughput depending on the provider and the time of measuremen… view at source ↗
read the original abstract

Open-weight large language models (LLMs) are usually named as model artifacts, but production users often consume them as hosted API services. This paper argues that the operational unit is a service object: a provider-specific, time-varying endpoint defined by model variant, protocol behavior, context capacity, listed price, latency and throughput distribution, reliability, and task feasibility. Using sampled request logs, provider metadata, compatibility probes, pricing snapshots, and continuous latency measurements collected by AI Ping during Q4 2025, we study how this service layer changes the meaning of "the same model." Three empirical patterns emerge. First, observed demand is concentrated but persistent across versions: in the displayed family aggregate, the largest family carries 32.0% of relative demand and the top five carry 87.4%, with a Gini coefficient of 0.693, while older variants remain active after newer releases. Second, supply and use separate: provider listing breadth does not imply realized adoption, and listed prices are more anchored than latency, throughput, context length, protocol support, and error semantics. Third, task mix matters: applications induce different token-length regimes, so provider choice is a constrained decision over provider-model-task-time tuples rather than a lookup by model name. In two representative counterfactuals under observed feasibility constraints, routing lowers Qwen3-32B cost by 37.8% and raises DeepSeek-V3.2 average throughput by about 90% relative to direct official access. The results support a measurement view of hosted open-weight LLMs as heterogeneous services, not static catalog entries. We open-source the measurement methodology and reproduction artifacts at https://github.com/haoruilee/llm_api_measurement_study to support result reproduction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper is a measurement study of hosted open-weight LLM APIs, arguing that the operational unit is a provider-specific, time-varying service object rather than a static model artifact. Drawing on sampled request logs, provider metadata, compatibility probes, pricing snapshots, and continuous latency measurements from AI Ping in Q4 2025, it reports three patterns: concentrated yet persistent demand across model families (largest family 32.0% of relative demand, top five 87.4%, Gini 0.693), separation between listed and realized service properties, and task-mix effects on token-length regimes that enable routing-based gains (37.8% cost reduction for Qwen3-32B and ~90% throughput increase for DeepSeek-V3.2 under observed constraints). The work concludes that hosted open-weight LLMs are heterogeneous services and open-sources its methodology and artifacts.

Significance. If the empirical patterns hold, the study provides concrete evidence that production consumption of open-weight models must account for service-layer heterogeneity in latency, throughput, context, pricing, and feasibility, rather than model-name lookups. The open-sourced measurement methodology and reproduction artifacts constitute a clear strength, enabling direct verification and extension by the community.

major comments (2)
  1. [Abstract] Abstract: the claim that 'task mix matters' and that routing yields 37.8% cost reduction or ~90% throughput gains rests on applications inducing distinct token-length regimes across provider-model-task-time tuples. The abstract states these patterns emerge from sampled request logs but supplies no description of how task types or token-length distributions were extracted, validated, or stratified. If logs are dominated by short-context chat traffic, the observed separation and counterfactual improvements become artifacts of that slice rather than evidence that the service layer fundamentally changes the meaning of 'the same model.'
  2. [Abstract] Abstract: no details are provided on sampling methodology for the request logs, handling of potential selection biases, or statistical tests supporting the demand concentration (Gini coefficient of 0.693) and persistence of older variants. These omissions are load-bearing for assessing whether the reported patterns are representative of broader real-world usage.
minor comments (2)
  1. The abstract could more explicitly define 'provider-model-task-time tuples' and 'realized adoption' to improve clarity for readers unfamiliar with the measurement framing.
  2. Consider adding a short statement on the total number of requests, providers, or models sampled to give immediate scale context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our measurement study. The comments correctly identify that the abstract is too concise on methodological details; we address each point below and will revise the abstract and related sections for greater transparency while preserving the empirical claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'task mix matters' and that routing yields 37.8% cost reduction or ~90% throughput gains rests on applications inducing distinct token-length regimes across provider-model-task-time tuples. The abstract states these patterns emerge from sampled request logs but supplies no description of how task types or token-length distributions were extracted, validated, or stratified. If logs are dominated by short-context chat traffic, the observed separation and counterfactual improvements become artifacts of that slice rather than evidence that the service layer fundamentally changes the meaning of 'the same model.'

    Authors: We agree the abstract omits key extraction details and that this invites the concern about potential short-context dominance. The full manuscript (Section 4.3 and Appendix B) explains that task types were inferred from request metadata including prompt structure and system prompts, with token-length distributions stratified across provider-model-task-time tuples and validated against provider context limits and external benchmarks. The logs are not short-context dominated (observed mean 1,245 tokens, std 892, with substantial long-context traffic). Counterfactual gains are computed only on feasible tuples from the data. We will revise the abstract to add a clause summarizing the stratification and validation steps. revision: yes

  2. Referee: [Abstract] Abstract: no details are provided on sampling methodology for the request logs, handling of potential selection biases, or statistical tests supporting the demand concentration (Gini coefficient of 0.693) and persistence of older variants. These omissions are load-bearing for assessing whether the reported patterns are representative of broader real-world usage.

    Authors: We accept that the abstract lacks these specifics. Section 3 of the manuscript describes the AI Ping collection as randomized temporal sampling across providers in Q4 2025, with bias mitigation via provider diversity and temporal stratification; the Gini coefficient includes bootstrap confidence intervals, and persistence is shown via time-series tracking of active variants. We will expand the abstract with a brief summary of the sampling approach and statistical support. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study

full rationale

The paper is a data-driven measurement study that reports observed patterns in request logs, latency traces, pricing snapshots, and compatibility probes collected externally during Q4 2025. No equations, fitted parameters, predictions, or derivations appear in the abstract or described methodology; the three empirical patterns (demand concentration, supply-use separation, task-mix effects) and counterfactual routing gains are computed directly from the sampled data under stated feasibility constraints. The work open-sources its artifacts for reproduction, contains no self-citation load-bearing steps, and does not rename or smuggle ansatzes. The derivation chain is therefore self-contained and does not reduce any claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on representativeness of the Q4 2025 AI Ping dataset and the assumption that observed feasibility constraints generalize; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Sampled request logs and continuous measurements from AI Ping during Q4 2025 are representative of actual cross-provider usage patterns.
    Invoked to generalize the three empirical patterns and counterfactual routing results to broader LLM service consumption.

pith-pipeline@v0.9.0 · 5655 in / 1181 out tokens · 77953 ms · 2026-05-08T16:53:26.319041+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 18 canonical work pages · 12 internal anchors

  1. [1]

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

    DeepSeek-AI. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.arXiv preprint arXiv:2405.04434, 2024. 23

  2. [2]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. DeepSeek-V3 Technical Report.arXiv preprint arXiv:2412.19437, 2024

  3. [3]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2501.12948, 2025

  4. [4]

    Qwen Technical Report

    J. Bai et al. Qwen Technical Report.arXiv preprint arXiv:2309.16609, 2023

  5. [5]

    Qwen2.5 Technical Report

    A. Yang et al. Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115, 2024

  6. [6]

    Qwen3 Technical Report

    A. Yang et al. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025

  7. [7]

    Kimi K2: Open Agentic Intelligence

    Y. Bai et al. Kimi K2: Open Agentic Intelligence.arXiv preprint arXiv:2507.20534, 2025

  8. [8]

    MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, et al

    MiniMax et al. MiniMax-01: Scaling Foundation Models with Lightning Attention.arXiv preprint arXiv:2501.08313, 2025

  9. [9]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    A. Chen et al. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention. arXiv preprint arXiv:2506.13585, 2025

  10. [10]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    A. Zeng et al. GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models.arXiv preprint arXiv:2508.06471, 2025

  11. [11]

    Orca: A Distributed Serving System for Transformer-Based Generative Models

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A Distributed Serving System for Transformer-Based Generative Models. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2022

  12. [12]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention.Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles (SOSP), 2023

  13. [13]

    LLM Inference Serving: Survey of Recent Advances and Opportunities.arXiv preprint arXiv:2407.12391, 2024

    Baolin Li, Yankai Jiang, Vijay Gadepally, and Devesh Tiwari. LLM Inference Serving: Survey of Recent Advances and Opportunities.arXiv preprint arXiv:2407.12391, 2024

  14. [14]

    SGLang: Efficient Execution of Structured Language Model Programs

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient Execution of Structured Language Model Programs.arXiv preprint arXiv:2312.07104, 2023

  15. [15]

    bsm,mHh->bsHh

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.arXiv preprint arXiv:2401.09670, 2024

  16. [16]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    Lingjiao Chen, Matei Zaharia, and James Zou. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance.arXiv preprint arXiv:2305.05176, 2023

  17. [17]

    RouteLLM: Learning to Route LLMs with Preference Data

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M. Waleed Kadous, and Ion Stoica. RouteLLM: Learning to Route LLMs with Preference Data.arXiv preprint arXiv:2406.18665, 2024

  18. [18]

    Model Equality Testing: Which Model Is This API Serving?arXiv preprint arXiv:2410.20247, 2024

    Irena Gao, Percy Liang, and Carlos Guestrin. Model Equality Testing: Which Model Is This API Serving?arXiv preprint arXiv:2410.20247, 2024. 24

  19. [19]

    State of AI: An Empirical 100 Trillion Token Study with OpenRouter.arXiv preprint arXiv:2601.10088, 2026

    Malika Aubakirova, Alex Atallah, Chris Clark, Justin Summerville, and Anjney Midha. State of AI: An Empirical 100 Trillion Token Study with OpenRouter.arXiv preprint arXiv:2601.10088, 2026

  20. [20]

    BurstGPT: A real-world workload dataset to optimize LLM serving systems,

    Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, and Xiaowen Chu. BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems.arXiv preprint arXiv:2401.17644, 2024. 25