pith. machine review for the scientific record. sign in

arxiv: 2604.09557 · v1 · submitted 2026-02-10 · 💻 cs.DC · cs.AI

Recognition: no theorem link

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:05 UTC · model grok-4.3

classification 💻 cs.DC cs.AI
keywords speculative decodingLLM inferencebenchmarkthroughput evaluationsemantic diversityproduction enginesvLLMTensorRT-LLM
0
0 comments X

The pith

SPEED-Bench establishes a unified benchmark for speculative decoding that covers diverse semantic domains, throughput across concurrencies, and integration with production engines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Speculative decoding accelerates LLM inference but its gains depend on the input data, so existing benchmarks with narrow tasks and synthetic data give incomplete pictures. SPEED-Bench supplies a qualitative split chosen to maximize semantic variety across samples plus a throughput split that measures speedups from low-batch latency settings to high-concurrency loads. The benchmark wires directly into engines such as vLLM and TensorRT-LLM, exposing effects that high-level simulators hide. It shows synthetic inputs inflate reported throughput, that optimal draft lengths shift with batch size, and that low-diversity data creates measurable biases. A practitioner can therefore compare speculative decoding methods on workloads that better match actual serving conditions.

Core claim

SPEED-Bench establishes a unified evaluation standard for practical comparisons of SD algorithms by offering diverse semantic domains, throughput splits across concurrencies, and integration with production engines like vLLM and TensorRT-LLM. It quantifies how synthetic inputs overestimate real-world throughput, identifies batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzes the caveats of vocabulary pruning in state-of-the-art drafters.

What carries the argument

SPEED-Bench suite, built from a Qualitative data split prioritized for semantic diversity across samples and a Throughput data split spanning latency-sensitive to high-load concurrencies, integrated directly with production engines.

If this is right

  • Synthetic inputs overestimate real-world throughput gains from speculative decoding.
  • Optimal draft lengths vary with batch size in production settings.
  • Low-diversity data introduces systematic biases in measured speedups.
  • Vocabulary pruning in current drafters carries identifiable limitations under realistic loads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread use of SPEED-Bench could replace ad-hoc evaluations and make head-to-head claims about new speculative decoding methods more reliable.
  • The split design may transfer to other data-dependent LLM serving techniques such as speculative sampling or tree decoding.
  • Extending the benchmark with additional languages or multimodal inputs would test whether the current diversity criteria generalize.

Load-bearing premise

That the curated qualitative split and the production-engine integrations are representative enough to reveal behaviors that other benchmarks mask.

What would settle it

If side-by-side runs of the same speculative decoding algorithms on SPEED-Bench and prior benchmarks produce identical speedup rankings and no new batch-size or diversity effects, the added splits and integrations would not change practical conclusions.

Figures

Figures reproduced from arXiv: 2604.09557 by Benjamin Chislett, Bita Darvish Rouhani, Carl (Izzy) Putterman, Maor Ashkenazi, Ran Zilberstein, Talor Abramovich, Tiyasa Mitra, Yonatan Geifman.

Figure 1
Figure 1. Figure 1: Overview of the SPEED-Bench ecosystem. (Left) Curation of the Qualitative split, utilizing a custom selection algorithm on prompt embeddings to maximize semantic diversity across categories. (Middle) Construction of the Throughput Split, where data is aggregated and processed into fixed Input Sequence Length (ISL) buckets (1k-32k) across three domain difficulties, supporting large batch sizes (up to 512 pe… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of average semantic similarity between samples (lower is better). SPEED-Bench achieves lower similarity than both random selection and SpecBench across all categories. with Local Swap Refinement (see Algorithm 1). We initialize S with a random index and iteratively append i ∗ = argmini /∈S P j∈S x ⊤ i xj . To escape local minima, we then iteratively swap iout ∈ S with iin ∈/ S if the swap strict… view at source ↗
Figure 3
Figure 3. Figure 3: Average AL on the Qualitative Split. External drafting scales better across DLs. the SpecBench framework excels at evaluating methods us￾ing native PyTorch/HuggingFace, SPEED-Bench focuses on the viability of these methods in deployment. To support a holistic pipeline, we demonstrate how SpecBench models can be evaluated within our framework. The supplementary material includes an example for SpecBench’s M… view at source ↗
Figure 5
Figure 5. Figure 5: Average AL across selected categories in SpecBench vs SPEED-Bench. Target model is Llama 3.3 70B. DL = 7. Full results are in Appendix K. narios. Unlike methods that focus on latency at BS = 1, SPEED-Bench enables the construction of throughput￾latency Pareto curves, providing insights into the interplay between BS, DL, and inference engines. Random data VS SPEED-Bench In Section 6, we iden￾tified the risk… view at source ↗
Figure 6
Figure 6. Figure 6: Throughput as a function of user TPS, comparing random input tokens to the Throughput Split (8k). Target is GPT-OSS 120B with EAGLE3 drafter, measured on TensorRT-LLM. DL = 3. Points represent BS from 1 to 128. 0 100 200 300 400 User TPS 0 5000 10000 15000 Output TPS per GPU Draft Length=1 Draft Length=3 w/o SD [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Throughput as a function of user TPS, comparing DL = 1, 3 on the Throughput Split (2k). Target is GPT-OSS 120B with EAGLE3, measured on vLLM. Points represent BS from 2 to 512. in Appendix F: random inputs fail to trigger realistic expert routing in the MoE target model. This leads to inaccurate step latency measurements even without speculation. Optimal DL selection [PITH_FULL_IMAGE:figures/full_fig_p008… view at source ↗
Figure 9
Figure 9. Figure 9: Pairwise similarity matrices for the ’Translation/Multilingual’ category. SpecBench (left) shows dense blocks of high similarity, indicating redundant data. SPEED-Bench (right) shows a dispersed, low-similarity distribution, demonstrating better semantic diversity [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: display the pairwise cosine similarity matrices for two categories: Translation/Multilingual and Math, respectively. In these heatmaps, darker green values indicate high semantic similarity (redundancy), while lighter yellow values indicate low similarity (diversity). • SpecBench (Left Column): This figure reveals clusters of highly repetitive prompts (e.g., the same math problem with minor changes, or id… view at source ↗
Figure 11
Figure 11. Figure 11: illustrates the activation frequency of the top-k experts for a middle layer (Layer 17) in GPT-OSS 120B during the prefill of 8k ISL inputs at a batch size of 32. While SPEED-Bench inputs result in a relatively uniform activation profile, random tokens lead to significant imbalance, where the router disproportionately favors a subset of experts [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: tracks the total number of unique experts activated across layers of the model. Notably, processing random tokens fails to activate 20-30% of available experts in certain layers. This lack of coverage is interesting given the high volume of tokens (32 × 8000), confirming that synthetic noise fails to trigger the routing logic that occurs on real semantic workloads. 0 32 64 96 128 10 1 10 3 10 5 10 7 Frequ… view at source ↗
Figure 13
Figure 13. Figure 13: presents the average AL as a function of ISL for three setups. For Vanilla SD (Llama 3.3 70B) and Native MTP (Qwen3-Next), we observe the expected behavior: Low Entropy prompts (e.g., coding, sorting) yield the highest ALs. High Entropy prompts (e.g., creative writing, roleplay) yield the lowest ALs. Mixed Entropy prompts (e.g., STEM and general knowledge) fall in between. Furthermore, these methods demon… view at source ↗
Figure 14
Figure 14. Figure 14: Average AL across all categories in SpecBench vs. SPEED-Bench. Target model is Llama 3.3 70B. DL=7, BS=32. L. Inference Engine Comparison In Section 8.4, we briefly discussed the performance differences between inference backends. Here we provide the full comparison between TensorRT-LLM and vLLM [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: compares the throughput of TensorRT-LLM and vLLM. Both frameworks are orchestrated in Python, which can introduce host synchronization overhead and kernel launch latency compared to C++ implementations. To mitigate this, both engines leverage CUDA Graphs to capture and replay device operations with a single launch. We observe that TensorRT-LLM achieves higher throughput in this configuration, largely due … view at source ↗
Figure 16
Figure 16. Figure 16: AL Stability across various models. Average AL measured on the Throughput Split buckets (1k–32k). Target is GPT-OSS 120B, with three EAGLE3 drafters. Carefully configured RoPE scaling can ensure stability over all context lengths. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
read the original abstract

Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces SPEED-Bench, a benchmark suite for Speculative Decoding (SD) in LLMs. It features a Qualitative data split curated by prioritizing semantic diversity across samples, a Throughput data split supporting speedup measurements across concurrencies from latency-sensitive low-batch to high-load regimes, and direct integrations with production engines such as vLLM and TensorRT-LLM. The authors claim this enables quantification of synthetic-input overestimation of real-world throughput, identification of batch-size-dependent optimal draft lengths, detection of biases in low-diversity data, and analysis of vocabulary-pruning caveats, thereby establishing a unified standard for practical SD comparisons.

Significance. If the benchmark's data splits prove representative and the engine integrations reliably expose production behaviors, SPEED-Bench could become a standard reference for SD evaluation, enabling more accurate cross-algorithm comparisons and highlighting limitations of synthetic or low-diversity workloads that current benchmarks obscure.

major comments (3)
  1. [Abstract / Data Curation description] The central claim that the Qualitative data split 'sufficiently represents real-world workloads' rests on curation by 'prioritizing semantic diversity,' yet the manuscript provides no quantitative validation metrics (e.g., embedding variance, topic entropy, distributional similarity to production traces, or held-out query validation). This assumption is load-bearing for the assertions about realistic serving regimes and unified evaluation standard.
  2. [Throughput evaluation and engine integration sections] The Throughput data split and vLLM/TensorRT-LLM integrations are presented as exposing system behaviors masked by high-level implementations, but the manuscript lacks concrete implementation details, quantitative comparisons (e.g., throughput deltas or latency breakdowns), or ablation showing how these integrations differ from prior high-level SD evaluations.
  3. [Results / Highlighted observations] Observations such as 'synthetic inputs overestimate real-world throughput' and 'batch-size dependent optimal draft lengths' are highlighted, but without accompanying methodology details, data statistics, tables of quantitative results, or error bars, it is not possible to verify the strength of support for these claims.
minor comments (1)
  1. [Abstract] The abstract refers to 'we highlight this by quantifying...' without cross-references to specific sections, figures, or tables where the quantitative results appear; adding such pointers would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on SPEED-Bench. We address each major comment below and will revise the manuscript to incorporate quantitative validations, implementation details, and expanded results sections as suggested.

read point-by-point responses
  1. Referee: [Abstract / Data Curation description] The central claim that the Qualitative data split 'sufficiently represents real-world workloads' rests on curation by 'prioritizing semantic diversity,' yet the manuscript provides no quantitative validation metrics (e.g., embedding variance, topic entropy, distributional similarity to production traces, or held-out query validation). This assumption is load-bearing for the assertions about realistic serving regimes and unified evaluation standard.

    Authors: We agree that quantitative validation metrics would strengthen the claim of representativeness. In the revised manuscript, we will add embedding variance, topic entropy, distributional similarity to production traces, and held-out query validation to the data curation section to empirically support the semantic diversity prioritization. revision: yes

  2. Referee: [Throughput evaluation and engine integration sections] The Throughput data split and vLLM/TensorRT-LLM integrations are presented as exposing system behaviors masked by high-level implementations, but the manuscript lacks concrete implementation details, quantitative comparisons (e.g., throughput deltas or latency breakdowns), or ablation showing how these integrations differ from prior high-level SD evaluations.

    Authors: We will expand these sections with concrete implementation details for the vLLM and TensorRT-LLM integrations, including pseudocode and configuration specifics. Quantitative comparisons such as throughput deltas and latency breakdowns versus high-level baselines, plus ablations, will be added to demonstrate the differences. revision: yes

  3. Referee: [Results / Highlighted observations] Observations such as 'synthetic inputs overestimate real-world throughput' and 'batch-size dependent optimal draft lengths' are highlighted, but without accompanying methodology details, data statistics, tables of quantitative results, or error bars, it is not possible to verify the strength of support for these claims.

    Authors: We acknowledge the need for greater transparency. The revised Results section will include detailed methodology, data statistics, tables with quantitative results, and error bars from repeated runs to allow verification of the observations on synthetic input overestimation and batch-size dependent draft lengths. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark introduction relies on curation choices and integrations, not derived predictions

full rationale

The paper presents SPEED-Bench as a new evaluation suite with curated data splits and production-engine integrations. No equations, fitted parameters, or first-principles derivations appear in the manuscript. The qualitative split is introduced via an explicit curation decision (prioritizing semantic diversity), which is an input rather than a result derived from the benchmark itself. Throughput splits and vLLM/TensorRT-LLM integrations are described as engineering contributions without self-referential reduction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify core claims. The work is therefore self-contained as an empirical benchmark release.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that semantic diversity in the selected samples represents real workloads and that production-engine integration reveals otherwise masked behaviors.

axioms (1)
  • domain assumption The qualitative data split selected by prioritizing semantic diversity across samples is representative of real-world semantic domains and workloads
    Invoked to justify the curation of the qualitative split as described in the abstract.

pith-pipeline@v0.9.0 · 5560 in / 1233 out tokens · 213007 ms · 2026-05-16T03:05:43.307851+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 13 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    Pard: Accelerating llm inference with low-cost parallel draft model adaptation

    An, Z., Bai, H., Liu, Z., Li, D., and Barsoum, E. Pard: Accelerating llm inference with low-cost parallel draft model adaptation. arXiv preprint arXiv:2504.18583, 2025

  4. [4]

    Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues

    Bai, G., Liu, J., Bu, X., He, Y., Liu, J., Zhou, Z., Lin, Z., Su, W., Ge, T., Zheng, B., et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. arXiv preprint arXiv:2402.14762, 2024

  5. [5]

    Nvidia nemotron 3: Efficient and open intelligence

    Blakeman, A., Grattafiori, A., Basant, A., Gupta, A., Khattar, A., Renduchintala, A., Vavre, A., Shukla, A., Bercovich, A., Ficek, A., et al. Nvidia nemotron 3: Efficient and open intelligence. arXiv preprint arXiv:2512.20856, 2025

  6. [6]

    Long code arena: a set of benchmarks for long-context code models

    Bogomolov, E., Eliseeva, A., Galimzyanov, T., Glukhov, E., Shapkin, A., Tigina, M., Golubev, Y., Kovrigin, A., van Deursen, A., Izadi, M., and Bryksin, T. Long code arena: a set of benchmarks for long-context code models. arXiv preprint arXiv:2406.11612, 2024

  7. [7]

    Bojar, O., Buck, C., Federmann, C., Haddow, B., Koehn, P., Leveling, J., Monz, C., Pecina, P., Post, M., Saint-Amand, H., Soricut, R., Specia, L., and Tamchyna, A. s. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pp.\ 12--58, Baltimore, Maryland, USA, June 2014. A...

  8. [8]

    D., Chen, D., and Dao, T

    Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J. D., Chen, D., and Dao, T. Medusa: Simple llm inference acceleration framework with multiple decoding heads. In International Conference on Machine Learning, pp.\ 5209--5235. PMLR, 2024

  9. [9]

    Accelerating Large Language Model Decoding with Speculative Sampling

    Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023

  10. [10]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Ch...

  11. [11]

    Sequoia: Scalable, robust, and hardware-aware speculative decoding

    Chen, Z., May, A., Svirschevski, R., Huang, Y., Ryabinin, M., Jia, Z., and Chen, B. Sequoia: Scalable, robust, and hardware-aware speculative decoding. CoRR, 2024

  12. [12]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  13. [13]

    Enhancing chat language models by scaling high-quality instructional conversations

    Ding, N., Chen, Y., Xu, B., Qin, Y., Hu, S., Liu, Z., Sun, M., and Zhou, B. Enhancing chat language models by scaling high-quality instructional conversations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 3029--3051, 2023

  14. [14]

    X., and Wen, J.-R

    Dong, Z., Tang, T., Li, J., Zhao, W. X., and Wen, J.-R. Bamboo: A comprehensive benchmark for evaluating long text modeling capacities of large language models. arXiv preprint arXiv:2309.13345, 2023

  15. [15]

    Break the sequential dependency of llm inference using lookahead decoding

    Fu, Y., Bailis, P., Stoica, I., and Zhang, H. Break the sequential dependency of llm inference using lookahead decoding. In Forty-first International Conference on Machine Learning

  16. [16]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., and et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  17. [17]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  18. [18]

    REST : Retrieval-based speculative decoding

    He, Z., Zhong, Z., Cai, T., Lee, J., and He, D. REST : Retrieval-based speculative decoding. In Duh, K., Gomez, H., and Bethard, S. (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 1582--1595, Mexico City, Mexico, June 2024. ...

  19. [19]

    Moesd: Unveil speculative decoding's potential for accelerating sparse moe

    Huang, Z., Zhu, L., Zhan, Z., Hu, T., Mao, W., Yu, X., Liu, Y., and Zhang, T. Moesd: Unveil speculative decoding's potential for accelerating sparse moe. arXiv preprint arXiv:2505.19645, 2025

  20. [20]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

  21. [21]

    H., Gonzalez, J

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  22. [22]

    Fast inference from transformers via speculative decoding

    Leviathan, Y., Kalman, M., and Matias, Y. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.\ 19274--19286. PMLR, 2023

  23. [23]

    M., Ghaddar, A., Sun, Q., Ma, L., Luo, Y., Li, D., Coates, M., Hao, J., and Zhang, Y

    Li, D., Zhou, J., Brunswic, L. M., Ghaddar, A., Sun, Q., Ma, L., Luo, Y., Li, D., Coates, M., Hao, J., and Zhang, Y. Omni-thinker: Scaling multi-task rl in llms with hybrid reward and task scheduling, 2025 a . URL https://arxiv.org/abs/2507.14783

  24. [24]

    Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

    Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., Hubert, T., Choy, P., de Masson d'Autume, C., Babuschkin, I., Chen, X., Huang, P.-S., Welbl, J., Gowal, S., Cherepanov, A., Molloy, J., Mankowitz, D., Sutherland Robson, E., Kohli, P., de Freitas, N., Kavukcuoglu, K., and Vinyals, O...

  25. [25]

    EAGLE : Speculative sampling requires rethinking feature uncertainty

    Li, Y., Wei, F., Zhang, C., and Zhang, H. EAGLE : Speculative sampling requires rethinking feature uncertainty. In International Conference on Machine Learning, 2024 a

  26. [26]

    EAGLE-2 : Faster inference of language models with dynamic draft trees

    Li, Y., Wei, F., Zhang, C., and Zhang, H. EAGLE-2 : Faster inference of language models with dynamic draft trees. In Empirical Methods in Natural Language Processing, 2024 b

  27. [27]

    EAGLE-3 : Scaling up inference acceleration of large language models via training-time test

    Li, Y., Wei, F., Zhang, C., and Zhang, H. EAGLE-3 : Scaling up inference acceleration of large language models via training-time test. In Annual Conference on Neural Information Processing Systems, 2025 b

  28. [28]

    RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

    Liu, T., Xu, C., and McAuley, J. Repobench: Benchmarking repository-level code auto-completion systems, 2024 a . URL https://arxiv.org/abs/2306.03091

  29. [29]

    Chatqa: Surpassing gpt-4 on conversational qa and rag

    Liu, Z., Ping, W., Roy, R., Xu, P., Lee, C., Shoeybi, M., and Catanzaro, B. Chatqa: Surpassing gpt-4 on conversational qa and rag. arXiv preprint arXiv:2401.10225, 2024 b

  30. [30]

    X., Sha, J., Wang, S., and Wen, J.-R

    Luo, W., Zhao, W. X., Sha, J., Wang, S., and Wen, J.-R. Mmath: A multilingual benchmark for mathematical reasoning. arXiv preprint arXiv:2505.19126, 2025

  31. [31]

    Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., Zhang, Z., Wong, R. Y. Y., Zhu, A., Yang, L., Shi, X., et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vo...

  32. [32]

    Y., Singh, S., Tang, X., von Werra, L., and Longpre, S

    Muennighoff, N., Liu, Q., Zebaze, A., Zheng, Q., Hui, B., Zhuo, T. Y., Singh, S., Tang, X., von Werra, L., and Longpre, S. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023

  33. [33]

    Tensorrt‑llm: High‑performance inference for large language models

    NVIDIA . Tensorrt‑llm: High‑performance inference for large language models. https://github.com/NVIDIA/TensorRT-LLM, 2023. Accessed: 2026‑01‑06

  34. [34]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URL https://arxiv.org/abs/2508.10925

  35. [35]

    Paech, S. J. Eq-bench creative writing benchmark v3. https://github.com/EQ-bench/creative-writing-bench, 2025

  36. [36]

    Mcif: Multimodal crosslingual instruction-following benchmark from scientific talks, 2025

    Papi, S., Züfle, M., Gaido, M., Savoldi, B., Liu, D., Douros, I., Bentivogli, L., and Niehues, J. Mcif: Multimodal crosslingual instruction-following benchmark from scientific talks, 2025. URL https://arxiv.org/abs/2507.19634

  37. [37]

    Ya RN : Efficient context window extension of large language models

    Peng, B., Quesnelle, J., Fan, H., and Shippole, E. Ya RN : Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wHBfxhZu1u

  38. [38]

    Humanity's Last Exam

    Phan, L., Gatti, A., Han, Z., and et al. Humanity's last exam, 2025. URL https://arxiv.org/abs/2501.14249

  39. [39]

    Project gutenberg

    Project Gutenberg . Project gutenberg. https://www.gutenberg.org

  40. [40]

    E.-H., May, A., Chen, T., and Chen, B

    Sadhukhan, R., Chen, J., Chen, Z., Tiwari, V., Lai, R., Shi, J., Yen, I. E.-H., May, A., Chen, T., and Chen, B. Magicdec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding. In The Thirteenth International Conference on Learning Representations

  41. [42]

    J., and Manning, C

    See, A., Liu, P. J., and Manning, C. D. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1073--1083, Vancouver, Canada, July 2017 b . Association for Computational Linguistics. doi:10.18653/v1/P17-1099

  42. [43]

    Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  43. [44]

    Ada-leval: Evaluating long-context llms with length-adaptable benchmarks

    Wang, C., Duan, H., Zhang, S., Lin, D., and Chen, K. Ada-leval: Evaluating long-context llms with length-adaptable benchmarks. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 3712--3724, 2024 a

  44. [45]

    Coser: Coordinating llm-based persona simulation of established roles, 2025

    Wang, X., Wang, H., Zhang, Y., Yuan, X., Xu, R., tse Huang, J., Yuan, S., Guo, H., Chen, J., Wang, W., Xiao, Y., and Zhou, S. Coser: Coordinating llm-based persona simulation of established roles, 2025. URL https://arxiv.org/abs/2502.09082

  45. [46]

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

    Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37: 0 95266--95290, 2024 b

  46. [47]

    M., Peng, Z., Que, H., Liu, J., Zhou, W., Wu, Y., Guo, H., Gan, R., Ni, Z., Zhang, M., Zhang, Z., Ouyang, W., Xu, K., Chen, W., Fu, J., and Peng, J

    Wang, Z. M., Peng, Z., Que, H., Liu, J., Zhou, W., Wu, Y., Guo, H., Gan, R., Ni, Z., Zhang, M., Zhang, Z., Ouyang, W., Xu, K., Chen, W., Fu, J., and Peng, J. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv: 2310.00746, 2023

  47. [48]

    L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A

    Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conferenc...

  48. [49]

    WritingBench: A Comprehensive Benchmark for Generative Writing, March 2025

    Wu, Y., Mei, J., Yan, M., Li, C., Lai, S., Ren, Y., Wang, Z., Zhang, J., Wu, M., Jin, Q., and Huang, F. Writingbench: A comprehensive benchmark for generative writing, 2025. URL https://arxiv.org/abs/2503.05244

  49. [50]

    Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

    Xia, H., Yang, Z., Dong, Q., Wang, P., Li, Y., Ge, T., Liu, T., Li, W., and Sui, Z. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics ACL 2024, pp.\ 7655--7671, Bangkok, Thailand and virtual me...

  50. [51]

    Parallelspec: Parallel drafter for efficient speculative decoding

    Xiao, Z., Zhang, H., Ge, T., Ouyang, S., Ordonez, V., and Yu, D. Parallelspec: Parallel drafter for efficient speculative decoding. arXiv preprint arXiv:2410.05589, 2024

  51. [52]

    MiMo-V2-Flash Technical Report

    Xiaomi, L.-C. Mimo-v2-flash technical report, 2026. URL https://arxiv.org/abs/2601.02780

  52. [53]

    Xu, Z., Jiang, F., Niu, L., Deng, Y., Poovendran, R., Choi, Y., and Lin, B. Y. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint arXiv:2406.08464, 2024

  53. [54]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 a

  54. [55]

    Longspec: Long-context lossless speculative decoding with efficient drafting and verification

    Yang, P., Du, C., Zhang, F., Wang, H., Pang, T., Du, C., and An, B. Longspec: Long-context lossless speculative decoding with efficient drafting and verification. In ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 2025 b

  55. [56]

    Improving massively multilingual neural machine translation and zero-shot translation

    Zhang, B., Williams, P., Titov, I., and Sennrich, R. Improving massively multilingual neural machine translation and zero-shot translation. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 1628--1639, Online, July 2020. Association for Computati...

  56. [57]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36: 0 46595--46623, 2023

  57. [58]

    SGLang: Efficient Execution of Structured Language Model Programs

    Zheng, L., Yin, L., Xie, Z., Sun, C. L., Huang, J., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Barrett, C., and Sheng, Y. Sglang: Efficient execution of structured language model programs. In Conference on Neural Information Processing Systems (NeurIPS), 2024. doi:10.48550/arXiv.2312.07104