arxiv: 2604.09557 · v1 · submitted 2026-02-10 · 💻 cs.DC · cs.AI

Recognition: no theorem link

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Talor Abramovich , Maor Ashkenazi , Carl (Izzy) Putterman , Benjamin Chislett , Tiyasa Mitra , Bita Darvish Rouhani , Ran Zilberstein , Yonatan Geifman

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:05 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords speculative decodingLLM inferencebenchmarkthroughput evaluationsemantic diversityproduction enginesvLLMTensorRT-LLM

0 comments

The pith

SPEED-Bench establishes a unified benchmark for speculative decoding that covers diverse semantic domains, throughput across concurrencies, and integration with production engines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Speculative decoding accelerates LLM inference but its gains depend on the input data, so existing benchmarks with narrow tasks and synthetic data give incomplete pictures. SPEED-Bench supplies a qualitative split chosen to maximize semantic variety across samples plus a throughput split that measures speedups from low-batch latency settings to high-concurrency loads. The benchmark wires directly into engines such as vLLM and TensorRT-LLM, exposing effects that high-level simulators hide. It shows synthetic inputs inflate reported throughput, that optimal draft lengths shift with batch size, and that low-diversity data creates measurable biases. A practitioner can therefore compare speculative decoding methods on workloads that better match actual serving conditions.

Core claim

SPEED-Bench establishes a unified evaluation standard for practical comparisons of SD algorithms by offering diverse semantic domains, throughput splits across concurrencies, and integration with production engines like vLLM and TensorRT-LLM. It quantifies how synthetic inputs overestimate real-world throughput, identifies batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzes the caveats of vocabulary pruning in state-of-the-art drafters.

What carries the argument

SPEED-Bench suite, built from a Qualitative data split prioritized for semantic diversity across samples and a Throughput data split spanning latency-sensitive to high-load concurrencies, integrated directly with production engines.

If this is right

Synthetic inputs overestimate real-world throughput gains from speculative decoding.
Optimal draft lengths vary with batch size in production settings.
Low-diversity data introduces systematic biases in measured speedups.
Vocabulary pruning in current drafters carries identifiable limitations under realistic loads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use of SPEED-Bench could replace ad-hoc evaluations and make head-to-head claims about new speculative decoding methods more reliable.
The split design may transfer to other data-dependent LLM serving techniques such as speculative sampling or tree decoding.
Extending the benchmark with additional languages or multimodal inputs would test whether the current diversity criteria generalize.

Load-bearing premise

That the curated qualitative split and the production-engine integrations are representative enough to reveal behaviors that other benchmarks mask.

What would settle it

If side-by-side runs of the same speculative decoding algorithms on SPEED-Bench and prior benchmarks produce identical speedup rankings and no new batch-size or diversity effects, the added splits and integrations would not change practical conclusions.

Figures

Figures reproduced from arXiv: 2604.09557 by Benjamin Chislett, Bita Darvish Rouhani, Carl (Izzy) Putterman, Maor Ashkenazi, Ran Zilberstein, Talor Abramovich, Tiyasa Mitra, Yonatan Geifman.

**Figure 1.** Figure 1: Overview of the SPEED-Bench ecosystem. (Left) Curation of the Qualitative split, utilizing a custom selection algorithm on prompt embeddings to maximize semantic diversity across categories. (Middle) Construction of the Throughput Split, where data is aggregated and processed into fixed Input Sequence Length (ISL) buckets (1k-32k) across three domain difficulties, supporting large batch sizes (up to 512 pe… view at source ↗

**Figure 2.** Figure 2: Comparison of average semantic similarity between samples (lower is better). SPEED-Bench achieves lower similarity than both random selection and SpecBench across all categories. with Local Swap Refinement (see Algorithm 1). We initialize S with a random index and iteratively append i ∗ = argmini /∈S P j∈S x ⊤ i xj . To escape local minima, we then iteratively swap iout ∈ S with iin ∈/ S if the swap strict… view at source ↗

**Figure 3.** Figure 3: Average AL on the Qualitative Split. External drafting scales better across DLs. the SpecBench framework excels at evaluating methods using native PyTorch/HuggingFace, SPEED-Bench focuses on the viability of these methods in deployment. To support a holistic pipeline, we demonstrate how SpecBench models can be evaluated within our framework. The supplementary material includes an example for SpecBench’s M… view at source ↗

**Figure 5.** Figure 5: Average AL across selected categories in SpecBench vs SPEED-Bench. Target model is Llama 3.3 70B. DL = 7. Full results are in Appendix K. narios. Unlike methods that focus on latency at BS = 1, SPEED-Bench enables the construction of throughputlatency Pareto curves, providing insights into the interplay between BS, DL, and inference engines. Random data VS SPEED-Bench In Section 6, we identified the risk… view at source ↗

**Figure 6.** Figure 6: Throughput as a function of user TPS, comparing random input tokens to the Throughput Split (8k). Target is GPT-OSS 120B with EAGLE3 drafter, measured on TensorRT-LLM. DL = 3. Points represent BS from 1 to 128. 0 100 200 300 400 User TPS 0 5000 10000 15000 Output TPS per GPU Draft Length=1 Draft Length=3 w/o SD [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Throughput as a function of user TPS, comparing DL = 1, 3 on the Throughput Split (2k). Target is GPT-OSS 120B with EAGLE3, measured on vLLM. Points represent BS from 2 to 512. in Appendix F: random inputs fail to trigger realistic expert routing in the MoE target model. This leads to inaccurate step latency measurements even without speculation. Optimal DL selection [PITH_FULL_IMAGE:figures/full_fig_p008… view at source ↗

**Figure 9.** Figure 9: Pairwise similarity matrices for the ’Translation/Multilingual’ category. SpecBench (left) shows dense blocks of high similarity, indicating redundant data. SPEED-Bench (right) shows a dispersed, low-similarity distribution, demonstrating better semantic diversity [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: display the pairwise cosine similarity matrices for two categories: Translation/Multilingual and Math, respectively. In these heatmaps, darker green values indicate high semantic similarity (redundancy), while lighter yellow values indicate low similarity (diversity). • SpecBench (Left Column): This figure reveals clusters of highly repetitive prompts (e.g., the same math problem with minor changes, or id… view at source ↗

**Figure 11.** Figure 11: illustrates the activation frequency of the top-k experts for a middle layer (Layer 17) in GPT-OSS 120B during the prefill of 8k ISL inputs at a batch size of 32. While SPEED-Bench inputs result in a relatively uniform activation profile, random tokens lead to significant imbalance, where the router disproportionately favors a subset of experts [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: tracks the total number of unique experts activated across layers of the model. Notably, processing random tokens fails to activate 20-30% of available experts in certain layers. This lack of coverage is interesting given the high volume of tokens (32 × 8000), confirming that synthetic noise fails to trigger the routing logic that occurs on real semantic workloads. 0 32 64 96 128 10 1 10 3 10 5 10 7 Frequ… view at source ↗

**Figure 13.** Figure 13: presents the average AL as a function of ISL for three setups. For Vanilla SD (Llama 3.3 70B) and Native MTP (Qwen3-Next), we observe the expected behavior: Low Entropy prompts (e.g., coding, sorting) yield the highest ALs. High Entropy prompts (e.g., creative writing, roleplay) yield the lowest ALs. Mixed Entropy prompts (e.g., STEM and general knowledge) fall in between. Furthermore, these methods demon… view at source ↗

**Figure 14.** Figure 14: Average AL across all categories in SpecBench vs. SPEED-Bench. Target model is Llama 3.3 70B. DL=7, BS=32. L. Inference Engine Comparison In Section 8.4, we briefly discussed the performance differences between inference backends. Here we provide the full comparison between TensorRT-LLM and vLLM [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: compares the throughput of TensorRT-LLM and vLLM. Both frameworks are orchestrated in Python, which can introduce host synchronization overhead and kernel launch latency compared to C++ implementations. To mitigate this, both engines leverage CUDA Graphs to capture and replay device operations with a single launch. We observe that TensorRT-LLM achieves higher throughput in this configuration, largely due … view at source ↗

**Figure 16.** Figure 16: AL Stability across various models. Average AL measured on the Throughput Split buckets (1k–32k). Target is GPT-OSS 120B, with three EAGLE3 drafters. Carefully configured RoPE scaling can ensure stability over all context lengths. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

read the original abstract

Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPEED-Bench adds diversity splits and vLLM/TensorRT-LLM ties that existing SD benchmarks lack, but the curation for real-world representativeness stays under-specified.

read the letter

SPEED-Bench introduces a qualitative split chosen for semantic diversity and a throughput split that runs across low to high concurrency. It also wires the evaluation directly into vLLM and TensorRT-LLM instead of staying at high-level simulations. Those two moves are the concrete additions over prior work on speculative decoding benchmarks, which the abstract correctly flags as too narrow on tasks and too detached from production serving behavior. The paper uses the new splits to surface observations such as synthetic data inflating throughput numbers and batch size shifting the best draft length, which are the kind of practical signals people running inference care about. Releasing the benchmark itself is the part that could actually move the field forward if others adopt it. The soft spot is the qualitative split. The description says the samples were picked to prioritize semantic diversity, yet it gives no numbers on how that was measured or any check that the distribution matches production traces. Without embedding variance, topic coverage stats, or a hold-out comparison to real queries, the claim that it exposes behaviors masked by other benchmarks rests on an assumption rather than demonstrated coverage. The throughput results and engine integrations look more grounded because they can be run and inspected directly. This is for researchers who compare speculative decoding methods under realistic load and want a common testbed instead of each group rolling their own data. A reader who already works on LLM serving or inference optimization will find the splits and the production hooks useful even if they later swap in their own data. The work shows clear engagement with the limitations of current evaluation practice, so it deserves a serious referee. I would send it to review and ask specifically for the curation metrics and the full quantitative tables that back the overestimation and batch-dependent claims.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces SPEED-Bench, a benchmark suite for Speculative Decoding (SD) in LLMs. It features a Qualitative data split curated by prioritizing semantic diversity across samples, a Throughput data split supporting speedup measurements across concurrencies from latency-sensitive low-batch to high-load regimes, and direct integrations with production engines such as vLLM and TensorRT-LLM. The authors claim this enables quantification of synthetic-input overestimation of real-world throughput, identification of batch-size-dependent optimal draft lengths, detection of biases in low-diversity data, and analysis of vocabulary-pruning caveats, thereby establishing a unified standard for practical SD comparisons.

Significance. If the benchmark's data splits prove representative and the engine integrations reliably expose production behaviors, SPEED-Bench could become a standard reference for SD evaluation, enabling more accurate cross-algorithm comparisons and highlighting limitations of synthetic or low-diversity workloads that current benchmarks obscure.

major comments (3)

[Abstract / Data Curation description] The central claim that the Qualitative data split 'sufficiently represents real-world workloads' rests on curation by 'prioritizing semantic diversity,' yet the manuscript provides no quantitative validation metrics (e.g., embedding variance, topic entropy, distributional similarity to production traces, or held-out query validation). This assumption is load-bearing for the assertions about realistic serving regimes and unified evaluation standard.
[Throughput evaluation and engine integration sections] The Throughput data split and vLLM/TensorRT-LLM integrations are presented as exposing system behaviors masked by high-level implementations, but the manuscript lacks concrete implementation details, quantitative comparisons (e.g., throughput deltas or latency breakdowns), or ablation showing how these integrations differ from prior high-level SD evaluations.
[Results / Highlighted observations] Observations such as 'synthetic inputs overestimate real-world throughput' and 'batch-size dependent optimal draft lengths' are highlighted, but without accompanying methodology details, data statistics, tables of quantitative results, or error bars, it is not possible to verify the strength of support for these claims.

minor comments (1)

[Abstract] The abstract refers to 'we highlight this by quantifying...' without cross-references to specific sections, figures, or tables where the quantitative results appear; adding such pointers would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on SPEED-Bench. We address each major comment below and will revise the manuscript to incorporate quantitative validations, implementation details, and expanded results sections as suggested.

read point-by-point responses

Referee: [Abstract / Data Curation description] The central claim that the Qualitative data split 'sufficiently represents real-world workloads' rests on curation by 'prioritizing semantic diversity,' yet the manuscript provides no quantitative validation metrics (e.g., embedding variance, topic entropy, distributional similarity to production traces, or held-out query validation). This assumption is load-bearing for the assertions about realistic serving regimes and unified evaluation standard.

Authors: We agree that quantitative validation metrics would strengthen the claim of representativeness. In the revised manuscript, we will add embedding variance, topic entropy, distributional similarity to production traces, and held-out query validation to the data curation section to empirically support the semantic diversity prioritization. revision: yes
Referee: [Throughput evaluation and engine integration sections] The Throughput data split and vLLM/TensorRT-LLM integrations are presented as exposing system behaviors masked by high-level implementations, but the manuscript lacks concrete implementation details, quantitative comparisons (e.g., throughput deltas or latency breakdowns), or ablation showing how these integrations differ from prior high-level SD evaluations.

Authors: We will expand these sections with concrete implementation details for the vLLM and TensorRT-LLM integrations, including pseudocode and configuration specifics. Quantitative comparisons such as throughput deltas and latency breakdowns versus high-level baselines, plus ablations, will be added to demonstrate the differences. revision: yes
Referee: [Results / Highlighted observations] Observations such as 'synthetic inputs overestimate real-world throughput' and 'batch-size dependent optimal draft lengths' are highlighted, but without accompanying methodology details, data statistics, tables of quantitative results, or error bars, it is not possible to verify the strength of support for these claims.

Authors: We acknowledge the need for greater transparency. The revised Results section will include detailed methodology, data statistics, tables with quantitative results, and error bars from repeated runs to allow verification of the observations on synthetic input overestimation and batch-size dependent draft lengths. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark introduction relies on curation choices and integrations, not derived predictions

full rationale

The paper presents SPEED-Bench as a new evaluation suite with curated data splits and production-engine integrations. No equations, fitted parameters, or first-principles derivations appear in the manuscript. The qualitative split is introduced via an explicit curation decision (prioritizing semantic diversity), which is an input rather than a result derived from the benchmark itself. Throughput splits and vLLM/TensorRT-LLM integrations are described as engineering contributions without self-referential reduction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify core claims. The work is therefore self-contained as an empirical benchmark release.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that semantic diversity in the selected samples represents real workloads and that production-engine integration reveals otherwise masked behaviors.

axioms (1)

domain assumption The qualitative data split selected by prioritizing semantic diversity across samples is representative of real-world semantic domains and workloads
Invoked to justify the curation of the qualitative split as described in the abstract.

pith-pipeline@v0.9.0 · 5560 in / 1233 out tokens · 213007 ms · 2026-05-16T03:05:43.307851+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 13 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Pard: Accelerating llm inference with low-cost parallel draft model adaptation

An, Z., Bai, H., Liu, Z., Li, D., and Barsoum, E. Pard: Accelerating llm inference with low-cost parallel draft model adaptation. arXiv preprint arXiv:2504.18583, 2025

work page arXiv 2025
[4]

Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues

Bai, G., Liu, J., Bu, X., He, Y., Liu, J., Zhou, Z., Lin, Z., Su, W., Ge, T., Zheng, B., et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. arXiv preprint arXiv:2402.14762, 2024

work page arXiv 2024
[5]

Nvidia nemotron 3: Efficient and open intelligence

Blakeman, A., Grattafiori, A., Basant, A., Gupta, A., Khattar, A., Renduchintala, A., Vavre, A., Shukla, A., Bercovich, A., Ficek, A., et al. Nvidia nemotron 3: Efficient and open intelligence. arXiv preprint arXiv:2512.20856, 2025

work page arXiv 2025
[6]

Long code arena: a set of benchmarks for long-context code models

Bogomolov, E., Eliseeva, A., Galimzyanov, T., Glukhov, E., Shapkin, A., Tigina, M., Golubev, Y., Kovrigin, A., van Deursen, A., Izadi, M., and Bryksin, T. Long code arena: a set of benchmarks for long-context code models. arXiv preprint arXiv:2406.11612, 2024

work page arXiv 2024
[7]

Bojar, O., Buck, C., Federmann, C., Haddow, B., Koehn, P., Leveling, J., Monz, C., Pecina, P., Post, M., Saint-Amand, H., Soricut, R., Specia, L., and Tamchyna, A. s. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pp.\ 12--58, Baltimore, Maryland, USA, June 2014. A...

work page 2014
[8]

D., Chen, D., and Dao, T

Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J. D., Chen, D., and Dao, T. Medusa: Simple llm inference acceleration framework with multiple decoding heads. In International Conference on Machine Learning, pp.\ 5209--5235. PMLR, 2024

work page 2024
[9]

Accelerating Large Language Model Decoding with Speculative Sampling

Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Ch...

work page 2021
[11]

Sequoia: Scalable, robust, and hardware-aware speculative decoding

Chen, Z., May, A., Svirschevski, R., Huang, Y., Ryabinin, M., Jia, Z., and Chen, B. Sequoia: Scalable, robust, and hardware-aware speculative decoding. CoRR, 2024

work page 2024
[12]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Enhancing chat language models by scaling high-quality instructional conversations

Ding, N., Chen, Y., Xu, B., Qin, Y., Hu, S., Liu, Z., Sun, M., and Zhou, B. Enhancing chat language models by scaling high-quality instructional conversations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 3029--3051, 2023

work page 2023
[14]

X., and Wen, J.-R

Dong, Z., Tang, T., Li, J., Zhao, W. X., and Wen, J.-R. Bamboo: A comprehensive benchmark for evaluating long text modeling capacities of large language models. arXiv preprint arXiv:2309.13345, 2023

work page arXiv 2023
[15]

Break the sequential dependency of llm inference using lookahead decoding

Fu, Y., Bailis, P., Stoica, I., and Zhang, H. Break the sequential dependency of llm inference using lookahead decoding. In Forty-first International Conference on Machine Learning

work page
[16]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., and et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

REST : Retrieval-based speculative decoding

He, Z., Zhong, Z., Cai, T., Lee, J., and He, D. REST : Retrieval-based speculative decoding. In Duh, K., Gomez, H., and Bethard, S. (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 1582--1595, Mexico City, Mexico, June 2024. ...

work page doi:10.18653/v1/2024.naacl-long.88 2024
[19]

Moesd: Unveil speculative decoding's potential for accelerating sparse moe

Huang, Z., Zhu, L., Zhan, Z., Hu, T., Mao, W., Yu, X., Liu, Y., and Zhang, T. Moesd: Unveil speculative decoding's potential for accelerating sparse moe. arXiv preprint arXiv:2505.19645, 2025

work page arXiv 2025
[20]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

H., Gonzalez, J

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[22]

Fast inference from transformers via speculative decoding

Leviathan, Y., Kalman, M., and Matias, Y. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.\ 19274--19286. PMLR, 2023

work page 2023
[23]

M., Ghaddar, A., Sun, Q., Ma, L., Luo, Y., Li, D., Coates, M., Hao, J., and Zhang, Y

Li, D., Zhou, J., Brunswic, L. M., Ghaddar, A., Sun, Q., Ma, L., Luo, Y., Li, D., Coates, M., Hao, J., and Zhang, Y. Omni-thinker: Scaling multi-task rl in llms with hybrid reward and task scheduling, 2025 a . URL https://arxiv.org/abs/2507.14783

work page arXiv 2025
[24]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., Hubert, T., Choy, P., de Masson d'Autume, C., Babuschkin, I., Chen, X., Huang, P.-S., Welbl, J., Gowal, S., Cherepanov, A., Molloy, J., Mankowitz, D., Sutherland Robson, E., Kohli, P., de Freitas, N., Kavukcuoglu, K., and Vinyals, O...

work page arXiv 2022
[25]

EAGLE : Speculative sampling requires rethinking feature uncertainty

Li, Y., Wei, F., Zhang, C., and Zhang, H. EAGLE : Speculative sampling requires rethinking feature uncertainty. In International Conference on Machine Learning, 2024 a

work page 2024
[26]

EAGLE-2 : Faster inference of language models with dynamic draft trees

Li, Y., Wei, F., Zhang, C., and Zhang, H. EAGLE-2 : Faster inference of language models with dynamic draft trees. In Empirical Methods in Natural Language Processing, 2024 b

work page 2024
[27]

EAGLE-3 : Scaling up inference acceleration of large language models via training-time test

Li, Y., Wei, F., Zhang, C., and Zhang, H. EAGLE-3 : Scaling up inference acceleration of large language models via training-time test. In Annual Conference on Neural Information Processing Systems, 2025 b

work page 2025
[28]

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Liu, T., Xu, C., and McAuley, J. Repobench: Benchmarking repository-level code auto-completion systems, 2024 a . URL https://arxiv.org/abs/2306.03091

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Chatqa: Surpassing gpt-4 on conversational qa and rag

Liu, Z., Ping, W., Roy, R., Xu, P., Lee, C., Shoeybi, M., and Catanzaro, B. Chatqa: Surpassing gpt-4 on conversational qa and rag. arXiv preprint arXiv:2401.10225, 2024 b

work page arXiv 2024
[30]

X., Sha, J., Wang, S., and Wen, J.-R

Luo, W., Zhao, W. X., Sha, J., Wang, S., and Wen, J.-R. Mmath: A multilingual benchmark for mathematical reasoning. arXiv preprint arXiv:2505.19126, 2025

work page arXiv 2025
[31]

Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., Zhang, Z., Wong, R. Y. Y., Zhu, A., Yang, L., Shi, X., et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vo...

work page 2024
[32]

Y., Singh, S., Tang, X., von Werra, L., and Longpre, S

Muennighoff, N., Liu, Q., Zebaze, A., Zheng, Q., Hui, B., Zhuo, T. Y., Singh, S., Tang, X., von Werra, L., and Longpre, S. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023

work page arXiv 2023
[33]

Tensorrt‑llm: High‑performance inference for large language models

NVIDIA . Tensorrt‑llm: High‑performance inference for large language models. https://github.com/NVIDIA/TensorRT-LLM, 2023. Accessed: 2026‑01‑06

work page 2023
[34]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URL https://arxiv.org/abs/2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Paech, S. J. Eq-bench creative writing benchmark v3. https://github.com/EQ-bench/creative-writing-bench, 2025

work page 2025
[36]

Mcif: Multimodal crosslingual instruction-following benchmark from scientific talks, 2025

Papi, S., Züfle, M., Gaido, M., Savoldi, B., Liu, D., Douros, I., Bentivogli, L., and Niehues, J. Mcif: Multimodal crosslingual instruction-following benchmark from scientific talks, 2025. URL https://arxiv.org/abs/2507.19634

work page arXiv 2025
[37]

Ya RN : Efficient context window extension of large language models

Peng, B., Quesnelle, J., Fan, H., and Shippole, E. Ya RN : Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wHBfxhZu1u

work page 2024
[38]

Humanity's Last Exam

Phan, L., Gatti, A., Han, Z., and et al. Humanity's last exam, 2025. URL https://arxiv.org/abs/2501.14249

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Project gutenberg

Project Gutenberg . Project gutenberg. https://www.gutenberg.org

work page
[40]

E.-H., May, A., Chen, T., and Chen, B

Sadhukhan, R., Chen, J., Chen, Z., Tiwari, V., Lai, R., Shi, J., Yen, I. E.-H., May, A., Chen, T., and Chen, B. Magicdec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding. In The Thirteenth International Conference on Learning Representations

work page
[42]

J., and Manning, C

See, A., Liu, P. J., and Manning, C. D. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1073--1083, Vancouver, Canada, July 2017 b . Association for Computational Linguistics. doi:10.18653/v1/P17-1099

work page doi:10.18653/v1/p17-1099 2017
[43]

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[44]

Ada-leval: Evaluating long-context llms with length-adaptable benchmarks

Wang, C., Duan, H., Zhang, S., Lin, D., and Chen, K. Ada-leval: Evaluating long-context llms with length-adaptable benchmarks. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 3712--3724, 2024 a

work page 2024
[45]

Coser: Coordinating llm-based persona simulation of established roles, 2025

Wang, X., Wang, H., Zhang, Y., Yuan, X., Xu, R., tse Huang, J., Yuan, S., Guo, H., Chen, J., Wang, W., Xiao, Y., and Zhou, S. Coser: Coordinating llm-based persona simulation of established roles, 2025. URL https://arxiv.org/abs/2502.09082

work page arXiv 2025
[46]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark

Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37: 0 95266--95290, 2024 b

work page 2024
[47]

M., Peng, Z., Que, H., Liu, J., Zhou, W., Wu, Y., Guo, H., Gan, R., Ni, Z., Zhang, M., Zhang, Z., Ouyang, W., Xu, K., Chen, W., Fu, J., and Peng, J

Wang, Z. M., Peng, Z., Que, H., Liu, J., Zhou, W., Wu, Y., Guo, H., Gan, R., Ni, Z., Zhang, M., Zhang, Z., Ouyang, W., Xu, K., Chen, W., Fu, J., and Peng, J. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv: 2310.00746, 2023

work page arXiv 2023
[48]

L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conferenc...

work page 2020
[49]

WritingBench: A Comprehensive Benchmark for Generative Writing, March 2025

Wu, Y., Mei, J., Yan, M., Li, C., Lai, S., Ren, Y., Wang, Z., Zhang, J., Wu, M., Jin, Q., and Huang, F. Writingbench: A comprehensive benchmark for generative writing, 2025. URL https://arxiv.org/abs/2503.05244

work page arXiv 2025
[50]

Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding

Xia, H., Yang, Z., Dong, Q., Wang, P., Li, Y., Ge, T., Liu, T., Li, W., and Sui, Z. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics ACL 2024, pp.\ 7655--7671, Bangkok, Thailand and virtual me...

work page doi:10.18653/v1/2024.findings-acl.456 2024
[51]

Parallelspec: Parallel drafter for efficient speculative decoding

Xiao, Z., Zhang, H., Ge, T., Ouyang, S., Ordonez, V., and Yu, D. Parallelspec: Parallel drafter for efficient speculative decoding. arXiv preprint arXiv:2410.05589, 2024

work page arXiv 2024
[52]

MiMo-V2-Flash Technical Report

Xiaomi, L.-C. Mimo-v2-flash technical report, 2026. URL https://arxiv.org/abs/2601.02780

work page internal anchor Pith review Pith/arXiv arXiv 2026
[53]

Xu, Z., Jiang, F., Niu, L., Deng, Y., Poovendran, R., Choi, Y., and Lin, B. Y. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint arXiv:2406.08464, 2024

work page internal anchor Pith review arXiv 2024
[54]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Longspec: Long-context lossless speculative decoding with efficient drafting and verification

Yang, P., Du, C., Zhang, F., Wang, H., Pang, T., Du, C., and An, B. Longspec: Long-context lossless speculative decoding with efficient drafting and verification. In ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 2025 b

work page 2025
[56]

Improving massively multilingual neural machine translation and zero-shot translation

Zhang, B., Williams, P., Titov, I., and Sennrich, R. Improving massively multilingual neural machine translation and zero-shot translation. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 1628--1639, Online, July 2020. Association for Computati...

work page doi:10.18653/v1/2020.acl-main.148 2020
[57]

Judging llm-as-a-judge with mt-bench and chatbot arena

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36: 0 46595--46623, 2023

work page 2023
[58]

SGLang: Efficient Execution of Structured Language Model Programs

Zheng, L., Yin, L., Xie, Z., Sun, C. L., Huang, J., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Barrett, C., and Sheng, Y. Sglang: Efficient execution of structured language model programs. In Conference on Neural Information Processing Systems (NeurIPS), 2024. doi:10.48550/arXiv.2312.07104

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.07104 2024