Recognition: no theorem link
SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding
Pith reviewed 2026-05-16 03:05 UTC · model grok-4.3
The pith
SPEED-Bench establishes a unified benchmark for speculative decoding that covers diverse semantic domains, throughput across concurrencies, and integration with production engines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SPEED-Bench establishes a unified evaluation standard for practical comparisons of SD algorithms by offering diverse semantic domains, throughput splits across concurrencies, and integration with production engines like vLLM and TensorRT-LLM. It quantifies how synthetic inputs overestimate real-world throughput, identifies batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzes the caveats of vocabulary pruning in state-of-the-art drafters.
What carries the argument
SPEED-Bench suite, built from a Qualitative data split prioritized for semantic diversity across samples and a Throughput data split spanning latency-sensitive to high-load concurrencies, integrated directly with production engines.
If this is right
- Synthetic inputs overestimate real-world throughput gains from speculative decoding.
- Optimal draft lengths vary with batch size in production settings.
- Low-diversity data introduces systematic biases in measured speedups.
- Vocabulary pruning in current drafters carries identifiable limitations under realistic loads.
Where Pith is reading between the lines
- Widespread use of SPEED-Bench could replace ad-hoc evaluations and make head-to-head claims about new speculative decoding methods more reliable.
- The split design may transfer to other data-dependent LLM serving techniques such as speculative sampling or tree decoding.
- Extending the benchmark with additional languages or multimodal inputs would test whether the current diversity criteria generalize.
Load-bearing premise
That the curated qualitative split and the production-engine integrations are representative enough to reveal behaviors that other benchmarks mask.
What would settle it
If side-by-side runs of the same speculative decoding algorithms on SPEED-Bench and prior benchmarks produce identical speedup rankings and no new batch-size or diversity effects, the added splits and integrations would not change practical conclusions.
Figures
read the original abstract
Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SPEED-Bench, a benchmark suite for Speculative Decoding (SD) in LLMs. It features a Qualitative data split curated by prioritizing semantic diversity across samples, a Throughput data split supporting speedup measurements across concurrencies from latency-sensitive low-batch to high-load regimes, and direct integrations with production engines such as vLLM and TensorRT-LLM. The authors claim this enables quantification of synthetic-input overestimation of real-world throughput, identification of batch-size-dependent optimal draft lengths, detection of biases in low-diversity data, and analysis of vocabulary-pruning caveats, thereby establishing a unified standard for practical SD comparisons.
Significance. If the benchmark's data splits prove representative and the engine integrations reliably expose production behaviors, SPEED-Bench could become a standard reference for SD evaluation, enabling more accurate cross-algorithm comparisons and highlighting limitations of synthetic or low-diversity workloads that current benchmarks obscure.
major comments (3)
- [Abstract / Data Curation description] The central claim that the Qualitative data split 'sufficiently represents real-world workloads' rests on curation by 'prioritizing semantic diversity,' yet the manuscript provides no quantitative validation metrics (e.g., embedding variance, topic entropy, distributional similarity to production traces, or held-out query validation). This assumption is load-bearing for the assertions about realistic serving regimes and unified evaluation standard.
- [Throughput evaluation and engine integration sections] The Throughput data split and vLLM/TensorRT-LLM integrations are presented as exposing system behaviors masked by high-level implementations, but the manuscript lacks concrete implementation details, quantitative comparisons (e.g., throughput deltas or latency breakdowns), or ablation showing how these integrations differ from prior high-level SD evaluations.
- [Results / Highlighted observations] Observations such as 'synthetic inputs overestimate real-world throughput' and 'batch-size dependent optimal draft lengths' are highlighted, but without accompanying methodology details, data statistics, tables of quantitative results, or error bars, it is not possible to verify the strength of support for these claims.
minor comments (1)
- [Abstract] The abstract refers to 'we highlight this by quantifying...' without cross-references to specific sections, figures, or tables where the quantitative results appear; adding such pointers would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on SPEED-Bench. We address each major comment below and will revise the manuscript to incorporate quantitative validations, implementation details, and expanded results sections as suggested.
read point-by-point responses
-
Referee: [Abstract / Data Curation description] The central claim that the Qualitative data split 'sufficiently represents real-world workloads' rests on curation by 'prioritizing semantic diversity,' yet the manuscript provides no quantitative validation metrics (e.g., embedding variance, topic entropy, distributional similarity to production traces, or held-out query validation). This assumption is load-bearing for the assertions about realistic serving regimes and unified evaluation standard.
Authors: We agree that quantitative validation metrics would strengthen the claim of representativeness. In the revised manuscript, we will add embedding variance, topic entropy, distributional similarity to production traces, and held-out query validation to the data curation section to empirically support the semantic diversity prioritization. revision: yes
-
Referee: [Throughput evaluation and engine integration sections] The Throughput data split and vLLM/TensorRT-LLM integrations are presented as exposing system behaviors masked by high-level implementations, but the manuscript lacks concrete implementation details, quantitative comparisons (e.g., throughput deltas or latency breakdowns), or ablation showing how these integrations differ from prior high-level SD evaluations.
Authors: We will expand these sections with concrete implementation details for the vLLM and TensorRT-LLM integrations, including pseudocode and configuration specifics. Quantitative comparisons such as throughput deltas and latency breakdowns versus high-level baselines, plus ablations, will be added to demonstrate the differences. revision: yes
-
Referee: [Results / Highlighted observations] Observations such as 'synthetic inputs overestimate real-world throughput' and 'batch-size dependent optimal draft lengths' are highlighted, but without accompanying methodology details, data statistics, tables of quantitative results, or error bars, it is not possible to verify the strength of support for these claims.
Authors: We acknowledge the need for greater transparency. The revised Results section will include detailed methodology, data statistics, tables with quantitative results, and error bars from repeated runs to allow verification of the observations on synthetic input overestimation and batch-size dependent draft lengths. revision: yes
Circularity Check
No circularity: benchmark introduction relies on curation choices and integrations, not derived predictions
full rationale
The paper presents SPEED-Bench as a new evaluation suite with curated data splits and production-engine integrations. No equations, fitted parameters, or first-principles derivations appear in the manuscript. The qualitative split is introduced via an explicit curation decision (prioritizing semantic diversity), which is an input rather than a result derived from the benchmark itself. Throughput splits and vLLM/TensorRT-LLM integrations are described as engineering contributions without self-referential reduction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes are invoked to justify core claims. The work is therefore self-contained as an empirical benchmark release.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The qualitative data split selected by prioritizing semantic diversity across samples is representative of real-world semantic domains and workloads
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Pard: Accelerating llm inference with low-cost parallel draft model adaptation
An, Z., Bai, H., Liu, Z., Li, D., and Barsoum, E. Pard: Accelerating llm inference with low-cost parallel draft model adaptation. arXiv preprint arXiv:2504.18583, 2025
-
[4]
Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues
Bai, G., Liu, J., Bu, X., He, Y., Liu, J., Zhou, Z., Lin, Z., Su, W., Ge, T., Zheng, B., et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. arXiv preprint arXiv:2402.14762, 2024
-
[5]
Nvidia nemotron 3: Efficient and open intelligence
Blakeman, A., Grattafiori, A., Basant, A., Gupta, A., Khattar, A., Renduchintala, A., Vavre, A., Shukla, A., Bercovich, A., Ficek, A., et al. Nvidia nemotron 3: Efficient and open intelligence. arXiv preprint arXiv:2512.20856, 2025
-
[6]
Long code arena: a set of benchmarks for long-context code models
Bogomolov, E., Eliseeva, A., Galimzyanov, T., Glukhov, E., Shapkin, A., Tigina, M., Golubev, Y., Kovrigin, A., van Deursen, A., Izadi, M., and Bryksin, T. Long code arena: a set of benchmarks for long-context code models. arXiv preprint arXiv:2406.11612, 2024
-
[7]
Bojar, O., Buck, C., Federmann, C., Haddow, B., Koehn, P., Leveling, J., Monz, C., Pecina, P., Post, M., Saint-Amand, H., Soricut, R., Specia, L., and Tamchyna, A. s. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pp.\ 12--58, Baltimore, Maryland, USA, June 2014. A...
work page 2014
-
[8]
Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J. D., Chen, D., and Dao, T. Medusa: Simple llm inference acceleration framework with multiple decoding heads. In International Conference on Machine Learning, pp.\ 5209--5235. PMLR, 2024
work page 2024
-
[9]
Accelerating Large Language Model Decoding with Speculative Sampling
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Ch...
work page 2021
-
[11]
Sequoia: Scalable, robust, and hardware-aware speculative decoding
Chen, Z., May, A., Svirschevski, R., Huang, Y., Ryabinin, M., Jia, Z., and Chen, B. Sequoia: Scalable, robust, and hardware-aware speculative decoding. CoRR, 2024
work page 2024
-
[12]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
Enhancing chat language models by scaling high-quality instructional conversations
Ding, N., Chen, Y., Xu, B., Qin, Y., Hu, S., Liu, Z., Sun, M., and Zhou, B. Enhancing chat language models by scaling high-quality instructional conversations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 3029--3051, 2023
work page 2023
-
[14]
Dong, Z., Tang, T., Li, J., Zhao, W. X., and Wen, J.-R. Bamboo: A comprehensive benchmark for evaluating long text modeling capacities of large language models. arXiv preprint arXiv:2309.13345, 2023
-
[15]
Break the sequential dependency of llm inference using lookahead decoding
Fu, Y., Bailis, P., Stoica, I., and Zhang, H. Break the sequential dependency of llm inference using lookahead decoding. In Forty-first International Conference on Machine Learning
-
[16]
Grattafiori, A., Dubey, A., Jauhri, A., and et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
REST : Retrieval-based speculative decoding
He, Z., Zhong, Z., Cai, T., Lee, J., and He, D. REST : Retrieval-based speculative decoding. In Duh, K., Gomez, H., and Bethard, S. (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 1582--1595, Mexico City, Mexico, June 2024. ...
-
[19]
Moesd: Unveil speculative decoding's potential for accelerating sparse moe
Huang, Z., Zhu, L., Zhan, Z., Hu, T., Mao, W., Yu, X., Liu, Y., and Zhang, T. Moesd: Unveil speculative decoding's potential for accelerating sparse moe. arXiv preprint arXiv:2505.19645, 2025
-
[20]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., and Stoica, I. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023
work page 2023
-
[22]
Fast inference from transformers via speculative decoding
Leviathan, Y., Kalman, M., and Matias, Y. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.\ 19274--19286. PMLR, 2023
work page 2023
-
[23]
M., Ghaddar, A., Sun, Q., Ma, L., Luo, Y., Li, D., Coates, M., Hao, J., and Zhang, Y
Li, D., Zhou, J., Brunswic, L. M., Ghaddar, A., Sun, Q., Ma, L., Luo, Y., Li, D., Coates, M., Hao, J., and Zhang, Y. Omni-thinker: Scaling multi-task rl in llms with hybrid reward and task scheduling, 2025 a . URL https://arxiv.org/abs/2507.14783
-
[24]
Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., Hubert, T., Choy, P., de Masson d'Autume, C., Babuschkin, I., Chen, X., Huang, P.-S., Welbl, J., Gowal, S., Cherepanov, A., Molloy, J., Mankowitz, D., Sutherland Robson, E., Kohli, P., de Freitas, N., Kavukcuoglu, K., and Vinyals, O...
-
[25]
EAGLE : Speculative sampling requires rethinking feature uncertainty
Li, Y., Wei, F., Zhang, C., and Zhang, H. EAGLE : Speculative sampling requires rethinking feature uncertainty. In International Conference on Machine Learning, 2024 a
work page 2024
-
[26]
EAGLE-2 : Faster inference of language models with dynamic draft trees
Li, Y., Wei, F., Zhang, C., and Zhang, H. EAGLE-2 : Faster inference of language models with dynamic draft trees. In Empirical Methods in Natural Language Processing, 2024 b
work page 2024
-
[27]
EAGLE-3 : Scaling up inference acceleration of large language models via training-time test
Li, Y., Wei, F., Zhang, C., and Zhang, H. EAGLE-3 : Scaling up inference acceleration of large language models via training-time test. In Annual Conference on Neural Information Processing Systems, 2025 b
work page 2025
-
[28]
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
Liu, T., Xu, C., and McAuley, J. Repobench: Benchmarking repository-level code auto-completion systems, 2024 a . URL https://arxiv.org/abs/2306.03091
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Chatqa: Surpassing gpt-4 on conversational qa and rag
Liu, Z., Ping, W., Roy, R., Xu, P., Lee, C., Shoeybi, M., and Catanzaro, B. Chatqa: Surpassing gpt-4 on conversational qa and rag. arXiv preprint arXiv:2401.10225, 2024 b
-
[30]
X., Sha, J., Wang, S., and Wen, J.-R
Luo, W., Zhao, W. X., Sha, J., Wang, S., and Wen, J.-R. Mmath: A multilingual benchmark for mathematical reasoning. arXiv preprint arXiv:2505.19126, 2025
-
[31]
Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., Zhang, Z., Wong, R. Y. Y., Zhu, A., Yang, L., Shi, X., et al. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vo...
work page 2024
-
[32]
Y., Singh, S., Tang, X., von Werra, L., and Longpre, S
Muennighoff, N., Liu, Q., Zebaze, A., Zheng, Q., Hui, B., Zhuo, T. Y., Singh, S., Tang, X., von Werra, L., and Longpre, S. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023
-
[33]
Tensorrt‑llm: High‑performance inference for large language models
NVIDIA . Tensorrt‑llm: High‑performance inference for large language models. https://github.com/NVIDIA/TensorRT-LLM, 2023. Accessed: 2026‑01‑06
work page 2023
-
[34]
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. URL https://arxiv.org/abs/2508.10925
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Paech, S. J. Eq-bench creative writing benchmark v3. https://github.com/EQ-bench/creative-writing-bench, 2025
work page 2025
-
[36]
Mcif: Multimodal crosslingual instruction-following benchmark from scientific talks, 2025
Papi, S., Züfle, M., Gaido, M., Savoldi, B., Liu, D., Douros, I., Bentivogli, L., and Niehues, J. Mcif: Multimodal crosslingual instruction-following benchmark from scientific talks, 2025. URL https://arxiv.org/abs/2507.19634
-
[37]
Ya RN : Efficient context window extension of large language models
Peng, B., Quesnelle, J., Fan, H., and Shippole, E. Ya RN : Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wHBfxhZu1u
work page 2024
-
[38]
Phan, L., Gatti, A., Han, Z., and et al. Humanity's last exam, 2025. URL https://arxiv.org/abs/2501.14249
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [39]
-
[40]
E.-H., May, A., Chen, T., and Chen, B
Sadhukhan, R., Chen, J., Chen, Z., Tiwari, V., Lai, R., Shi, J., Yen, I. E.-H., May, A., Chen, T., and Chen, B. Magicdec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding. In The Thirteenth International Conference on Learning Representations
-
[42]
See, A., Liu, P. J., and Manning, C. D. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1073--1083, Vancouver, Canada, July 2017 b . Association for Computational Linguistics. doi:10.18653/v1/P17-1099
-
[43]
Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023
work page 2023
-
[44]
Ada-leval: Evaluating long-context llms with length-adaptable benchmarks
Wang, C., Duan, H., Zhang, S., Lin, D., and Chen, K. Ada-leval: Evaluating long-context llms with length-adaptable benchmarks. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.\ 3712--3724, 2024 a
work page 2024
-
[45]
Coser: Coordinating llm-based persona simulation of established roles, 2025
Wang, X., Wang, H., Zhang, Y., Yuan, X., Xu, R., tse Huang, J., Yuan, S., Guo, H., Chen, J., Wang, W., Xiao, Y., and Zhou, S. Coser: Coordinating llm-based persona simulation of established roles, 2025. URL https://arxiv.org/abs/2502.09082
-
[46]
Mmlu-pro: A more robust and challenging multi-task language understanding benchmark
Wang, Y., Ma, X., Zhang, G., Ni, Y., Chandra, A., Guo, S., Ren, W., Arulraj, A., He, X., Jiang, Z., et al. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. Advances in Neural Information Processing Systems, 37: 0 95266--95290, 2024 b
work page 2024
-
[47]
Wang, Z. M., Peng, Z., Que, H., Liu, J., Zhou, W., Wu, Y., Guo, H., Gan, R., Ni, Z., Zhang, M., Zhang, Z., Ouyang, W., Xu, K., Chen, W., Fu, J., and Peng, J. Rolellm: Benchmarking, eliciting, and enhancing role-playing abilities of large language models. arXiv preprint arXiv: 2310.00746, 2023
-
[48]
L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A
Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conferenc...
work page 2020
-
[49]
WritingBench: A Comprehensive Benchmark for Generative Writing, March 2025
Wu, Y., Mei, J., Yan, M., Li, C., Lai, S., Ren, Y., Wang, Z., Zhang, J., Wu, M., Jin, Q., and Huang, F. Writingbench: A comprehensive benchmark for generative writing, 2025. URL https://arxiv.org/abs/2503.05244
-
[50]
Xia, H., Yang, Z., Dong, Q., Wang, P., Li, Y., Ge, T., Liu, T., Li, W., and Sui, Z. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Findings of the Association for Computational Linguistics ACL 2024, pp.\ 7655--7671, Bangkok, Thailand and virtual me...
-
[51]
Parallelspec: Parallel drafter for efficient speculative decoding
Xiao, Z., Zhang, H., Ge, T., Ouyang, S., Ordonez, V., and Yu, D. Parallelspec: Parallel drafter for efficient speculative decoding. arXiv preprint arXiv:2410.05589, 2024
-
[52]
MiMo-V2-Flash Technical Report
Xiaomi, L.-C. Mimo-v2-flash technical report, 2026. URL https://arxiv.org/abs/2601.02780
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[53]
Xu, Z., Jiang, F., Niu, L., Deng, Y., Poovendran, R., Choi, Y., and Lin, B. Y. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint arXiv:2406.08464, 2024
work page internal anchor Pith review arXiv 2024
-
[54]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025 a
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Longspec: Long-context lossless speculative decoding with efficient drafting and verification
Yang, P., Du, C., Zhang, F., Wang, H., Pang, T., Du, C., and An, B. Longspec: Long-context lossless speculative decoding with efficient drafting and verification. In ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models, 2025 b
work page 2025
-
[56]
Improving massively multilingual neural machine translation and zero-shot translation
Zhang, B., Williams, P., Titov, I., and Sennrich, R. Improving massively multilingual neural machine translation and zero-shot translation. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J. (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 1628--1639, Online, July 2020. Association for Computati...
-
[57]
Judging llm-as-a-judge with mt-bench and chatbot arena
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36: 0 46595--46623, 2023
work page 2023
-
[58]
SGLang: Efficient Execution of Structured Language Model Programs
Zheng, L., Yin, L., Xie, Z., Sun, C. L., Huang, J., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Barrett, C., and Sheng, Y. Sglang: Efficient execution of structured language model programs. In Conference on Neural Information Processing Systems (NeurIPS), 2024. doi:10.48550/arXiv.2312.07104
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.07104 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.