HyperDFlash: Hyper-Connection-Aligned Block Speculative Decoding with Gated Residual Reduction

Fangmin Chen; Hongjian Sun; Junhao Hua; Luxi Lin; Qiang Wang; Rui Ma; Shuang Peng; Shuwei Fan; Songwei Liu; Zhengda Qin

arxiv: 2606.26744 · v2 · pith:CRSDSOYDnew · submitted 2026-06-25 · 💻 cs.LG · cs.CL

HyperDFlash: Hyper-Connection-Aligned Block Speculative Decoding with Gated Residual Reduction

Luxi Lin , Shuang Peng , Rui Ma , Junhao Hua , Shuwei Fan , Zhengda Qin , Qiang Wang , Hongjian Sun

show 2 more authors

Fangmin Chen Songwei Liu

This is my paper

Pith reviewed 2026-06-30 10:08 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords speculative decodinghyper-connectionsblock-parallel draftinggated residual reductionmulti-token predictionKL distillationDeepSeek-V4

0 comments

The pith

HyperDFlash resolves feature misalignment in DeepSeek-V4's hyper-connections by conditioning drafters on pre-collapse residuals and inheriting a lightweight gated reducer from the hc_head module.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HyperDFlash as a block-parallel speculative decoding method built for models that use hyper-connections. Native multi-token prediction in DeepSeek-V4 loses accuracy on later draft tokens because of error buildup and because standard drafting methods clash with the model's multi-path residual streams. HyperDFlash fixes the clash by feeding the drafter only pre-collapse residual states and by replacing a generic compressor with a gated reducer whose weights come directly from the target's own hc_head. A KL distillation loss is added during training to keep early drafts close to the target distribution. Experiments on math, code, and chat tasks show longer accepted drafts and faster decoding than both the built-in MTP baseline and a direct DFlash port.

Core claim

By restricting the conditioning signal to pre-collapse residual states and replacing the linear compressor with a gated residual reducer whose parameters are copied from the target's hc_head module, the drafter regains alignment with the hyper-connection pathway; the resulting block drafts achieve higher acceptance rates and produce measurable speedups when paired with a targeted KL loss on the LM head.

What carries the argument

Pre-collapse residual states as the sole conditioning input together with a gated residual reducer that inherits parameters from the model's built-in hc_head module.

If this is right

Average accepted draft length increases on math reasoning, code synthesis, and conversational benchmarks.
Decoding throughput improves over both native MTP and a direct DFlash adaptation of the same model.
Early-position draft quality improves when the KL distillation term is applied to the LM head.
The gated reducer adds only a negligible parameter count while preserving exact architectural alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pre-collapse conditioning and parameter-inheritance pattern could be tested on other architectures that maintain parallel residual paths before aggregation.
Because the reducer reuses existing hc_head weights, the method may reduce the compute needed to train a separate drafter for any hyper-connection model.
If the gated reducer generalizes, block-parallel drafting could become a standard add-on for any model whose residuals are summed from multiple paths.

Load-bearing premise

DeepSeek-V4's multi-path residual stream creates feature misalignment that is resolved by conditioning exclusively on pre-collapse residual states and inheriting parameters directly from the built-in hc_head module for the gated reducer.

What would settle it

Run the same block-drafting experiments on DeepSeek-V4 but replace the pre-collapse conditioning with post-collapse states; if average accepted draft length and speedup both drop to the level of the vanilla DFlash adaptation, the alignment claim does not hold.

read the original abstract

We present HyperDFlash, a block-parallel speculative decoding framework tailored to DeepSeek-V4's Hyper-Connections (HC). Despite the strong performance of DeepSeek-V4's native Multi-Token Prediction (MTP) module on initial token drafting, its draft accuracy degrades sharply at later positions, as error accumulation from unverified intermediate tokens harms draft acceptance rates. Although the original DFlash method supports efficient one-pass block drafting, it cannot be seamlessly adapted to the HC paradigm, since DeepSeek-V4's multi-path residual stream induces inherent feature misalignment with conventional drafting designs. To resolve this architectural mismatch, we propose two dedicated, model-aligned optimizations for HC residual streams. First, we adopt pre-collapse residual states as the exclusive conditioning signal, preserving complete multi-path structural information and better aligning the drafter with the target's native prediction pathway. Second, we replace the heavy generic linear compressor with a lightweight gated residual reducer, whose parameters are directly inherited from the target model's built-in hc_head module. This design yields input-aware path aggregation with three orders of magnitude fewer parameters while maintaining precise architectural alignment. We further enhance model training via a targeted KL distillation loss applied to the LM-head, regularizing predictions against the target distribution to improve early draft quality. Extensive experiments across math reasoning, code synthesis, and conversational benchmarks demonstrate that HyperDFlash consistently outperforms both the native MTP baseline and vanilla DFlash adaptation, achieving substantial gains in average accepted draft length and decoding speedup. These results validate HC alignment, gated reduction, and targeted distillation for high-performance speculative decoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HyperDFlash makes two targeted, architecture-aligned changes to speculative decoding for DeepSeek-V4 that fix a residual mismatch and claim better performance.

read the letter

The two HC-specific changes in HyperDFlash address the residual stream mismatch in DeepSeek-V4 and produce the reported gains in draft acceptance.

The paper takes DFlash and modifies it with pre-collapse conditioning and an inherited gated reducer from hc_head, plus KL distillation. This keeps the drafter aligned without heavy new parameters. The gated reducer reuses the target's own module, which is efficient.

It does well at identifying why a direct adaptation fails and at proposing lightweight, architecture-matched solutions. The experiments on math reasoning, code synthesis, and conversational benchmarks show outperformance over MTP and vanilla DFlash in accepted draft length and speedup.

The soft spot is the narrow scope to this one model family, which limits how far the ideas will travel. The central assumption about feature misalignment holds up from the architecture description. No load-bearing flaws appear in the framing.

This paper is for people optimizing inference speed on hyper-connection LLMs like DeepSeek-V4. It deserves a serious referee because the problem is practical and the fixes are specific and testable. The results validate the targeted distillation and the two optimizations for this setting. Overall the work is incremental but cleanly executed for its niche. Readers working on similar architectural adaptations will find the details useful for their own implementations.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces HyperDFlash, a block-parallel speculative decoding framework for DeepSeek-V4's Hyper-Connections (HC) architecture. It identifies feature misalignment arising from the multi-path residual stream and proposes two HC-aligned fixes: conditioning exclusively on pre-collapse residual states and replacing the generic compressor with a lightweight gated residual reducer whose parameters are inherited from the built-in hc_head module. KL distillation on the LM-head is added to improve early draft quality. The central claim is that these changes yield consistent outperformance over the native MTP baseline and a vanilla DFlash adaptation, with gains in average accepted draft length and decoding speedup across math reasoning, code synthesis, and conversational benchmarks.

Significance. If the claimed gains are robust, the work supplies a practical, parameter-efficient template for adapting speculative decoding to models whose residual streams use hyper-connections. The explicit architectural alignment (pre-collapse conditioning and hc_head-inherited reducer) and the three-order-of-magnitude parameter reduction are concrete strengths that could generalize to other HC-style architectures.

major comments (1)

Abstract: the claim that HyperDFlash 'consistently outperforms' the baselines and achieves 'substantial gains' is presented without any quantitative metrics, error bars, baseline configurations, or statistical controls. Because the central claim is empirical outperformance, the absence of these details in the abstract (and the reader's note that none appear in the supplied text) renders the result impossible to evaluate for robustness or post-hoc selection effects.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying the need for quantitative support in the abstract. We address the comment below and will incorporate the suggested changes in the revised manuscript.

read point-by-point responses

Referee: Abstract: the claim that HyperDFlash 'consistently outperforms' the baselines and achieves 'substantial gains' is presented without any quantitative metrics, error bars, baseline configurations, or statistical controls. Because the central claim is empirical outperformance, the absence of these details in the abstract (and the reader's note that none appear in the supplied text) renders the result impossible to evaluate for robustness or post-hoc selection effects.

Authors: We agree that the abstract, in its current form, presents the claims of consistent outperformance and substantial gains without accompanying quantitative metrics, error bars, baseline configurations, or statistical details. Although the body of the manuscript reports the full experimental results (including tables of accepted draft lengths, speedups, and benchmark-specific comparisons), we acknowledge that the abstract should be self-contained and allow direct evaluation of the central empirical claims. In the revised version we will update the abstract to include key quantitative results drawn from the experiments section, such as specific gains in average accepted draft length and decoding speedup, along with brief indications of the evaluation settings and baselines. This revision will be made while preserving the existing technical contributions. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical method for speculative decoding adapted to DeepSeek-V4's Hyper-Connections architecture, with two model-aligned optimizations (pre-collapse conditioning and hc_head-inherited gated reducer) plus KL distillation. These are described as direct engineering fixes for feature misalignment, validated through experiments on external benchmarks against native MTP and vanilla DFlash baselines. No derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing uniqueness theorems are present in the provided text; the central claims rest on comparative performance metrics rather than any reduction to the method's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient technical detail to identify any free parameters, axioms, or invented entities; no equations or implementation specifics are given.

pith-pipeline@v0.9.1-grok · 5843 in / 1138 out tokens · 119358 ms · 2026-06-30T10:08:11.457743+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 11 canonical work pages · 8 internal anchors

[1]

Hydra: Sequentially- dependent draft heads for medusa decoding.arXiv preprint arXiv:2402.05109, 2024

Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan- Kelley, and William Brandon. Hydra: Sequentially- dependent draft heads for medusa decoding.arXiv preprint arXiv:2402.05109, 2024

work page arXiv 2024
[2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Do- han, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration frame- work with multiple decoding heads. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

2024
[4]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irv- ing, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model de- coding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

DFlash: Block Diffusion for Flash Speculative Decoding

Jian Chen, Yesheng Liang, and Zhijian Liu. DFlash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Ka- plan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large lan- guage models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavar- ian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Rei- ichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

DeepSeek-V4-Flash

DeepSeek-AI. DeepSeek-V4-Flash. https://huggingface.co/deepseek-ai/ DeepSeek-V4-Flash, 2026. Hugging Face model card

2026
[9]

Aly, Beidi Chen, and Carole-Jean Wu

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A. Aly, Beidi Chen, and Carole-Jean Wu. LayerSkip: Enabling early exit in- ferenceandself-speculativedecoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistic...

2024
[10]

Break the sequential dependency of LLM inference using lookahead decoding

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of LLM inference using lookahead decoding. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

2024
[11]

Bet- ter & faster large language models via multi-token prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Roz- ière, David Lopez-Paz, and Gabriel Synnaeve. Bet- ter & faster large language models via multi-token prediction. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

2024
[12]

Lee, and Di He

Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, and Di He. REST: Retrieval-based speculative decoding. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

2024
[13]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Dis- tilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[15]

Livecodebench: Holis- tic and contamination free evaluation of large lan- guage models for code

Naman Jain, Alex Gu, Wen-Ding Li, Fanjia Yan, TianjunZhang, SidaWang, ArmandoSolar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holis- tic and contamination free evaluation of large lan- guage models for code. InInternational Conference on Learning Representations, volume 2025, pages 58791–58831, 2025

2025
[16]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serv- ing with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Sys- tems Principles, 2023

2023
[17]

Fast inference from transformers via speculative de- coding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative de- coding. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023

2023
[18]

Eagle: Speculative sampling requires re- thinking feature uncertainty

YuhuiLi, FangyunWei, ChaoZhang, andHongyang Zhang. Eagle: Speculative sampling requires re- thinking feature uncertainty. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

2024
[19]

Eagle-2: Faster inference of language mod- els with dynamic draft trees

YuhuiLi, FangyunWei, ChaoZhang, andHongyang Zhang. Eagle-2: Faster inference of language mod- els with dynamic draft trees. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

2024
[20]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

YuhuiLi, FangyunWei, ChaoZhang, andHongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

SpecInfer: Ac- celerating generative large language model serving with tree-based speculative inference and verifica- tion

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xin- hao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. SpecInfer: Ac- celerating generative large language model serving with tree-based speculative inference and verifica- tion. InProceedings ...

2024
[22]

Blockwise parallel decoding for deep autoregressive models

Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models. InAdvances in Neural Information Process- ing Systems (NeurIPS), 2018

2018
[23]

Evol-CodeAlpaca-v1.https: //huggingface.co/datasets/theblackcat102/ evol-codealpaca-v1, 2024

theblackcat102. Evol-CodeAlpaca-v1.https: //huggingface.co/datasets/theblackcat102/ evol-codealpaca-v1, 2024. Hugging Face dataset card

2024
[24]

Deepseek- v4: Towards highly efficient million-token context intelligence.arXiv preprint arXiv:2606.19348, 2026

Anyi Xu, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chenchen Ling, et al. Deepseek- v4: Towards highly efficient million-token context intelligence.arXiv preprint arXiv:2606.19348, 2026

work page arXiv 2026
[25]

Draft & verify: Lossless large language model acceleration via self-speculative decoding

Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra. Draft & verify: Lossless large language model acceleration via self-speculative decoding. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (ACL), 2024

2024
[26]

EagleChat.https://huggingface.co/ datasets/zhaode/EagleChat, 2026

De Zhao. EagleChat.https://huggingface.co/ datasets/zhaode/EagleChat, 2026. Hugging Face dataset card

2026
[27]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

2023
[28]

Hyper-connections.arXiv preprint arXiv:2409.19606, 2024

Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper-connections.arXiv preprint arXiv:2409.19606, 2024. 7

work page arXiv 2024

[1] [1]

Hydra: Sequentially- dependent draft heads for medusa decoding.arXiv preprint arXiv:2402.05109, 2024

Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan- Kelley, and William Brandon. Hydra: Sequentially- dependent draft heads for medusa decoding.arXiv preprint arXiv:2402.05109, 2024

work page arXiv 2024

[2] [2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Do- han, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration frame- work with multiple decoding heads. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

2024

[4] [4]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irv- ing, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model de- coding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

DFlash: Block Diffusion for Flash Speculative Decoding

Jian Chen, Yesheng Liang, and Zhijian Liu. DFlash: Block diffusion for flash speculative decoding.arXiv preprint arXiv:2602.06036, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Ka- plan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large lan- guage models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavar- ian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Rei- ichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

DeepSeek-V4-Flash

DeepSeek-AI. DeepSeek-V4-Flash. https://huggingface.co/deepseek-ai/ DeepSeek-V4-Flash, 2026. Hugging Face model card

2026

[9] [9]

Aly, Beidi Chen, and Carole-Jean Wu

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A. Aly, Beidi Chen, and Carole-Jean Wu. LayerSkip: Enabling early exit in- ferenceandself-speculativedecoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistic...

2024

[10] [10]

Break the sequential dependency of LLM inference using lookahead decoding

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of LLM inference using lookahead decoding. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

2024

[11] [11]

Bet- ter & faster large language models via multi-token prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Roz- ière, David Lopez-Paz, and Gabriel Synnaeve. Bet- ter & faster large language models via multi-token prediction. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

2024

[12] [12]

Lee, and Di He

Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, and Di He. REST: Retrieval-based speculative decoding. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

2024

[13] [13]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Dis- tilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[15] [15]

Livecodebench: Holis- tic and contamination free evaluation of large lan- guage models for code

Naman Jain, Alex Gu, Wen-Ding Li, Fanjia Yan, TianjunZhang, SidaWang, ArmandoSolar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holis- tic and contamination free evaluation of large lan- guage models for code. InInternational Conference on Learning Representations, volume 2025, pages 58791–58831, 2025

2025

[16] [16]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serv- ing with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Sys- tems Principles, 2023

2023

[17] [17]

Fast inference from transformers via speculative de- coding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative de- coding. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023

2023

[18] [18]

Eagle: Speculative sampling requires re- thinking feature uncertainty

YuhuiLi, FangyunWei, ChaoZhang, andHongyang Zhang. Eagle: Speculative sampling requires re- thinking feature uncertainty. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

2024

[19] [19]

Eagle-2: Faster inference of language mod- els with dynamic draft trees

YuhuiLi, FangyunWei, ChaoZhang, andHongyang Zhang. Eagle-2: Faster inference of language mod- els with dynamic draft trees. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

2024

[20] [20]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

YuhuiLi, FangyunWei, ChaoZhang, andHongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint arXiv:2503.01840, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

SpecInfer: Ac- celerating generative large language model serving with tree-based speculative inference and verifica- tion

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xin- hao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. SpecInfer: Ac- celerating generative large language model serving with tree-based speculative inference and verifica- tion. InProceedings ...

2024

[22] [22]

Blockwise parallel decoding for deep autoregressive models

Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models. InAdvances in Neural Information Process- ing Systems (NeurIPS), 2018

2018

[23] [23]

Evol-CodeAlpaca-v1.https: //huggingface.co/datasets/theblackcat102/ evol-codealpaca-v1, 2024

theblackcat102. Evol-CodeAlpaca-v1.https: //huggingface.co/datasets/theblackcat102/ evol-codealpaca-v1, 2024. Hugging Face dataset card

2024

[24] [24]

Deepseek- v4: Towards highly efficient million-token context intelligence.arXiv preprint arXiv:2606.19348, 2026

Anyi Xu, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, Chenchen Ling, et al. Deepseek- v4: Towards highly efficient million-token context intelligence.arXiv preprint arXiv:2606.19348, 2026

work page arXiv 2026

[25] [25]

Draft & verify: Lossless large language model acceleration via self-speculative decoding

Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, and Sharad Mehrotra. Draft & verify: Lossless large language model acceleration via self-speculative decoding. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (ACL), 2024

2024

[26] [26]

EagleChat.https://huggingface.co/ datasets/zhaode/EagleChat, 2026

De Zhao. EagleChat.https://huggingface.co/ datasets/zhaode/EagleChat, 2026. Hugging Face dataset card

2026

[27] [27]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

2023

[28] [28]

Hyper-connections.arXiv preprint arXiv:2409.19606, 2024

Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou. Hyper-connections.arXiv preprint arXiv:2409.19606, 2024. 7

work page arXiv 2024