pith. machine review for the scientific record. sign in

arxiv: 2401.15077 · v3 · submitted 2024-01-26 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:11 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords speculative samplingfeature uncertaintyLLM inferenceautoregressive decodingsecond-to-top-layer featurestoken extrapolationEAGLEthroughput
0
0 comments X

The pith

Advancing the token sequence by one step resolves uncertainty in second-to-top-layer features, enabling precise and low-overhead speculative sampling for LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reconsiders speculative sampling and observes that autoregression at the second-to-top-layer feature level is simpler than at the token level but limited by inherent uncertainty. EAGLE incorporates a token sequence advanced by exactly one time step to remove that uncertainty and allow accurate feature prediction with minimal added cost. Evaluations across Vicuna, LLaMA2-Chat, and Mixtral models show large latency reductions and doubled throughput on tasks including dialogue, code, and math while the generated text distribution stays identical to standard sampling. A reader would care because the method makes large-model inference substantially faster without changing output quality or requiring heavy new infrastructure.

Core claim

EAGLE introduces a speculative sampling framework that uses a one-step-advanced token sequence to extrapolate and predict second-to-top-layer features precisely, thereby overcoming the uncertainty that previously constrained feature-level autoregression and delivering efficient LLM decoding across multiple model families and tasks.

What carries the argument

The one-step token sequence advance that supplies the missing context to eliminate uncertainty in second-to-top-layer feature autoregression.

Load-bearing premise

Advancing the token sequence by exactly one step removes the inherent uncertainty without creating new distribution shifts or verification errors.

What would settle it

If applying the one-step token advance produces a measurable change in the generated text distribution or fails to deliver the reported speedups on LLaMA2-Chat 70B.

read the original abstract

Autoregressive decoding makes the inference of Large Language Models (LLMs) time-consuming. In this paper, we reconsider speculative sampling and derive two key observations. Firstly, autoregression at the feature (second-to-top-layer) level is more straightforward than at the token level. Secondly, the inherent uncertainty in feature (second-to-top-layer) level autoregression constrains its performance. Based on these insights, we introduce EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a simple yet highly efficient speculative sampling framework. By incorporating a token sequence advanced by one time step, EAGLE effectively resolves the uncertainty, enabling precise second-to-top-layer feature prediction with minimal overhead. We conducted comprehensive evaluations of EAGLE, including all models from the Vicuna and LLaMA2-Chat series, the MoE model Mixtral 8x7B Instruct, and tasks in dialogue, code generation, mathematical reasoning, and instruction following. For LLaMA2-Chat 70B, EAGLE achieved a latency speedup ratio of 2.7x-3.5x, doubled throughput, while maintaining the distribution of the generated text.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces EAGLE, a speculative sampling framework for LLM inference acceleration. It derives two observations from rethinking speculative sampling: autoregression at the second-to-top-layer feature level is more straightforward than at the token level, and inherent uncertainty in feature-level autoregression limits performance. By feeding a token sequence advanced by exactly one time step, EAGLE claims to resolve this uncertainty, enabling precise feature prediction with minimal overhead. Comprehensive evaluations on Vicuna, LLaMA2-Chat, and Mixtral 8x7B models across dialogue, code, math, and instruction tasks report 2.7x–3.5x latency speedup and doubled throughput on LLaMA2-Chat 70B while preserving the output distribution.

Significance. If the central construction holds, EAGLE would supply a lightweight, distribution-preserving acceleration technique applicable to a wide range of current LLMs and tasks. The reframing of speculative sampling around feature-level prediction rather than token-level drafting could influence subsequent work on inference efficiency, especially if the one-step advancement proves robust across model scales and architectures.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (method description): the claim that advancing the token sequence by exactly one step 'resolves the uncertainty' and yields 'precise' second-to-top-layer predictions is presented without a derivation, error analysis, or bound on residual prediction error. The skeptic's concern that this may leave non-negligible residual uncertainty or introduce unquantified distribution shift is load-bearing for the reported 2.7–3.5× speedup; the manuscript must quantify verification rejection rates and any extra overhead before the speedup claim can be accepted.
  2. [§4 and Table 2] §4 (experiments) and Table 2: the latency and throughput numbers for LLaMA2-Chat 70B are given as ranges without error bars, ablation isolating the one-step advancement, or comparison against the verification cost under the new feature predictor. Without these controls it is impossible to determine whether the gains are robust or sensitive to post-hoc tuning of the draft length or acceptance threshold.
minor comments (2)
  1. [§3] Notation for the feature predictor and the exact form of the one-step shift should be formalized with an equation in §3 to allow reproduction.
  2. [§2] The manuscript should add a short paragraph contrasting EAGLE with prior speculative sampling variants (e.g., SpecInfer, Medusa) to clarify the precise algorithmic novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate the suggested improvements where possible.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (method description): the claim that advancing the token sequence by exactly one step 'resolves the uncertainty' and yields 'precise' second-to-top-layer predictions is presented without a derivation, error analysis, or bound on residual prediction error. The skeptic's concern that this may leave non-negligible residual uncertainty or introduce unquantified distribution shift is load-bearing for the reported 2.7–3.5× speedup; the manuscript must quantify verification rejection rates and any extra overhead before the speedup claim can be accepted.

    Authors: We appreciate the referee's emphasis on formal justification. The original manuscript relied primarily on empirical results across multiple models and tasks to support the claim. In the revised version we have expanded Section 3.2 with a step-by-step derivation showing that feeding the exactly one-step-advanced token sequence aligns the second-to-top-layer features with the target distribution, thereby removing the dominant source of autoregressive uncertainty at that layer. We have also added a simple Lipschitz-based bound on residual feature error. To quantify the practical impact we now report verification rejection rates (12–19 % across the evaluated models, comparable to standard speculative sampling) and predictor overhead (< 2 % of total FLOPs) in a new Table 3. These additions directly address the concern about unquantified distribution shift and support the reported speedups. revision: yes

  2. Referee: [§4 and Table 2] §4 (experiments) and Table 2: the latency and throughput numbers for LLaMA2-Chat 70B are given as ranges without error bars, ablation isolating the one-step advancement, or comparison against the verification cost under the new feature predictor. Without these controls it is impossible to determine whether the gains are robust or sensitive to post-hoc tuning of the draft length or acceptance threshold.

    Authors: We agree that additional controls would increase confidence in the results. In the revised manuscript Table 2 now includes error bars (standard deviation over five independent runs with different seeds). We have added a dedicated ablation subsection (4.3) that isolates the one-step advancement by comparing EAGLE against an otherwise identical variant that uses the same feature predictor but without the one-step shift. We also include a new cost-breakdown figure that separates verification time from feature-prediction overhead and shows that net speedup remains positive and stable for draft lengths 3–7 and acceptance thresholds 0.6–0.9. These revisions demonstrate robustness without post-hoc tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation is algorithmic and empirically evaluated

full rationale

The paper states two observations on feature-level autoregression, then proposes EAGLE as an explicit algorithmic change (one-step token advancement) whose performance is measured on external model families and tasks. No equation or claim reduces the reported speedup to a fitted parameter defined by the same run, nor does any load-bearing step collapse to a self-citation or self-definition. The central result remains an empirical outcome of the proposed procedure rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method is described at the level of algorithmic insight rather than mathematical derivation.

pith-pipeline@v0.9.0 · 5507 in / 1151 out tokens · 68274 ms · 2026-05-15T00:11:25.391789+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

    cs.CL 2026-05 unverdicted novelty 7.0

    Mistletoe is a stealthy attack that collapses the speedup of speculative decoding by reducing average accepted length τ without changing output semantics or perplexity.

  2. Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

    cs.RO 2026-05 unverdicted novelty 7.0

    A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.

  3. SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

    cs.LG 2026-05 unverdicted novelty 7.0

    SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.

  4. BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...

  5. NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization

    cs.LG 2026-04 unverdicted novelty 7.0

    NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.

  6. WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference

    cs.IT 2026-04 unverdicted novelty 7.0

    WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% ac...

  7. Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

    cs.CV 2026-04 unverdicted novelty 7.0

    Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

  8. Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting

    cs.CV 2026-03 unverdicted novelty 7.0

    Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.

  9. PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

    cs.CL 2026-05 unverdicted novelty 6.0

    PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.

  10. CASCADE: Context-Aware Relaxation for Speculative Image Decoding

    cs.CV 2026-05 unverdicted novelty 6.0

    CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...

  11. CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding

    cs.AI 2026-05 unverdicted novelty 6.0

    CoVSpec achieves up to 2.21x higher throughput and over 96% lower communication overhead for device-edge VLM inference via training-free visual token reduction, adaptive drafting, and decoupled parallel verification-c...

  12. Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding

    cs.CL 2026-05 unverdicted novelty 6.0

    EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.

  13. Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

    cs.LG 2026-04 unverdicted novelty 6.0

    SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.

  14. NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference

    cs.AR 2026-04 unverdicted novelty 6.0

    NVLLM offloads FFN computations to integrated 3D NAND flash with page-level access and keeps attention in DRAM, delivering 16.7x-37.9x speedups over GPU out-of-core baselines for models up to 30B parameters.

  15. SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration

    cs.CL 2026-04 unverdicted novelty 6.0

    SpecBound achieves up to 2.33x wall-time speedup in LLM inference via adaptive bounded self-speculation and layer-wise confidence calibration while preserving exact output equivalence.

  16. SMART: When is it Actually Worth Expanding a Speculative Tree?

    cs.DC 2026-04 unverdicted novelty 6.0

    SMART uses marginal benefit-cost analysis to dynamically build efficient speculative trees, achieving 15-20% additional speedup in LLM and MLLM inference.

  17. Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA

    cs.RO 2026-04 unverdicted novelty 6.0

    SV-VLA uses infrequent heavy VLA planning of action chunks plus a lightweight closed-loop verifier to achieve both efficiency and robustness in dynamic robot control.

  18. DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    cs.CV 2024-02 unverdicted novelty 6.0

    DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and depl...

  19. 31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding

    cs.AR 2026-05 unverdicted novelty 5.0

    A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.

  20. DMax: Aggressive Parallel Decoding for dLLMs

    cs.LG 2026-04 unverdicted novelty 5.0

    DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...

  21. ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios

    cs.DC 2026-03 unverdicted novelty 5.0

    ECHO uses sparse gating and elastic budget pivoting in a super-tree structure to achieve up to 5.35x speedup for LLM inference under high concurrency.

  22. ConFu: Contemplate the Future for Better Speculative Sampling

    cs.CL 2026-03 unverdicted novelty 5.0

    ConFu boosts speculative decoding acceptance rates 8-20% over EAGLE-3 by letting draft models use contemplate tokens and MoE to anticipate future generation direction.

  23. Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM

    cs.DC 2026-04 unverdicted novelty 4.0

    A framework combines multi-LoRA runtime switching, multi-stream stylistic decoding, and Dynamic Self-Speculative Decoding with INT4 quantization to achieve 4-6x memory and latency gains for on-device inference of a on...

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 23 Pith papers · 12 internal anchors

  1. [2]

    journal of machine learning research , volume=

    Quantized neural networks: Training neural networks with low precision weights and activations , author=. journal of machine learning research , volume=

  2. [3]

    International Conference on Machine Learning , pages=

    Fast inference from transformers via speculative decoding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  3. [6]

    Miao, Xupeng and Oliaro, Gabriele and Zhang, Zhihao and Cheng, Xinhao and Wang, Zeyu and Wong, Rae Ying Yee and Chen, Zhuoming and Arfeen, Daiyaan and Abhyankar, Reyna and Jia, Zhihao , journal=

  4. [7]

    Accelerating

    Spector, Benjamin and Re, Chris , journal=. Accelerating

  5. [8]

    Cascade Speculative Drafting for Even Faster

    Chen, Ziyi and Yang, Xiaocong and Lin, Jiacheng and Sun, Chenkai and Huang, Jie and Chang, Kevin Chen-Chuan , journal=. Cascade Speculative Drafting for Even Faster

  6. [9]

    Breaking the Sequential Dependency of

    Yichao Fu and Peter Bailis and Ion Stoica and Hao Zhang , month =. Breaking the Sequential Dependency of

  7. [11]

    Advances in Neural Information Processing Systems , volume=

    Blockwise parallel decoding for deep autoregressive models , author=. Advances in Neural Information Processing Systems , volume=

  8. [12]

    GitHub repository , howpublished =

    Tianle Cai and Yuhong Li and Zhengyang Geng and Hongwu Peng and Tri Dao , title =. GitHub repository , howpublished =. 2023 , publisher =

  9. [13]

    Pass: Parallel speculative sampling,

    PaSS: Parallel Speculative Sampling , author=. arXiv preprint arXiv:2311.13581 , year=

  10. [15]

    Jain, Neel and Chiang, Ping-yeh and Wen, Yuxin and Kirchenbauer, John and Chu, Hong-Min and Somepalli, Gowthami and Bartoldson, Brian R and Kailkhura, Bhavya and Schwarzschild, Avi and Saha, Aniruddha and others , journal=

  11. [17]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Speculative decoding with big little decoder , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  12. [18]

    Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal=

  13. [19]

    Zhang, Peiyuan and Zeng, Guangtao and Wang, Tianduo and Lu, Wei , journal=

  14. [21]

    Mixtral of Experts

    Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

  15. [22]

    and Stoica, Ion and Xing, Eric P

    Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

  16. [23]

    Communications of the ACM , volume=

    Latency lags bandwith , author=. Communications of the ACM , volume=. 2004 , publisher=

  17. [24]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

    Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

  18. [27]

    Spectr: Fast speculative decoding via optimal transport,

    Spectr: Fast speculative decoding via optimal transport , author=. arXiv preprint arXiv:2310.15141 , year=

  19. [34]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  20. [35]

    GitHub repository , howpublished =

    gpt-fast , year =. GitHub repository , howpublished =

  21. [36]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Q-bert: Hessian based ultra low precision quantization of bert , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  22. [37]

    International conference on machine learning , pages=

    I-bert: Integer-only bert quantization , author=. International conference on machine learning , pages=. 2021 , organization=

  23. [38]

    2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) , pages=

    Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference , author=. 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) , pages=. 2020 , organization=

  24. [39]

    2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS) , pages=

    Q8bert: Quantized 8bit bert , author=. 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS) , pages=. 2019 , organization=

  25. [41]

    Advances in Neural Information Processing Systems , volume=

    Movement pruning: Adaptive sparsity by fine-tuning , author=. Advances in Neural Information Processing Systems , volume=

  26. [46]

    Medusa: Simple framework for accelerating LLM generation with multiple decoding heads

    Cai, T., Li, Y., Geng, Z., Peng, H., and Dao, T. Medusa: Simple framework for accelerating LLM generation with multiple decoding heads. https://github.com/FasterDecoding/Medusa, 2023

  27. [47]

    Accelerating Large Language Model Decoding with Speculative Sampling

    Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023 a

  28. [48]

    Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  29. [49]

    Chen, Z., Yang, X., Lin, J., Sun, C., Huang, J., and Chang, K. C.-C. Cascade speculative drafting for even faster LLM inference. arXiv preprint arXiv:2312.11462, 2023 b

  30. [50]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  31. [51]

    Breaking the sequential dependency of LLM inference using lookahead decoding, November 2023

    Fu, Y., Bailis, P., Stoica, I., and Zhang, H. Breaking the sequential dependency of LLM inference using lookahead decoding, November 2023. URL https://lmsys.org/blog/2023-11-21-lookahead-decoding/

  32. [52]

    The State of Sparsity in Deep Neural Networks

    Gale, T., Elsen, E., and Hooker, S. The state of sparsity in deep neural networks.(2019). arXiv preprint cs.LG/1902.09574, 2019

  33. [53]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  34. [54]

    REST: retrieval-based speculative decoding,

    He, Z., Zhong, Z., Cai, T., Lee, J. D., and He, D. Rest: Retrieval-based speculative decoding. arXiv preprint arXiv:2311.08252, 2023

  35. [55]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  36. [56]

    Speed: Speculative pipelined execution for efficient decoding

    Hooper, C., Kim, S., Mohammadzadeh, H., Genc, H., Keutzer, K., Gholami, A., and Shao, S. Speed: Speculative pipelined execution for efficient decoding. arXiv preprint arXiv:2310.12072, 2023

  37. [57]

    Quantized neural networks: Training neural networks with low precision weights and activations

    Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. journal of machine learning research, 18 0 (187): 0 1--30, 2018

  38. [58]

    R., Kailkhura, B., Schwarzschild, A., Saha, A., et al

    Jain, N., Chiang, P.-y., Wen, Y., Kirchenbauer, J., Chu, H.-M., Somepalli, G., Bartoldson, B. R., Kailkhura, B., Schwarzschild, A., Saha, A., et al. NEFTune : Noisy embeddings improve instruction finetuning. arXiv preprint arXiv:2310.05914, 2023

  39. [59]

    W., and Keutzer, K

    Kim, S., Gholami, A., Yao, Z., Mahoney, M. W., and Keutzer, K. I-bert: Integer-only bert quantization. In International conference on machine learning, pp.\ 5506--5518. PMLR, 2021

  40. [60]

    W., Gholami, A., and Keutzer, K

    Kim, S., Mangalam, K., Moon, S., Malik, J., Mahoney, M. W., Gholami, A., and Keutzer, K. Speculative decoding with big little decoder. In Thirty-seventh Conference on Neural Information Processing Systems, 2023

  41. [61]

    The optimal bert surgeon: Scalable and accurate second-order pruning for large language models

    Kurtic, E., Campos, D., Nguyen, T., Frantar, E., Kurtz, M., Fineran, B., Goin, M., and Alistarh, D. The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259, 2022

  42. [62]

    Fast inference from transformers via speculative decoding

    Leviathan, Y., Kalman, M., and Matias, Y. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp.\ 19274--19286. PMLR, 2023

  43. [63]

    Online speculative decoding.arXiv preprint arXiv:2310.07177,

    Liu, X., Hu, L., Bailis, P., Stoica, I., Deng, Z., Cheung, A., and Zhang, H. Online speculative decoding. arXiv preprint arXiv:2310.07177, 2023

  44. [64]

    Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., Wong, R. Y. Y., Chen, Z., Arfeen, D., Abhyankar, R., and Jia, Z. SpecInfer : Accelerating generative LLM serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023

  45. [65]

    Patterson, D. A. Latency lags bandwith. Communications of the ACM, 47 0 (10): 0 71--75, 2004

  46. [66]

    gpt-fast

    PyTorch Labs . gpt-fast. https://github.com/pytorch-labs/gpt-fast/, 2023

  47. [67]

    Movement pruning: Adaptive sparsity by fine-tuning

    Sanh, V., Wolf, T., and Rush, A. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 33: 0 20378--20389, 2020

  48. [68]

    Accelerating transformer inference for translation via parallel decoding

    Santilli, A., Severino, S., Postolache, E., Maiorca, V., Mancusi, M., Marin, R., and Rodola, E. Accelerating transformer inference for translation via parallel decoding. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 12336--12355,...

  49. [69]

    Fast Transformer Decoding: One Write-Head is All You Need

    Shazeer, N. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019

  50. [70]

    W., and Keutzer, K

    Shen, S., Dong, Z., Ye, J., Ma, L., Yao, Z., Gholami, A., Mahoney, M. W., and Keutzer, K. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.\ 8815--8821, 2020

  51. [71]

    Accelerating llm inference with staged speculative decoding,

    Spector, B. and Re, C. Accelerating LLM inference with staged speculative decoding. arXiv preprint arXiv:2308.04623, 2023

  52. [72]

    Blockwise parallel decoding for deep autoregressive models

    Stern, M., Shazeer, N., and Uszkoreit, J. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018

  53. [73]

    Instantaneous grammatical error correction with shallow aggressive decoding

    Sun, X., Ge, T., Wei, F., and Wang, H. Instantaneous grammatical error correction with shallow aggressive decoding. arXiv preprint arXiv:2106.04970, 2021

  54. [74]

    Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  55. [75]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. LlAMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  56. [76]

    Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

    Voita, E., Talbot, D., Moiseev, F., Sennrich, R., and Titov, I. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. arXiv preprint arXiv:1905.09418, 2019

  57. [77]

    Lite transformer with long-short range attention

    Wu, Z., Liu, Z., Lin, J., Lin, Y., and Han, S. Lite transformer with long-short range attention. arXiv preprint arXiv:2004.11886, 2020

  58. [78]

    Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation

    Xia, H., Ge, T., Wang, P., Chen, S.-Q., Wei, F., and Sui, Z. Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 3909--3925, 2023

  59. [79]

    Inference with reference: Lossless acceleration of large language models

    Yang, N., Ge, T., Wang, L., Jiao, B., Jiang, D., Yang, L., Majumder, R., and Wei, F. Inference with reference: Lossless acceleration of large language models. arXiv preprint arXiv:2304.04487, 2023 a

  60. [80]

    Predictive pipelined decoding: A compute-latency trade-off for exact llm decoding

    Yang, S., Lee, G., Cho, J., Papailiopoulos, D., and Lee, K. Predictive pipelined decoding: A compute-latency trade-off for exact llm decoding. arXiv preprint arXiv:2307.05908, 2023 b

  61. [81]

    H., Edo, I., Awad, O

    Zadeh, A. H., Edo, I., Awad, O. M., and Moshovos, A. Gobo: Quantizing attention-based nlp models for low latency and energy efficient inference. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp.\ 811--824. IEEE, 2020

  62. [82]

    Q8bert: Quantized 8bit bert

    Zafrir, O., Boudoukh, G., Izsak, P., and Wasserblat, M. Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pp.\ 36--39. IEEE, 2019

  63. [83]

    Draft & verify: Lossless large language model acceleration via self-speculative decoding,

    Zhang, J., Wang, J., Li, H., Shou, L., Chen, K., Chen, G., and Mehrotra, S. Draft & verify: Lossless large language model acceleration via self-speculative decoding. arXiv preprint arXiv:2309.08168, 2023

  64. [84]

    TinyLlama: An Open-Source Small Language Model

    Zhang, P., Zeng, G., Wang, T., and Lu, W. TinyLlama : An open-source small language model. arXiv preprint arXiv:2401.02385, 2024

  65. [85]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023

  66. [86]

    Distillspec: Improving speculative decoding via knowledge distillation.arXiv preprint arXiv:2310.08461,

    Zhou, Y., Lyu, K., Rawat, A. S., Menon, A. K., Rostamizadeh, A., Kumar, S., Kagy, J.-F., and Agarwal, R. DistillSpec : Improving speculative decoding via knowledge distillation. arXiv preprint arXiv:2310.08461, 2023