EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Chao Zhang; Fangyun Wei; Hongyang Zhang; Yuhui Li

arxiv: 2503.01840 · v3 · pith:P42XLZOBnew · submitted 2025-03-03 · 💻 cs.CL

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li , Fangyun Wei , Chao Zhang , Hongyang Zhang This is my paper

Pith reviewed 2026-05-15 18:48 UTC · model grok-4.3

classification 💻 cs.CL

keywords speculative samplinginference accelerationlarge language modelsdraft modeltoken predictionmulti-layer fusiontraining-time test

0 comments

The pith

By switching to direct token prediction and multi-layer feature fusion, EAGLE-3 enables draft models to improve with increased training data for faster LLM inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern large language models generate text one token at a time, making them slow, and speculative sampling uses a smaller draft model to propose multiple tokens for the main model to verify in parallel. Earlier methods like EAGLE predicted internal features rather than actual tokens and depended only on the top layer of the target model, which capped how much extra training data could help the draft model. EAGLE-3 switches to predicting tokens directly and fuses features from multiple layers of the target model through a training-time test process. This change removes the prior limits on data scaling, raising the rate at which proposed tokens are accepted and producing larger speedups. A reader would care because the approach keeps inference quality high while cutting the time and compute needed to run capable models.

Core claim

EAGLE-3 abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data, achieving up to 6.5x speedup and 1.4x over EAGLE-2.

What carries the argument

Training-time test that performs multi-layer feature fusion to support direct token prediction in the draft model.

If this is right

Inference runs up to 6.5 times faster than standard autoregressive decoding on chat and reasoning tasks.
The draft model delivers roughly 1.4 times the speedup of EAGLE-2 when both use the same training scale.
Throughput rises 1.38 times in frameworks such as SGLang when batch size reaches 64.
Gains appear consistently across both chat-oriented and reasoning-oriented target models on five separate benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same direct-prediction shift could be tested in other speculative-sampling methods that currently rely on feature matching.
If multi-layer fusion proves stable, it may allow smaller target models to be paired with stronger drafts without losing end-to-end quality.
Pairing the technique with quantization or other compression methods could produce compounded reductions in latency and memory.

Load-bearing premise

Direct token prediction combined with multi-layer feature fusion will remove prior constraints on scaling training data without introducing new accuracy or stability problems in the draft model.

What would settle it

Train an EAGLE-3 draft model on substantially more data than the EAGLE-2 baseline and check whether the measured inference speedup ratio fails to increase; if acceptance rates stay flat or drop, the claim does not hold.

read the original abstract

The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE's feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. In the SGLang framework, EAGLE-3 achieves a 1.38x throughput improvement at a batch size of 64. The code is available at https://github.com/SafeAILab/EAGLE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EAGLE-3 switches to direct token prediction and adds training-time multi-layer fusion, which produces the claimed speedups over EAGLE-2, but the story about now scaling better with more training data rests on unshown comparisons.

read the letter

The main takeaway is that EAGLE-3 drops feature-level prediction for direct token prediction and uses a training-time test to fuse features across layers instead of relying only on the top layer. This produces the reported gains: up to 6.5x speedup overall and roughly 1.4x better than EAGLE-2 on both chat and reasoning models across five tasks, plus a 1.38x throughput lift in SGLang at batch size 64. The code release is a practical plus for anyone who wants to reproduce or extend it. The experiments cover enough model types and tasks to show the changes are not just tuned to one narrow setting. The central claim is that these changes remove the old constraints so the draft model can now take full advantage of larger training sets. That part is harder to judge because the paper does not include scaling curves that plot performance against training-data volume for EAGLE versus EAGLE-3, nor ablations that hold the fusion fixed while changing only the prediction target. Without those, the speedups could come from the fusion step or hyper-parameter shifts rather than the removal of the feature-prediction bottleneck. The results also give no error bars, no details on exact data splits, and limited baseline descriptions, so the magnitude of the improvement is harder to assess for robustness. This paper is aimed at people who build or deploy speculative decoding systems and care about concrete latency or throughput numbers. A reader working on inference optimization would get usable ideas and numbers from it. The changes are concrete enough and the empirical results positive enough that it deserves a serious referee, even if the reviewers will likely request the missing scaling plots and statistical details.

Referee Report

2 major / 1 minor

Summary. The paper introduces EAGLE-3 as an extension of prior EAGLE speculative sampling for LLM inference acceleration. It replaces feature prediction with direct token prediction and introduces multi-layer feature fusion via a 'training-time test' technique, claiming this removes prior constraints and allows the draft model to fully benefit from scaling training data. Experiments on chat and reasoning models across five tasks report speedups up to 6.5x (1.4x over EAGLE-2) and 1.38x throughput improvement in SGLang at batch size 64, with code released.

Significance. If the core claims hold, the work offers a practical advance in speculative decoding by addressing data-scaling limitations in draft models, with potential impact on efficient LLM deployment. Strengths include empirical evaluation across multiple model types and tasks plus public code release, which supports reproducibility and follow-up work.

major comments (2)

Abstract: the central claim that abandoning feature prediction and adding multi-layer fusion 'enable the draft model to fully benefit from scaling up training data' lacks supporting evidence; no scaling curves, performance-vs-data-volume plots, or ablations isolating direct token prediction (while holding fusion fixed) are presented, so observed gains could stem from hyperparameter changes or fusion alone rather than removal of the feature-prediction bottleneck.
Experiments section: reported speedups (6.5x, 1.4x over EAGLE-2) and throughput numbers provide no details on exact baselines, statistical significance, error bars, data splits, or variance across runs, preventing full verification of the performance claims.

minor comments (1)

Abstract: the phrase 'training-time test' is used without a concise definition or pointer to its implementation details; add a brief description or section reference for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each major comment below and commit to revising the manuscript to strengthen the evidence and clarity of our claims.

read point-by-point responses

Referee: Abstract: the central claim that abandoning feature prediction and adding multi-layer fusion 'enable the draft model to fully benefit from scaling up training data' lacks supporting evidence; no scaling curves, performance-vs-data-volume plots, or ablations isolating direct token prediction (while holding fusion fixed) are presented, so observed gains could stem from hyperparameter changes or fusion alone rather than removal of the feature-prediction bottleneck.

Authors: We acknowledge that the current manuscript does not present explicit scaling curves, performance-vs-data-volume plots, or ablations that isolate direct token prediction while holding multi-layer fusion fixed. Our central claim rests on the empirical observation that prior EAGLE variants (relying on feature prediction) exhibit limited gains from increased training data, whereas EAGLE-3 shows substantial improvements over EAGLE-2. To address the concern that gains may arise from other factors, we will add dedicated ablations and scaling plots in the revised version. These additions will directly test the contribution of abandoning feature prediction. revision: yes
Referee: Experiments section: reported speedups (6.5x, 1.4x over EAGLE-2) and throughput numbers provide no details on exact baselines, statistical significance, error bars, data splits, or variance across runs, preventing full verification of the performance claims.

Authors: We agree that the experimental section requires more precise reporting to enable verification. In the revision we will explicitly list the exact baseline implementations and versions (including EAGLE-2), report statistical significance tests, include error bars derived from multiple independent runs, detail the data splits used for training and evaluation, and quantify variance across runs. The public code release already supports reproducibility, but the text will be updated to include these details. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical speedups are measured outcomes, not reductions to fitted inputs or self-citations.

full rationale

The paper's central claims rest on experimental measurements of speedup (up to 6.5x and 1.4x over EAGLE-2) across five tasks after introducing direct token prediction and training-time test fusion. No equations are presented that define the reported ratios in terms of internally fitted parameters, and no uniqueness theorems or ansatzes are imported via self-citation to force the architecture. The scaling-data benefit is asserted from observed performance differences rather than derived by construction from the method's own definitions. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the established premise that speculative sampling can accelerate autoregressive generation and on standard supervised training of a draft model; no new physical constants or invented entities are introduced.

free parameters (1)

draft model training hyperparameters
Standard learning rate, batch size, and layer fusion weights are fitted during training of the draft model.

axioms (1)

domain assumption Speculative sampling with a draft model reduces wall-clock latency while preserving output distribution
Invoked in the opening paragraph as the foundation for all EAGLE variants.

pith-pipeline@v0.9.0 · 5549 in / 1342 out tokens · 29075 ms · 2026-05-15T18:48:28.987599+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EAGLE-3 abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via training-time test, significantly enhancing performance and enabling the draft model to fully benefit from scaling up training data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 34 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SSV: Sparse Speculative Verification for Efficient LLM Inference
cs.OS 2026-05 unverdicted novelty 7.0

SpecSA is a sparse speculative-verification framework that integrates speculative decoding and dynamic sparse attention to achieve up to 3.49x end-to-end throughput and 6.86x kernel speedups on H100 GPUs for long-cont...
PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding
cs.CL 2026-05 unverdicted novelty 7.0

PSD is a training-free framework that jointly optimizes spatial unmasking and temporal speculative decoding in diffusion LLMs to reach up to 5.5x tokens per forward pass while preserving accuracy comparable to greedy ...
Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
cs.RO 2026-05 unverdicted novelty 7.0

A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
cs.LG 2026-05 unverdicted novelty 7.0

SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
Test-Time Speculation
cs.CL 2026-05 unverdicted novelty 7.0

Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting
cs.CL 2026-05 unverdicted novelty 7.0

SpecBlock achieves 8-13% higher mean speedup than EAGLE-3 at 44-52% drafting cost via block-iterative drafting with hidden-state inheritance, dynamic rank-head branching, valid-prefix masking, and optional cost-aware ...
An Empirical Study of Speculative Decoding on Software Engineering Tasks
cs.SE 2026-04 unverdicted novelty 7.0

Speculative decoding accelerates LLM inference on SE tasks without accuracy loss, with model-based methods suiting code generation and model-free methods suiting repository-level repair and editing.
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
cs.LG 2026-04 unverdicted novelty 7.0

NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference
cs.IT 2026-04 unverdicted novelty 7.0

WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% ac...
Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting
cs.CV 2026-03 unverdicted novelty 7.0

Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.
KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models
cs.RO 2026-03 unverdicted novelty 7.0

KERV integrates kinematic Kalman Filter predictions with speculative decoding in VLA models to achieve 27-37% faster inference while maintaining nearly the same task success rates.
VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping
cs.CV 2025-11 conditional novelty 7.0

VVS accelerates visual AR image generation by partially skipping verifications in speculative decoding, achieving 2.8x fewer target forward passes while preserving competitive quality.
Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing
cs.CL 2026-05 unverdicted novelty 6.0

PPOW uses window-level RL with cost-aware speedup and proximity rewards plus adaptive divergence-aware windowing to reach 6.29-6.52 acceptance lengths and 3.39-4.36x speedups in speculative decoding.
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
cs.LG 2026-05 unverdicted novelty 6.0

Orthrus unifies autoregressive LLMs and diffusion models via shared KV cache and consensus to enable up to 7.8x parallel token generation speedup with O(1) memory overhead and lossless results.
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
cs.LG 2026-05 unverdicted novelty 6.0

Orthrus unifies autoregressive and diffusion views on a shared KV cache to deliver lossless parallel token generation with up to 7.8x speedup and O(1) memory overhead.
Enabling Performant and Flexible Model-Internal Observability for LLM Inference
cs.LG 2026-05 unverdicted novelty 6.0

DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
Test-Time Speculation
cs.CL 2026-05 unverdicted novelty 6.0

TTS adapts speculator models online via target model verifications to improve acceptance lengths by up to 72% over prior methods, with gains increasing for longer generations.
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding
cs.CL 2026-05 unverdicted novelty 6.0

PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting
cs.CL 2026-05 unverdicted novelty 6.0

SpecBlock achieves 8-19% higher speedup than EAGLE-3 in LLM speculative decoding by using repeated block expansions with hidden-state inheritance, a dynamic rank head, and a valid-prefix training mask.
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
cs.CV 2026-05 unverdicted novelty 6.0

CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
cs.LG 2026-04 unverdicted novelty 6.0

SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding
cs.CL 2026-04 unverdicted novelty 6.0

RACER unifies retrieval of exact matching patterns with logit-driven cues to produce better speculative drafts, achieving more than 2x speedup over autoregressive decoding and outperforming prior training-free specula...
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
cs.LG 2026-04 unverdicted novelty 6.0

ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
SMART: When is it Actually Worth Expanding a Speculative Tree?
cs.DC 2026-04 unverdicted novelty 6.0

SMART uses marginal benefit-cost analysis to dynamically build efficient speculative trees, achieving 15-20% additional speedup in LLM and MLLM inference.
Multi-Token Prediction via Self-Distillation
cs.CL 2026-02 unverdicted novelty 6.0

Self-distillation turns pretrained autoregressive LMs into multi-token predictors that decode over 3x faster with under 5% accuracy drop on GSM8K.
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
cs.LG 2026-01 conditional novelty 6.0

ETS enables direct sampling from the optimal RL policy for language models at inference time by estimating the energy term with online Monte Carlo and acceleration techniques.
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
cs.LG 2026-01 unverdicted novelty 6.0

ETS performs training-free RL alignment for language models by energy-guided test-time scaling with Monte Carlo energy estimation and importance sampling acceleration.
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning
cs.DC 2025-11 unverdicted novelty 6.0

Seer improves synchronous LLM RL rollout throughput by up to 2.04x and reduces long-tail latency by 72-94% via divided rollout, context-aware scheduling, and adaptive grouped speculative decoding based on prompt simil...
SSV: Sparse Speculative Verification for Efficient LLM Inference
cs.OS 2026-05 unverdicted novelty 5.0

SSV presents a sparse speculative-verification framework that resolves mismatches between speculative decoding and dynamic sparse attention to deliver up to 3.49x end-to-end throughput and 6.86x kernel speedups on NVI...
Lever: Speculative LLM Inference on Smartphones
cs.LG 2026-05 unverdicted novelty 5.0

Lever optimizes the drafting, verification, and execution stages of speculative decoding for flash-backed LLM inference on smartphones, reporting 2.93x average latency reduction over baseline flash-offloaded inference.
Enhancing Cloud Network Resilience via a Robust LLM-Empowered Multi-Agent Reinforcement Learning Framework
cs.CR 2026-01 unverdicted novelty 5.0

CyberOps-Bots is a hierarchical LLM-empowered multi-agent RL framework that reports 68.5% higher network availability and 34.7% better jumpstart performance in new scenarios without retraining on real cloud datasets.
LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation
cs.CL 2025-07 unverdicted novelty 5.0

LogitSpec accelerates retrieval-based speculative decoding by speculating the next-next token from the last logit and retrieving relevant references for both next and next-next tokens, reporting up to 2.61x speedup an...
Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack
cs.AI 2026-05 unverdicted novelty 4.0

Workload-aware optimizations for LLM serving in AML and fraud detection yield substantial gains in throughput, latency, and GPU utilization on synthetic compliance prompts.