EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Pith reviewed 2026-05-15 18:48 UTC · model grok-4.3
The pith
By switching to direct token prediction and multi-layer feature fusion, EAGLE-3 enables draft models to improve with increased training data for faster LLM inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EAGLE-3 abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data, achieving up to 6.5x speedup and 1.4x over EAGLE-2.
What carries the argument
Training-time test that performs multi-layer feature fusion to support direct token prediction in the draft model.
If this is right
- Inference runs up to 6.5 times faster than standard autoregressive decoding on chat and reasoning tasks.
- The draft model delivers roughly 1.4 times the speedup of EAGLE-2 when both use the same training scale.
- Throughput rises 1.38 times in frameworks such as SGLang when batch size reaches 64.
- Gains appear consistently across both chat-oriented and reasoning-oriented target models on five separate benchmarks.
Where Pith is reading between the lines
- The same direct-prediction shift could be tested in other speculative-sampling methods that currently rely on feature matching.
- If multi-layer fusion proves stable, it may allow smaller target models to be paired with stronger drafts without losing end-to-end quality.
- Pairing the technique with quantization or other compression methods could produce compounded reductions in latency and memory.
Load-bearing premise
Direct token prediction combined with multi-layer feature fusion will remove prior constraints on scaling training data without introducing new accuracy or stability problems in the draft model.
What would settle it
Train an EAGLE-3 draft model on substantially more data than the EAGLE-2 baseline and check whether the measured inference speedup ratio fails to increase; if acceptance rates stay flat or drop, the claim does not hold.
read the original abstract
The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE's feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. In the SGLang framework, EAGLE-3 achieves a 1.38x throughput improvement at a batch size of 64. The code is available at https://github.com/SafeAILab/EAGLE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EAGLE-3 as an extension of prior EAGLE speculative sampling for LLM inference acceleration. It replaces feature prediction with direct token prediction and introduces multi-layer feature fusion via a 'training-time test' technique, claiming this removes prior constraints and allows the draft model to fully benefit from scaling training data. Experiments on chat and reasoning models across five tasks report speedups up to 6.5x (1.4x over EAGLE-2) and 1.38x throughput improvement in SGLang at batch size 64, with code released.
Significance. If the core claims hold, the work offers a practical advance in speculative decoding by addressing data-scaling limitations in draft models, with potential impact on efficient LLM deployment. Strengths include empirical evaluation across multiple model types and tasks plus public code release, which supports reproducibility and follow-up work.
major comments (2)
- Abstract: the central claim that abandoning feature prediction and adding multi-layer fusion 'enable the draft model to fully benefit from scaling up training data' lacks supporting evidence; no scaling curves, performance-vs-data-volume plots, or ablations isolating direct token prediction (while holding fusion fixed) are presented, so observed gains could stem from hyperparameter changes or fusion alone rather than removal of the feature-prediction bottleneck.
- Experiments section: reported speedups (6.5x, 1.4x over EAGLE-2) and throughput numbers provide no details on exact baselines, statistical significance, error bars, data splits, or variance across runs, preventing full verification of the performance claims.
minor comments (1)
- Abstract: the phrase 'training-time test' is used without a concise definition or pointer to its implementation details; add a brief description or section reference for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation for major revision. We address each major comment below and commit to revising the manuscript to strengthen the evidence and clarity of our claims.
read point-by-point responses
-
Referee: Abstract: the central claim that abandoning feature prediction and adding multi-layer fusion 'enable the draft model to fully benefit from scaling up training data' lacks supporting evidence; no scaling curves, performance-vs-data-volume plots, or ablations isolating direct token prediction (while holding fusion fixed) are presented, so observed gains could stem from hyperparameter changes or fusion alone rather than removal of the feature-prediction bottleneck.
Authors: We acknowledge that the current manuscript does not present explicit scaling curves, performance-vs-data-volume plots, or ablations that isolate direct token prediction while holding multi-layer fusion fixed. Our central claim rests on the empirical observation that prior EAGLE variants (relying on feature prediction) exhibit limited gains from increased training data, whereas EAGLE-3 shows substantial improvements over EAGLE-2. To address the concern that gains may arise from other factors, we will add dedicated ablations and scaling plots in the revised version. These additions will directly test the contribution of abandoning feature prediction. revision: yes
-
Referee: Experiments section: reported speedups (6.5x, 1.4x over EAGLE-2) and throughput numbers provide no details on exact baselines, statistical significance, error bars, data splits, or variance across runs, preventing full verification of the performance claims.
Authors: We agree that the experimental section requires more precise reporting to enable verification. In the revision we will explicitly list the exact baseline implementations and versions (including EAGLE-2), report statistical significance tests, include error bars derived from multiple independent runs, detail the data splits used for training and evaluation, and quantify variance across runs. The public code release already supports reproducibility, but the text will be updated to include these details. revision: yes
Circularity Check
No significant circularity; empirical speedups are measured outcomes, not reductions to fitted inputs or self-citations.
full rationale
The paper's central claims rest on experimental measurements of speedup (up to 6.5x and 1.4x over EAGLE-2) across five tasks after introducing direct token prediction and training-time test fusion. No equations are presented that define the reported ratios in terms of internally fitted parameters, and no uniqueness theorems or ansatzes are imported via self-citation to force the architecture. The scaling-data benefit is asserted from observed performance differences rather than derived by construction from the method's own definitions. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- draft model training hyperparameters
axioms (1)
- domain assumption Speculative sampling with a draft model reduces wall-clock latency while preserving output distribution
Lean theorems connected to this paper
-
HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
EAGLE-3 abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via training-time test, significantly enhancing performance and enabling the draft model to fully benefit from scaling up training data
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 34 Pith papers
-
SSV: Sparse Speculative Verification for Efficient LLM Inference
SpecSA is a sparse speculative-verification framework that integrates speculative decoding and dynamic sparse attention to achieve up to 3.49x end-to-end throughput and 6.86x kernel speedups on H100 GPUs for long-cont...
-
PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding
PSD is a training-free framework that jointly optimizes spatial unmasking and temporal speculative decoding in diffusion LLMs to reach up to 5.5x tokens per forward pass while preserving accuracy comparable to greedy ...
-
Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
-
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
-
Test-Time Speculation
Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.
-
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
-
SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting
SpecBlock achieves 8-13% higher mean speedup than EAGLE-3 at 44-52% drafting cost via block-iterative drafting with hidden-state inheritance, dynamic rank-head branching, valid-prefix masking, and optional cost-aware ...
-
An Empirical Study of Speculative Decoding on Software Engineering Tasks
Speculative decoding accelerates LLM inference on SE tasks without accuracy loss, with model-based methods suiting code generation and model-free methods suiting repository-level repair and editing.
-
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization
NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
-
WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference
WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% ac...
-
Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting
Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.
-
KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models
KERV integrates kinematic Kalman Filter predictions with speculative decoding in VLA models to achieve 27-37% faster inference while maintaining nearly the same task success rates.
-
VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping
VVS accelerates visual AR image generation by partially skipping verifications in speculative decoding, achieving 2.8x fewer target forward passes while preserving competitive quality.
-
Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing
PPOW uses window-level RL with cost-aware speedup and proximity rewards plus adaptive divergence-aware windowing to reach 6.29-6.52 acceptance lengths and 3.39-4.36x speedups in speculative decoding.
-
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Orthrus unifies autoregressive LLMs and diffusion models via shared KV cache and consensus to enable up to 7.8x parallel token generation speedup with O(1) memory overhead and lossless results.
-
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Orthrus unifies autoregressive and diffusion views on a shared KV cache to deliver lossless parallel token generation with up to 7.8x speedup and O(1) memory overhead.
-
Enabling Performant and Flexible Model-Internal Observability for LLM Inference
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
-
Test-Time Speculation
TTS adapts speculator models online via target model verifications to improve acceptance lengths by up to 72% over prior methods, with gains increasing for longer generations.
-
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding
PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
-
SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting
SpecBlock achieves 8-19% higher speedup than EAGLE-3 in LLM speculative decoding by using repeated block expansions with hidden-state inheritance, a dynamic rank head, and a valid-prefix training mask.
-
CASCADE: Context-Aware Relaxation for Speculative Image Decoding
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to...
-
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
-
RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding
RACER unifies retrieval of exact matching patterns with logit-driven cues to produce better speculative drafts, achieving more than 2x speedup over autoregressive decoding and outperforming prior training-free specula...
-
ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving
ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.
-
SMART: When is it Actually Worth Expanding a Speculative Tree?
SMART uses marginal benefit-cost analysis to dynamically build efficient speculative trees, achieving 15-20% additional speedup in LLM and MLLM inference.
-
Multi-Token Prediction via Self-Distillation
Self-distillation turns pretrained autoregressive LMs into multi-token predictors that decode over 3x faster with under 5% accuracy drop on GSM8K.
-
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
ETS enables direct sampling from the optimal RL policy for language models at inference time by estimating the energy term with online Monte Carlo and acceleration techniques.
-
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
ETS performs training-free RL alignment for language models by energy-guided test-time scaling with Monte Carlo energy estimation and importance sampling acceleration.
-
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning
Seer improves synchronous LLM RL rollout throughput by up to 2.04x and reduces long-tail latency by 72-94% via divided rollout, context-aware scheduling, and adaptive grouped speculative decoding based on prompt simil...
-
SSV: Sparse Speculative Verification for Efficient LLM Inference
SSV presents a sparse speculative-verification framework that resolves mismatches between speculative decoding and dynamic sparse attention to deliver up to 3.49x end-to-end throughput and 6.86x kernel speedups on NVI...
-
Lever: Speculative LLM Inference on Smartphones
Lever optimizes the drafting, verification, and execution stages of speculative decoding for flash-backed LLM inference on smartphones, reporting 2.93x average latency reduction over baseline flash-offloaded inference.
-
Enhancing Cloud Network Resilience via a Robust LLM-Empowered Multi-Agent Reinforcement Learning Framework
CyberOps-Bots is a hierarchical LLM-empowered multi-agent RL framework that reports 68.5% higher network availability and 34.7% better jumpstart performance in new scenarios without retraining on real cloud datasets.
-
LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation
LogitSpec accelerates retrieval-based speculative decoding by speculating the next-next token from the last logit and retrieving relevant references for both next and next-next tokens, reporting up to 2.61x speedup an...
-
Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack
Workload-aware optimizations for LLM serving in AML and fraud detection yield substantial gains in throughput, latency, and GPU utilization on synthetic compliance prompts.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.