hub Canonical reference

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang · 2024 · cs.LG · arXiv 2401.15077

Canonical reference. 80% of citing Pith papers cite this work as background.

49 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 49 citing papers arXiv PDF

abstract

Autoregressive decoding makes the inference of Large Language Models (LLMs) time-consuming. In this paper, we reconsider speculative sampling and derive two key observations. Firstly, autoregression at the feature (second-to-top-layer) level is more straightforward than at the token level. Secondly, the inherent uncertainty in feature (second-to-top-layer) level autoregression constrains its performance. Based on these insights, we introduce EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a simple yet highly efficient speculative sampling framework. By incorporating a token sequence advanced by one time step, EAGLE effectively resolves the uncertainty, enabling precise second-to-top-layer feature prediction with minimal overhead. We conducted comprehensive evaluations of EAGLE, including all models from the Vicuna and LLaMA2-Chat series, the MoE model Mixtral 8x7B Instruct, and tasks in dialogue, code generation, mathematical reasoning, and instruction following. For LLaMA2-Chat 70B, EAGLE achieved a latency speedup ratio of 2.7x-3.5x, doubled throughput, while maintaining the distribution of the generated text.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 method 2

citation-polarity summary

background 8 use method 2

representative citing papers

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

cs.CL · 2026-05-13 · unverdicted · novelty 8.0 · 2 refs

Mistletoe introduces a stealthy attack on speculative decoding that collapses acceleration by reducing average accepted length while preserving output semantics.

Certified Speculative Execution for Untrusted AI Agents

cs.CR · 2026-06-30 · unverdicted · novelty 7.0

CGPA enables certified speculative execution of untrusted AI proposals in constrained sequential decisions via verifier rejection, conformal boundary gating, and solver deferral, yielding zero violations and regret within noise of the oracle.

Cost-Aware Diffusion Draft Trees for Speculative Decoding

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

CaDDTree jointly selects tree structure and budget to maximize expected tokens per unit time in speculative decoding, proving unimodality under convex verification cost and matching oracle DDTree performance on Qwen models.

Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture

cs.AI · 2026-05-29 · unverdicted · novelty 7.0

Proposes the Intelligent Computing Architecture (ICA) as a six-layer framework with dual probabilistic-deterministic planes and three Amdahl-style heuristics to unify design of LLM-based systems.

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.

Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs

cs.RO · 2026-05-13 · unverdicted · novelty 7.0

A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.

BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preserving exact synchrony.

NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.

WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference

cs.IT · 2026-04-20 · unverdicted · novelty 7.0

WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% accuracy loss.

Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

DMax: Aggressive Parallel Decoding for dLLMs

cs.LG · 2026-04-09 · conditional · novelty 7.0 · 2 refs

DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.

KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models

cs.RO · 2026-03-02 · unverdicted · novelty 7.0

KERV integrates kinematic Kalman Filter predictions with speculative decoding in VLA models to achieve 27-37% faster inference while maintaining nearly the same task success rates.

A Markov Categorical Framework for Language Modeling

cs.LG · 2025-07-25 · unverdicted · novelty 7.0

A Markov category framework for language models provides an information-theoretic rationale for speculative decoding and shows that a quadratic surrogate to negative log-likelihood induces generalized CCA alignment in linear-softmax heads after normalization.

Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning

cs.CL · 2026-06-29 · unverdicted · novelty 6.0

PRP introduces proactive routing via Draft Rating Learning and Joint Rating Learning to route queries early between draft and target models for efficient multimodal reasoning.

Depth Exploration for LLM Decoding

cs.LG · 2026-06-28 · unverdicted · novelty 6.0

DEX replaces single-depth selection with parallel exploration over multiple candidate depths, committing the final-depth token while collapsing reusable states to reduce per-token computation.

DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation

cs.LG · 2026-05-30 · unverdicted · novelty 6.0

DREAM-S combines neural architecture search, target-aware supernet training, and attention-entropy-guided distillation to accelerate speculative decoding in VLMs, reporting up to 3.85x speedup over standard methods.

Draft-OPD: On-Policy Distillation for Speculative Draft Models

cs.CL · 2026-05-28 · unverdicted · novelty 6.0

Draft-OPD applies on-policy distillation via target-assisted generation and error replay to train speculative draft models, yielding over 5x lossless acceleration and gains over EAGLE-3 and DFlash.

Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

cs.CL · 2026-05-22 · unverdicted · novelty 6.0 · 2 refs

Fast-dDrive is a block-diffusion VLA that reports SOTA accuracy on WOD-E2E and nuScenes driving benchmarks together with 12x throughput over autoregressive baselines via section scaffolds and test-time averaging.

VeriCache: Turning Lossy KV Cache into Lossless LLM Inference

cs.AR · 2026-05-17 · unverdicted · novelty 6.0

VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.

Delay-Adaptive Speculation Control for Low-Latency Edge-Cloud LLM Inference

cs.NI · 2026-05-17 · unverdicted · novelty 6.0

Formulates speculation control in distributed LLM inference as optimal stopping, proves delay-monotone thresholds, gives UCB-SpecStop with regret bounds, and reports up to 22% latency reduction on a Jetson-RTX testbed.

An Interpretable Latency Model for Speculative Decoding in LLM Serving

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

The paper presents an interpretable latency model for speculative decoding that infers effective batch size via Little's Law and decomposes demand to predict and explain performance across serving loads, validated on vLLM measurements.

Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing

cs.CL · 2026-05-14 · unverdicted · novelty 6.0

PPOW uses window-level RL with cost-aware speedup and proximity rewards plus adaptive divergence-aware windowing to reach 6.29-6.52 acceptance lengths and 3.39-4.36x speedups in speculative decoding.

citing papers explorer

Showing 49 of 49 citing papers.

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding cs.CL · 2026-05-13 · unverdicted · none · ref 10 · 2 links · internal anchor
Mistletoe introduces a stealthy attack on speculative decoding that collapses acceleration by reducing average accepted length while preserving output semantics.
Certified Speculative Execution for Untrusted AI Agents cs.CR · 2026-06-30 · unverdicted · none · ref 12 · internal anchor
CGPA enables certified speculative execution of untrusted AI proposals in constrained sequential decisions via verifier rejection, conformal boundary gating, and solver deferral, yielding zero violations and regret within noise of the oracle.
Cost-Aware Diffusion Draft Trees for Speculative Decoding cs.CL · 2026-06-01 · unverdicted · none · ref 6 · internal anchor
CaDDTree jointly selects tree structure and budget to maximize expected tokens per unit time in speculative decoding, proving unimodality under convex verification cost and matching oracle DDTree performance on Qwen models.
Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture cs.AI · 2026-05-29 · unverdicted · none · ref 86 · internal anchor
Proposes the Intelligent Computing Architecture (ICA) as a six-layer framework with dual probabilistic-deterministic planes and three Amdahl-style heuristics to unify design of LLM-based systems.
Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting cs.LG · 2026-05-28 · unverdicted · none · ref 41 · internal anchor
BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding cs.LG · 2026-05-19 · unverdicted · none · ref 20 · internal anchor
Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.
Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs cs.RO · 2026-05-13 · unverdicted · none · ref 12 · internal anchor
A new speculative inference system speeds up diffusion VLAs to 19.1 ms average latency (3.04x faster) on LIBERO by replacing most full 58 ms inferences with 7.8 ms draft rounds while preserving task performance.
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding cs.LG · 2026-05-11 · unverdicted · none · ref 7 · internal anchor
SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning cs.LG · 2026-05-09 · unverdicted · none · ref 12 · internal anchor
BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preserving exact synchrony.
NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization cs.LG · 2026-04-20 · unverdicted · none · ref 29 · internal anchor
NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference cs.IT · 2026-04-20 · unverdicted · none · ref 26 · internal anchor
WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% accuracy loss.
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding cs.CV · 2026-04-14 · unverdicted · none · ref 26 · internal anchor
Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
DMax: Aggressive Parallel Decoding for dLLMs cs.LG · 2026-04-09 · conditional · none · ref 43 · 2 links · internal anchor
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models cs.RO · 2026-03-02 · unverdicted · none · ref 15 · internal anchor
KERV integrates kinematic Kalman Filter predictions with speculative decoding in VLA models to achieve 27-37% faster inference while maintaining nearly the same task success rates.
A Markov Categorical Framework for Language Modeling cs.LG · 2025-07-25 · unverdicted · none · ref 18 · internal anchor
A Markov category framework for language models provides an information-theoretic rationale for speculative decoding and shows that a quadratic surrogate to negative log-likelihood induces generalized CCA alignment in linear-softmax heads after normalization.
Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning cs.CL · 2026-06-29 · unverdicted · none · ref 22 · internal anchor
PRP introduces proactive routing via Draft Rating Learning and Joint Rating Learning to route queries early between draft and target models for efficient multimodal reasoning.
Depth Exploration for LLM Decoding cs.LG · 2026-06-28 · unverdicted · none · ref 31 · internal anchor
DEX replaces single-depth selection with parallel exploration over multiple candidate depths, committing the final-depth token while collapsing reusable states to reduce per-token computation.
DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation cs.LG · 2026-05-30 · unverdicted · none · ref 76 · internal anchor
DREAM-S combines neural architecture search, target-aware supernet training, and attention-entropy-guided distillation to accelerate speculative decoding in VLMs, reporting up to 3.85x speedup over standard methods.
Draft-OPD: On-Policy Distillation for Speculative Draft Models cs.CL · 2026-05-28 · unverdicted · none · ref 17 · internal anchor
Draft-OPD applies on-policy distillation via target-assisted generation and error replay to train speculative draft models, yielding over 5x lossless acceleration and gains over EAGLE-3 and DFlash.
Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving cs.CL · 2026-05-22 · unverdicted · none · ref 8 · 2 links · internal anchor
Fast-dDrive is a block-diffusion VLA that reports SOTA accuracy on WOD-E2E and nuScenes driving benchmarks together with 12x throughput over autoregressive baselines via section scaffolds and test-time averaging.
VeriCache: Turning Lossy KV Cache into Lossless LLM Inference cs.AR · 2026-05-17 · unverdicted · none · ref 31 · internal anchor
VeriCache turns lossy KV cache compression into lossless LLM inference by drafting with compressed cache and verifying drafts with full cache, achieving up to 4x throughput with identical outputs.
Delay-Adaptive Speculation Control for Low-Latency Edge-Cloud LLM Inference cs.NI · 2026-05-17 · unverdicted · none · ref 19 · internal anchor
Formulates speculation control in distributed LLM inference as optimal stopping, proves delay-monotone thresholds, gives UCB-SpecStop with regret bounds, and reports up to 22% latency reduction on a Jetson-RTX testbed.
An Interpretable Latency Model for Speculative Decoding in LLM Serving cs.LG · 2026-05-14 · unverdicted · none · ref 8 · internal anchor
The paper presents an interpretable latency model for speculative decoding that infers effective batch size via Little's Law and decomposes demand to predict and explain performance across serving loads, validated on vLLM measurements.
Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing cs.CL · 2026-05-14 · unverdicted · none · ref 5 · internal anchor
PPOW uses window-level RL with cost-aware speedup and proximity rewards plus adaptive divergence-aware windowing to reach 6.29-6.52 acceptance lengths and 3.39-4.36x speedups in speculative decoding.
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding cs.CL · 2026-05-09 · unverdicted · none · ref 19 · internal anchor
PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
CASCADE: Context-Aware Relaxation for Speculative Image Decoding cs.CV · 2026-05-08 · unverdicted · none · ref 24 · internal anchor
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.
CoVSpec: Efficient Device-Edge Co-Inference for Vision-Language Models via Speculative Decoding cs.AI · 2026-05-04 · unverdicted · none · ref 7 · internal anchor
CoVSpec achieves up to 2.21x higher throughput and over 96% lower communication overhead for device-edge VLM inference via training-free visual token reduction, adaptive drafting, and decoupled parallel verification-correction in speculative decoding.
Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding cs.CL · 2026-05-01 · unverdicted · none · ref 3 · internal anchor
EVICT adaptively truncates draft trees in MoE speculative decoding by combining drafter signals with profiled costs to retain only cost-effective prefixes, delivering up to 2.35x speedup over autoregressive decoding.
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving cs.LG · 2026-04-29 · unverdicted · none · ref 33 · internal anchor
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference cs.AR · 2026-04-28 · unverdicted · none · ref 18 · internal anchor
NVLLM offloads FFN computations to integrated 3D NAND flash with page-level access and keeps attention in DRAM, delivering 16.7x-37.9x speedups over GPU out-of-core baselines for models up to 30B parameters.
SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration cs.CL · 2026-04-14 · unverdicted · none · ref 3 · internal anchor
SpecBound achieves up to 2.33x wall-time speedup in LLM inference via adaptive bounded self-speculation and layer-wise confidence calibration while preserving exact output equivalence.
Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA cs.RO · 2026-04-03 · unverdicted · none · ref 21 · internal anchor
SV-VLA uses infrequent heavy VLA planning of action chunks plus a lightweight closed-loop verifier to achieve both efficiency and robustness in dynamic robot control.
GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts cs.AI · 2026-01-08 · unverdicted · none · ref 12 · internal anchor
GlimpRouter uses the entropy of the first token in each reasoning step to decide whether to invoke a large model, yielding 10.7% higher accuracy and 25.9% lower latency than a standalone large model on AIME25.
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning cs.DC · 2025-11-18 · unverdicted · none · ref 20 · internal anchor
Seer improves synchronous LLM RL rollout throughput by up to 2.04x and reduces long-tail latency by 72-94% via divided rollout, context-aware scheduling, and adaptive grouped speculative decoding based on prompt similarity observations.
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models cs.CV · 2024-02-19 · unverdicted · none · ref 72 · internal anchor
DriveVLM adds vision-language models with scene description, analysis, and hierarchical planning modules to autonomous driving, paired with a hybrid DriveVLM-Dual system tested on nuScenes and SUP-AD datasets and deployed on a production vehicle.
BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding cs.CL · 2026-06-30 · unverdicted · none · ref 24 · internal anchor
BlockPilot is an instance-adaptive policy that predicts optimal block size from the prefilling representation for diffusion speculative decoding, reporting 5.92 acceptance length and 4.20x speedup on Qwen3-4B.
EntMTP: Accelerating LLM Inference with Entropy Guided Multi Token Prediction cs.CL · 2026-06-25 · unverdicted · none · ref 8 · internal anchor
EntMTP is a training-free entropy-guided scheduler for multi-token prediction that dynamically selects from task-specific Pareto-optimal trees to accelerate LLM inference by up to 1.36x on benchmarks without quality loss.
RTP-LLM: High-Performance Alibaba LLM Inference Engine cs.OS · 2026-05-28 · unverdicted · none · ref 33 · internal anchor
RTP-LLM is a new LLM inference engine achieving 4.7x-6.3x model loading speedup and 1.12x-2.52x throughput gains over vLLM and SGLang via disaggregated phases, multi-tier KV cache, and modular optimizations in production at Alibaba.
SSV: Sparse Speculative Verification for Efficient LLM Inference cs.OS · 2026-05-19 · unverdicted · none · ref 23 · internal anchor
SSV presents a sparse speculative-verification framework that resolves mismatches between speculative decoding and dynamic sparse attention to deliver up to 3.49x end-to-end throughput and 6.86x kernel speedups on NVIDIA H100 GPUs.
Multi-Token Residual Prediction cs.LG · 2026-05-12 · unverdicted · none · ref 21 · 2 links · internal anchor
MRP predicts logit residuals between adjacent denoising steps in DLMs from backbone hidden states to support efficient multi-token denoising, yielding up to 1.4x lossless speedup or 22.6-point accuracy gains on code and math tasks.
31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding cs.AR · 2026-05-10 · unverdicted · none · ref 9 · internal anchor
A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.
ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios cs.DC · 2026-03-10 · unverdicted · none · ref 11 · internal anchor
ECHO uses sparse gating and elastic budget pivoting in a super-tree structure to achieve up to 5.35x speedup for LLM inference under high concurrency.
ConFu: Contemplate the Future for Better Speculative Sampling cs.CL · 2026-03-09 · unverdicted · none · ref 12 · internal anchor
ConFu boosts speculative decoding acceptance rates 8-20% over EAGLE-3 by letting draft models use contemplate tokens and MoE to anticipate future generation direction.
LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation cs.CL · 2025-07-02 · unverdicted · none · ref 19 · internal anchor
LogitSpec accelerates retrieval-based speculative decoding by speculating the next-next token from the last logit and retrieving relevant references for both next and next-next tokens, reporting up to 2.61x speedup and 3.28 mean accepted tokens.
CSD: Content-aware Speculative Decoding for Efficient Image Generation cs.CV · 2026-06-26 · unverdicted · none · ref 6 · internal anchor
CSD adds content-aware entropy relaxation and a distribution alignment filter to speculative decoding, raising acceptance rates in low-detail image areas while keeping output aligned with the target model.
Reference-Augmented Learning for Precise Tracking Policy of Tendon-Driven Continuum Robots cs.RO · 2026-04-28 · unverdicted · none · ref 18 · internal anchor
A reference-augmented offline learning framework for 6-DOF tracking control of tendon-driven continuum robots achieves 50.9% lower average position error than non-augmented baselines.
Unlocking the Edge deployment and ondevice acceleration of multi-LoRA enabled one-for-all foundational LLM cs.DC · 2026-04-20 · unverdicted · none · ref 15 · internal anchor
A framework combines multi-LoRA runtime switching, multi-stream stylistic decoding, and Dynamic Self-Speculative Decoding with INT4 quantization to achieve 4-6x memory and latency gains for on-device inference of a one-for-all foundational LLM on Qualcomm chipsets.
SMART: When is it Actually Worth Expanding a Speculative Tree? cs.DC · 2026-04-09 · unreviewed · ref 20 · internal anchor
Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting cs.CV · 2026-03-30 · unreviewed · ref 15 · internal anchor

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer