Dart: Diffusion-inspired speculative decoding for fast llm inference

DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference , author= · 2026 · arXiv 2601.19278

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Cost-Aware Diffusion Draft Trees for Speculative Decoding

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

CaDDTree jointly selects tree structure and budget to maximize expected tokens per unit time in speculative decoding, proving unimodality under convex verification cost and matching oracle DDTree performance on Qwen models.

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

cs.CL · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

SpecBlock achieves 8-13% higher mean speedup than EAGLE-3 at 44-52% drafting cost via block-iterative drafting with hidden-state inheritance, dynamic rank-head branching, valid-prefix masking, and optional cost-aware bandit adaptation.

Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

cs.CL · 2026-05-28 · unverdicted · novelty 6.0

Domino decouples causal dependency modeling from autoregressive draft execution via a parallel backbone plus lightweight causal head and a base-anchored training curriculum, reporting up to 5.49x speedup.

Draft-OPD: On-Policy Distillation for Speculative Draft Models

cs.CL · 2026-05-28 · unverdicted · novelty 6.0

Draft-OPD applies on-policy distillation via target-assisted generation and error replay to train speculative draft models, yielding over 5x lossless acceleration and gains over EAGLE-3 and DFlash.

FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

FlexDraft is a lossless speculative decoding framework that adapts to batch sizes via attention tuning on final layers, MLP-based bonus calibration, and dynamic parallel/sequential decoding.

Enabling Performant and Flexible Model-Internal Observability for LLM Inference

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.

PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.

Accelerating Speculative Decoding with Block Diffusion Draft Trees

cs.CL · 2026-04-14 · unverdicted · novelty 6.0

DDTree builds a draft tree from a block diffusion drafter using a best-first heap on its output probabilities and verifies the tree in one target-model pass via an ancestor-only attention mask, increasing average accepted tokens per round.

D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting

cs.LG · 2026-05-12 · unverdicted · novelty 5.0

D-PACE derives per-position weights from a surrogate of expected accepted draft length to shift training focus toward currently limiting positions, yielding measured gains in wall-clock speedup and emitted length across benchmarks.

citing papers explorer

Showing 9 of 9 citing papers after filters.

Cost-Aware Diffusion Draft Trees for Speculative Decoding cs.CL · 2026-06-01 · unverdicted · none · ref 13
CaDDTree jointly selects tree structure and budget to maximize expected tokens per unit time in speculative decoding, proving unimodality under convex verification cost and matching oracle DDTree performance on Qwen models.
SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting cs.CL · 2026-05-08 · unverdicted · none · ref 20 · 2 links
SpecBlock achieves 8-13% higher mean speedup than EAGLE-3 at 44-52% drafting cost via block-iterative drafting with hidden-state inheritance, dynamic rank-head branching, valid-prefix masking, and optional cost-aware bandit adaptation.
Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding cs.CL · 2026-05-28 · unverdicted · none · ref 19
Domino decouples causal dependency modeling from autoregressive draft execution via a parallel backbone plus lightweight causal head and a base-anchored training curriculum, reporting up to 5.49x speedup.
Draft-OPD: On-Policy Distillation for Speculative Draft Models cs.CL · 2026-05-28 · unverdicted · none · ref 20
Draft-OPD applies on-policy distillation via target-assisted generation and error replay to train speculative draft models, yielding over 5x lossless acceleration and gains over EAGLE-3 and DFlash.
FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration cs.CL · 2026-05-19 · unverdicted · none · ref 20
FlexDraft is a lossless speculative decoding framework that adapts to batch sizes via attention tuning on final layers, MLP-based bonus calibration, and dynamic parallel/sequential decoding.
Enabling Performant and Flexible Model-Internal Observability for LLM Inference cs.LG · 2026-05-11 · unverdicted · none · ref 23
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding cs.CL · 2026-05-09 · unverdicted · none · ref 23
PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
Accelerating Speculative Decoding with Block Diffusion Draft Trees cs.CL · 2026-04-14 · unverdicted · none · ref 15
DDTree builds a draft tree from a block diffusion drafter using a best-first heap on its output probabilities and verifies the tree in one target-model pass via an ancestor-only attention mask, increasing average accepted tokens per round.
D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting cs.LG · 2026-05-12 · unverdicted · none · ref 16
D-PACE derives per-position weights from a surrogate of expected accepted draft length to shift training focus toward currently limiting positions, yielding measured gains in wall-clock speedup and emitted length across benchmarks.

Dart: Diffusion-inspired speculative decoding for fast llm inference

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer