Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

Yuhao Shen , Tianyu Liu , Junyi Shen , Jinyang Wu , Quan Kong , Li Huan , Cong Wang

Authors on Pith no claims yet

classification 💻 cs.CL

keywords doubleretrievalspeculativedraftmodelspeedupdecodingextensive

read the original abstract

Parallel Speculative Decoding (PSD) accelerates traditional Speculative Decoding (SD) by overlapping draft generation with verification. However, it remains hampered by two fundamental challenges: (1) a theoretical speedup ceiling dictated by the speed ratio between the draft and target models, and (2) high computational waste and pipeline stall due to mid-sequence token rejections of early errors. To address these limitations, we introduce \textsc{Double} (Double Retrieval Speculative Parallelism). By bridging the gap between SD and PSD, our framework resolves the Retrieval \emph{Precision-Efficiency Dilemma} through a novel synchronous mechanism. Specifically, we enable the draft model to execute iterative retrieval speculations to break the theoretical speedup limits; to alleviate rejections without rollback, the target model performs authoritative retrieval to generate multi-token guidance. \textsc{Double} is entirely training-free and lossless. Extensive experiments demonstrate state-of-the-art speedup of $\textbf{5.3}\times$ on LLaMA3.3-70B and $\textbf{2.8}\times$ on Qwen3-32B, significantly outperforming the advanced method EAGLE-3 that requires extensive model training.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models
cs.CV 2026-03 unverdicted novelty 8.0

FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indis...
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
cs.DC 2026-05 conditional novelty 6.0

SPECTRE delivers up to 2.28x speedup on large-model LLM inference by turning idle tail-model services into remote speculative drafters using hybrid parallel decoding and priority scheduling.
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference
cs.DC 2026-05 unverdicted novelty 6.0

SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.
When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?
cs.CL 2026-04 unverdicted novelty 6.0

KV cache reuse improves long-range draft acceptance rates in speculative decoding but delivers only marginal end-to-end speedups because shallow drafters cannot accurately estimate target queries and receive sparse gr...
When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?
cs.CL 2026-04 unverdicted novelty 6.0

KV cache reuse improves long-range draft acceptance in speculative decoding but delivers only marginal end-to-end speedups due to drafter limitations.
FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning
cs.AI 2026-04 unverdicted novelty 6.0

FACT-E uses controlled perturbations as an instrumental signal to measure intra-chain faithfulness in CoT reasoning and combines it with answer consistency to select trustworthy trajectories.
ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios
cs.DC 2026-03 unverdicted novelty 5.0

ECHO uses sparse gating and elastic budget pivoting in a super-tree structure to achieve up to 5.35x speedup for LLM inference under high concurrency.