pith. machine review for the scientific record. sign in

arxiv: 2601.05524 · v3 · submitted 2026-01-09 · 💻 cs.CL

Recognition: unknown

Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

Authors on Pith no claims yet
classification 💻 cs.CL
keywords doubleretrievalspeculativedraftmodelspeedupdecodingextensive
0
0 comments X
read the original abstract

Parallel Speculative Decoding (PSD) accelerates traditional Speculative Decoding (SD) by overlapping draft generation with verification. However, it remains hampered by two fundamental challenges: (1) a theoretical speedup ceiling dictated by the speed ratio between the draft and target models, and (2) high computational waste and pipeline stall due to mid-sequence token rejections of early errors. To address these limitations, we introduce \textsc{Double} (Double Retrieval Speculative Parallelism). By bridging the gap between SD and PSD, our framework resolves the Retrieval \emph{Precision-Efficiency Dilemma} through a novel synchronous mechanism. Specifically, we enable the draft model to execute iterative retrieval speculations to break the theoretical speedup limits; to alleviate rejections without rollback, the target model performs authoritative retrieval to generate multi-token guidance. \textsc{Double} is entirely training-free and lossless. Extensive experiments demonstrate state-of-the-art speedup of $\textbf{5.3}\times$ on LLaMA3.3-70B and $\textbf{2.8}\times$ on Qwen3-32B, significantly outperforming the advanced method EAGLE-3 that requires extensive model training.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models

    cs.CV 2026-03 unverdicted novelty 8.0

    FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indis...

  2. SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

    cs.DC 2026-05 conditional novelty 6.0

    SPECTRE delivers up to 2.28x speedup on large-model LLM inference by turning idle tail-model services into remote speculative drafters using hybrid parallel decoding and priority scheduling.

  3. SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

    cs.DC 2026-05 unverdicted novelty 6.0

    SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.

  4. When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?

    cs.CL 2026-04 unverdicted novelty 6.0

    KV cache reuse improves long-range draft acceptance rates in speculative decoding but delivers only marginal end-to-end speedups because shallow drafters cannot accurately estimate target queries and receive sparse gr...

  5. When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?

    cs.CL 2026-04 unverdicted novelty 6.0

    KV cache reuse improves long-range draft acceptance in speculative decoding but delivers only marginal end-to-end speedups due to drafter limitations.

  6. FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning

    cs.AI 2026-04 unverdicted novelty 6.0

    FACT-E uses controlled perturbations as an instrumental signal to measure intra-chain faithfulness in CoT reasoning and combines it with answer consistency to select trustworthy trajectories.

  7. ECHO: Elastic Speculative Decoding with Sparse Gating for High-Concurrency Scenarios

    cs.DC 2026-03 unverdicted novelty 5.0

    ECHO uses sparse gating and elastic budget pivoting in a super-tree structure to achieve up to 5.35x speedup for LLM inference under high concurrency.