LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification

· 2025 · cs.CL · arXiv 2502.17421

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

As Large Language Models (LLMs) can now process extremely long contexts, efficient inference over these extended inputs has become increasingly important, especially for emerging applications like LLM agents that highly depend on this capability. Speculative decoding (SD) offers a promising lossless acceleration technique compared to lossy alternatives such as quantization and model cascades. However, most state-of-the-art SD methods are trained on short texts (typically fewer than 4k tokens), making them unsuitable for long-context scenarios. Specifically, adapting these methods to long contexts presents three key challenges: (1) the excessive memory demands posed by draft models due to large Key-Value (KV) cache; (2) performance degradation resulting from the mismatch between short-context training and long-context inference; and (3) inefficiencies in tree attention mechanisms when managing long token sequences. This work introduces LongSpec, a framework that addresses these challenges through three core innovations: a memory-efficient draft model with a constant-sized KV cache; novel position indices that mitigate the training-inference mismatch; and an attention aggregation strategy that combines fast prefix computation with standard tree attention to enable efficient decoding. Experimental results confirm the effectiveness of LongSpec, achieving up to a 3.26x speedup over strong Flash Attention baselines across five long-context understanding datasets, as well as a 2.25x reduction in wall-clock time on the AIME24 long reasoning task with the QwQ model, demonstrating significant latency improvements for long-context applications. The code is available at https://github.com/sail-sg/LongSpec.

representative citing papers

See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

LVSpec introduces the first training-free loosely speculative decoding framework for Video-LLMs that identifies sparse visual-relevant tokens for strict verification while tolerating position shifts for semantic fillers, delivering 2.7-2.9x speedup with over 99.8% performance retention.

Test-Time Speculation

cs.CL · 2026-05-10 · unverdicted · novelty 6.0 · 2 refs

TTS adapts speculator models online via target model verifications to improve acceptance lengths by up to 72% over prior methods, with gains increasing for longer generations.

When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?

cs.CL · 2026-04-29 · unverdicted · novelty 6.0 · 2 refs

KV cache reuse improves long-range draft acceptance in speculative decoding but delivers only marginal end-to-end speedups due to drafter limitations.

citing papers explorer

Showing 3 of 3 citing papers.

See the Forest for the Trees: Loosely Speculative Decoding via Visual-Semantic Guidance for Efficient Inference of Video LLMs cs.CL · 2026-04-07 · unverdicted · none · ref 7 · internal anchor
LVSpec introduces the first training-free loosely speculative decoding framework for Video-LLMs that identifies sparse visual-relevant tokens for strict verification while tolerating position shifts for semantic fillers, delivering 2.7-2.9x speedup with over 99.8% performance retention.
Test-Time Speculation cs.CL · 2026-05-10 · unverdicted · none · ref 17 · 2 links · internal anchor
TTS adapts speculator models online via target model verifications to improve acceptance lengths by up to 72% over prior methods, with gains increasing for longer generations.
When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding? cs.CL · 2026-04-29 · unverdicted · none · ref 15 · 2 links · internal anchor
KV cache reuse improves long-range draft acceptance in speculative decoding but delivers only marginal end-to-end speedups due to drafter limitations.

LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification

fields

years

verdicts

representative citing papers

citing papers explorer