Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance

Jianshu Li; Jing Huang; Pan Zhou; Xiandong Zou

arxiv: 2602.05774 · v4 · pith:WH5NJYASnew · submitted 2026-02-05 · 💻 cs.LG · cs.AI· math.PR

Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance

Xiandong Zou , Jianshu Li , Jing Huang , Pan Zhou This is my paper

classification 💻 cs.LG cs.AImath.PR

keywords decodingdraftacceptancespeculativevariationalwhileinferencelatent

0 comments

read the original abstract

Speculative decoding accelerates inference for (M)LLMs, yet a training-decoding discrepancy persists: while existing methods optimize single greedy trajectories, decoding involves verifying and ranking multiple sampled draft paths. We propose Variational Speculative Decoding (VSD), formulating draft training as variational inference over latent proposals (draft paths). VSD maximizes the marginal probability of target-model acceptance, yielding an ELBO that promotes high-quality latent proposals while minimizing divergence from the target distribution. To enhance quality and reduce variance, we incorporate a path-level utility and optimize via an Expectation-Maximization procedure. The E-step draws Monte Carlo samples from an oracle-filtered posterior, while the M-step maximizes weighted likelihood using Adaptive Rejection Weighting (ARW) and Confidence-Aware Regularization (CAR). Theoretical analysis confirms that VSD increases expected acceptance length and speedup. Extensive experiments across LLMs and MLLMs show that VSD achieves up to a 9.6% speedup over EAGLE-3 and 7.9% over ViSpec, significantly improving decoding efficiency.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Spec-AUF: Accept-Until-Fail Training under Train-Inference Misalignment for Masked Block Drafters
cs.AI 2026-07 unverdicted novelty 6.0

Accept-Until-Fail training improves average accepted block length in speculative decoding from 2.40 to 2.61 by limiting cross-entropy support to the drafter's first predicted failure point.