pith. sign in

arxiv: 2602.05774 · v4 · pith:WH5NJYASnew · submitted 2026-02-05 · 💻 cs.LG · cs.AI· math.PR

Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance

classification 💻 cs.LG cs.AImath.PR
keywords decodingdraftacceptancespeculativevariationalwhileinferencelatent
0
0 comments X
read the original abstract

Speculative decoding accelerates inference for (M)LLMs, yet a training-decoding discrepancy persists: while existing methods optimize single greedy trajectories, decoding involves verifying and ranking multiple sampled draft paths. We propose Variational Speculative Decoding (VSD), formulating draft training as variational inference over latent proposals (draft paths). VSD maximizes the marginal probability of target-model acceptance, yielding an ELBO that promotes high-quality latent proposals while minimizing divergence from the target distribution. To enhance quality and reduce variance, we incorporate a path-level utility and optimize via an Expectation-Maximization procedure. The E-step draws Monte Carlo samples from an oracle-filtered posterior, while the M-step maximizes weighted likelihood using Adaptive Rejection Weighting (ARW) and Confidence-Aware Regularization (CAR). Theoretical analysis confirms that VSD increases expected acceptance length and speedup. Extensive experiments across LLMs and MLLMs show that VSD achieves up to a 9.6% speedup over EAGLE-3 and 7.9% over ViSpec, significantly improving decoding efficiency.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Spec-AUF: Accept-Until-Fail Training under Train-Inference Misalignment for Masked Block Drafters

    cs.AI 2026-07 unverdicted novelty 6.0

    Accept-Until-Fail training improves average accepted block length in speculative decoding from 2.40 to 2.61 by limiting cross-entropy support to the drafter's first predicted failure point.