pith. sign in

arxiv: 2601.05149 · v2 · pith:PK6U6AEVnew · submitted 2026-01-08 · 💻 cs.CV

Multi-Scale Local Speculative Decoding for Image Generation

classification 💻 cs.CV
keywords decodingimagespeculativelocalresamplingparallelrejectionacceleration
0
0 comments X
read the original abstract

Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with an up-sampling step to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. When integrated with parallel decoding resampling, MuLo-SD achieves substantial speedups -- up to $\mathbf{5\times}$ -- outperforming both speculative decoding and parallel decoding baselines in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity. Project page is available at https://qualcomm-ai-research.github.io/mulo-sd-webpage/ .

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Knowledge Distillation for Visual Autoregressive Models

    cs.CV 2026-06 unverdicted novelty 6.0

    VarKD is a distillation framework for visual AR models that uses student samples and selective teacher supervision to reduce token ambiguity, outperforming prior baselines on ImageNet.

  2. SSD: Spatially Speculative Decoding Accelerates Autoregressive Image Generation

    cs.CV 2026-06 unverdicted novelty 5.0

    SSD predicts multiple spatially adjacent tokens at once in autoregressive image models, claiming up to 13.3x inference speedup on DPG-Bench and GenEval with maintained fidelity.