VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping

arxiv: 2511.13587 · v3 · submitted 2025-11-17 · 💻 cs.CV · cs.AI

VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping

Haotian Dong , Ye Li , Rongwei Lu , Chen Tang , Shu-Tao Xia , Zhi Wang This is my paper

Pith reviewed 2026-05-17 21:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords speculative decodingvisual autoregressive generationinference accelerationverification skippingfeature cachingimage generationtoken-level reuse

0 comments p. Extension

The pith

VVS reduces target model forward passes by 2.8 times in visual autoregressive generation by skipping redundant verification steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a speculative decoding approach tailored to visual autoregressive image generation models that normally predict tokens sequentially and incur high latency. It shows that many verification steps after drafting can be skipped because visual tokens are interchangeable and drafting features often remain reusable. The method adds a token selector that decides which steps need no verification, caches and reuses features at the token level, and schedules the skipped steps at fine granularity. These changes cut the number of full target-model evaluations while keeping generated image quality close to standard decoding. The result is faster inference with a better speed-to-quality balance than earlier speculative decoding setups.

Core claim

Verification redundancy and stale feature reusability in the drafting stage of speculative decoding permit partial verification skipping without meaningful quality loss. The VVS framework realizes this by combining a verification-free token selector with dynamic truncation, token-level feature caching and reuse, and fine-grained skipped step scheduling, thereby lowering target-model forward passes to 2.8 times fewer than vanilla autoregressive decoding while preserving competitive generation quality.

What carries the argument

The VVS framework that integrates verification-free token selection with dynamic truncation, token-level feature caching, and skipped-step scheduling to enable partial verification skipping during speculative decoding.

If this is right

The number of target model forward passes drops by a factor of 2.8 compared with vanilla autoregressive decoding.
Image generation quality remains competitive with conventional speculative decoding.
The speed-quality trade-off improves over existing speculative decoding methods for visual autoregressive models.
The overall speculative decoding paradigm gains a new direction through selective verification skipping.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same skipping logic could extend to autoregressive generation of video or audio sequences where tokens also show high interchangeability.
Combining VVS with other latency-reduction techniques such as early exiting or quantization might yield further gains.
The feature-reuse idea may help in non-visual domains that already use draft-then-verify pipelines.
Empirical tests on larger visual autoregressive models would show whether the 2.8x reduction scales.

Load-bearing premise

Visual tokens are interchangeable enough and drafting-stage redundancy plus feature reuse are reliable enough that skipping selected verification steps leaves generation quality intact.

What would settle it

Generate images on a standard benchmark with VVS and measure either no reduction in target forward passes or a clear increase in FID or other quality metrics relative to both vanilla autoregressive decoding and standard speculative decoding.

Figures

Figures reproduced from arXiv: 2511.13587 by Chen Tang, Haotian Dong, Rongwei Lu, Shu-Tao Xia, Ye Li, Zhi Wang.

**Figure 1.** Figure 1: Overview of VVS framework. VVS explicitly reduce the target model forward passes by bypassing part verification stages, thereby cutting the inference latency during SD. Dt denotes draft stage at iteration t, Vt denotes verification stage at iteration t. tribute the verify-free steps? Excessive bypassing of verification turns the draft model into the primary generator, inevitably causing severe quality de… view at source ↗

**Figure 2.** Figure 2: Similarity of the drafted candidate token tree. (a) Visual [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Mean accept length comparison under feature blending [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Inference pipeline of our SD framework VVS, which supports partial verification skipping. (b) Token-level feature caching and reuse mechanism. Since the number of tokens accepted at different iterations varies and truncation in Sec. 4.2 is applied, the cached features to be reused could come from multiple steps. Tokens accepted without verification proceed to the target model at the next verification s… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of results between the acceptance-relax-based SD framework (upper) and ours (lower). [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Accept length dynamically changes during the SD pro [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 7.** Figure 7: Pareto-front Comparison: TPF vs FID. AR denotes [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 9.** Figure 9: Qualitative comparison of results between the acceptance-relax-based SD framework (upper) and ours (lower) on the Lumina [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

read the original abstract

Visual autoregressive (AR) generation models have demonstrated strong potential for image generation, yet their next-token-prediction paradigm introduces considerable inference latency. Although speculative decoding (SD) has been proven effective for accelerating visual AR models, its "draft one step, then verify one step" paradigm prevents a direct reduction in the number of forward passes, limiting its acceleration potential. Motivated by the interchangeability of visual tokens, we explore verification skipping in the SD process for the first time to explicitly cut the number of target model forward passes, thereby reducing inference latency. By analyzing the characteristics of the drafting stage, we observe that verification redundancy and stale feature reusability are key factors to maintain generation quality while improving speed for verification-free steps. Inspired by these two observations, we propose a novel SD framework VVS to accelerate visual AR model via partial verification skipping, which integrates three complementary modules: (1) a verification-free token selector with dynamic truncation, (2) token-level feature caching and reuse, and (3) fine-grained skipped step scheduling. Consequently, VVS reduces the number of target model forward passes by $2.8\times$ relative to vanilla AR decoding while maintaining competitive generation quality, offering a superior speed-quality trade-off over conventional SD frameworks and revealing strong potential to reshape the SD paradigm. Our code is available at https://github.com/HyattDD/VVS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VVS shows a practical way to skip some verification steps in speculative decoding for visual AR models and cut target forwards by 2.8x, but the handling of feature reuse drift over long sequences is the part that still needs tighter evidence.

read the letter

The core claim is that partial verification skipping, built on observations about visual token interchangeability, lets you drop a lot of target model calls without hurting image quality much. They report a 2.8 times reduction relative to plain AR decoding and better speed-quality balance than standard speculative decoding setups. That number is the thing a practitioner would actually care about if it holds up in real pipelines.

Referee Report

2 major / 2 minor

Summary. The paper introduces VVS, a speculative decoding framework for visual autoregressive image generation models. Motivated by observations of verification redundancy and stale feature reusability during the drafting stage, it enables partial verification skipping to reduce target-model forward passes. The framework integrates three modules: a verification-free token selector using dynamic truncation, token-level feature caching and reuse, and fine-grained skipped-step scheduling. The central empirical claim is a 2.8× reduction in target-model forward passes relative to vanilla AR decoding while preserving competitive generation quality and improving the speed-quality trade-off over standard speculative decoding.

Significance. If the quality-preservation results hold under the proposed skipping strategy, the work could meaningfully advance efficient inference for visual AR models by relaxing the rigid draft-then-verify loop of conventional speculative decoding. The empirical grounding in visual-token interchangeability and the open-sourced code are constructive elements that support further exploration of verification-light SD variants.

major comments (2)

[Method (token-level feature caching and reuse)] The central 2.8× forward-pass reduction rests on the assumption that verification redundancy plus stale feature reuse permit skipping without meaningful quality loss. In the method description of token-level feature caching and reuse, no explicit bound, divergence metric, or ablation is provided on hidden-state drift or logit/perplexity shift as a function of consecutive skip length. This is load-bearing for the quality-preservation claim, especially over long AR sequences where small inconsistencies can compound.
[Experiments (main results table)] Table reporting the main speedup and quality results: the 2.8× figure and competitive quality metrics should include error bars or standard deviations across multiple random seeds and at least two distinct datasets to demonstrate robustness against post-hoc choices of the dynamic truncation threshold.

minor comments (2)

[Abstract] The abstract states that VVS 'reveals strong potential to reshape the SD paradigm'; this phrasing is stronger than the concrete contribution and could be revised to 'suggests a promising direction for relaxing verification in SD for visual AR models'.
[Figures] Figure captions and legends should explicitly label all compared baselines (vanilla AR, standard SD, VVS variants) and state the exact quality metrics (FID, CLIP score, etc.) used in each panel.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment point by point below, indicating the revisions we will make to strengthen the manuscript while preserving its core contributions.

read point-by-point responses

Referee: [Method (token-level feature caching and reuse)] The central 2.8× forward-pass reduction rests on the assumption that verification redundancy plus stale feature reuse permit skipping without meaningful quality loss. In the method description of token-level feature caching and reuse, no explicit bound, divergence metric, or ablation is provided on hidden-state drift or logit/perplexity shift as a function of consecutive skip length. This is load-bearing for the quality-preservation claim, especially over long AR sequences where small inconsistencies can compound.

Authors: We agree that a more explicit characterization of feature drift would strengthen the methodological justification. The current manuscript supports the quality-preservation claim through end-to-end generation metrics and targeted ablations on the overall VVS framework, but does not include a dedicated per-skip-length analysis of hidden-state or logit divergence. In the revision we will add a new figure and accompanying text that reports cosine similarity of cached features, KL divergence on logits, and perplexity shift as functions of consecutive skip length (up to the maximum used in our scheduling). This addition will directly address concerns about compounding effects in long sequences. revision: yes
Referee: [Experiments (main results table)] Table reporting the main speedup and quality results: the 2.8× figure and competitive quality metrics should include error bars or standard deviations across multiple random seeds and at least two distinct datasets to demonstrate robustness against post-hoc choices of the dynamic truncation threshold.

Authors: We will revise the main results table to report means and standard deviations over at least three random seeds for all metrics. For datasets, primary results are reported on the standard ImageNet benchmark used by prior visual AR work; we will add a second dataset (COCO captions) with corresponding speed and quality numbers, either in the main table or as a dedicated row if space is limited. We will also include a short sensitivity plot for the dynamic truncation threshold to demonstrate that the reported 2.8× speedup and quality remain stable across reasonable threshold choices, thereby addressing post-hoc selection concerns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations and engineering modules form an independent proposal

full rationale

The paper's central claim rests on two stated empirical observations (verification redundancy and stale feature reusability) drawn from analysis of the drafting stage, which then motivate three engineering modules. These observations are presented as direct findings rather than parameters fitted to the target speed-up result. No equations, uniqueness theorems, or self-citations are invoked to force the 2.8× forward-pass reduction; the reduction is reported as a measured outcome on visual AR tasks. The derivation chain is therefore self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on a domain assumption about visual-token interchangeability plus a small number of tunable selection and scheduling parameters whose values are not derived from first principles.

free parameters (1)

dynamic truncation threshold
Controls which drafted tokens are treated as verification-free in the selector module.

axioms (1)

domain assumption Visual tokens are interchangeable enough that selected verification steps can be skipped without quality degradation
Explicitly invoked to justify verification skipping in the drafting stage.

pith-pipeline@v0.9.0 · 5561 in / 1198 out tokens · 37304 ms · 2026-05-17T21:30:47.039739+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

verification redundancy and stale feature reusability are key factors... candidate token sequences exhibit similarity >0.7... similarity between features of adjunct tokens is 0.68
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VVS reduces the number of target model forward passes by 2.8×... three complementary modules: verification-free token selector with dynamic truncation, token-level feature caching and reuse, fine-grained skipped step scheduling

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 3 internal anchors

[1]

Judge Decoding: Faster Speculative Sampling Requires Going Be- yond Model Alignment

Gregor Bachmann, Sotiris Anagnostidis, Albert Pumarola, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Edgar Sch ¨onfeld, Ali Thabet, and Jonas K Kohler. Judge Decoding: Faster Speculative Sampling Requires Going Be- yond Model Alignment. InThe Thirteenth International Conference on Learning Representations, 2025. 8

work page 2025
[2]

Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 2

work page 1901
[3]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Sim- ple LLM Inference Acceleration Framework with Multiple Decoding Heads. InForty-first International Conference on Machine Learning. arXiv, 2024. 1, 4, 8

work page 2024
[4]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean- Baptiste Lespiau, Laurent Sifre, and John Jumper. Acceler- ating large language model decoding with speculative sam- pling.arXiv preprint arXiv:2302.01318, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Collaborative decoding makes visual auto-regressive modeling efficient

Zigeng Chen, Xinyin Ma, Gongfan Fang, and Xinchao Wang. Collaborative decoding makes visual auto-regressive modeling efficient. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23334–23344,

work page
[6]

arXiv preprint arXiv:2407.06135

Ethan Chern, Jiadi Su, Yan Ma, and Pengfei Liu. Anole: An open, autoregressive, native large multimodal mod- els for interleaved image-text generation.arXiv preprint arXiv:2407.06135, 2024. 1

work page arXiv 2024
[7]

Deepseek-v3 technical report, 2025

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, et al. Deepseek-v3 technical report, 2025. 1

work page 2025
[8]

Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference.arXiv preprint arXiv:2307.02628, 2023

Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, and Subhabrata Mukherjee. Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference.arXiv preprint arXiv:2307.02628, 2023. 1

work page arXiv 2023
[9]

Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

work page arXiv 2024
[10]

Zipar: Parallel Au- toregressive Image Generation through Spatial Locality

Yefei He, Feng Chen, Yuanyu He, Shaoxuan He, Hong Zhou, Kaipeng Zhang, and Bohan Zhuang. Zipar: Parallel Au- toregressive Image Generation through Spatial Locality. In Forty-second International Conference on Machine Learn- ing, 2025. 1

work page 2025
[11]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Pro- cessing, pages 7514–7528, Online and Punta Cana, Domini- can Republic, 2021. Association for Computational Linguis- tics. 6

work page 2021
[12]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InProceedings of the 31st International Conference on Neural Information Processing Systems, page 6629–6640, Red Hook, NY , USA, 2017. Curran Associates Inc. 6

work page 2017
[13]

Yang, Yeonsung Jung, Ji- hun Yun, Souvik Kundu, Sung-Yub Kim, and Eunho Yang

Doohyuk Jang, Sihwan Park, J. Yang, Yeonsung Jung, Ji- hun Yun, Souvik Kundu, Sung-Yub Kim, and Eunho Yang. Lantern: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding. InInternational Conference on Learning Representations. arXiv, 2024. 1, 3, 6, 8

work page 2024
[14]

Improved precision and recall met- ric for assessing generative models

Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models. InNeural Information Processing Systems, 2019. 6

work page 2019
[15]

Fast In- ference from Transformers via Speculative Decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast In- ference from Transformers via Speculative Decoding. InIn- ternational Conference on Machine Learning, pages 19274– 19286, 2022. 3

work page 2022
[16]

Eagle: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative Sampling Requires Rethinking Feature Uncertainty. InForty-first International Conference on Ma- chine Learning. arXiv, 2024. 4, 8

work page 2024
[17]

Eagle-2: Faster Inference of Language Models with Dy- namic Draft Trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster Inference of Language Models with Dy- namic Draft Trees. InConference on Empirical Methods in Natural Language Processing, pages 7421–7432. arXiv,

work page
[18]

Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration, 2025

Ye Li, Yuan Meng, Zewen Sun, Kangye Ji, Chen Tang, Ji- ajun Fan, Xinzhu Ma, Shutao Xia, Zhi Wang, and Wenwu Zhu. Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration, 2025. 8

work page 2025
[19]

Prance: Joint token-optimization and structural channel-pruning for adap- tive vit inference.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–17, 2025

Ye Li, Chen Tang, Yuan Meng, Jiajun Fan, Zenghao Chai, Xinzhu Ma, Zhi Wang, and Wenwu Zhu. Prance: Joint token-optimization and structural channel-pruning for adap- tive vit inference.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–17, 2025. 8

work page 2025
[20]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test.arXiv.org, abs/2503.01840, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, 2014. 6

work page 2014
[22]

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal gener- ative pretraining, 2024a

Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Genera- tion with Multimodal Generative Pretraining.arXiv.org, abs/2408.02657, 2024. 1, 2

work page arXiv 2024
[23]

Specinfer: Accel- erating large language model serving with tree-based specu- lative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accel- erating large language model serving with tree-based specu- lative inference and verification. InProceedings of the 29th ACM International Conference on Architectural Support for Programming ...

work page 2024
[24]

Grouped speculative decoding for autoregressive im- 9 age generation

Junhyuk So, Juncheol Shin, Hyunho Kook, and Eunhyeok Park. Grouped speculative decoding for autoregressive im- 9 age generation. InInternational Conference on Computer Vision, 2025. 1, 6, 8

work page 2025
[25]

Block- wise parallel decoding for deep autoregressive models.Ad- vances in Neural Information Processing Systems, 31, 2018

Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Block- wise parallel decoding for deep autoregressive models.Ad- vances in Neural Information Processing Systems, 31, 2018. 1

work page 2018
[26]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation. arXiv.org, abs/2406.06525, 2024. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Spectr: Fast spec- ulative decoding via optimal transport.Advances in Neural Information Processing Systems, 36:30222–30242, 2023

Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, and Felix Yu. Spectr: Fast spec- ulative decoding via optimal transport.Advances in Neural Information Processing Systems, 36:30222–30242, 2023. 8

work page 2023
[28]

Mixed-precision neural network quantization via learned layer-wise importance

Chen Tang, Kai Ouyang, Zhi Wang, Yifei Zhu, Wen Ji, Yaowei Wang, and Wenwu Zhu. Mixed-precision neural network quantization via learned layer-wise importance. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI, page 259–275, Berlin, Heidelberg, 2022. Springer-Verlag. 8

work page 2022
[29]

Chameleon: Mixed-modal early-fusion foundation models, 2025

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models, 2025. 1

work page 2025
[30]

Accelerating Auto-regressive Text-to-Image Generation with Training- free Speculative Jacobi Decoding

Yao Teng, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Accelerating Auto-regressive Text-to-Image Generation with Training- free Speculative Jacobi Decoding. InInternational Confer- ence on Learning Representations. arXiv, 2024. 6, 8

work page 2024
[31]

Neural discrete representation learning,

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning,

work page
[32]

Emu3: Next-token prediction is all you need, 2024

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, and Zhongyuan Wang. Emu3: Next-token prediction is all you need, 2024. 1

work page 2024
[33]

Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, 2023

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, 2023. 6

work page 2023
[34]

Speculative decoding: Exploiting spec- ulative execution for accelerating seq2seq generation

Heming Xia, Tao Ge, Peiyi Wang, Si-Qing Chen, Furu Wei, and Zhifang Sui. Speculative decoding: Exploiting spec- ulative execution for accelerating seq2seq generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3909–3925, Singapore, 2023. Associa- tion for Computational Linguistics. 8

work page 2023
[35]

Unlock- ing efficiency in large language model inference: A com- prehensive survey of speculative decoding

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlock- ing efficiency in large language model inference: A com- prehensive survey of speculative decoding. InFindings of the Association for Computational Linguistics: ACL 2024, pages 7655–7671, Bangkok, Thailand, 2024. Association for Computational...

work page 2024
[36]

Lumina-mGPT 2.0: Stand-Alone AutoRegressive Im- age Modeling.arXiv.org, abs/2507.17801, 2025

Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, Ren- rui Zhang, Le Zhuo, Tiancheng Han, Xiaoqing Sun, Siqi Luo, Mengmeng Wang, Bin Fu, Yuewen Cao, Hongsheng Li, Guangtao Zhai, Xiaohong Liu, Yuting Qiao, and Peng Gao. Lumina-mGPT 2.0: Stand-Alone AutoRegressive Im- age Modeling.arXiv.org, abs/2507.17801...

work page arXiv 2025
[37]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, et al. Qwen3 technical report, 2025. 1

work page 2025
[38]

Vector-quantized image modeling with improved vqgan, 2022

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan, 2022. 2

work page 2022
[39]

Faster Speculative De- coding via Effective Draft Decoder with Pruned Candidate Tree

Huanran Zheng and Xiaoling Wang. Faster Speculative De- coding via Effective Draft Decoder with Pruned Candidate Tree. InAnnual Meeting of the Association for Computa- tional Linguistics, pages 9856–9868, 2025. 1 10 VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping Supplementary Material

work page 2025
[40]

Implement Details We present the pseudocode of the VVS framework in Algo- rithm 2 to further illustrate our design. Algorithm 2VVS with Partial Verification Skipping Require:℘: text prompt;M T : target model;M D: drafter model;L: max length of generated sequence;V last: whether last step was verified Ensure:Generated token sequenceSfor decoding to image 1...

work page
[41]

4, the results demonstrate that after substituting the verification results, both the acceleration performance and generation quality of SD remain highly stable

Supplementary Results of Drafting Stage Analysis In Tab. 4, the results demonstrate that after substituting the verification results, both the acceleration performance and generation quality of SD remain highly stable. Tab. 5 and Tab. 6 further illustrate the impact of leveraging features various staleness for drafting, highlighting their reusability

work page
[42]

Prompts used in Qualitative Experiment • A vast desert landscape under a starry sky, with a single tent illuminated by a warm campfire. Table 4. Verification redundancy experiment.rrepresents the pro- portion of verified results that are replaced for all iterations. We re- place the verified tokens of the target model with the same number of tokens from t...

work page
[43]

Additional Experiments on Generalization We further validated our VVS framework on the Lumina- mGPT model. Fig. 9 offers a visual demonstration of the resulting image quality, using the same prompts as in Sec. 3. We observe that under the same relaxation thresh- oldδ= 0.2, VVS markedly cuts the target model’s forward passes while preserving generation fid...

work page

[1] [1]

Judge Decoding: Faster Speculative Sampling Requires Going Be- yond Model Alignment

Gregor Bachmann, Sotiris Anagnostidis, Albert Pumarola, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Edgar Sch ¨onfeld, Ali Thabet, and Jonas K Kohler. Judge Decoding: Faster Speculative Sampling Requires Going Be- yond Model Alignment. InThe Thirteenth International Conference on Learning Representations, 2025. 8

work page 2025

[2] [2]

Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020. 2

work page 1901

[3] [3]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Sim- ple LLM Inference Acceleration Framework with Multiple Decoding Heads. InForty-first International Conference on Machine Learning. arXiv, 2024. 1, 4, 8

work page 2024

[4] [4]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean- Baptiste Lespiau, Laurent Sifre, and John Jumper. Acceler- ating large language model decoding with speculative sam- pling.arXiv preprint arXiv:2302.01318, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Collaborative decoding makes visual auto-regressive modeling efficient

Zigeng Chen, Xinyin Ma, Gongfan Fang, and Xinchao Wang. Collaborative decoding makes visual auto-regressive modeling efficient. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23334–23344,

work page

[6] [6]

arXiv preprint arXiv:2407.06135

Ethan Chern, Jiadi Su, Yan Ma, and Pengfei Liu. Anole: An open, autoregressive, native large multimodal mod- els for interleaved image-text generation.arXiv preprint arXiv:2407.06135, 2024. 1

work page arXiv 2024

[7] [7]

Deepseek-v3 technical report, 2025

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, et al. Deepseek-v3 technical report, 2025. 1

work page 2025

[8] [8]

Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference.arXiv preprint arXiv:2307.02628, 2023

Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, and Subhabrata Mukherjee. Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference.arXiv preprint arXiv:2307.02628, 2023. 1

work page arXiv 2023

[9] [9]

Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

work page arXiv 2024

[10] [10]

Zipar: Parallel Au- toregressive Image Generation through Spatial Locality

Yefei He, Feng Chen, Yuanyu He, Shaoxuan He, Hong Zhou, Kaipeng Zhang, and Bohan Zhuang. Zipar: Parallel Au- toregressive Image Generation through Spatial Locality. In Forty-second International Conference on Machine Learn- ing, 2025. 1

work page 2025

[11] [11]

CLIPScore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Pro- cessing, pages 7514–7528, Online and Punta Cana, Domini- can Republic, 2021. Association for Computational Linguis- tics. 6

work page 2021

[12] [12]

Gans trained by a two time-scale update rule converge to a local nash equilib- rium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilib- rium. InProceedings of the 31st International Conference on Neural Information Processing Systems, page 6629–6640, Red Hook, NY , USA, 2017. Curran Associates Inc. 6

work page 2017

[13] [13]

Yang, Yeonsung Jung, Ji- hun Yun, Souvik Kundu, Sung-Yub Kim, and Eunho Yang

Doohyuk Jang, Sihwan Park, J. Yang, Yeonsung Jung, Ji- hun Yun, Souvik Kundu, Sung-Yub Kim, and Eunho Yang. Lantern: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding. InInternational Conference on Learning Representations. arXiv, 2024. 1, 3, 6, 8

work page 2024

[14] [14]

Improved precision and recall met- ric for assessing generative models

Tuomas Kynk ¨a¨anniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall met- ric for assessing generative models. InNeural Information Processing Systems, 2019. 6

work page 2019

[15] [15]

Fast In- ference from Transformers via Speculative Decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast In- ference from Transformers via Speculative Decoding. InIn- ternational Conference on Machine Learning, pages 19274– 19286, 2022. 3

work page 2022

[16] [16]

Eagle: Speculative Sampling Requires Rethinking Feature Uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative Sampling Requires Rethinking Feature Uncertainty. InForty-first International Conference on Ma- chine Learning. arXiv, 2024. 4, 8

work page 2024

[17] [17]

Eagle-2: Faster Inference of Language Models with Dy- namic Draft Trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster Inference of Language Models with Dy- namic Draft Trees. InConference on Empirical Methods in Natural Language Processing, pages 7421–7432. arXiv,

work page

[18] [18]

Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration, 2025

Ye Li, Yuan Meng, Zewen Sun, Kangye Ji, Chen Tang, Ji- ajun Fan, Xinzhu Ma, Shutao Xia, Zhi Wang, and Wenwu Zhu. Sp-vla: A joint model scheduling and token pruning approach for vla model acceleration, 2025. 8

work page 2025

[19] [19]

Prance: Joint token-optimization and structural channel-pruning for adap- tive vit inference.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–17, 2025

Ye Li, Chen Tang, Yuan Meng, Jiajun Fan, Zenghao Chai, Xinzhu Ma, Zhi Wang, and Wenwu Zhu. Prance: Joint token-optimization and structural channel-pruning for adap- tive vit inference.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–17, 2025. 8

work page 2025

[20] [20]

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test.arXiv.org, abs/2503.01840, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean Conference on Computer Vision, 2014. 6

work page 2014

[22] [22]

Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimodal gener- ative pretraining, 2024a

Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Genera- tion with Multimodal Generative Pretraining.arXiv.org, abs/2408.02657, 2024. 1, 2

work page arXiv 2024

[23] [23]

Specinfer: Accel- erating large language model serving with tree-based specu- lative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, et al. Specinfer: Accel- erating large language model serving with tree-based specu- lative inference and verification. InProceedings of the 29th ACM International Conference on Architectural Support for Programming ...

work page 2024

[24] [24]

Grouped speculative decoding for autoregressive im- 9 age generation

Junhyuk So, Juncheol Shin, Hyunho Kook, and Eunhyeok Park. Grouped speculative decoding for autoregressive im- 9 age generation. InInternational Conference on Computer Vision, 2025. 1, 6, 8

work page 2025

[25] [25]

Block- wise parallel decoding for deep autoregressive models.Ad- vances in Neural Information Processing Systems, 31, 2018

Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Block- wise parallel decoding for deep autoregressive models.Ad- vances in Neural Information Processing Systems, 31, 2018. 1

work page 2018

[26] [26]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation. arXiv.org, abs/2406.06525, 2024. 1, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Spectr: Fast spec- ulative decoding via optimal transport.Advances in Neural Information Processing Systems, 36:30222–30242, 2023

Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, and Felix Yu. Spectr: Fast spec- ulative decoding via optimal transport.Advances in Neural Information Processing Systems, 36:30222–30242, 2023. 8

work page 2023

[28] [28]

Mixed-precision neural network quantization via learned layer-wise importance

Chen Tang, Kai Ouyang, Zhi Wang, Yifei Zhu, Wen Ji, Yaowei Wang, and Wenwu Zhu. Mixed-precision neural network quantization via learned layer-wise importance. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XI, page 259–275, Berlin, Heidelberg, 2022. Springer-Verlag. 8

work page 2022

[29] [29]

Chameleon: Mixed-modal early-fusion foundation models, 2025

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models, 2025. 1

work page 2025

[30] [30]

Accelerating Auto-regressive Text-to-Image Generation with Training- free Speculative Jacobi Decoding

Yao Teng, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Accelerating Auto-regressive Text-to-Image Generation with Training- free Speculative Jacobi Decoding. InInternational Confer- ence on Learning Representations. arXiv, 2024. 6, 8

work page 2024

[31] [31]

Neural discrete representation learning,

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning,

work page

[32] [32]

Emu3: Next-token prediction is all you need, 2024

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, and Zhongyuan Wang. Emu3: Next-token prediction is all you need, 2024. 1

work page 2024

[33] [33]

Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, 2023

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis, 2023. 6

work page 2023

[34] [34]

Speculative decoding: Exploiting spec- ulative execution for accelerating seq2seq generation

Heming Xia, Tao Ge, Peiyi Wang, Si-Qing Chen, Furu Wei, and Zhifang Sui. Speculative decoding: Exploiting spec- ulative execution for accelerating seq2seq generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3909–3925, Singapore, 2023. Associa- tion for Computational Linguistics. 8

work page 2023

[35] [35]

Unlock- ing efficiency in large language model inference: A com- prehensive survey of speculative decoding

Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlock- ing efficiency in large language model inference: A com- prehensive survey of speculative decoding. InFindings of the Association for Computational Linguistics: ACL 2024, pages 7655–7671, Bangkok, Thailand, 2024. Association for Computational...

work page 2024

[36] [36]

Lumina-mGPT 2.0: Stand-Alone AutoRegressive Im- age Modeling.arXiv.org, abs/2507.17801, 2025

Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, Ren- rui Zhang, Le Zhuo, Tiancheng Han, Xiaoqing Sun, Siqi Luo, Mengmeng Wang, Bin Fu, Yuewen Cao, Hongsheng Li, Guangtao Zhai, Xiaohong Liu, Yuting Qiao, and Peng Gao. Lumina-mGPT 2.0: Stand-Alone AutoRegressive Im- age Modeling.arXiv.org, abs/2507.17801...

work page arXiv 2025

[37] [37]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, et al. Qwen3 technical report, 2025. 1

work page 2025

[38] [38]

Vector-quantized image modeling with improved vqgan, 2022

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan, 2022. 2

work page 2022

[39] [39]

Faster Speculative De- coding via Effective Draft Decoder with Pruned Candidate Tree

Huanran Zheng and Xiaoling Wang. Faster Speculative De- coding via Effective Draft Decoder with Pruned Candidate Tree. InAnnual Meeting of the Association for Computa- tional Linguistics, pages 9856–9868, 2025. 1 10 VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping Supplementary Material

work page 2025

[40] [40]

Implement Details We present the pseudocode of the VVS framework in Algo- rithm 2 to further illustrate our design. Algorithm 2VVS with Partial Verification Skipping Require:℘: text prompt;M T : target model;M D: drafter model;L: max length of generated sequence;V last: whether last step was verified Ensure:Generated token sequenceSfor decoding to image 1...

work page

[41] [41]

4, the results demonstrate that after substituting the verification results, both the acceleration performance and generation quality of SD remain highly stable

Supplementary Results of Drafting Stage Analysis In Tab. 4, the results demonstrate that after substituting the verification results, both the acceleration performance and generation quality of SD remain highly stable. Tab. 5 and Tab. 6 further illustrate the impact of leveraging features various staleness for drafting, highlighting their reusability

work page

[42] [42]

Prompts used in Qualitative Experiment • A vast desert landscape under a starry sky, with a single tent illuminated by a warm campfire. Table 4. Verification redundancy experiment.rrepresents the pro- portion of verified results that are replaced for all iterations. We re- place the verified tokens of the target model with the same number of tokens from t...

work page

[43] [43]

Additional Experiments on Generalization We further validated our VVS framework on the Lumina- mGPT model. Fig. 9 offers a visual demonstration of the resulting image quality, using the same prompts as in Sec. 3. We observe that under the same relaxation thresh- oldδ= 0.2, VVS markedly cuts the target model’s forward passes while preserving generation fid...

work page