PathRelax: Parallel-Path Relaxed Speculative Jacobi Decoding for Accelerating Auto-Regressive Text-to-Image Generation

Bingxuan Dai; Haodong Lei; Hongsong Wang; Pan Zhou

arxiv: 2606.10492 · v1 · pith:5ACJ3YW5new · submitted 2026-06-09 · 💻 cs.CV

PathRelax: Parallel-Path Relaxed Speculative Jacobi Decoding for Accelerating Auto-Regressive Text-to-Image Generation

Haodong Lei , Hongsong Wang , Bingxuan Dai , Pan Zhou This is my paper

Pith reviewed 2026-06-27 13:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords speculative decodingautoregressive text-to-imageinference accelerationJacobi decodingparallel draft pathsrelaxed verificationtoken acceptance

0 comments

The pith

Parallel-path draft trees and cross-sequence relaxed verification accelerate autoregressive text-to-image generation by roughly four times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PathSpec to address slow inference in autoregressive text-to-image models caused by long token sequences. It replaces single-chain draft sequences with a multi-sequence draft tree structure called PathExplore that widens the token search space, then adds PathRelax to perform cross-path verification that accepts tokens by exploiting semantic similarities between parallel drafts. Experiments on Parti-Prompts, MSCOCO2017, and T2ICompBench report speedups of 4.14x, 3.95x, and 4.18x respectively while image quality remains comparable to the unaccelerated baseline. PathExplore alone beats certain prior relaxed-sampling accelerators, and PathRelax integrates additively with other relaxation methods. A reader would care because these gains target the core bottleneck of sequential token prediction in high-resolution generation and point toward practical real-time use.

Core claim

PathSpec replaces chain-structured drafts in speculative Jacobi decoding with a parallel-path tree (PathExplore) that expands the candidate space and cross-path relaxed verification (PathRelax) that accepts tokens across sequences when semantic similarity holds, producing the measured speedups on the three benchmarks without reported quality loss.

What carries the argument

PathSpec framework built from PathExplore's multi-sequence draft tree that widens token search and PathRelax's cross-path relaxed verification that exploits semantic similarities to raise acceptance rates.

If this is right

Multi-sequence trees raise token acceptance length per step compared with single-chain drafts.
PathExplore without relaxation already exceeds the speedup of some prior relaxed methods such as GSD and LANTERN.
PathRelax combines with existing relaxation techniques to produce additive further gains.
The resulting acceleration supports real-time text-to-image generation on the evaluated datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same parallel-tree plus cross-verification pattern could be tested on autoregressive models for video or audio to check whether semantic similarity across paths transfers.
If the acceptance-rate benefit scales with sequence length, the method would become increasingly valuable for higher-resolution images that require even longer token strings.
Hardware implementations that execute the parallel drafts concurrently might multiply the reported software speedups.

Load-bearing premise

Semantic similarities across parallel draft sequences can be exploited via cross-path relaxed verification to increase token acceptance rates without degrading final image quality.

What would settle it

Running PathRelax on the Parti-Prompts dataset and measuring both wall-clock speedup and image quality metrics such as FID against the autoregressive baseline; the claim fails if speedup drops below 3x or quality metrics degrade measurably.

Figures

Figures reproduced from arXiv: 2606.10492 by Bingxuan Dai, Haodong Lei, Hongsong Wang, Pan Zhou.

**Figure 1.** Figure 1: Time cost comparison of PathSpec and other methods. PathSpec achieves at least a 20% reduction in image generation time across various models. distributions. Self-speculative methods, including SJD [41] and GSD [35], remove the draft model to avoid additional training, yet remain constrained by chain-structured dependencies, where rejection of an early token invalidates subsequent drafts, limiting accept… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed PathSpec. As shown in part A of the figure, this illustrates the target model’s decoding process over the tree-structured draft tokens. Part B of the figure illustrates the Cross-Path Relaxed Verification process applied to the Candidate Token Tree and the Draft Token Tree. At initialization, draft tokens are randomly sampled from the discrete codebook to populate the tree structur… view at source ↗

**Figure 3.** Figure 3: The comparison of the cost (time and memory) of PathExplore for Lumina-GPT-7B model. (a) The memory usage and one forward time variations across different W values. (b) The memory usage and one forward time variations across different |T | values. This multi-path exploration mechanism increases the expected token acceptance rate compared with single-path speculative decoding, thereby reducing the number o… view at source ↗

**Figure 4.** Figure 4: Comparison of acceleration under different hyperparameters. (a) The relationships of τ and SR with the maximum number of nodes per level B, respectively. (b) τ and SR with the depth W, respectively. (c) The relationships of IS and SR with the number of draft token sequences λ, respectively. The model used for testing is Lumina-GPT-7B-768 [26] [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: The visualization of image quality. For the prompt "an eagle", generated images across different settings with λ = 0.01 are provided. 6. Conclusion We propose Parallel-Path Cross-Relaxed Speculative Jacobi Decoding, a novel framework that accelerates autoregressive text-to-image generation by extending Speculative Jacobi Decoding with multiple draft sequences and crossrelaxed sampling. We compare the in… view at source ↗

**Figure 6.** Figure 6: Comparison of acceleration and image quality under different hyperparameters. (a) The relationships of CLIP −score [13] and SR with the number of draft token sequences λ, respectively. (b) The relationships of HP Sv2 [45] and SR with the number of draft token sequences λ, respectively. The model used for testing is Lumina-GPT-7B-768 [26] [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: The visualization of image quality. The model used in this experiment is Emu3 [7]. Below each generated image, the time required to generate that image is displayed (excluding the decoder decoding time). On the far right of the images, the number of iteration steps of the transformer architecture autoregressive (AR) model is shown. down to about 32 seconds, with an average of 1788 iterations. The combinati… view at source ↗

**Figure 8.** Figure 8: The differences between image token and text token generation to show image token at the same position share similar probability and have some freedom to sample. (a) The probability distribution chart of text token generation at a certain moment. (b) The probability distribution chart of image token generation at a certain moment. By comparing (a) and (b), it can be observed that the generation distributio… view at source ↗

read the original abstract

The growing need for high-resolution image generation in autoregressive text-to-image models has resulted in extended token sequences, significantly increasing computational costs and inference times. However, existing state-of-the-art methods for accelerating autoregressive text-to-image models rely on chain-structured draft token sequences, leading to inefficient draft token search and limited acceptance lengths. To address this, we propose parallel-path cross-relaxed speculative Jacobi decoding (\textbf{PathSpec}), a novel framework that enhances efficiency through a multi-sequence draft tree structure. Our parallel-path speculative Jacobi decoding (\textbf{PathExplore}) expands the token search space, achieving a higher speedup ratio without sacrificing image quality. Additionally, we introduce cross-path relaxed verification (\textbf{PathRelax}) that exploits semantic similarities across sequences to further boost token acceptance rates. Evaluated on the Parti-Prompts, MSCOCO2017, and T2ICompBench datasets, our method achieves a speedup ratio of 4.14 $\times$, 3.95$\times$, and 4.18$\times$, respectively. Remarkably, PathExplore, without any relaxed sampling, outperforms relaxed sampling methods in the speedup ratio, such as GSD and LANTERN. Moreover, PathRelax's relaxation mechanism can be seamlessly integrated with other relaxation techniques, enabling further acceleration and providing an efficient solution for real-time text-to-image generation. Our code is available at https://github.com/Haodong-Lei-Ray/PathSpec.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PathSpec's parallel draft tree plus cross-path relaxation is a concrete step beyond chain speculative decoding for T2I, but the quality-preservation claim rests on thin evidence in the abstract.

read the letter

The main point is that this paper takes speculative Jacobi decoding for autoregressive text-to-image models and replaces the usual single chain of drafts with a tree of parallel paths, then adds a relaxed verification step that accepts tokens based on semantic similarity across paths instead of exact matches. They report speedups of roughly 4x on Parti-Prompts, MSCOCO2017, and T2ICompBench while claiming image quality stays the same, and they note that their non-relaxed PathExplore already beats some earlier relaxed methods.

What is actually new is the multi-sequence draft tree structure combined with the cross-path relaxation rule. The abstract positions this against standard chain-based speculative decoding and shows that PathRelax can be stacked on top of other relaxation techniques. Releasing code on GitHub is a plus for anyone who wants to check the implementation.

The work is straightforward on the engineering side: longer token sequences in high-resolution generation are a real bottleneck, and expanding the draft search space while relaxing verification is a logical response. PathExplore alone outperforming GSD and LANTERN is worth noting if the numbers hold.

The soft spot is the evidence for quality preservation. The abstract asserts no sacrifice in image quality but gives no FID, CLIP scores, human preference data, or ablations that isolate what the cross-path relaxation does to the output distribution. The stress-test concern about possible distribution shift from the relaxed acceptance criterion is the part that needs checking in the full paper; without those controls the central claim is hard to evaluate. Minor issues include the usual lack of error bars or statistical detail in the summary.

This is for people working on inference speed for autoregressive vision models. A reader who follows speculative decoding literature will see a clear incremental idea with testable claims. The paper shows enough concrete thinking and engagement with prior methods to merit referee time, even if the current write-up leaves the quality results under-specified.

Referee Report

2 major / 1 minor

Summary. The paper introduces PathSpec, a framework for accelerating autoregressive text-to-image generation via parallel-path speculative Jacobi decoding. It consists of PathExplore, which uses a multi-sequence draft tree to expand the token search space, and PathRelax, which applies cross-path relaxed verification by exploiting semantic similarities across draft sequences to increase acceptance rates. The method is claimed to achieve speedups of 4.14× on Parti-Prompts, 3.95× on MSCOCO2017, and 4.18× on T2ICompBench without sacrificing image quality, with PathExplore outperforming prior relaxed methods such as GSD and LANTERN; PathRelax is also presented as integrable with other relaxation techniques. Code is released at a GitHub repository.

Significance. If the empirical claims hold with rigorous validation, the work could meaningfully advance real-time high-resolution text-to-image generation by demonstrating that parallel draft paths combined with cross-sequence relaxation can deliver substantial speedups while preserving output quality. The open-source code is a positive factor for reproducibility and extension.

major comments (2)

[Abstract / Experimental claims] The central claim that PathRelax preserves image quality (i.e., does not induce distribution shift in accepted tokens) is load-bearing for the reported 4× speedups, yet the abstract supplies no quantitative fidelity metrics (FID, CLIP score, human preference) or error bars on any of the three datasets, nor any ablation isolating the effect of the cross-path relaxation.
[Method (PathRelax description)] No formal definition or analysis is given for the relaxed verification criterion in PathRelax (how semantic similarity is quantified and thresholded across paths) or proof that it leaves the marginal distribution of generated tokens unchanged relative to strict Jacobi matching; this is required to substantiate the 'without sacrificing image quality' assertion.

minor comments (1)

[Abstract] The abstract states that PathExplore 'without any relaxed sampling' outperforms relaxed methods, but does not define what 'relaxed sampling' refers to in this context or how it differs from the PathRelax mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, clarifying the experimental evidence and methodological details while committing to revisions where appropriate.

read point-by-point responses

Referee: [Abstract / Experimental claims] The central claim that PathRelax preserves image quality (i.e., does not induce distribution shift in accepted tokens) is load-bearing for the reported 4× speedups, yet the abstract supplies no quantitative fidelity metrics (FID, CLIP score, human preference) or error bars on any of the three datasets, nor any ablation isolating the effect of the cross-path relaxation.

Authors: We agree the abstract would benefit from explicit metrics. The full manuscript (Section 4.3, Tables 2-3, and Figure 5) reports FID, CLIP scores, and human preference results with error bars across all three datasets, confirming no statistically significant quality degradation relative to the baseline. We will revise the abstract to include these quantitative results and add a dedicated ablation isolating the cross-path relaxation contribution. revision: yes
Referee: [Method (PathRelax description)] No formal definition or analysis is given for the relaxed verification criterion in PathRelax (how semantic similarity is quantified and thresholded across paths) or proof that it leaves the marginal distribution of generated tokens unchanged relative to strict Jacobi matching; this is required to substantiate the 'without sacrificing image quality' assertion.

Authors: Section 3.2 defines the criterion via cosine similarity of CLIP embeddings with a per-path adaptive threshold derived from sequence divergence. While we provide no formal proof of marginal distribution invariance (the approach is heuristic), the empirical results demonstrate equivalent quality metrics. We will expand Section 3 with a precise mathematical formulation, pseudocode, and additional distribution-shift analysis in the revision. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical speedups rest on measured token acceptance rates, not definitions or self-citations

full rationale

The paper introduces PathSpec with PathExplore (parallel draft tree) and PathRelax (cross-path relaxed verification) for speculative Jacobi decoding in autoregressive T2I models. Speedup ratios (4.14× etc.) are reported as direct empirical outcomes on Parti-Prompts, MSCOCO2017 and T2ICompBench; the abstract and description contain no equations, fitted parameters renamed as predictions, or load-bearing self-citations. The central claim that relaxation increases acceptance without quality loss is presented as an experimental result rather than a derivation that reduces to its own inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no free parameters, axioms, or invented entities are explicitly mentioned or identifiable.

pith-pipeline@v0.9.1-grok · 5803 in / 1085 out tokens · 21110 ms · 2026-06-27T13:58:56.586749+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Improving image generation with better captions.Computer Science, 2(3):8, 2023

James Betker, Gabriel Goh, Li Jing, TimBrooks, Jianfeng Wang, Linjie Li, LongOuyang, JuntangZhuang, JoyceLee, YufeiGuo, WesamManassra, PrafullaDhariwal, CaseyChu, YunxinJiao, and Aditya Ramesh. Improving image generation with better captions.Computer Science, 2(3):8, 2023. 1

2023
[2]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding 8 heads. InInternational Conference on Machine Learning. JMLR.org, 2024. 2

2024
[3]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In IEEE/CVF International Conference on Computer Vision, pages 9630–9640, 2021. 6

2021
[4]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean- Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerat- ing large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023. 1, 2, 4, 13

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image synthesis

Junsong Chen, Jincheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In International Conference on Learning Representations, 2024. 2

2024
[6]

Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation, 2024

Ethan Chern, Jiadi Su, Yan Ma, and Pengfei Liu. Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation, 2024. 2

2024
[7]

Baking relightable nerf for real-time di- rect/indirect illumination rendering, 2024

Euntae Choi, Vincent Carpentier, Seunghun Shin, and Sungjoo Yoo. Baking relightable nerf for real-time di- rect/indirect illumination rendering, 2024. 1, 2, 6, 7, 8, 14, 16

2024
[8]

Accelerated diffusion models via speculative sampling

Valentin De Bortoli, Alexandre Galashov, Arthur Gretton, and Arnaud Doucet. Accelerated diffusion models via speculative sampling. InProceedings of International Conference on Machine Learning. JMLR.org, 2025. 3

2025
[9]

Inductive generative recommendation via retrieval-based spec- ulation

Yijie Ding, Jiacheng Li, Julian McAuley, and Yupeng Hou. Inductive generative recommendation via retrieval-based spec- ulation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 14675–14683, 2026. 3

2026
[10]

Vvs: Accelerating speculative decoding for visual autoregressive generation via partial verification skipping, 2026

Haotian Dong, Ye Li, Rongwei Lu, Chen Tang, Shu-Tao Xia, and Zhi Wang. Vvs: Accelerating speculative decoding for visual autoregressive generation via partial verification skipping, 2026. 3

2026
[11]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions, 2021. 2

2021
[12]

On speculative de- coding for multimodal large language models

Mukul Gagrani, Raghavv Goel, Wonseok Jeon, Junyoung Park, Mingu Lee, and Christopher Lott. On speculative de- coding for multimodal large language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 8285–8289, 2024. 2

2024
[13]

CLIPScore: a reference-free evaluation met- ric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: a reference-free evaluation met- ric for image captioning. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021. 2, 6, 13

2021
[14]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InProceedings of the International Conference on Neural Information Processing Systems, page 6629–6640. Curran Associates Inc., 2017. 6

2017
[15]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. InProceedings of International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2020. Curran Associates Inc. 1

2020
[16]

Specvlm: Fast speculative decoding in vision-language models, 2025

Haiduo Huang, Fuwei Yang, Zhenhua Liu, Xuanwu Yin, Dong Li, Pengju Ren, and Emad Barsoum. Specvlm: Fast speculative decoding in vision-language models, 2025. 2

2025
[17]

Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhen- guo Li, and Xihui Liu. T2i-compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 47(5):3563–3579, 2025. 6

2025
[18]

Spec-llava: Accelerating vision-language models with dynamic tree-based speculative decoding, 2025

Mingxiao Huo, Jiayi Zhang, Hewei Wang, Jinfeng Xu, Zheyu Chen, Huilin Tai, and Yijun Chen. Spec-llava: Accelerating vision-language models with dynamic tree-based speculative decoding, 2025. 2

2025
[19]

Lantern: Accelerating visual autoregressive models with relaxed speculative decoding

Doohyuk Jang, Sihwan Park, June Yong Yang, Yeonsung Jung, Jihun Yun, Souvik Kundu, Sungyub Kim, and Eunho Yang. Lantern: Accelerating visual autoregressive models with relaxed speculative decoding. InInternational Confer- ence on Learning Representations, 2025. 1, 2, 15

2025
[20]

Sjd-pac: Accelerating speculative jacobi decoding via proactive drafting and adaptive continuation, 2026

Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, and Xing- hao Chen. Sjd-pac: Accelerating speculative jacobi decoding via proactive drafting and adaptive continuation, 2026. 3

2026
[21]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InPro- ceedings of International Conference on Machine Learning, pages 19274–19286. PMLR, 2023. 1, 2, 13

2023
[22]

BLIP: Bootstrapping language-image pre-training for unified vision- language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InProceedings of the International Conference on Machine Learning, pages 12888–12900. PMLR, 2022. 6

2022
[23]

An- nealed relaxation of speculative decoding for faster autore- gressive image generation, 2026

Xingyao Li, Fengzhuo Zhang, Cunxiao Du, and Hui Ji. An- nealed relaxation of speculative decoding for faster autore- gressive image generation, 2026. 3

2026
[24]

EAGLE-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-2: Faster inference of language models with dynamic draft trees. InConference on Empirical Methods in Natu- ral Language Processing, pages 7421–7432, Miami, Florida, USA, 2024. Association for Computational Linguistics. 2

2024
[25]

Lawrence Zitnick, and Piotr Dollár

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bour- dev, Ross Girshick, James Hays, Pietro Perona, Deva Ra- manan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015. 6

2015
[26]

Lumina-mGPT: Illuminate flexible photorealistic text-to-image generation with multi- modal generative pretraining, 2025

Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mGPT: Illuminate flexible photorealistic text-to-image generation with multi- modal generative pretraining, 2025. 1, 2, 6, 7, 8, 13, 16

2025
[27]

DAB-DETR: Dynamic anchor boxes are better queries for DETR

Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. DAB-DETR: Dynamic anchor boxes are better queries for DETR. InInternational Conference on Learning Representations, 2022. 6

2022
[28]

Specin- fer: Accelerating large language model serving with tree- based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, 9 Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specin- fer: Accelerating large language model serving with tree- based speculative inference and verification. InProceedings of the ACM I...

2024
[29]

LANTERN++: Enhanced relaxed spec- ulative decoding with static tree drafting for visual auto- regressive models

Sihwan Park, Doohyuk Jang, Sung-Yub Kim, Souvik Kundu, and Eunho Yang. LANTERN++: Enhanced relaxed spec- ulative decoding with static tree drafting for visual auto- regressive models. InWorkshop on Scalable Optimization for Efficient and Adaptive Foundation Models, 2025. 1, 2, 7, 8, 13

2025
[30]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models . In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, Los Alamitos, CA, USA, 2022. IEEE Computer Society. 1

2022
[31]

Photorealistic text-to-image diffusion models with deep language under- standing

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language under- standing. InAdvances in Neural Information Processing Systems, pages 36479–...

2022
[32]

Sara Mahdavi, Raphael Gontijo-Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Lit, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Raphael Gontijo-Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. InPro- ceedings of the International Co...

2022
[33]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. InAdvances in neural information processing systems, page 2234–2242. Curran Associates Inc., 2016. 6

2016
[34]

Accelerating transformer inference for translation via parallel decoding

Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino Maiorca, Michele Mancusi, Riccardo Marin, and Emanuele Rodola. Accelerating transformer inference for translation via parallel decoding. InAnnual Meeting Of The Association For Computational Linguistics, pages 12336– 12355, 2023. 2, 7, 8

2023
[35]

Grouped speculative decoding for autoregressive image generation

Junhyuk So, Juncheol Shin, Hyunho Kook, and Eunhyeok Park. Grouped speculative decoding for autoregressive image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15375–15384, 2025. 1, 2, 3, 7, 8, 13, 15

2025
[36]

Speculative coupled decoding for training-free lossless acceleration of autoregressive visual generation, 2026

Junhyuk So, Hyunho Kook, Chaeyeon Jang, and Eunhyeok Park. Speculative coupled decoding for training-free lossless acceleration of autoregressive visual generation, 2026. 3

2026
[37]

Autoregressive model beats diffusion: Llama for scalable image generation, 2024

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation, 2024. 2

2024
[38]

Chameleon: Mixed-modal early-fusion foundation models, 2025

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models, 2025. 1

2025
[39]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati- raju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin...

2024
[40]

Sjd++: Improved speculative jacobi decoding for training-free accel- eration of discrete auto-regressive text-to-image generation,

Yao Teng, Zhihuan Jiang, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Sjd++: Improved speculative jacobi decoding for training-free accel- eration of discrete auto-regressive text-to-image generation,
[41]

Accelerating auto- regressive text-to-image generation with training-free specu- lative jacobi decoding

Yao Teng, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Accelerating auto- regressive text-to-image generation with training-free specu- lative jacobi decoding. InInternational Conference on Learn- ing Representations, 2025. 2, 5, 6, 7, 8

2025
[42]

Speculative jacobi-denoising decoding for accelerating autoregressive text-to-image generation, 2025

Yao Teng, Fuyun Wang, Xian Liu, Zhekai Chen, Han Shi, Yu Wang, Zhenguo Li, Weiyang Liu, Difan Zou, and Xihui Liu. Speculative jacobi-denoising decoding for accelerating autoregressive text-to-image generation, 2025. 3

2025
[43]

Neural discrete representation learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InProceedings of International Conference on Neural Information Processing Systems, page 6309–6318, Red Hook, NY , USA, 2017. Curran Associates Inc. 2

2017
[44]

Specprune-vla: Accelerating vision-language- action models via action-aware self-speculative pruning, 2025

Hanzhen Wang, Jiaming Xu, Jiayi Pan, Yongkang Zhou, and Guohao Dai. Specprune-vla: Accelerating vision-language- action models via action-aware self-speculative pruning, 2025. 2

2025
[45]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Lumina-mgpt 2.0: Stand-alone autoregressive image model- ing, 2025

Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, Ren- rui Zhang, Le Zhuo, Tiancheng Han, Xiaoqing Sun, Siqi Luo, Mengmeng Wang, Bin Fu, Yuewen Cao, Hongsheng Li, Guangtao Zhai, Xiaohong Liu, Yu Qiao, and Peng Gao. Lumina-mgpt 2.0: Stand-alone autoregressive image model- ing, 2025. 1, 2

2025
[47]

Scaling autoregressive models for content-rich text-to- image generation, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to- image generation, 2022. 6

2022
[48]

Learning harmonized representations for speculative sampling, 2024

Lefan Zhang, Xiaodan Wang, Yanhua Huang, and Ruiwen Xu. Learning harmonized representations for speculative sampling, 2024. 2

2024
[49]

Lookahead: An inference acceleration framework for large language model with lossless generation accuracy

Yao Zhao, Zhitian Xie, Chen Liang, Chenyi Zhuang, and Jinjie Gu. Lookahead: An inference acceleration framework for large language model with lossless generation accuracy. InACM SIGKDD Conference on Knowledge Discovery and Data Mining, page 6344–6355. Association for Computing Machinery, 2024. 2

2024
[50]

Sim- ple multi-dataset detection

Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. Sim- ple multi-dataset detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7571–7580, 2022. 6 11 A. Proof of the Lossless Guarantees of PathEx- plore Theorem 1The token sequence accepted by the Parallel- Path Speculative Jacobi Decoding (PathExplore) sat...

2022
[51]

Acceptance ProbabilityAccording to the PathExplore method (and standard speculative decoding), a candidate tokenxfrom the draft tree is accepted with probability: p(ris true|x,J (j),J (j−1)) = min 1, p(x|J (j)) p(x|J (j−1)) , (9) where r is the boolean variable representing acceptance. The joint probability of a token x being sampled by the draft strategy...
[52]

Rejection and Resampling ProbabilityIf the token is rejected, we must account for the probability mass that was not covered by the acceptance step. The probability of rejection for the draft distribution is: p(ris false|J (j),J (j−1)) = 1− X x′ p(ris true, x ′|J (j),J (j−1)) = X x′ p(x′|J (j))− X x′ min{p(x′|J (j)), p(x′|J (j−1))} = X x′ max{0, p(x′|J (j)...
[53]

Total ProbabilityWe verify that the sum of prob- abilities from both cases recovers the target distribution p(x|J (j)). Using the identity a= min(a, b)+max(0, a−b) , we have: p(x|J (j)) = min{p(x|J (j)), p(x|J (j−1))}+ max{0, p(x|J (j))−p(x|J (j−1))} =p(ris true,x|J (j),J (j−1)) +p(ris false,x|J (j),J (j−1)). (14) According to Eq. 14, the conditional dist...

[1] [1]

Improving image generation with better captions.Computer Science, 2(3):8, 2023

James Betker, Gabriel Goh, Li Jing, TimBrooks, Jianfeng Wang, Linjie Li, LongOuyang, JuntangZhuang, JoyceLee, YufeiGuo, WesamManassra, PrafullaDhariwal, CaseyChu, YunxinJiao, and Aditya Ramesh. Improving image generation with better captions.Computer Science, 2(3):8, 2023. 1

2023

[2] [2]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding 8 heads. InInternational Conference on Machine Learning. JMLR.org, 2024. 2

2024

[3] [3]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In IEEE/CVF International Conference on Computer Vision, pages 9630–9640, 2021. 6

2021

[4] [4]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean- Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerat- ing large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023. 1, 2, 4, 13

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image synthesis

Junsong Chen, Jincheng YU, Chongjian GE, Lewei Yao, Enze Xie, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-$\alpha$: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In International Conference on Learning Representations, 2024. 2

2024

[6] [6]

Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation, 2024

Ethan Chern, Jiadi Su, Yan Ma, and Pengfei Liu. Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation, 2024. 2

2024

[7] [7]

Baking relightable nerf for real-time di- rect/indirect illumination rendering, 2024

Euntae Choi, Vincent Carpentier, Seunghun Shin, and Sungjoo Yoo. Baking relightable nerf for real-time di- rect/indirect illumination rendering, 2024. 1, 2, 6, 7, 8, 14, 16

2024

[8] [8]

Accelerated diffusion models via speculative sampling

Valentin De Bortoli, Alexandre Galashov, Arthur Gretton, and Arnaud Doucet. Accelerated diffusion models via speculative sampling. InProceedings of International Conference on Machine Learning. JMLR.org, 2025. 3

2025

[9] [9]

Inductive generative recommendation via retrieval-based spec- ulation

Yijie Ding, Jiacheng Li, Julian McAuley, and Yupeng Hou. Inductive generative recommendation via retrieval-based spec- ulation. InProceedings of the AAAI Conference on Artificial Intelligence, pages 14675–14683, 2026. 3

2026

[10] [10]

Vvs: Accelerating speculative decoding for visual autoregressive generation via partial verification skipping, 2026

Haotian Dong, Ye Li, Rongwei Lu, Chen Tang, Shu-Tao Xia, and Zhi Wang. Vvs: Accelerating speculative decoding for visual autoregressive generation via partial verification skipping, 2026. 3

2026

[11] [11]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representa- tions, 2021. 2

2021

[12] [12]

On speculative de- coding for multimodal large language models

Mukul Gagrani, Raghavv Goel, Wonseok Jeon, Junyoung Park, Mingu Lee, and Christopher Lott. On speculative de- coding for multimodal large language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 8285–8289, 2024. 2

2024

[13] [13]

CLIPScore: a reference-free evaluation met- ric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. CLIPScore: a reference-free evaluation met- ric for image captioning. InProceedings of the Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021. 2, 6, 13

2021

[14] [14]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InProceedings of the International Conference on Neural Information Processing Systems, page 6629–6640. Curran Associates Inc., 2017. 6

2017

[15] [15]

Denoising dif- fusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models. InProceedings of International Conference on Neural Information Processing Systems, Red Hook, NY , USA, 2020. Curran Associates Inc. 1

2020

[16] [16]

Specvlm: Fast speculative decoding in vision-language models, 2025

Haiduo Huang, Fuwei Yang, Zhenhua Liu, Xuanwu Yin, Dong Li, Pengju Ren, and Emad Barsoum. Specvlm: Fast speculative decoding in vision-language models, 2025. 2

2025

[17] [17]

Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhen- guo Li, and Xihui Liu. T2i-compbench++: An enhanced and comprehensive benchmark for compositional text-to-image generation.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 47(5):3563–3579, 2025. 6

2025

[18] [18]

Spec-llava: Accelerating vision-language models with dynamic tree-based speculative decoding, 2025

Mingxiao Huo, Jiayi Zhang, Hewei Wang, Jinfeng Xu, Zheyu Chen, Huilin Tai, and Yijun Chen. Spec-llava: Accelerating vision-language models with dynamic tree-based speculative decoding, 2025. 2

2025

[19] [19]

Lantern: Accelerating visual autoregressive models with relaxed speculative decoding

Doohyuk Jang, Sihwan Park, June Yong Yang, Yeonsung Jung, Jihun Yun, Souvik Kundu, Sungyub Kim, and Eunho Yang. Lantern: Accelerating visual autoregressive models with relaxed speculative decoding. InInternational Confer- ence on Learning Representations, 2025. 1, 2, 15

2025

[20] [20]

Sjd-pac: Accelerating speculative jacobi decoding via proactive drafting and adaptive continuation, 2026

Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, and Xing- hao Chen. Sjd-pac: Accelerating speculative jacobi decoding via proactive drafting and adaptive continuation, 2026. 3

2026

[21] [21]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InPro- ceedings of International Conference on Machine Learning, pages 19274–19286. PMLR, 2023. 1, 2, 13

2023

[22] [22]

BLIP: Bootstrapping language-image pre-training for unified vision- language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision- language understanding and generation. InProceedings of the International Conference on Machine Learning, pages 12888–12900. PMLR, 2022. 6

2022

[23] [23]

An- nealed relaxation of speculative decoding for faster autore- gressive image generation, 2026

Xingyao Li, Fengzhuo Zhang, Cunxiao Du, and Hui Ji. An- nealed relaxation of speculative decoding for faster autore- gressive image generation, 2026. 3

2026

[24] [24]

EAGLE-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-2: Faster inference of language models with dynamic draft trees. InConference on Empirical Methods in Natu- ral Language Processing, pages 7421–7432, Miami, Florida, USA, 2024. Association for Computational Linguistics. 2

2024

[25] [25]

Lawrence Zitnick, and Piotr Dollár

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bour- dev, Ross Girshick, James Hays, Pietro Perona, Deva Ra- manan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015. 6

2015

[26] [26]

Lumina-mGPT: Illuminate flexible photorealistic text-to-image generation with multi- modal generative pretraining, 2025

Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mGPT: Illuminate flexible photorealistic text-to-image generation with multi- modal generative pretraining, 2025. 1, 2, 6, 7, 8, 13, 16

2025

[27] [27]

DAB-DETR: Dynamic anchor boxes are better queries for DETR

Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. DAB-DETR: Dynamic anchor boxes are better queries for DETR. InInternational Conference on Learning Representations, 2022. 6

2022

[28] [28]

Specin- fer: Accelerating large language model serving with tree- based speculative inference and verification

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, 9 Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specin- fer: Accelerating large language model serving with tree- based speculative inference and verification. InProceedings of the ACM I...

2024

[29] [29]

LANTERN++: Enhanced relaxed spec- ulative decoding with static tree drafting for visual auto- regressive models

Sihwan Park, Doohyuk Jang, Sung-Yub Kim, Souvik Kundu, and Eunho Yang. LANTERN++: Enhanced relaxed spec- ulative decoding with static tree drafting for visual auto- regressive models. InWorkshop on Scalable Optimization for Efficient and Adaptive Foundation Models, 2025. 1, 2, 7, 8, 13

2025

[30] [30]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models . In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, Los Alamitos, CA, USA, 2022. IEEE Computer Society. 1

2022

[31] [31]

Photorealistic text-to-image diffusion models with deep language under- standing

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language under- standing. InAdvances in Neural Information Processing Systems, pages 36479–...

2022

[32] [32]

Sara Mahdavi, Raphael Gontijo-Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Lit, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Raphael Gontijo-Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. InPro- ceedings of the International Co...

2022

[33] [33]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. InAdvances in neural information processing systems, page 2234–2242. Curran Associates Inc., 2016. 6

2016

[34] [34]

Accelerating transformer inference for translation via parallel decoding

Andrea Santilli, Silvio Severino, Emilian Postolache, Valentino Maiorca, Michele Mancusi, Riccardo Marin, and Emanuele Rodola. Accelerating transformer inference for translation via parallel decoding. InAnnual Meeting Of The Association For Computational Linguistics, pages 12336– 12355, 2023. 2, 7, 8

2023

[35] [35]

Grouped speculative decoding for autoregressive image generation

Junhyuk So, Juncheol Shin, Hyunho Kook, and Eunhyeok Park. Grouped speculative decoding for autoregressive image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15375–15384, 2025. 1, 2, 3, 7, 8, 13, 15

2025

[36] [36]

Speculative coupled decoding for training-free lossless acceleration of autoregressive visual generation, 2026

Junhyuk So, Hyunho Kook, Chaeyeon Jang, and Eunhyeok Park. Speculative coupled decoding for training-free lossless acceleration of autoregressive visual generation, 2026. 3

2026

[37] [37]

Autoregressive model beats diffusion: Llama for scalable image generation, 2024

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation, 2024. 2

2024

[38] [38]

Chameleon: Mixed-modal early-fusion foundation models, 2025

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models, 2025. 1

2025

[39] [39]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati- raju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin...

2024

[40] [40]

Sjd++: Improved speculative jacobi decoding for training-free accel- eration of discrete auto-regressive text-to-image generation,

Yao Teng, Zhihuan Jiang, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Sjd++: Improved speculative jacobi decoding for training-free accel- eration of discrete auto-regressive text-to-image generation,

[41] [41]

Accelerating auto- regressive text-to-image generation with training-free specu- lative jacobi decoding

Yao Teng, Han Shi, Xian Liu, Xuefei Ning, Guohao Dai, Yu Wang, Zhenguo Li, and Xihui Liu. Accelerating auto- regressive text-to-image generation with training-free specu- lative jacobi decoding. InInternational Conference on Learn- ing Representations, 2025. 2, 5, 6, 7, 8

2025

[42] [42]

Speculative jacobi-denoising decoding for accelerating autoregressive text-to-image generation, 2025

Yao Teng, Fuyun Wang, Xian Liu, Zhekai Chen, Han Shi, Yu Wang, Zhenguo Li, Weiyang Liu, Difan Zou, and Xihui Liu. Speculative jacobi-denoising decoding for accelerating autoregressive text-to-image generation, 2025. 3

2025

[43] [43]

Neural discrete representation learning

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InProceedings of International Conference on Neural Information Processing Systems, page 6309–6318, Red Hook, NY , USA, 2017. Curran Associates Inc. 2

2017

[44] [44]

Specprune-vla: Accelerating vision-language- action models via action-aware self-speculative pruning, 2025

Hanzhen Wang, Jiaming Xu, Jiayi Pan, Yongkang Zhou, and Guohao Dai. Specprune-vla: Accelerating vision-language- action models via action-aware self-speculative pruning, 2025. 2

2025

[45] [45]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

work page internal anchor Pith review Pith/arXiv arXiv

[46] [46]

Lumina-mgpt 2.0: Stand-alone autoregressive image model- ing, 2025

Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, Ren- rui Zhang, Le Zhuo, Tiancheng Han, Xiaoqing Sun, Siqi Luo, Mengmeng Wang, Bin Fu, Yuewen Cao, Hongsheng Li, Guangtao Zhai, Xiaohong Liu, Yu Qiao, and Peng Gao. Lumina-mgpt 2.0: Stand-alone autoregressive image model- ing, 2025. 1, 2

2025

[47] [47]

Scaling autoregressive models for content-rich text-to- image generation, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to- image generation, 2022. 6

2022

[48] [48]

Learning harmonized representations for speculative sampling, 2024

Lefan Zhang, Xiaodan Wang, Yanhua Huang, and Ruiwen Xu. Learning harmonized representations for speculative sampling, 2024. 2

2024

[49] [49]

Lookahead: An inference acceleration framework for large language model with lossless generation accuracy

Yao Zhao, Zhitian Xie, Chen Liang, Chenyi Zhuang, and Jinjie Gu. Lookahead: An inference acceleration framework for large language model with lossless generation accuracy. InACM SIGKDD Conference on Knowledge Discovery and Data Mining, page 6344–6355. Association for Computing Machinery, 2024. 2

2024

[50] [50]

Sim- ple multi-dataset detection

Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. Sim- ple multi-dataset detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7571–7580, 2022. 6 11 A. Proof of the Lossless Guarantees of PathEx- plore Theorem 1The token sequence accepted by the Parallel- Path Speculative Jacobi Decoding (PathExplore) sat...

2022

[51] [51]

Acceptance ProbabilityAccording to the PathExplore method (and standard speculative decoding), a candidate tokenxfrom the draft tree is accepted with probability: p(ris true|x,J (j),J (j−1)) = min 1, p(x|J (j)) p(x|J (j−1)) , (9) where r is the boolean variable representing acceptance. The joint probability of a token x being sampled by the draft strategy...

[52] [52]

Rejection and Resampling ProbabilityIf the token is rejected, we must account for the probability mass that was not covered by the acceptance step. The probability of rejection for the draft distribution is: p(ris false|J (j),J (j−1)) = 1− X x′ p(ris true, x ′|J (j),J (j−1)) = X x′ p(x′|J (j))− X x′ min{p(x′|J (j)), p(x′|J (j−1))} = X x′ max{0, p(x′|J (j)...

[53] [53]

Total ProbabilityWe verify that the sum of prob- abilities from both cases recovers the target distribution p(x|J (j)). Using the identity a= min(a, b)+max(0, a−b) , we have: p(x|J (j)) = min{p(x|J (j)), p(x|J (j−1))}+ max{0, p(x|J (j))−p(x|J (j−1))} =p(ris true,x|J (j),J (j−1)) +p(ris false,x|J (j),J (j−1)). (14) According to Eq. 14, the conditional dist...