arxiv: 2605.06137 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

Autoregressive Visual Generation Needs a Prologue

Bowen Zheng , Weijian Luo , Guang Yang , Colin Zhang , Tianyang Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords autoregressive image generationprologue tokensImageNetFID evaluationELBOtoken decouplingreconstruction-generation gap

0 comments

The pith

Prepending a small set of prologue tokens trained only on AR loss decouples generation from reconstruction in autoregressive image models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Prologue to address the tension between reconstruction and generation objectives in autoregressive visual models. Rather than altering visual tokens to serve both goals, it prepends a handful of dedicated prologue tokens to the token sequence. These prologue tokens receive training solely through the autoregressive cross-entropy loss, leaving the visual tokens free to optimize reconstruction. The separation is further justified via an ELBO perspective. Experiments on ImageNet 256x256 demonstrate that this yields substantially lower generation FID scores while reconstruction metrics stay nearly identical.

Core claim

Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy loss, while visual tokens remain dedicated to reconstruction. This decoupled design optimizes generation through the AR model's true distribution without affecting reconstruction quality, which the paper formalizes from an ELBO perspective.

What carries the argument

Prologue tokens: a small learned set of tokens prepended to the visual token sequence and optimized exclusively under AR cross-entropy loss to carry the generative objective separately from reconstruction.

If this is right

Generation FID improves markedly: Prologue-Base lowers gFID from 21.01 to 10.75 without classifier-free guidance.
Reconstruction quality stays nearly constant under the decoupled training.
Prologue tokens acquire emergent semantic structure, shown by linear probing accuracy rising to 35.88 percent Top-1.
Prologue-Large reaches rFID of 0.99 and gFID of 1.46 with a plain AR model and no extra semantic supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prepended-token separation could be tested in autoregressive models for video or audio sequences.
Because the prologue tokens develop semantic layout on their own, they might serve as a compact conditioning signal for downstream tasks.
Varying the number or training schedule of prologue tokens offers a direct experimental knob for trading generation quality against compute.

Load-bearing premise

Training the prologue tokens exclusively with AR cross-entropy loss will leave the visual tokens' reconstruction quality essentially unchanged.

What would settle it

A clear rise in reconstruction error metrics such as rFID when prologue tokens are introduced and trained would show that the claimed decoupling does not hold.

read the original abstract

In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Instead of modifying visual tokens to satisfy both reconstruction and generation, Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy (CE) loss, while visual tokens remain dedicated to reconstruction. This decoupled design lets us optimize generation through the AR model's true distribution without affecting reconstruction quality, which we further formalize from an ELBO perspective. On ImageNet 256x256, Prologue-Base reduces gFID from 21.01 to 10.75 without classifier-free guidance while keeping reconstruction almost unchanged; Prologue-Large reaches a competitive rFID of 0.99 and gFID of 1.46 using a standard AR model without auxiliary semantic supervision. Interestingly, driven only by AR gradients, prologue tokens exhibit emergent semantic structure: linear probing on 16 prologue tokens reaches 35.88% Top-1, far above the 23.71% of the first 16 tokens from a standard tokenizer; resampling with fixed prologue tokens preserves a similar high-level semantic layout. Our results suggest a new direction: generation quality can be improved by introducing a separate learned generative representation while leaving the original representation intact.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Prologue, a method for autoregressive (AR) image generation that prepends a small set of learnable prologue tokens to the visual token sequence. Prologue tokens are trained exclusively with the AR cross-entropy loss while visual tokens remain dedicated to reconstruction; the design is formalized from an ELBO perspective to argue for decoupled optimization. On ImageNet 256x256, Prologue-Base improves gFID from 21.01 to 10.75 without classifier-free guidance and with reconstruction quality essentially unchanged; Prologue-Large achieves rFID 0.99 and gFID 1.46 using a standard AR model without auxiliary semantic supervision. The prologue tokens exhibit emergent semantic structure, with linear probing reaching 35.88% Top-1 accuracy and resampling preserving high-level layout.

Significance. If the decoupling holds, the work identifies a practical route to improving generation fidelity in AR models by introducing a separate learned generative representation while leaving the original visual representation intact. The reported gains are substantial, achieved without CFG or extra supervision, and the emergent semantics in the prologue tokens constitute an interesting empirical finding that could motivate further analysis of learned prefixes in sequence models.

major comments (2)

[Section 3] ELBO formalization (Section 3): the claim that the objectives remain decoupled is load-bearing for the central contribution, yet the shared AR transformer parameters mean that gradients from the prologue CE loss necessarily update weights used to predict subsequent visual tokens. An explicit derivation is required showing that the variational/marginal terms in the ELBO separate despite this parameter sharing; without it, the preservation of rFID cannot be attributed to the formalization rather than an unstated implementation choice (e.g., frozen layers or separate prediction heads).
[Section 4] Experimental protocol (Section 4 and Appendix): the manuscript must clarify whether the visual tokenizer and reconstruction objective are held completely fixed during prologue training or whether any joint fine-tuning occurs. If any parameters are updated jointly, the reported “almost unchanged” rFID requires quantitative before/after tables and controls for confounding factors such as training schedule or data augmentation.

minor comments (2)

[Section 3] Notation: the distinction between the prologue token embeddings and the visual token embeddings should be made explicit in the equations (e.g., denote prologue tokens as z_p and visual tokens as z_v) to avoid ambiguity when describing the concatenated sequence.
[Section 2] Related work: the positioning relative to prior prefix-conditioning or prompt-tuning techniques in AR models should be expanded to clarify the precise novelty of training the prefix exclusively with CE while freezing the reconstruction path.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important points about the ELBO formalization and experimental details, which we address below. We will revise the manuscript to incorporate clarifications and additional derivations as outlined.

read point-by-point responses

Referee: [Section 3] ELBO formalization (Section 3): the claim that the objectives remain decoupled is load-bearing for the central contribution, yet the shared AR transformer parameters mean that gradients from the prologue CE loss necessarily update weights used to predict subsequent visual tokens. An explicit derivation is required showing that the variational/marginal terms in the ELBO separate despite this parameter sharing; without it, the preservation of rFID cannot be attributed to the formalization rather than an unstated implementation choice (e.g., frozen layers or separate prediction heads).

Authors: We agree that an explicit derivation is needed to rigorously support the decoupling claim given the shared parameters. In the revised version, we will expand the ELBO analysis in Section 3 with a step-by-step derivation. This will show that the prologue cross-entropy term optimizes a distinct prefix distribution in the joint ELBO, while the visual token reconstruction term remains isolated in the marginal likelihood; the shared transformer weights do not mix the objectives because the prologue loss does not back-propagate into the reconstruction likelihood for visual tokens. We will also confirm that the implementation uses no frozen layers or separate heads, ensuring the rFID preservation follows directly from the formal separation rather than unstated choices. revision: yes
Referee: [Section 4] Experimental protocol (Section 4 and Appendix): the manuscript must clarify whether the visual tokenizer and reconstruction objective are held completely fixed during prologue training or whether any joint fine-tuning occurs. If any parameters are updated jointly, the reported “almost unchanged” rFID requires quantitative before/after tables and controls for confounding factors such as training schedule or data augmentation.

Authors: The visual tokenizer and reconstruction objective are held completely fixed; only the prologue tokens are optimized via the autoregressive cross-entropy loss, with no updates to the tokenizer parameters or reconstruction loss terms. We will revise Section 4 and the Appendix to state this protocol explicitly. We will also add a quantitative before/after rFID table and include controls for training schedule and data augmentation to rule out confounding effects. revision: yes

Circularity Check

0 steps flagged

No circularity: method introduces independent prologue tokens and reports empirical decoupling

full rationale

The paper defines a new architectural component (prologue tokens prepended to the visual sequence) and a training split (CE loss applied only to prologue positions, reconstruction loss on visual tokens). The ELBO formalization is presented as supporting the claim that these objectives remain decoupled under parameter sharing, but the provided text contains no equations that reduce the claimed separation to a tautology or to a fitted parameter renamed as a prediction. Results are obtained by training the combined model and measuring gFID/rFID on ImageNet; no load-bearing step collapses to self-citation, ansatz smuggling, or renaming of a known result. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The paper introduces a new set of tokens as the main addition, with the number being a free parameter and the ELBO as an assumption.

free parameters (1)

number of prologue tokens
The size of the prologue set is a design choice that likely affects performance.

axioms (1)

domain assumption The ELBO perspective formalizes the decoupled training without affecting reconstruction.
Mentioned in abstract as further formalization.

invented entities (1)

prologue tokens no independent evidence
purpose: To handle generation separately from reconstruction in AR models.
New introduced component without external validation mentioned.

pith-pipeline@v0.9.0 · 5544 in / 1310 out tokens · 57710 ms · 2026-05-08T13:48:32.689247+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 25 canonical work pages · 6 internal anchors

[1]

FlexTok: Resampling images into 1D token sequences of flexible length

Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, Oğuzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Zamir, and Afshin Dehghan. FlexTok: Resampling images into 1D token sequences of flexible length. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Pro...
[2]

URLhttps://proceedings.mlr.press/v267/bachmann25a.html

PMLR, 13–19 Jul 2025. URLhttps://proceedings.mlr.press/v267/bachmann25a.html

2025
[3]

Generative pretraining from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 1691–1703. PMLR, 13–18 Jul 2020. URLhttps://proceeding...

2020
[4]

ImageNet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 248–255. IEEE Computer Society, 2009. doi: 10.1109/CVPR.2009.5206848. URLhttps://doi.org/10.1...

work page doi:10.1109/cvpr.2009.5206848 2009
[5]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat gans on image synthesis. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wort- man Vaughan, editors,Advances in Neural Information Processing Systems 34: Annual Confer- ence on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, ...

2021
[6]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12873–12883, June 2021

2021
[7]

Borgwardt, Malte J

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.Journal of Machine Learning Research, 13(25):723–773, 2012. URLhttp: //jmlr.org/papers/v13/gretton12a.html

2012
[8]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Is- abelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vish- wanathan, and Roman Garnett, editors,Advances in Neural Information Processing Syste...

2017
[9]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.CoRR, abs/2207.12598, 2022. doi: 10.48550/ARXIV.2207.12598. URLhttps://doi.org/10.48550/arXiv.2207.12598

work page internal anchor Pith review doi:10.48550/arxiv.2207.12598 2022
[10]

Spectralar: Spectral autoregressive visual generation, 2025

Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Yueqi Duan, Jie Zhou, and Jiwen Lu. Spectralar: Spectral autoregressive visual generation, 2025. URLhttps://arxiv.org/abs/2506.10962

work page arXiv 2025
[11]

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

2017
[12]

Categorical reparameterization with gumbel-softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URLhttps://openreview.net/forum?id= rkE3y85ee

2017
[13]

Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens,

Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, and Liang-Chieh Chen. Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens,
[14]

URLhttps://arxiv.org/abs/2501.07730

work page arXiv
[15]

Auto-Encoding Variational Bayes

Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengio and Yann LeCun, editors,2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URLhttp://arxiv.org/abs/1312.6114

work page internal anchor Pith review arXiv 2014
[16]

Autoregressiveimagegeneration using residual quantization

DoyupLee, ChiheonKim, SaehoonKim, MinsuCho, andWook-ShinHan. Autoregressiveimagegeneration using residual quantization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11523–11532, June 2022

2022
[17]

The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Ass...

work page doi:10.18653/v1/2021.emnlp-main.243 2021
[18]

Autoregressive image generation without vector quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing S...

2024
[19]

Imagefolder: Autoregressive image gen- eration with folded tokens.arXiv preprint arXiv:2410.01756,

Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, and Zhe Lin. Imagefolder: Autore- gressive image generation with folded tokens, 2024. URLhttps://arxiv.org/abs/2410.01756

work page arXiv 2024
[20]

Prefix-Tuning: Optimizing Continuous Prompts for Generation , url =

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 458...

work page doi:10.18653/v1/2021.acl-long.353 2021
[21]

Detailflow: 1d coarse-to-fine autoregressive image generation via next-detail prediction.arXiv preprint arXiv:2505.21473,

YihengLiu, LiaoQu, HuichaoZhang, XuWang, YiJiang, YimingGao, HuYe, XianLi, ShuaiWang, DanielK. Du, Fangmin Chen, Zehuan Yuan, and Xinglong Wu. Detailflow: 1d coarse-to-fine autoregressive image generation via next-detail prediction, 2025. URLhttps://arxiv.org/abs/2505.21473

work page arXiv 2025
[23]

URLhttp://arxiv.org/abs/1711.05101

work page internal anchor Pith review arXiv
[24]

Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion mod- els

Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion mod- els. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference ...

2023
[25]

Maddison, Andriy Mnih, and Yee Whye Teh

Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URLhttps: //openreview.net/forum?id=S1jE5L5gl

2017
[26]

Finite scalar quantization: VQ-VAE made simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: VQ-VAE made simple. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=8ishA3LxN8

2024
[27]

One-d-piece: Image tokenizer meets quality-controllable compression.arXiv preprint arXiv:2501.10064, 2025

Keita Miwa, Kento Sasaki, Hidehisa Arai, Tsubasa Takahashi, and Yu Yamaguchi. One-d-piece: Image tokenizer meets quality-controllable compression, 2025. URLhttps://arxiv.org/abs/2501.10064

work page arXiv 2025
[28]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Lab...

2024
[29]

Freeman, and Yu-Xiong Wang

Ziqi Pang, Tianyuan Zhang, Fujun Luan, Yunze Man, Hao Tan, Kai Zhang, William T. Freeman, and Yu-Xiong Wang. Randar: Decoder-only autoregressive visual generation in random orders. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 45–55, June 2025

2025
[30]

Image transformer

Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4055–4064. PMLR, 10–15 Jul 2018. URLhttps://proceedings.mlr...

2018
[31]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, October 2023

2023
[32]

When worse is better: Navigating the compression-generation tradeoff in visual tokenization, 2025

Vivek Ramanujan, Kushal Tirumala, Armen Aghajanyan, Luke Zettlemoyer, and Ali Farhadi. When worse is better: Navigating the compression-generation tradeoff in visual tokenization, 2025. URL https://arxiv.org/abs/2412.16326. Autoregressive Visual Generation Needs a Prologue 14

work page arXiv 2025
[33]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 8821–8831. PMLR, 18–24 Jul 2021. UR...

2021
[34]

Flowar: Scale-wise autoregressive image generation meets flow matching, 2024

Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. Flowar: Scale-wise autoregressive image generation meets flow matching, 2024. URLhttps://arxiv.org/abs/2412. 15205

2024
[35]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022

2022
[36]

Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen

Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Im- proved techniques for training gans. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Is- abelle Guyon, and Roman Garnett, editors,Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, Dece...

2016
[37]

Flow to the mode: Mode-seeking diffusionautoencodersforstate-of-the-artimagetokenization

Kyle Sargent, Kyle Hsu, Justin Johnson, Li Fei-Fei, and Jiajun Wu. Flow to the mode: Mode-seeking diffusionautoencodersforstate-of-the-artimagetokenization. InProceedingsoftheIEEE/CVFInternational Conference on Computer Vision (ICCV), pages 19471–19481, October 2025

2025
[38]

Scalable image tokenization with index backpropagation quantization, 2025

Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization, 2025. URLhttps://arxiv.org/abs/2412. 02692

2025
[39]

Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien ...

work page internal anchor Pith review arXiv 2025
[40]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autore- gressive model beats diffusion: Llama for scalable image generation, 2024. URLhttps://arxiv.org/ abs/2406.06525

work page internal anchor Pith review arXiv 2024
[41]

Hart: Efficient visual generation with hybrid autoregressive transformer,

Haotian Tang, Yecheng Wu, Shang Yang, Enze Xie, Junsong Chen, Junyu Chen, Zhuoyang Zhang, Han Cai, Yao Lu, and Song Han. Hart: Efficient visual generation with hybrid autoregressive transformer,
[42]

URLhttps://arxiv.org/abs/2410.10812

work page arXiv
[43]

Visual autoregressive modeling: Scalable image generation via next-scale prediction

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural In...

2024
[44]

Conditional image generation with pixelcnn decoders, 2016

AaronvandenOord, NalKalchbrenner, OriolVinyals, LasseEspeholt, AlexGraves, andKorayKavukcuoglu. Conditional image generation with pixelcnn decoders, 2016. URLhttps://arxiv.org/abs/1606. 05328

2016
[45]

Neural discrete representation learn- ing

Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learn- ing. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Decem...

2017
[46]

Selftok: Discrete visual tokens of autoregression, by diffusion, and for reasoning, 2025

BohanWang, ZhongqiYue, FengdaZhang, ShuoChen, Li’anBi, JunzheZhang, XueSong, KennardYanting Chan, Jiachun Pan, Weijia Wu, Mingze Zhou, Wang Lin, Kaihang Pan, Saining Zhang, Liyu Jia, Wentao Hu, Wei Zhao, and Hanwang Zhang. Selftok: Discrete visual tokens of autoregression, by diffusion, and for reasoning, 2025. URLhttps://arxiv.org/abs/2505.07538

work page arXiv 2025
[47]

Larp: Tokenizing videos with a learned autoregressive generative prior, 2025

Hanyu Wang, Saksham Suri, Yixuan Ren, Hao Chen, and Abhinav Shrivastava. Larp: Tokenizing videos with a learned autoregressive generative prior, 2025. URLhttps://arxiv.org/abs/2410.21264

work page arXiv 2025
[48]

Vector quantize pytorch

Phil Wang. Vector quantize pytorch. https://github.com/lucidrains/ vector-quantize-pytorch, 2022. URL https://github.com/lucidrains/ vector-quantize-pytorch. GitHub repository

2022
[49]

Parallelized autoregressive visual generation

Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, and Xihui Liu. Parallelized autoregressive visual generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12955–12965, June 2025

2025
[50]

Maskbit: Embedding-free image generation via bit tokens.Transactions on Machine Learning Research,2024

Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. Maskbit: Embedding-free image generation via bit tokens.Transactions on Machine Learning Research,2024. ISSN2835-8856. URLhttps://openreview.net/forum?id=NYe2JuN3v3. Featured Certification, Reproducibility Certification

2024
[51]

Towards sequence modeling alignment between tokenizer and autoregressive model

Pingyu Wu, Kai Zhu, Yu Liu, Longxiang Tang, Jian Yang, Yansong Peng, Wei Zhai, Yang Cao, and Zheng-Jun Zha. Towards sequence modeling alignment between tokenizer and autoregressive model. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview. net/forum?id=GT3obnZ5Fk

2026
[52]

Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation, 2025

Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, and Xihui Liu. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation, 2025. URLhttps://arxiv.org/abs/ 2504.08736

work page arXiv 2025
[53]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15703–15712, June 2025

2025
[54]

Vector-quantized image modeling with improved VQGAN

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved VQGAN. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URLhttps://openreview.net/forum?id=pfNyExj7z2

2022
[55]

Scaling autoregressive models for content-rich text-to- image generation.Transactions on Machine Learning Research, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to- image generation.Transactions on Machine Learning Research, 2022. ISS...

2022
[56]

Language model beats diffusion - tokenizer is key to visual generation

Lijun Yu, Jose Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A Ross, and Lu Jiang. Language model beats diffusion - tokenizer is key to visual generation. In The Twelfth International Conference on Learning Representat...

2024
[57]

Randomized autore- gressive visual generation.arXiv preprint arXiv:2411.00776, 2024

Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang-Chieh Chen. Randomized autoregressive visual generation, 2024. URLhttps://arxiv.org/abs/2411.00776

work page arXiv 2024
[58]

An image is worth 32 tokens for reconstruction and generation, 2024

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation, 2024. URLhttps://arxiv.org/abs/2406. 07550. Autoregressive Visual Generation Needs a Prologue 16

2024
[59]

Representation alignment for generation: Training diffusion transformers is easier than you think,

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think,
[60]

URLhttps://arxiv.org/abs/2410.06940

work page internal anchor Pith review arXiv
[61]

Soundstream: An end-to-end neural audio codec.IEEE ACM Trans

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec.IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022. doi: 10.1109/TASLP.2021.3129994. URLhttps://doi.org/10.1109/TASLP.2021.3129994

work page doi:10.1109/taslp.2021.3129994 2022
[62]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

2018
[63]

Restok: Learning hierarchical residuals in 1d visual tokenizers for autoregressive image generation, 2026

Xu Zhang, Cheng Da, Huan Yang, Kun Gai, Ming Lu, and Zhan Ma. Restok: Learning hierarchical residuals in 1d visual tokenizers for autoregressive image generation, 2026. URLhttps://arxiv.org/ abs/2601.03955

work page arXiv 2026
[64]

Hita: Holistic tokenizer for autoregressive image generation, 2025

Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, and Xiaojuan Qi. Hita: Holistic tokenizer for autoregressive image generation, 2025. URLhttps://arxiv.org/ abs/2507.02358

work page arXiv 2025
[65]

Vision foundation models as effective visual tokenizers for autoregressive image generation,

Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, and Xiaojuan Qi. Vision foundation models as effective visual tokenizers for autoregressive image generation,
[66]

URLhttps://arxiv.org/abs/2507.08441

work page arXiv
[67]

Designing a better asymmetric vqgan for stablediffusion, 2023

Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, and Gang Hua. Designing a better asymmetric vqgan for stablediffusion, 2023. URLhttps://arxiv.org/abs/2306. 04632. A. Broader impacts Our work may positively impact research on efficient and controllable visual generation. The core idea of improving generative quality by de...

work page arXiv 2023