pith. machine review for the scientific record. sign in

arxiv: 2605.06137 · v1 · submitted 2026-05-07 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

Autoregressive Visual Generation Needs a Prologue

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords autoregressive image generationprologue tokensImageNetFID evaluationELBOtoken decouplingreconstruction-generation gap
0
0 comments X

The pith

Prepending a small set of prologue tokens trained only on AR loss decouples generation from reconstruction in autoregressive image models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Prologue to address the tension between reconstruction and generation objectives in autoregressive visual models. Rather than altering visual tokens to serve both goals, it prepends a handful of dedicated prologue tokens to the token sequence. These prologue tokens receive training solely through the autoregressive cross-entropy loss, leaving the visual tokens free to optimize reconstruction. The separation is further justified via an ELBO perspective. Experiments on ImageNet 256x256 demonstrate that this yields substantially lower generation FID scores while reconstruction metrics stay nearly identical.

Core claim

Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy loss, while visual tokens remain dedicated to reconstruction. This decoupled design optimizes generation through the AR model's true distribution without affecting reconstruction quality, which the paper formalizes from an ELBO perspective.

What carries the argument

Prologue tokens: a small learned set of tokens prepended to the visual token sequence and optimized exclusively under AR cross-entropy loss to carry the generative objective separately from reconstruction.

If this is right

  • Generation FID improves markedly: Prologue-Base lowers gFID from 21.01 to 10.75 without classifier-free guidance.
  • Reconstruction quality stays nearly constant under the decoupled training.
  • Prologue tokens acquire emergent semantic structure, shown by linear probing accuracy rising to 35.88 percent Top-1.
  • Prologue-Large reaches rFID of 0.99 and gFID of 1.46 with a plain AR model and no extra semantic supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prepended-token separation could be tested in autoregressive models for video or audio sequences.
  • Because the prologue tokens develop semantic layout on their own, they might serve as a compact conditioning signal for downstream tasks.
  • Varying the number or training schedule of prologue tokens offers a direct experimental knob for trading generation quality against compute.

Load-bearing premise

Training the prologue tokens exclusively with AR cross-entropy loss will leave the visual tokens' reconstruction quality essentially unchanged.

What would settle it

A clear rise in reconstruction error metrics such as rFID when prologue tokens are introduced and trained would show that the claimed decoupling does not hold.

read the original abstract

In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Instead of modifying visual tokens to satisfy both reconstruction and generation, Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy (CE) loss, while visual tokens remain dedicated to reconstruction. This decoupled design lets us optimize generation through the AR model's true distribution without affecting reconstruction quality, which we further formalize from an ELBO perspective. On ImageNet 256x256, Prologue-Base reduces gFID from 21.01 to 10.75 without classifier-free guidance while keeping reconstruction almost unchanged; Prologue-Large reaches a competitive rFID of 0.99 and gFID of 1.46 using a standard AR model without auxiliary semantic supervision. Interestingly, driven only by AR gradients, prologue tokens exhibit emergent semantic structure: linear probing on 16 prologue tokens reaches 35.88% Top-1, far above the 23.71% of the first 16 tokens from a standard tokenizer; resampling with fixed prologue tokens preserves a similar high-level semantic layout. Our results suggest a new direction: generation quality can be improved by introducing a separate learned generative representation while leaving the original representation intact.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Prologue, a method for autoregressive (AR) image generation that prepends a small set of learnable prologue tokens to the visual token sequence. Prologue tokens are trained exclusively with the AR cross-entropy loss while visual tokens remain dedicated to reconstruction; the design is formalized from an ELBO perspective to argue for decoupled optimization. On ImageNet 256x256, Prologue-Base improves gFID from 21.01 to 10.75 without classifier-free guidance and with reconstruction quality essentially unchanged; Prologue-Large achieves rFID 0.99 and gFID 1.46 using a standard AR model without auxiliary semantic supervision. The prologue tokens exhibit emergent semantic structure, with linear probing reaching 35.88% Top-1 accuracy and resampling preserving high-level layout.

Significance. If the decoupling holds, the work identifies a practical route to improving generation fidelity in AR models by introducing a separate learned generative representation while leaving the original visual representation intact. The reported gains are substantial, achieved without CFG or extra supervision, and the emergent semantics in the prologue tokens constitute an interesting empirical finding that could motivate further analysis of learned prefixes in sequence models.

major comments (2)
  1. [Section 3] ELBO formalization (Section 3): the claim that the objectives remain decoupled is load-bearing for the central contribution, yet the shared AR transformer parameters mean that gradients from the prologue CE loss necessarily update weights used to predict subsequent visual tokens. An explicit derivation is required showing that the variational/marginal terms in the ELBO separate despite this parameter sharing; without it, the preservation of rFID cannot be attributed to the formalization rather than an unstated implementation choice (e.g., frozen layers or separate prediction heads).
  2. [Section 4] Experimental protocol (Section 4 and Appendix): the manuscript must clarify whether the visual tokenizer and reconstruction objective are held completely fixed during prologue training or whether any joint fine-tuning occurs. If any parameters are updated jointly, the reported “almost unchanged” rFID requires quantitative before/after tables and controls for confounding factors such as training schedule or data augmentation.
minor comments (2)
  1. [Section 3] Notation: the distinction between the prologue token embeddings and the visual token embeddings should be made explicit in the equations (e.g., denote prologue tokens as z_p and visual tokens as z_v) to avoid ambiguity when describing the concatenated sequence.
  2. [Section 2] Related work: the positioning relative to prior prefix-conditioning or prompt-tuning techniques in AR models should be expanded to clarify the precise novelty of training the prefix exclusively with CE while freezing the reconstruction path.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important points about the ELBO formalization and experimental details, which we address below. We will revise the manuscript to incorporate clarifications and additional derivations as outlined.

read point-by-point responses
  1. Referee: [Section 3] ELBO formalization (Section 3): the claim that the objectives remain decoupled is load-bearing for the central contribution, yet the shared AR transformer parameters mean that gradients from the prologue CE loss necessarily update weights used to predict subsequent visual tokens. An explicit derivation is required showing that the variational/marginal terms in the ELBO separate despite this parameter sharing; without it, the preservation of rFID cannot be attributed to the formalization rather than an unstated implementation choice (e.g., frozen layers or separate prediction heads).

    Authors: We agree that an explicit derivation is needed to rigorously support the decoupling claim given the shared parameters. In the revised version, we will expand the ELBO analysis in Section 3 with a step-by-step derivation. This will show that the prologue cross-entropy term optimizes a distinct prefix distribution in the joint ELBO, while the visual token reconstruction term remains isolated in the marginal likelihood; the shared transformer weights do not mix the objectives because the prologue loss does not back-propagate into the reconstruction likelihood for visual tokens. We will also confirm that the implementation uses no frozen layers or separate heads, ensuring the rFID preservation follows directly from the formal separation rather than unstated choices. revision: yes

  2. Referee: [Section 4] Experimental protocol (Section 4 and Appendix): the manuscript must clarify whether the visual tokenizer and reconstruction objective are held completely fixed during prologue training or whether any joint fine-tuning occurs. If any parameters are updated jointly, the reported “almost unchanged” rFID requires quantitative before/after tables and controls for confounding factors such as training schedule or data augmentation.

    Authors: The visual tokenizer and reconstruction objective are held completely fixed; only the prologue tokens are optimized via the autoregressive cross-entropy loss, with no updates to the tokenizer parameters or reconstruction loss terms. We will revise Section 4 and the Appendix to state this protocol explicitly. We will also add a quantitative before/after rFID table and include controls for training schedule and data augmentation to rule out confounding effects. revision: yes

Circularity Check

0 steps flagged

No circularity: method introduces independent prologue tokens and reports empirical decoupling

full rationale

The paper defines a new architectural component (prologue tokens prepended to the visual sequence) and a training split (CE loss applied only to prologue positions, reconstruction loss on visual tokens). The ELBO formalization is presented as supporting the claim that these objectives remain decoupled under parameter sharing, but the provided text contains no equations that reduce the claimed separation to a tautology or to a fitted parameter renamed as a prediction. Results are obtained by training the combined model and measuring gFID/rFID on ImageNet; no load-bearing step collapses to self-citation, ansatz smuggling, or renaming of a known result. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The paper introduces a new set of tokens as the main addition, with the number being a free parameter and the ELBO as an assumption.

free parameters (1)
  • number of prologue tokens
    The size of the prologue set is a design choice that likely affects performance.
axioms (1)
  • domain assumption The ELBO perspective formalizes the decoupled training without affecting reconstruction.
    Mentioned in abstract as further formalization.
invented entities (1)
  • prologue tokens no independent evidence
    purpose: To handle generation separately from reconstruction in AR models.
    New introduced component without external validation mentioned.

pith-pipeline@v0.9.0 · 5544 in / 1310 out tokens · 57710 ms · 2026-05-08T13:48:32.689247+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 25 canonical work pages · 6 internal anchors

  1. [1]

    FlexTok: Resampling images into 1D token sequences of flexible length

    Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, Oğuzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Zamir, and Afshin Dehghan. FlexTok: Resampling images into 1D token sequences of flexible length. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Pro...

  2. [2]

    URLhttps://proceedings.mlr.press/v267/bachmann25a.html

    PMLR, 13–19 Jul 2025. URLhttps://proceedings.mlr.press/v267/bachmann25a.html

  3. [3]

    Generative pretraining from pixels

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 1691–1703. PMLR, 13–18 Jul 2020. URLhttps://proceeding...

  4. [4]

    ImageNet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 248–255. IEEE Computer Society, 2009. doi: 10.1109/CVPR.2009.5206848. URLhttps://doi.org/10.1...

  5. [5]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat gans on image synthesis. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wort- man Vaughan, editors,Advances in Neural Information Processing Systems 34: Annual Confer- ence on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, ...

  6. [6]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12873–12883, June 2021

  7. [7]

    Borgwardt, Malte J

    Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.Journal of Machine Learning Research, 13(25):723–773, 2012. URLhttp: //jmlr.org/papers/v13/gretton12a.html

  8. [8]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Is- abelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vish- wanathan, and Roman Garnett, editors,Advances in Neural Information Processing Syste...

  9. [9]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.CoRR, abs/2207.12598, 2022. doi: 10.48550/ARXIV.2207.12598. URLhttps://doi.org/10.48550/arXiv.2207.12598

  10. [10]

    Spectralar: Spectral autoregressive visual generation, 2025

    Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Yueqi Duan, Jie Zhou, and Jiwen Lu. Spectralar: Spectral autoregressive visual generation, 2025. URLhttps://arxiv.org/abs/2506.10962

  11. [11]

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

  12. [12]

    Categorical reparameterization with gumbel-softmax

    Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URLhttps://openreview.net/forum?id= rkE3y85ee

  13. [13]

    Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens,

    Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, and Liang-Chieh Chen. Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens,

  14. [14]

    URLhttps://arxiv.org/abs/2501.07730

  15. [15]

    Auto-Encoding Variational Bayes

    Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengio and Yann LeCun, editors,2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URLhttp://arxiv.org/abs/1312.6114

  16. [16]

    Autoregressiveimagegeneration using residual quantization

    DoyupLee, ChiheonKim, SaehoonKim, MinsuCho, andWook-ShinHan. Autoregressiveimagegeneration using residual quantization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11523–11532, June 2022

  17. [17]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Ass...

  18. [18]

    Autoregressive image generation without vector quantization

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing S...

  19. [19]

    Imagefolder: Autoregressive image gen- eration with folded tokens.arXiv preprint arXiv:2410.01756,

    Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, and Zhe Lin. Imagefolder: Autore- gressive image generation with folded tokens, 2024. URLhttps://arxiv.org/abs/2410.01756

  20. [20]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation , url =

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 458...

  21. [21]

    Detailflow: 1d coarse-to-fine autoregressive image generation via next-detail prediction.arXiv preprint arXiv:2505.21473,

    YihengLiu, LiaoQu, HuichaoZhang, XuWang, YiJiang, YimingGao, HuYe, XianLi, ShuaiWang, DanielK. Du, Fangmin Chen, Zehuan Yuan, and Xinglong Wu. Detailflow: 1d coarse-to-fine autoregressive image generation via next-detail prediction, 2025. URLhttps://arxiv.org/abs/2505.21473

  22. [23]

    URLhttp://arxiv.org/abs/1711.05101

  23. [24]

    Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion mod- els

    Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion mod- els. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference ...

  24. [25]

    Maddison, Andriy Mnih, and Yee Whye Teh

    Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URLhttps: //openreview.net/forum?id=S1jE5L5gl

  25. [26]

    Finite scalar quantization: VQ-VAE made simple

    Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: VQ-VAE made simple. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=8ishA3LxN8

  26. [27]

    One-d-piece: Image tokenizer meets quality-controllable compression.arXiv preprint arXiv:2501.10064, 2025

    Keita Miwa, Kento Sasaki, Hidehisa Arai, Tsubasa Takahashi, and Yu Yamaguchi. One-d-piece: Image tokenizer meets quality-controllable compression, 2025. URLhttps://arxiv.org/abs/2501.10064

  27. [28]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Lab...

  28. [29]

    Freeman, and Yu-Xiong Wang

    Ziqi Pang, Tianyuan Zhang, Fujun Luan, Yunze Man, Hao Tan, Kai Zhang, William T. Freeman, and Yu-Xiong Wang. Randar: Decoder-only autoregressive visual generation in random orders. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 45–55, June 2025

  29. [30]

    Image transformer

    Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4055–4064. PMLR, 10–15 Jul 2018. URLhttps://proceedings.mlr...

  30. [31]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, October 2023

  31. [32]

    When worse is better: Navigating the compression-generation tradeoff in visual tokenization, 2025

    Vivek Ramanujan, Kushal Tirumala, Armen Aghajanyan, Luke Zettlemoyer, and Ali Farhadi. When worse is better: Navigating the compression-generation tradeoff in visual tokenization, 2025. URL https://arxiv.org/abs/2412.16326. Autoregressive Visual Generation Needs a Prologue 14

  32. [33]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 8821–8831. PMLR, 18–24 Jul 2021. UR...

  33. [34]

    Flowar: Scale-wise autoregressive image generation meets flow matching, 2024

    Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. Flowar: Scale-wise autoregressive image generation meets flow matching, 2024. URLhttps://arxiv.org/abs/2412. 15205

  34. [35]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022

  35. [36]

    Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen

    Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Im- proved techniques for training gans. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Is- abelle Guyon, and Roman Garnett, editors,Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, Dece...

  36. [37]

    Flow to the mode: Mode-seeking diffusionautoencodersforstate-of-the-artimagetokenization

    Kyle Sargent, Kyle Hsu, Justin Johnson, Li Fei-Fei, and Jiajun Wu. Flow to the mode: Mode-seeking diffusionautoencodersforstate-of-the-artimagetokenization. InProceedingsoftheIEEE/CVFInternational Conference on Computer Vision (ICCV), pages 19471–19481, October 2025

  37. [38]

    Scalable image tokenization with index backpropagation quantization, 2025

    Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization, 2025. URLhttps://arxiv.org/abs/2412. 02692

  38. [39]

    Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien ...

  39. [40]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autore- gressive model beats diffusion: Llama for scalable image generation, 2024. URLhttps://arxiv.org/ abs/2406.06525

  40. [41]

    Hart: Efficient visual generation with hybrid autoregressive transformer,

    Haotian Tang, Yecheng Wu, Shang Yang, Enze Xie, Junsong Chen, Junyu Chen, Zhuoyang Zhang, Han Cai, Yao Lu, and Song Han. Hart: Efficient visual generation with hybrid autoregressive transformer,

  41. [42]

    URLhttps://arxiv.org/abs/2410.10812

  42. [43]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural In...

  43. [44]

    Conditional image generation with pixelcnn decoders, 2016

    AaronvandenOord, NalKalchbrenner, OriolVinyals, LasseEspeholt, AlexGraves, andKorayKavukcuoglu. Conditional image generation with pixelcnn decoders, 2016. URLhttps://arxiv.org/abs/1606. 05328

  44. [45]

    Neural discrete representation learn- ing

    Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learn- ing. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Decem...

  45. [46]

    Selftok: Discrete visual tokens of autoregression, by diffusion, and for reasoning, 2025

    BohanWang, ZhongqiYue, FengdaZhang, ShuoChen, Li’anBi, JunzheZhang, XueSong, KennardYanting Chan, Jiachun Pan, Weijia Wu, Mingze Zhou, Wang Lin, Kaihang Pan, Saining Zhang, Liyu Jia, Wentao Hu, Wei Zhao, and Hanwang Zhang. Selftok: Discrete visual tokens of autoregression, by diffusion, and for reasoning, 2025. URLhttps://arxiv.org/abs/2505.07538

  46. [47]

    Larp: Tokenizing videos with a learned autoregressive generative prior, 2025

    Hanyu Wang, Saksham Suri, Yixuan Ren, Hao Chen, and Abhinav Shrivastava. Larp: Tokenizing videos with a learned autoregressive generative prior, 2025. URLhttps://arxiv.org/abs/2410.21264

  47. [48]

    Vector quantize pytorch

    Phil Wang. Vector quantize pytorch. https://github.com/lucidrains/ vector-quantize-pytorch, 2022. URL https://github.com/lucidrains/ vector-quantize-pytorch. GitHub repository

  48. [49]

    Parallelized autoregressive visual generation

    Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, and Xihui Liu. Parallelized autoregressive visual generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12955–12965, June 2025

  49. [50]

    Maskbit: Embedding-free image generation via bit tokens.Transactions on Machine Learning Research,2024

    Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. Maskbit: Embedding-free image generation via bit tokens.Transactions on Machine Learning Research,2024. ISSN2835-8856. URLhttps://openreview.net/forum?id=NYe2JuN3v3. Featured Certification, Reproducibility Certification

  50. [51]

    Towards sequence modeling alignment between tokenizer and autoregressive model

    Pingyu Wu, Kai Zhu, Yu Liu, Longxiang Tang, Jian Yang, Yansong Peng, Wei Zhai, Yang Cao, and Zheng-Jun Zha. Towards sequence modeling alignment between tokenizer and autoregressive model. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview. net/forum?id=GT3obnZ5Fk

  51. [52]

    Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation, 2025

    Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, and Xihui Liu. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation, 2025. URLhttps://arxiv.org/abs/ 2504.08736

  52. [53]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15703–15712, June 2025

  53. [54]

    Vector-quantized image modeling with improved VQGAN

    Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved VQGAN. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URLhttps://openreview.net/forum?id=pfNyExj7z2

  54. [55]

    Scaling autoregressive models for content-rich text-to- image generation.Transactions on Machine Learning Research, 2022

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to- image generation.Transactions on Machine Learning Research, 2022. ISS...

  55. [56]

    Language model beats diffusion - tokenizer is key to visual generation

    Lijun Yu, Jose Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A Ross, and Lu Jiang. Language model beats diffusion - tokenizer is key to visual generation. In The Twelfth International Conference on Learning Representat...

  56. [57]

    Randomized autore- gressive visual generation.arXiv preprint arXiv:2411.00776, 2024

    Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang-Chieh Chen. Randomized autoregressive visual generation, 2024. URLhttps://arxiv.org/abs/2411.00776

  57. [58]

    An image is worth 32 tokens for reconstruction and generation, 2024

    Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation, 2024. URLhttps://arxiv.org/abs/2406. 07550. Autoregressive Visual Generation Needs a Prologue 16

  58. [59]

    Representation alignment for generation: Training diffusion transformers is easier than you think,

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think,

  59. [60]

    URLhttps://arxiv.org/abs/2410.06940

  60. [61]

    Soundstream: An end-to-end neural audio codec.IEEE ACM Trans

    Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec.IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022. doi: 10.1109/TASLP.2021.3129994. URLhttps://doi.org/10.1109/TASLP.2021.3129994

  61. [62]

    Efros, Eli Shechtman, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018

  62. [63]

    Restok: Learning hierarchical residuals in 1d visual tokenizers for autoregressive image generation, 2026

    Xu Zhang, Cheng Da, Huan Yang, Kun Gai, Ming Lu, and Zhan Ma. Restok: Learning hierarchical residuals in 1d visual tokenizers for autoregressive image generation, 2026. URLhttps://arxiv.org/ abs/2601.03955

  63. [64]

    Hita: Holistic tokenizer for autoregressive image generation, 2025

    Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, and Xiaojuan Qi. Hita: Holistic tokenizer for autoregressive image generation, 2025. URLhttps://arxiv.org/ abs/2507.02358

  64. [65]

    Vision foundation models as effective visual tokenizers for autoregressive image generation,

    Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, and Xiaojuan Qi. Vision foundation models as effective visual tokenizers for autoregressive image generation,

  65. [66]

    URLhttps://arxiv.org/abs/2507.08441

  66. [67]

    Designing a better asymmetric vqgan for stablediffusion, 2023

    Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, and Gang Hua. Designing a better asymmetric vqgan for stablediffusion, 2023. URLhttps://arxiv.org/abs/2306. 04632. A. Broader impacts Our work may positively impact research on efficient and controllable visual generation. The core idea of improving generative quality by de...