Recognition: unknown
Autoregressive Visual Generation Needs a Prologue
Pith reviewed 2026-05-08 13:48 UTC · model grok-4.3
The pith
Prepending a small set of prologue tokens trained only on AR loss decouples generation from reconstruction in autoregressive image models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy loss, while visual tokens remain dedicated to reconstruction. This decoupled design optimizes generation through the AR model's true distribution without affecting reconstruction quality, which the paper formalizes from an ELBO perspective.
What carries the argument
Prologue tokens: a small learned set of tokens prepended to the visual token sequence and optimized exclusively under AR cross-entropy loss to carry the generative objective separately from reconstruction.
If this is right
- Generation FID improves markedly: Prologue-Base lowers gFID from 21.01 to 10.75 without classifier-free guidance.
- Reconstruction quality stays nearly constant under the decoupled training.
- Prologue tokens acquire emergent semantic structure, shown by linear probing accuracy rising to 35.88 percent Top-1.
- Prologue-Large reaches rFID of 0.99 and gFID of 1.46 with a plain AR model and no extra semantic supervision.
Where Pith is reading between the lines
- The same prepended-token separation could be tested in autoregressive models for video or audio sequences.
- Because the prologue tokens develop semantic layout on their own, they might serve as a compact conditioning signal for downstream tasks.
- Varying the number or training schedule of prologue tokens offers a direct experimental knob for trading generation quality against compute.
Load-bearing premise
Training the prologue tokens exclusively with AR cross-entropy loss will leave the visual tokens' reconstruction quality essentially unchanged.
What would settle it
A clear rise in reconstruction error metrics such as rFID when prologue tokens are introduced and trained would show that the claimed decoupling does not hold.
read the original abstract
In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Instead of modifying visual tokens to satisfy both reconstruction and generation, Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy (CE) loss, while visual tokens remain dedicated to reconstruction. This decoupled design lets us optimize generation through the AR model's true distribution without affecting reconstruction quality, which we further formalize from an ELBO perspective. On ImageNet 256x256, Prologue-Base reduces gFID from 21.01 to 10.75 without classifier-free guidance while keeping reconstruction almost unchanged; Prologue-Large reaches a competitive rFID of 0.99 and gFID of 1.46 using a standard AR model without auxiliary semantic supervision. Interestingly, driven only by AR gradients, prologue tokens exhibit emergent semantic structure: linear probing on 16 prologue tokens reaches 35.88% Top-1, far above the 23.71% of the first 16 tokens from a standard tokenizer; resampling with fixed prologue tokens preserves a similar high-level semantic layout. Our results suggest a new direction: generation quality can be improved by introducing a separate learned generative representation while leaving the original representation intact.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Prologue, a method for autoregressive (AR) image generation that prepends a small set of learnable prologue tokens to the visual token sequence. Prologue tokens are trained exclusively with the AR cross-entropy loss while visual tokens remain dedicated to reconstruction; the design is formalized from an ELBO perspective to argue for decoupled optimization. On ImageNet 256x256, Prologue-Base improves gFID from 21.01 to 10.75 without classifier-free guidance and with reconstruction quality essentially unchanged; Prologue-Large achieves rFID 0.99 and gFID 1.46 using a standard AR model without auxiliary semantic supervision. The prologue tokens exhibit emergent semantic structure, with linear probing reaching 35.88% Top-1 accuracy and resampling preserving high-level layout.
Significance. If the decoupling holds, the work identifies a practical route to improving generation fidelity in AR models by introducing a separate learned generative representation while leaving the original visual representation intact. The reported gains are substantial, achieved without CFG or extra supervision, and the emergent semantics in the prologue tokens constitute an interesting empirical finding that could motivate further analysis of learned prefixes in sequence models.
major comments (2)
- [Section 3] ELBO formalization (Section 3): the claim that the objectives remain decoupled is load-bearing for the central contribution, yet the shared AR transformer parameters mean that gradients from the prologue CE loss necessarily update weights used to predict subsequent visual tokens. An explicit derivation is required showing that the variational/marginal terms in the ELBO separate despite this parameter sharing; without it, the preservation of rFID cannot be attributed to the formalization rather than an unstated implementation choice (e.g., frozen layers or separate prediction heads).
- [Section 4] Experimental protocol (Section 4 and Appendix): the manuscript must clarify whether the visual tokenizer and reconstruction objective are held completely fixed during prologue training or whether any joint fine-tuning occurs. If any parameters are updated jointly, the reported “almost unchanged” rFID requires quantitative before/after tables and controls for confounding factors such as training schedule or data augmentation.
minor comments (2)
- [Section 3] Notation: the distinction between the prologue token embeddings and the visual token embeddings should be made explicit in the equations (e.g., denote prologue tokens as z_p and visual tokens as z_v) to avoid ambiguity when describing the concatenated sequence.
- [Section 2] Related work: the positioning relative to prior prefix-conditioning or prompt-tuning techniques in AR models should be expanded to clarify the precise novelty of training the prefix exclusively with CE while freezing the reconstruction path.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The comments highlight important points about the ELBO formalization and experimental details, which we address below. We will revise the manuscript to incorporate clarifications and additional derivations as outlined.
read point-by-point responses
-
Referee: [Section 3] ELBO formalization (Section 3): the claim that the objectives remain decoupled is load-bearing for the central contribution, yet the shared AR transformer parameters mean that gradients from the prologue CE loss necessarily update weights used to predict subsequent visual tokens. An explicit derivation is required showing that the variational/marginal terms in the ELBO separate despite this parameter sharing; without it, the preservation of rFID cannot be attributed to the formalization rather than an unstated implementation choice (e.g., frozen layers or separate prediction heads).
Authors: We agree that an explicit derivation is needed to rigorously support the decoupling claim given the shared parameters. In the revised version, we will expand the ELBO analysis in Section 3 with a step-by-step derivation. This will show that the prologue cross-entropy term optimizes a distinct prefix distribution in the joint ELBO, while the visual token reconstruction term remains isolated in the marginal likelihood; the shared transformer weights do not mix the objectives because the prologue loss does not back-propagate into the reconstruction likelihood for visual tokens. We will also confirm that the implementation uses no frozen layers or separate heads, ensuring the rFID preservation follows directly from the formal separation rather than unstated choices. revision: yes
-
Referee: [Section 4] Experimental protocol (Section 4 and Appendix): the manuscript must clarify whether the visual tokenizer and reconstruction objective are held completely fixed during prologue training or whether any joint fine-tuning occurs. If any parameters are updated jointly, the reported “almost unchanged” rFID requires quantitative before/after tables and controls for confounding factors such as training schedule or data augmentation.
Authors: The visual tokenizer and reconstruction objective are held completely fixed; only the prologue tokens are optimized via the autoregressive cross-entropy loss, with no updates to the tokenizer parameters or reconstruction loss terms. We will revise Section 4 and the Appendix to state this protocol explicitly. We will also add a quantitative before/after rFID table and include controls for training schedule and data augmentation to rule out confounding effects. revision: yes
Circularity Check
No circularity: method introduces independent prologue tokens and reports empirical decoupling
full rationale
The paper defines a new architectural component (prologue tokens prepended to the visual sequence) and a training split (CE loss applied only to prologue positions, reconstruction loss on visual tokens). The ELBO formalization is presented as supporting the claim that these objectives remain decoupled under parameter sharing, but the provided text contains no equations that reduce the claimed separation to a tautology or to a fitted parameter renamed as a prediction. Results are obtained by training the combined model and measuring gFID/rFID on ImageNet; no load-bearing step collapses to self-citation, ansatz smuggling, or renaming of a known result. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of prologue tokens
axioms (1)
- domain assumption The ELBO perspective formalizes the decoupled training without affecting reconstruction.
invented entities (1)
-
prologue tokens
no independent evidence
Reference graph
Works this paper leans on
-
[1]
FlexTok: Resampling images into 1D token sequences of flexible length
Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, Oğuzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Zamir, and Afshin Dehghan. FlexTok: Resampling images into 1D token sequences of flexible length. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Pro...
-
[2]
URLhttps://proceedings.mlr.press/v267/bachmann25a.html
PMLR, 13–19 Jul 2025. URLhttps://proceedings.mlr.press/v267/bachmann25a.html
2025
-
[3]
Generative pretraining from pixels
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In Hal Daumé III and Aarti Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 1691–1703. PMLR, 13–18 Jul 2020. URLhttps://proceeding...
2020
-
[4]
ImageNet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 248–255. IEEE Computer Society, 2009. doi: 10.1109/CVPR.2009.5206848. URLhttps://doi.org/10.1...
-
[5]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat gans on image synthesis. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wort- man Vaughan, editors,Advances in Neural Information Processing Systems 34: Annual Confer- ence on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, ...
2021
-
[6]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12873–12883, June 2021
2021
-
[7]
Borgwardt, Malte J
Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.Journal of Machine Learning Research, 13(25):723–773, 2012. URLhttp: //jmlr.org/papers/v13/gretton12a.html
2012
-
[8]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Is- abelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vish- wanathan, and Roman Garnett, editors,Advances in Neural Information Processing Syste...
2017
-
[9]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.CoRR, abs/2207.12598, 2022. doi: 10.48550/ARXIV.2207.12598. URLhttps://doi.org/10.48550/arXiv.2207.12598
work page internal anchor Pith review doi:10.48550/arxiv.2207.12598 2022
-
[10]
Spectralar: Spectral autoregressive visual generation, 2025
Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Yueqi Duan, Jie Zhou, and Jiwen Lu. Spectralar: Spectral autoregressive visual generation, 2025. URLhttps://arxiv.org/abs/2506.10962
-
[11]
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017
2017
-
[12]
Categorical reparameterization with gumbel-softmax
Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URLhttps://openreview.net/forum?id= rkE3y85ee
2017
-
[13]
Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens,
Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, and Liang-Chieh Chen. Democratizing text-to-image masked generative models with compact text-aware one-dimensional tokens,
- [14]
-
[15]
Auto-Encoding Variational Bayes
Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengio and Yann LeCun, editors,2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URLhttp://arxiv.org/abs/1312.6114
work page internal anchor Pith review arXiv 2014
-
[16]
Autoregressiveimagegeneration using residual quantization
DoyupLee, ChiheonKim, SaehoonKim, MinsuCho, andWook-ShinHan. Autoregressiveimagegeneration using residual quantization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11523–11532, June 2022
2022
-
[17]
The Power of Scale for Parameter-Efficient Prompt Tuning
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Ass...
-
[18]
Autoregressive image generation without vector quantization
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing S...
2024
-
[19]
Imagefolder: Autoregressive image gen- eration with folded tokens.arXiv preprint arXiv:2410.01756,
Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, and Zhe Lin. Imagefolder: Autore- gressive image generation with folded tokens, 2024. URLhttps://arxiv.org/abs/2410.01756
-
[20]
Prefix-Tuning: Optimizing Continuous Prompts for Generation , url =
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 458...
-
[21]
YihengLiu, LiaoQu, HuichaoZhang, XuWang, YiJiang, YimingGao, HuYe, XianLi, ShuaiWang, DanielK. Du, Fangmin Chen, Zehuan Yuan, and Xinglong Wu. Detailflow: 1d coarse-to-fine autoregressive image generation via next-detail prediction, 2025. URLhttps://arxiv.org/abs/2505.21473
-
[23]
URLhttp://arxiv.org/abs/1711.05101
work page internal anchor Pith review arXiv
-
[24]
Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion mod- els
Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff-instruct: A universal approach for transferring knowledge from pre-trained diffusion mod- els. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference ...
2023
-
[25]
Maddison, Andriy Mnih, and Yee Whye Teh
Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URLhttps: //openreview.net/forum?id=S1jE5L5gl
2017
-
[26]
Finite scalar quantization: VQ-VAE made simple
Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: VQ-VAE made simple. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=8ishA3LxN8
2024
-
[27]
Keita Miwa, Kento Sasaki, Hidehisa Arai, Tsubasa Takahashi, and Yu Yamaguchi. One-d-piece: Image tokenizer meets quality-controllable compression, 2025. URLhttps://arxiv.org/abs/2501.10064
-
[28]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Lab...
2024
-
[29]
Freeman, and Yu-Xiong Wang
Ziqi Pang, Tianyuan Zhang, Fujun Luan, Yunze Man, Hao Tan, Kai Zhang, William T. Freeman, and Yu-Xiong Wang. Randar: Decoder-only autoregressive visual generation in random orders. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 45–55, June 2025
2025
-
[30]
Image transformer
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image transformer. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 4055–4064. PMLR, 10–15 Jul 2018. URLhttps://proceedings.mlr...
2018
-
[31]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, October 2023
2023
-
[32]
When worse is better: Navigating the compression-generation tradeoff in visual tokenization, 2025
Vivek Ramanujan, Kushal Tirumala, Armen Aghajanyan, Luke Zettlemoyer, and Ali Farhadi. When worse is better: Navigating the compression-generation tradeoff in visual tokenization, 2025. URL https://arxiv.org/abs/2412.16326. Autoregressive Visual Generation Needs a Prologue 14
-
[33]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In Marina Meila and Tong Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 8821–8831. PMLR, 18–24 Jul 2021. UR...
2021
-
[34]
Flowar: Scale-wise autoregressive image generation meets flow matching, 2024
Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. Flowar: Scale-wise autoregressive image generation meets flow matching, 2024. URLhttps://arxiv.org/abs/2412. 15205
2024
-
[35]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022
2022
-
[36]
Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen
Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Im- proved techniques for training gans. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Is- abelle Guyon, and Roman Garnett, editors,Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, Dece...
2016
-
[37]
Flow to the mode: Mode-seeking diffusionautoencodersforstate-of-the-artimagetokenization
Kyle Sargent, Kyle Hsu, Justin Johnson, Li Fei-Fei, and Jiajun Wu. Flow to the mode: Mode-seeking diffusionautoencodersforstate-of-the-artimagetokenization. InProceedingsoftheIEEE/CVFInternational Conference on Computer Vision (ICCV), pages 19471–19481, October 2025
2025
-
[38]
Scalable image tokenization with index backpropagation quantization, 2025
Fengyuan Shi, Zhuoyan Luo, Yixiao Ge, Yujiu Yang, Ying Shan, and Limin Wang. Scalable image tokenization with index backpropagation quantization, 2025. URLhttps://arxiv.org/abs/2412. 02692
2025
-
[39]
Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien ...
work page internal anchor Pith review arXiv 2025
-
[40]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autore- gressive model beats diffusion: Llama for scalable image generation, 2024. URLhttps://arxiv.org/ abs/2406.06525
work page internal anchor Pith review arXiv 2024
-
[41]
Hart: Efficient visual generation with hybrid autoregressive transformer,
Haotian Tang, Yecheng Wu, Shang Yang, Enze Xie, Junsong Chen, Junyu Chen, Zhuoyang Zhang, Han Cai, Yao Lu, and Song Han. Hart: Efficient visual generation with hybrid autoregressive transformer,
- [42]
-
[43]
Visual autoregressive modeling: Scalable image generation via next-scale prediction
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural In...
2024
-
[44]
Conditional image generation with pixelcnn decoders, 2016
AaronvandenOord, NalKalchbrenner, OriolVinyals, LasseEspeholt, AlexGraves, andKorayKavukcuoglu. Conditional image generation with pixelcnn decoders, 2016. URLhttps://arxiv.org/abs/1606. 05328
2016
-
[45]
Neural discrete representation learn- ing
Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learn- ing. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Decem...
2017
-
[46]
Selftok: Discrete visual tokens of autoregression, by diffusion, and for reasoning, 2025
BohanWang, ZhongqiYue, FengdaZhang, ShuoChen, Li’anBi, JunzheZhang, XueSong, KennardYanting Chan, Jiachun Pan, Weijia Wu, Mingze Zhou, Wang Lin, Kaihang Pan, Saining Zhang, Liyu Jia, Wentao Hu, Wei Zhao, and Hanwang Zhang. Selftok: Discrete visual tokens of autoregression, by diffusion, and for reasoning, 2025. URLhttps://arxiv.org/abs/2505.07538
-
[47]
Larp: Tokenizing videos with a learned autoregressive generative prior, 2025
Hanyu Wang, Saksham Suri, Yixuan Ren, Hao Chen, and Abhinav Shrivastava. Larp: Tokenizing videos with a learned autoregressive generative prior, 2025. URLhttps://arxiv.org/abs/2410.21264
-
[48]
Vector quantize pytorch
Phil Wang. Vector quantize pytorch. https://github.com/lucidrains/ vector-quantize-pytorch, 2022. URL https://github.com/lucidrains/ vector-quantize-pytorch. GitHub repository
2022
-
[49]
Parallelized autoregressive visual generation
Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, and Xihui Liu. Parallelized autoregressive visual generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12955–12965, June 2025
2025
-
[50]
Maskbit: Embedding-free image generation via bit tokens.Transactions on Machine Learning Research,2024
Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. Maskbit: Embedding-free image generation via bit tokens.Transactions on Machine Learning Research,2024. ISSN2835-8856. URLhttps://openreview.net/forum?id=NYe2JuN3v3. Featured Certification, Reproducibility Certification
2024
-
[51]
Towards sequence modeling alignment between tokenizer and autoregressive model
Pingyu Wu, Kai Zhu, Yu Liu, Longxiang Tang, Jian Yang, Yansong Peng, Wei Zhai, Yang Cao, and Zheng-Jun Zha. Towards sequence modeling alignment between tokenizer and autoregressive model. In The Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview. net/forum?id=GT3obnZ5Fk
2026
-
[52]
Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation, 2025
Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, and Xihui Liu. Gigatok: Scaling visual tokenizers to 3 billion parameters for autoregressive image generation, 2025. URLhttps://arxiv.org/abs/ 2504.08736
-
[53]
Reconstruction vs
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15703–15712, June 2025
2025
-
[54]
Vector-quantized image modeling with improved VQGAN
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved VQGAN. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URLhttps://openreview.net/forum?id=pfNyExj7z2
2022
-
[55]
Scaling autoregressive models for content-rich text-to- image generation.Transactions on Machine Learning Research, 2022
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to- image generation.Transactions on Machine Learning Research, 2022. ISS...
2022
-
[56]
Language model beats diffusion - tokenizer is key to visual generation
Lijun Yu, Jose Lezama, Nitesh Bharadwaj Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A Ross, and Lu Jiang. Language model beats diffusion - tokenizer is key to visual generation. In The Twelfth International Conference on Learning Representat...
2024
-
[57]
Randomized autore- gressive visual generation.arXiv preprint arXiv:2411.00776, 2024
Qihang Yu, Ju He, Xueqing Deng, Xiaohui Shen, and Liang-Chieh Chen. Randomized autoregressive visual generation, 2024. URLhttps://arxiv.org/abs/2411.00776
-
[58]
An image is worth 32 tokens for reconstruction and generation, 2024
Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation, 2024. URLhttps://arxiv.org/abs/2406. 07550. Autoregressive Visual Generation Needs a Prologue 16
2024
-
[59]
Representation alignment for generation: Training diffusion transformers is easier than you think,
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think,
-
[60]
URLhttps://arxiv.org/abs/2410.06940
work page internal anchor Pith review arXiv
-
[61]
Soundstream: An end-to-end neural audio codec.IEEE ACM Trans
Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec.IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022. doi: 10.1109/TASLP.2021.3129994. URLhttps://doi.org/10.1109/TASLP.2021.3129994
-
[62]
Efros, Eli Shechtman, and Oliver Wang
Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
2018
-
[63]
Xu Zhang, Cheng Da, Huan Yang, Kun Gai, Ming Lu, and Zhan Ma. Restok: Learning hierarchical residuals in 1d visual tokenizers for autoregressive image generation, 2026. URLhttps://arxiv.org/ abs/2601.03955
-
[64]
Hita: Holistic tokenizer for autoregressive image generation, 2025
Anlin Zheng, Haochen Wang, Yucheng Zhao, Weipeng Deng, Tiancai Wang, Xiangyu Zhang, and Xiaojuan Qi. Hita: Holistic tokenizer for autoregressive image generation, 2025. URLhttps://arxiv.org/ abs/2507.02358
-
[65]
Vision foundation models as effective visual tokenizers for autoregressive image generation,
Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, and Xiaojuan Qi. Vision foundation models as effective visual tokenizers for autoregressive image generation,
- [66]
-
[67]
Designing a better asymmetric vqgan for stablediffusion, 2023
Zixin Zhu, Xuelu Feng, Dongdong Chen, Jianmin Bao, Le Wang, Yinpeng Chen, Lu Yuan, and Gang Hua. Designing a better asymmetric vqgan for stablediffusion, 2023. URLhttps://arxiv.org/abs/2306. 04632. A. Broader impacts Our work may positively impact research on efficient and controllable visual generation. The core idea of improving generative quality by de...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.