IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder

Bo He; Junke Wang; Lingyu Kong; Yitong Chen; Yixuan Ren; Yu-Gang Jiang; Zijie Diao; Zuxuan Wu

arxiv: 2606.11096 · v1 · pith:UPHSYJ3Nnew · submitted 2026-06-09 · 💻 cs.CV

IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder

Yitong Chen , Zijie Diao , Junke Wang , Lingyu Kong , Yixuan Ren , Bo He , Yu-Gang Jiang , Zuxuan Wu This is my paper

Pith reviewed 2026-06-27 13:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords discrete representation autoencodervision foundation modelsshallow deep feature alignmentimage reconstructionautoregressive image generationquantized tokensrFIDgFID

0 comments

The pith

Aligning quantized tokens with both shallow and deep VFM features produces discrete visual tokens that keep fidelity and semantics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Ideal, a framework for discrete representation autoencoders that jointly aligns quantized image tokens with shallow and deep features from pretrained vision foundation models. Prior approaches use only deep features, which lose fine-grained local details during discretization and yield suboptimal reconstruction. The method exploits the observation that shallow features supply complementary appearance and structure to recover that lost information. This yields stronger discrete tokens for downstream tasks such as autoregressive image generation.

Core claim

By jointly aligning quantized tokens with both shallow and deep VFM features, Ideal enables the resulting discrete visual tokens to preserve both visual fidelity and rich semantics, achieving 0.61 rFID on ImageNet and outperforming the previous best method by 0.28; when used for autoregressive image generation it produces a gFID of 1.89.

What carries the argument

The in-depth alignment framework that jointly aligns quantized tokens with shallow and deep VFM features.

Load-bearing premise

Shallow VFM features retain considerably richer local appearance and structural detail that can recover low-level information lost after discretizing deep features alone.

What would settle it

An experiment in which joint shallow-plus-deep alignment produces no rFID improvement over deep-only alignment on ImageNet would falsify the claim.

read the original abstract

Built on pretrained vision foundation models (VFMs), representation autoencoders (RAEs) have recently emerged as a promising approach for constructing semantically rich latent spaces for image generation. However, their reconstruction quality often remains suboptimal, largely because deep VFM representations do not preserve sufficient fine-grained visual detail. This limitation becomes even more severe after discretization, where missing low-level information is difficult to recover. In fact, we observe that shallow VFM features retain considerably richer local appearance and structural detail, which complements the high-level semantics carried by deep features used in existing RAEs. Motivated by this complementary property, we propose Ideal, an In-depth Alignment framework for discrete representation autoencoding. By jointly aligning quantized tokens with both shallow and deep VFM features, Ideal enables the resulting discrete visual tokens to preserve both visual fidelity and rich semantics. Extensive experiments demonstrate that Ideal yields superior reconstruction performance, achieving 0.61 rFID on ImageNet and outperforming the previous best method by 0.28. When used for autoregressive image generation, Ideal further produces a gFID of 1.89, establishing a new state of the art for autoregressive image generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IDEAL claims better discrete tokens by jointly aligning to shallow and deep VFM features, with reported gains of 0.61 rFID and 1.89 gFID, but the abstract gives no loss details or ablations so the mechanism remains unverified.

read the letter

The main point on this paper is that IDEAL extends representation autoencoders by aligning quantized tokens to both shallow and deep layers from vision foundation models, arguing that shallow features supply the local detail lost in deep-only discretization. It reports 0.61 rFID on ImageNet reconstruction, 0.28 better than prior work, and 1.89 gFID in autoregressive generation.

The new piece is the explicit joint alignment to shallow features on top of the usual deep ones. The motivation is clear: deep features carry semantics but drop fine structure after quantization, while shallow ones keep appearance and structure. If the full paper shows this alignment actually drives the numbers through controlled experiments, it is a practical step for discrete tokenizers used in generation.

The paper does a reasonable job stating the problem and the complementary property of the features. The benchmark numbers are competitive in the autoregressive setting, which is the relevant downstream task.

The soft spots are real and centered on missing evidence. The abstract supplies no equation for the alignment loss, no description of the quantization operator, and no ablation isolating the shallow term. The stress-test concern holds: without those, it is impossible to confirm that the single codebook plus joint loss lets tokens encode both levels rather than gains coming from other unmentioned changes. The assumption that shallow features retain richer local detail is plausible but needs direct testing.

This is for people working on discrete latent spaces and autoregressive image models in computer vision. Readers tracking VQ-style methods or VFM-based encoders would get the most from the benchmarks.

The work shows honest engagement with a concrete limitation in existing RAEs. It deserves a serious referee to examine the full method, ablations, and experimental controls.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces IDEAL, an in-depth alignment framework for discrete representation autoencoders built on pretrained vision foundation models. It claims that jointly aligning quantized tokens to both shallow and deep VFM features allows the discrete tokens to retain low-level visual detail (lost in deep-only discretization) alongside high-level semantics, yielding 0.61 rFID on ImageNet (0.28 better than prior best) and 1.89 gFID for autoregressive image generation.

Significance. If the empirical gains are robustly attributable to the proposed alignment rather than unstated factors, the work would meaningfully advance discrete latent representations for image generation by exploiting complementary information across VFM depths. The reported metrics represent substantial improvements over prior RAEs.

major comments (2)

[Abstract] Abstract: performance numbers (0.61 rFID, 1.89 gFID) are stated without any experimental protocol, baseline details, ablation studies, or error analysis, so it is impossible to determine whether gains are due to the joint shallow+deep alignment or other factors.
[Abstract] Abstract: the central mechanism—that a single finite codebook can simultaneously encode both high-level semantics from deep features and fine-grained appearance from shallow features via joint alignment—is asserted without any equation for the alignment loss, quantization operator, or ablation isolating the shallow term.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. The concerns focus on the abstract's conciseness. The full manuscript (Sections 3 and 4, plus supplementary material) provides the experimental protocols, baselines, ablations, equations for the alignment loss and quantization, and analysis isolating the shallow term. We will make targeted revisions to the abstract for better context while preserving its brevity.

read point-by-point responses

Referee: [Abstract] Abstract: performance numbers (0.61 rFID, 1.89 gFID) are stated without any experimental protocol, baseline details, ablation studies, or error analysis, so it is impossible to determine whether gains are due to the joint shallow+deep alignment or other factors.

Authors: The abstract summarizes key outcomes; full details appear in the manuscript. Section 4.1 describes the ImageNet training protocol, rFID/gFID computation, and comparison to prior RAEs (e.g., VQGAN, RQ-VAE). Section 4.3 and Table 3 contain ablations that isolate the shallow alignment contribution (0.15 rFID gain when added to deep-only baseline). Multiple-run error bars are in Table 2. We will revise the abstract to add one sentence noting 'evaluated on ImageNet with standard rFID/gFID metrics and ablations confirming the joint alignment benefit.' revision: partial
Referee: [Abstract] Abstract: the central mechanism—that a single finite codebook can simultaneously encode both high-level semantics from deep features and fine-grained appearance from shallow features via joint alignment—is asserted without any equation for the alignment loss, quantization operator, or ablation isolating the shallow term.

Authors: Abstracts conventionally omit equations. The joint alignment objective (combining L_shallow and L_deep with the quantization operator Q) is defined in Equations (2)–(4) of Section 3.1. The ablation isolating the shallow term is in Table 3 (Section 4.3), demonstrating its necessity for low-level detail preservation. The abstract's description is therefore supported by the technical sections; we do not plan to insert equations into the abstract itself. revision: no

Circularity Check

0 steps flagged

No circularity; empirical method with external benchmarks

full rationale

The paper describes an empirical training procedure that jointly aligns quantized tokens to shallow and deep VFM features, reporting reconstruction and generation metrics (rFID, gFID) on ImageNet. No equations, derivations, or fitted parameters are presented that reduce to their own inputs by construction. No self-citation chains are invoked to justify uniqueness or load-bearing premises. The contribution rests on observable benchmark improvements rather than any self-referential mathematical step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution is empirical; the abstract introduces no new mathematical axioms, free parameters, or invented entities beyond standard components of representation autoencoders and vision foundation models.

pith-pipeline@v0.9.1-grok · 5760 in / 1127 out tokens · 29143 ms · 2026-06-27T13:08:35.054364+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 3 canonical work pages

[1]

Estimating or propagating gradients through stochastic neurons for conditional computation, 2013

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation, 2013. URLhttps://arxiv.org/abs/1308.3432

Pith/arXiv arXiv 2013
[2]

Perception encoder: The best visual embeddings are not at the output of the network

DanielBolya,Po-YaoHuang,PeizeSun,JangHyunCho,AndreaMadotto,ChenWei,TengyuMa,JialeZhi,Jathushan Rajasegaran, Hanoona Bangalath, et al. Perception encoder: The best visual embeddings are not at the output of the network. InNeurIPS, 2025

2025
[3]

Perception encoder: The best visual embeddings are not at the output of the network

DanielBolya,Po-YaoHuang,PeizeSun,JangHyunCho,AndreaMadotto,ChenWei,TengyuMa,JialeZhi,Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network. InNeurIPS, 2025

2025
[4]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conferenceon ComputerVision(ICCV), pages 9650–9660, October 2021

2021
[5]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InCVPR, 2022

2022
[6]

Comp: Continual multimodal pre-training for vision foundation models.arXivpreprintarXiv:2503.18931, 2025

Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, and Yu-Gang Jiang. Comp: Continual multimodal pre-training for vision foundation models.arXivpreprintarXiv:2503.18931, 2025

arXiv 2025
[7]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InICLR, 2024

2024
[8]

ImageNet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009
[9]

Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction, 2025

Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, and Chun Yuan. Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction, 2025. URLhttps://arxiv.org/abs/2511.23386. 11

arXiv 2025
[10]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, 2021

2021
[11]

Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

Pith/arXiv arXiv 2023
[12]

One layer is enough: Adapting pretrained visual encoders for image generation, 2025

Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pretrained visual encoders for image generation, 2025. URLhttps://arxiv.org/abs/2512.07829

arXiv 2025
[13]

Long video generation with time-agnostic vqgan and time-sensitive transformer, 2022

Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer, 2022. URLhttps://arxiv.org/abs/ 2204.03638

arXiv 2022
[14]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017

2017
[15]

beta-vae: Learning basic visual concepts with a constrained variational framework

IrinaHiggins,LoïcMatthey,ArkaPal,ChristopherP.Burgess,XavierGlorot,MatthewM.Botvinick,ShakirMohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2016. URL https://api.semanticscholar.org/ CorpusID:46798026

2016
[16]

Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional question answering, 2019

DrewA.HudsonandChristopherD.Manning. Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional question answering, 2019. URLhttps://arxiv.org/abs/1902.09506

Pith/arXiv arXiv 2019
[17]

Image-to-imagetranslationwithconditionaladversarial networks

PhillipIsola,Jun-YanZhu,TinghuiZhou,andAlexeiAEfros. Image-to-imagetranslationwithconditionaladversarial networks. InCVPR, 2017

2017
[18]

Dino-tok: Adapting dino for visual tokenizers, 2025

Mingkai Jia, Mingxiao Li, Liaoyuan Fan, Tianxing Shi, Jiaxin Guo, Zeming Li, Xiaoyang Guo, Xiao-Xiao Long, Qian Zhang, Ping Tan, and Wei Yin. Dino-tok: Adapting dino for visual tokenizers, 2025. URLhttps://arxiv.org/ abs/2511.20565

arXiv 2025
[19]

Product Quantization for Nearest Neighbor Search,

Herve Jégou, Matthĳs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.IEEE Transactionson PatternAnalysisandMachineIntelligence, 33(1):117–128, 2011. doi: 10.1109/TPAMI.2010.57

work page doi:10.1109/tpami.2010.57 2011
[20]

Auto-encoding variational bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. InICLR, 2014

2014
[21]

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uĳlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov,MatteoMalloci,AlexanderKolesnikov,TomDuerig,andVittorioFerrari. Theopenimagesdatasetv4: Unified image classification, object detection, and visual relationship detection at scale.International Journal of Computer Vision, 128, 03 2020. doi...

work page doi:10.1007/s11263-020-01316-z 2020
[22]

Improved precision and recall metric for assessing generative models, 2019

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models, 2019. URLhttps://arxiv.org/abs/1904.06991

arXiv 2019
[23]

Autoregressive image generation using residual quantization, 2022

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization, 2022. URLhttps://arxiv.org/abs/2203.01941

arXiv 2022
[24]

Seed-bench: Benchmarkingmultimodal llms with generative comprehension.arXivpreprint arXiv:2307.16125, 2023

BohaoLi,RuiWang,GuangzhiWang,YuyingGe,YixiaoGe,andYingShan. Seed-bench: Benchmarkingmultimodal llms with generative comprehension.arXivpreprint arXiv:2307.16125, 2023

Pith/arXiv arXiv 2023
[25]

Imagefolder: Autoregressive image generation with folded tokens, 2024

Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, and Zhe Lin. Imagefolder: Autoregressive image generation with folded tokens, 2024. URLhttps://arxiv.org/abs/2410.01756

arXiv 2024
[26]

Evaluating object hallucination in large vision-language models, 2023

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models, 2023. URLhttps://arxiv.org/abs/2305.10355

Pith/arXiv arXiv 2023
[27]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URLhttps://arxiv.org/abs/ 1711.05101

Pith/arXiv arXiv 2019
[28]

Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprintarXiv:2409.04410, 2024

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprintarXiv:2409.04410, 2024

arXiv 2024
[29]

Unitok: A unified tokenizer for visual generation and understanding, 2025

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding, 2025. URLhttps://arxiv.org/abs/2502.20321. 12

arXiv 2025
[30]

Albergo, Nicholas M

Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eĳnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers, 2024. URLhttps: //arxiv.org/abs/2401.08740

arXiv 2024
[31]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InACLFindings, 2022

2022
[32]

Docvqa: Adatasetforvqaondocumentimages

MineshMathew,DimosthenisKaratzas,andCVJawahar. Docvqa: Adatasetforvqaondocumentimages. In WACV, 2021

2021
[33]

Finitescalarquantization: Vq-vaemade simple, 2023

FabianMentzer, DavidMinnen, EirikurAgustsson, andMichaelTschannen. Finitescalarquantization: Vq-vaemade simple, 2023. URLhttps://arxiv.org/abs/2309.15505

Pith/arXiv arXiv 2023
[34]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. TMLR, 2024

2024
[35]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

2023
[36]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020

Pith/arXiv arXiv 2021
[37]

Generating diverse high-fidelity images with vq-vae-2, 2019

Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2, 2019

2019
[38]

Beyond next-token: Next-x prediction for autoregressive visual generation, 2025

Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. Beyond next-token: Next-x prediction for autoregressive visual generation, 2025. URLhttps://arxiv.org/abs/2502.20388

arXiv 2025
[39]

High-resolution image synthesis with latent diffusion models, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022. URLhttps://arxiv.org/abs/2112.10752

Pith/arXiv arXiv 2022
[40]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. InNeurIPS, 2016

2016
[41]

A-okvqa: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InECCV, 2022

2022
[42]

Latent diffusion model without variational autoencoder

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder. InICLR, 2026

2026
[43]

Dinov3.arXivpreprint arXiv:2508.10104, 2025

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cĳo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXivpreprint arXiv:2508.10104, 2025

Pith/arXiv arXiv 2025
[44]

Towards vqa models that can read, 2019

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read, 2019. URLhttps://arxiv.org/abs/1904.08920

Pith/arXiv arXiv 2019
[45]

Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies

Wei Song, Yuran Wang, Zĳia Song, Yadong Li, Zenan Zhou, Long Chen, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. InICLR, 2026

2026
[46]

Roformer: Enhanced transformer with rotary position embedding, 2023

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. URLhttps://arxiv.org/abs/2104.09864

Pith/arXiv arXiv 2023
[47]

Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprintarXiv:2406.06525, 2024

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprintarXiv:2406.06525, 2024

Pith/arXiv arXiv 2024
[48]

Visual autoregressive modeling: Scalable image generation via next-scale prediction

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InNeurIPS, 2024

2024
[49]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. In NeurIPS, 2024

2024
[50]

Siglip2: Multilingualvision-languageencoders with improved semantic understanding, localization, and dense features.arXivpreprintarXiv:2502.14786, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy,TalfanEvans,LucasBeyer,YeXia,BasilMustafa,etal. Siglip2: Multilingualvision-languageencoders with improved semantic understanding, localization, and dense features.arXivpreprintarXiv:2502.14786, 2025. 13

Pith/arXiv arXiv 2025
[51]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017

2017
[52]

Neural discrete representation learning, 2018

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning, 2018. URL https://arxiv.org/abs/1711.00937

Pith/arXiv arXiv 2018
[53]

Omnitokenizer: A joint image-video tokenizer for visual generation

Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation. InNeurIPS, 2024

2024
[54]

Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025

Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025

arXiv 2025
[55]

Omnigen-ar: Autoregressive any-to-image generation

Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Omnigen-ar: Autoregressive any-to-image generation. InNeurIPS, 2025

2025
[56]

Representation entanglement for generation: Training diffusion transformers is much easier than you think, 2025

Ge Wu, Shen Zhang, Ruĳing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, and Xiang Li. Representation entanglement for generation: Training diffusion transformers is much easier than you think, 2025. URLhttps://arxiv.org/abs/2507.01467

arXiv 2025
[57]

Grok-1.5 vision preview, 2024

xAI Team. Grok-1.5 vision preview, 2024. URLhttps://x.ai/blog/grok-1.5v

2024
[58]

Vision transformer with deformable attention,

Zhuofan Xia, Xuran Pan, Shĳi Song, Li Erran Li, and Gao Huang. Vision transformer with deformable attention,
[59]

URLhttps://arxiv.org/abs/2201.00520

arXiv
[60]

Videogpt: Video generation using vq-vae and transformers, 2021

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers, 2021

2021
[61]

Fasterdit: Towards faster diffusion transformers training without architecture modification, 2024

Jingfeng Yao, Wang Cheng, Wenyu Liu, and Xinggang Wang. Fasterdit: Towards faster diffusion transformers training without architecture modification, 2024. URLhttps://arxiv.org/abs/2410.10356

arXiv 2024
[62]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings ofthe IEEE/CVF ConferenceonComputerVisionand PatternRecognition, 2025

2025
[63]

Vector-quantized image modeling with improved vqgan, 2022

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan, 2022. URL https: //arxiv.org/abs/2110.04627

Pith/arXiv arXiv 2022
[64]

Scaling autoregressive models for content-rich text-to-image generation, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vĳay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022. URL https://arxiv.org/abs/2206.10789

Pith/arXiv arXiv 2022
[65]

Language model beats diffusion–tokenizer is key to visual generation

Lĳun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. In ICLR, 2024

2024
[66]

An image is worth 32 tokens for reconstruction and generation

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. InNeurIPS, 2024

2024
[67]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InICLR, 2024

2024
[68]

Sigmoid loss for language image pre-training, 2023

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023

2023
[69]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

2018
[70]

Spherical leech quantization for visual tokenization and generation, 2025

Yue Zhao, Hanwen Jiang, Zhenlin Xu, Chutong Yang, Ehsan Adeli, and Philipp Krähenbühl. Spherical leech quantization for visual tokenization and generation, 2025. URLhttps://arxiv.org/abs/2512.14697

arXiv 2025
[71]

Vision foundation models as effective visual tokenizers for autoregressive image generation, 2025

Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, and Xiaojuan Qi. Vision foundation models as effective visual tokenizers for autoregressive image generation, 2025. URL https://arxiv.org/abs/2507.08441. 14

arXiv 2025
[72]

Diffusion transformers with representation autoen- coders, 2025

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoen- coders, 2025. URLhttps://arxiv.org/abs/2510.11690

Pith/arXiv arXiv 2025
[73]

Fast training of diffusion models with masked transformers, 2024

Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers, 2024. URLhttps://arxiv.org/abs/2306.09305

arXiv 2024
[74]

Scaling the codebook size of vqgan to 100,000 with a utilization rate of 99%, 2024

Lei Zhu, Fangyun Wei, Yanye Lu, and Dong Chen. Scaling the codebook size of vqgan to 100,000 with a utilization rate of 99%, 2024. URLhttps://arxiv.org/abs/2406.11837

arXiv 2024
[75]

Addressingrepresentationcollapseinvectorquantized models with one linear layer, 2025

YongxinZhu,BochengLi,YifeiXin,ZhihuaXia,andLinliXu. Addressingrepresentationcollapseinvectorquantized models with one linear layer, 2025. URLhttps://arxiv.org/abs/2411.02038. 15 AIdealImplementation Details A.1 Tokenizer Training Details Overall, our tokenizer training recipe closely follows prior work VFMTok [70]. Since VFMTok uses a VFM with a patch siz...

arXiv 2025

[1] [1]

Estimating or propagating gradients through stochastic neurons for conditional computation, 2013

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation, 2013. URLhttps://arxiv.org/abs/1308.3432

Pith/arXiv arXiv 2013

[2] [2]

Perception encoder: The best visual embeddings are not at the output of the network

DanielBolya,Po-YaoHuang,PeizeSun,JangHyunCho,AndreaMadotto,ChenWei,TengyuMa,JialeZhi,Jathushan Rajasegaran, Hanoona Bangalath, et al. Perception encoder: The best visual embeddings are not at the output of the network. InNeurIPS, 2025

2025

[3] [3]

Perception encoder: The best visual embeddings are not at the output of the network

DanielBolya,Po-YaoHuang,PeizeSun,JangHyunCho,AndreaMadotto,ChenWei,TengyuMa,JialeZhi,Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network. InNeurIPS, 2025

2025

[4] [4]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conferenceon ComputerVision(ICCV), pages 9650–9660, October 2021

2021

[5] [5]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InCVPR, 2022

2022

[6] [6]

Comp: Continual multimodal pre-training for vision foundation models.arXivpreprintarXiv:2503.18931, 2025

Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, and Yu-Gang Jiang. Comp: Continual multimodal pre-training for vision foundation models.arXivpreprintarXiv:2503.18931, 2025

arXiv 2025

[7] [7]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InICLR, 2024

2024

[8] [8]

ImageNet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848

work page doi:10.1109/cvpr.2009.5206848 2009

[9] [9]

Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction, 2025

Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, and Chun Yuan. Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction, 2025. URLhttps://arxiv.org/abs/2511.23386. 11

arXiv 2025

[10] [10]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, 2021

2021

[11] [11]

Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

Pith/arXiv arXiv 2023

[12] [12]

One layer is enough: Adapting pretrained visual encoders for image generation, 2025

Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pretrained visual encoders for image generation, 2025. URLhttps://arxiv.org/abs/2512.07829

arXiv 2025

[13] [13]

Long video generation with time-agnostic vqgan and time-sensitive transformer, 2022

Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer, 2022. URLhttps://arxiv.org/abs/ 2204.03638

arXiv 2022

[14] [14]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017

2017

[15] [15]

beta-vae: Learning basic visual concepts with a constrained variational framework

IrinaHiggins,LoïcMatthey,ArkaPal,ChristopherP.Burgess,XavierGlorot,MatthewM.Botvinick,ShakirMohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2016. URL https://api.semanticscholar.org/ CorpusID:46798026

2016

[16] [16]

Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional question answering, 2019

DrewA.HudsonandChristopherD.Manning. Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional question answering, 2019. URLhttps://arxiv.org/abs/1902.09506

Pith/arXiv arXiv 2019

[17] [17]

Image-to-imagetranslationwithconditionaladversarial networks

PhillipIsola,Jun-YanZhu,TinghuiZhou,andAlexeiAEfros. Image-to-imagetranslationwithconditionaladversarial networks. InCVPR, 2017

2017

[18] [18]

Dino-tok: Adapting dino for visual tokenizers, 2025

Mingkai Jia, Mingxiao Li, Liaoyuan Fan, Tianxing Shi, Jiaxin Guo, Zeming Li, Xiaoyang Guo, Xiao-Xiao Long, Qian Zhang, Ping Tan, and Wei Yin. Dino-tok: Adapting dino for visual tokenizers, 2025. URLhttps://arxiv.org/ abs/2511.20565

arXiv 2025

[19] [19]

Product Quantization for Nearest Neighbor Search,

Herve Jégou, Matthĳs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.IEEE Transactionson PatternAnalysisandMachineIntelligence, 33(1):117–128, 2011. doi: 10.1109/TPAMI.2010.57

work page doi:10.1109/tpami.2010.57 2011

[20] [20]

Auto-encoding variational bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. InICLR, 2014

2014

[21] [21]

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uĳlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov,MatteoMalloci,AlexanderKolesnikov,TomDuerig,andVittorioFerrari. Theopenimagesdatasetv4: Unified image classification, object detection, and visual relationship detection at scale.International Journal of Computer Vision, 128, 03 2020. doi...

work page doi:10.1007/s11263-020-01316-z 2020

[22] [22]

Improved precision and recall metric for assessing generative models, 2019

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models, 2019. URLhttps://arxiv.org/abs/1904.06991

arXiv 2019

[23] [23]

Autoregressive image generation using residual quantization, 2022

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization, 2022. URLhttps://arxiv.org/abs/2203.01941

arXiv 2022

[24] [24]

Seed-bench: Benchmarkingmultimodal llms with generative comprehension.arXivpreprint arXiv:2307.16125, 2023

BohaoLi,RuiWang,GuangzhiWang,YuyingGe,YixiaoGe,andYingShan. Seed-bench: Benchmarkingmultimodal llms with generative comprehension.arXivpreprint arXiv:2307.16125, 2023

Pith/arXiv arXiv 2023

[25] [25]

Imagefolder: Autoregressive image generation with folded tokens, 2024

Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, and Zhe Lin. Imagefolder: Autoregressive image generation with folded tokens, 2024. URLhttps://arxiv.org/abs/2410.01756

arXiv 2024

[26] [26]

Evaluating object hallucination in large vision-language models, 2023

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models, 2023. URLhttps://arxiv.org/abs/2305.10355

Pith/arXiv arXiv 2023

[27] [27]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URLhttps://arxiv.org/abs/ 1711.05101

Pith/arXiv arXiv 2019

[28] [28]

Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprintarXiv:2409.04410, 2024

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprintarXiv:2409.04410, 2024

arXiv 2024

[29] [29]

Unitok: A unified tokenizer for visual generation and understanding, 2025

Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding, 2025. URLhttps://arxiv.org/abs/2502.20321. 12

arXiv 2025

[30] [30]

Albergo, Nicholas M

Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eĳnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers, 2024. URLhttps: //arxiv.org/abs/2401.08740

arXiv 2024

[31] [31]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InACLFindings, 2022

2022

[32] [32]

Docvqa: Adatasetforvqaondocumentimages

MineshMathew,DimosthenisKaratzas,andCVJawahar. Docvqa: Adatasetforvqaondocumentimages. In WACV, 2021

2021

[33] [33]

Finitescalarquantization: Vq-vaemade simple, 2023

FabianMentzer, DavidMinnen, EirikurAgustsson, andMichaelTschannen. Finitescalarquantization: Vq-vaemade simple, 2023. URLhttps://arxiv.org/abs/2309.15505

Pith/arXiv arXiv 2023

[34] [34]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. TMLR, 2024

2024

[35] [35]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

2023

[36] [36]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020

Pith/arXiv arXiv 2021

[37] [37]

Generating diverse high-fidelity images with vq-vae-2, 2019

Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2, 2019

2019

[38] [38]

Beyond next-token: Next-x prediction for autoregressive visual generation, 2025

Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. Beyond next-token: Next-x prediction for autoregressive visual generation, 2025. URLhttps://arxiv.org/abs/2502.20388

arXiv 2025

[39] [39]

High-resolution image synthesis with latent diffusion models, 2022

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022. URLhttps://arxiv.org/abs/2112.10752

Pith/arXiv arXiv 2022

[40] [40]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. InNeurIPS, 2016

2016

[41] [41]

A-okvqa: A benchmark for visual question answering using world knowledge

Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InECCV, 2022

2022

[42] [42]

Latent diffusion model without variational autoencoder

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder. InICLR, 2026

2026

[43] [43]

Dinov3.arXivpreprint arXiv:2508.10104, 2025

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cĳo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXivpreprint arXiv:2508.10104, 2025

Pith/arXiv arXiv 2025

[44] [44]

Towards vqa models that can read, 2019

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read, 2019. URLhttps://arxiv.org/abs/1904.08920

Pith/arXiv arXiv 2019

[45] [45]

Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies

Wei Song, Yuran Wang, Zĳia Song, Yadong Li, Zenan Zhou, Long Chen, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. InICLR, 2026

2026

[46] [46]

Roformer: Enhanced transformer with rotary position embedding, 2023

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. URLhttps://arxiv.org/abs/2104.09864

Pith/arXiv arXiv 2023

[47] [47]

Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprintarXiv:2406.06525, 2024

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprintarXiv:2406.06525, 2024

Pith/arXiv arXiv 2024

[48] [48]

Visual autoregressive modeling: Scalable image generation via next-scale prediction

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InNeurIPS, 2024

2024

[49] [49]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. In NeurIPS, 2024

2024

[50] [50]

Siglip2: Multilingualvision-languageencoders with improved semantic understanding, localization, and dense features.arXivpreprintarXiv:2502.14786, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy,TalfanEvans,LucasBeyer,YeXia,BasilMustafa,etal. Siglip2: Multilingualvision-languageencoders with improved semantic understanding, localization, and dense features.arXivpreprintarXiv:2502.14786, 2025. 13

Pith/arXiv arXiv 2025

[51] [51]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017

2017

[52] [52]

Neural discrete representation learning, 2018

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning, 2018. URL https://arxiv.org/abs/1711.00937

Pith/arXiv arXiv 2018

[53] [53]

Omnitokenizer: A joint image-video tokenizer for visual generation

Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation. InNeurIPS, 2024

2024

[54] [54]

Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025

Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025

arXiv 2025

[55] [55]

Omnigen-ar: Autoregressive any-to-image generation

Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Omnigen-ar: Autoregressive any-to-image generation. InNeurIPS, 2025

2025

[56] [56]

Representation entanglement for generation: Training diffusion transformers is much easier than you think, 2025

Ge Wu, Shen Zhang, Ruĳing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, and Xiang Li. Representation entanglement for generation: Training diffusion transformers is much easier than you think, 2025. URLhttps://arxiv.org/abs/2507.01467

arXiv 2025

[57] [57]

Grok-1.5 vision preview, 2024

xAI Team. Grok-1.5 vision preview, 2024. URLhttps://x.ai/blog/grok-1.5v

2024

[58] [58]

Vision transformer with deformable attention,

Zhuofan Xia, Xuran Pan, Shĳi Song, Li Erran Li, and Gao Huang. Vision transformer with deformable attention,

[59] [59]

URLhttps://arxiv.org/abs/2201.00520

arXiv

[60] [60]

Videogpt: Video generation using vq-vae and transformers, 2021

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers, 2021

2021

[61] [61]

Fasterdit: Towards faster diffusion transformers training without architecture modification, 2024

Jingfeng Yao, Wang Cheng, Wenyu Liu, and Xinggang Wang. Fasterdit: Towards faster diffusion transformers training without architecture modification, 2024. URLhttps://arxiv.org/abs/2410.10356

arXiv 2024

[62] [62]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings ofthe IEEE/CVF ConferenceonComputerVisionand PatternRecognition, 2025

2025

[63] [63]

Vector-quantized image modeling with improved vqgan, 2022

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan, 2022. URL https: //arxiv.org/abs/2110.04627

Pith/arXiv arXiv 2022

[64] [64]

Scaling autoregressive models for content-rich text-to-image generation, 2022

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vĳay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022. URL https://arxiv.org/abs/2206.10789

Pith/arXiv arXiv 2022

[65] [65]

Language model beats diffusion–tokenizer is key to visual generation

Lĳun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. In ICLR, 2024

2024

[66] [66]

An image is worth 32 tokens for reconstruction and generation

Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. InNeurIPS, 2024

2024

[67] [67]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InICLR, 2024

2024

[68] [68]

Sigmoid loss for language image pre-training, 2023

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023

2023

[69] [69]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

2018

[70] [70]

Spherical leech quantization for visual tokenization and generation, 2025

Yue Zhao, Hanwen Jiang, Zhenlin Xu, Chutong Yang, Ehsan Adeli, and Philipp Krähenbühl. Spherical leech quantization for visual tokenization and generation, 2025. URLhttps://arxiv.org/abs/2512.14697

arXiv 2025

[71] [71]

Vision foundation models as effective visual tokenizers for autoregressive image generation, 2025

Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, and Xiaojuan Qi. Vision foundation models as effective visual tokenizers for autoregressive image generation, 2025. URL https://arxiv.org/abs/2507.08441. 14

arXiv 2025

[72] [72]

Diffusion transformers with representation autoen- coders, 2025

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoen- coders, 2025. URLhttps://arxiv.org/abs/2510.11690

Pith/arXiv arXiv 2025

[73] [73]

Fast training of diffusion models with masked transformers, 2024

Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers, 2024. URLhttps://arxiv.org/abs/2306.09305

arXiv 2024

[74] [74]

Scaling the codebook size of vqgan to 100,000 with a utilization rate of 99%, 2024

Lei Zhu, Fangyun Wei, Yanye Lu, and Dong Chen. Scaling the codebook size of vqgan to 100,000 with a utilization rate of 99%, 2024. URLhttps://arxiv.org/abs/2406.11837

arXiv 2024

[75] [75]

Addressingrepresentationcollapseinvectorquantized models with one linear layer, 2025

YongxinZhu,BochengLi,YifeiXin,ZhihuaXia,andLinliXu. Addressingrepresentationcollapseinvectorquantized models with one linear layer, 2025. URLhttps://arxiv.org/abs/2411.02038. 15 AIdealImplementation Details A.1 Tokenizer Training Details Overall, our tokenizer training recipe closely follows prior work VFMTok [70]. Since VFMTok uses a VFM with a patch siz...

arXiv 2025