pith. sign in

arxiv: 2606.11096 · v1 · pith:UPHSYJ3Nnew · submitted 2026-06-09 · 💻 cs.CV

IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder

Pith reviewed 2026-06-27 13:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords discrete representation autoencodervision foundation modelsshallow deep feature alignmentimage reconstructionautoregressive image generationquantized tokensrFIDgFID
0
0 comments X

The pith

Aligning quantized tokens with both shallow and deep VFM features produces discrete visual tokens that keep fidelity and semantics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Ideal, a framework for discrete representation autoencoders that jointly aligns quantized image tokens with shallow and deep features from pretrained vision foundation models. Prior approaches use only deep features, which lose fine-grained local details during discretization and yield suboptimal reconstruction. The method exploits the observation that shallow features supply complementary appearance and structure to recover that lost information. This yields stronger discrete tokens for downstream tasks such as autoregressive image generation.

Core claim

By jointly aligning quantized tokens with both shallow and deep VFM features, Ideal enables the resulting discrete visual tokens to preserve both visual fidelity and rich semantics, achieving 0.61 rFID on ImageNet and outperforming the previous best method by 0.28; when used for autoregressive image generation it produces a gFID of 1.89.

What carries the argument

The in-depth alignment framework that jointly aligns quantized tokens with shallow and deep VFM features.

Load-bearing premise

Shallow VFM features retain considerably richer local appearance and structural detail that can recover low-level information lost after discretizing deep features alone.

What would settle it

An experiment in which joint shallow-plus-deep alignment produces no rFID improvement over deep-only alignment on ImageNet would falsify the claim.

read the original abstract

Built on pretrained vision foundation models (VFMs), representation autoencoders (RAEs) have recently emerged as a promising approach for constructing semantically rich latent spaces for image generation. However, their reconstruction quality often remains suboptimal, largely because deep VFM representations do not preserve sufficient fine-grained visual detail. This limitation becomes even more severe after discretization, where missing low-level information is difficult to recover. In fact, we observe that shallow VFM features retain considerably richer local appearance and structural detail, which complements the high-level semantics carried by deep features used in existing RAEs. Motivated by this complementary property, we propose Ideal, an In-depth Alignment framework for discrete representation autoencoding. By jointly aligning quantized tokens with both shallow and deep VFM features, Ideal enables the resulting discrete visual tokens to preserve both visual fidelity and rich semantics. Extensive experiments demonstrate that Ideal yields superior reconstruction performance, achieving 0.61 rFID on ImageNet and outperforming the previous best method by 0.28. When used for autoregressive image generation, Ideal further produces a gFID of 1.89, establishing a new state of the art for autoregressive image generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces IDEAL, an in-depth alignment framework for discrete representation autoencoders built on pretrained vision foundation models. It claims that jointly aligning quantized tokens to both shallow and deep VFM features allows the discrete tokens to retain low-level visual detail (lost in deep-only discretization) alongside high-level semantics, yielding 0.61 rFID on ImageNet (0.28 better than prior best) and 1.89 gFID for autoregressive image generation.

Significance. If the empirical gains are robustly attributable to the proposed alignment rather than unstated factors, the work would meaningfully advance discrete latent representations for image generation by exploiting complementary information across VFM depths. The reported metrics represent substantial improvements over prior RAEs.

major comments (2)
  1. [Abstract] Abstract: performance numbers (0.61 rFID, 1.89 gFID) are stated without any experimental protocol, baseline details, ablation studies, or error analysis, so it is impossible to determine whether gains are due to the joint shallow+deep alignment or other factors.
  2. [Abstract] Abstract: the central mechanism—that a single finite codebook can simultaneously encode both high-level semantics from deep features and fine-grained appearance from shallow features via joint alignment—is asserted without any equation for the alignment loss, quantization operator, or ablation isolating the shallow term.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. The concerns focus on the abstract's conciseness. The full manuscript (Sections 3 and 4, plus supplementary material) provides the experimental protocols, baselines, ablations, equations for the alignment loss and quantization, and analysis isolating the shallow term. We will make targeted revisions to the abstract for better context while preserving its brevity.

read point-by-point responses
  1. Referee: [Abstract] Abstract: performance numbers (0.61 rFID, 1.89 gFID) are stated without any experimental protocol, baseline details, ablation studies, or error analysis, so it is impossible to determine whether gains are due to the joint shallow+deep alignment or other factors.

    Authors: The abstract summarizes key outcomes; full details appear in the manuscript. Section 4.1 describes the ImageNet training protocol, rFID/gFID computation, and comparison to prior RAEs (e.g., VQGAN, RQ-VAE). Section 4.3 and Table 3 contain ablations that isolate the shallow alignment contribution (0.15 rFID gain when added to deep-only baseline). Multiple-run error bars are in Table 2. We will revise the abstract to add one sentence noting 'evaluated on ImageNet with standard rFID/gFID metrics and ablations confirming the joint alignment benefit.' revision: partial

  2. Referee: [Abstract] Abstract: the central mechanism—that a single finite codebook can simultaneously encode both high-level semantics from deep features and fine-grained appearance from shallow features via joint alignment—is asserted without any equation for the alignment loss, quantization operator, or ablation isolating the shallow term.

    Authors: Abstracts conventionally omit equations. The joint alignment objective (combining L_shallow and L_deep with the quantization operator Q) is defined in Equations (2)–(4) of Section 3.1. The ablation isolating the shallow term is in Table 3 (Section 4.3), demonstrating its necessity for low-level detail preservation. The abstract's description is therefore supported by the technical sections; we do not plan to insert equations into the abstract itself. revision: no

Circularity Check

0 steps flagged

No circularity; empirical method with external benchmarks

full rationale

The paper describes an empirical training procedure that jointly aligns quantized tokens to shallow and deep VFM features, reporting reconstruction and generation metrics (rFID, gFID) on ImageNet. No equations, derivations, or fitted parameters are presented that reduce to their own inputs by construction. No self-citation chains are invoked to justify uniqueness or load-bearing premises. The contribution rests on observable benchmark improvements rather than any self-referential mathematical step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution is empirical; the abstract introduces no new mathematical axioms, free parameters, or invented entities beyond standard components of representation autoencoders and vision foundation models.

pith-pipeline@v0.9.1-grok · 5760 in / 1127 out tokens · 29143 ms · 2026-06-27T13:08:35.054364+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 3 canonical work pages

  1. [1]

    Estimating or propagating gradients through stochastic neurons for conditional computation, 2013

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation, 2013. URLhttps://arxiv.org/abs/1308.3432

  2. [2]

    Perception encoder: The best visual embeddings are not at the output of the network

    DanielBolya,Po-YaoHuang,PeizeSun,JangHyunCho,AndreaMadotto,ChenWei,TengyuMa,JialeZhi,Jathushan Rajasegaran, Hanoona Bangalath, et al. Perception encoder: The best visual embeddings are not at the output of the network. InNeurIPS, 2025

  3. [3]

    Perception encoder: The best visual embeddings are not at the output of the network

    DanielBolya,Po-YaoHuang,PeizeSun,JangHyunCho,AndreaMadotto,ChenWei,TengyuMa,JialeZhi,Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network. InNeurIPS, 2025

  4. [4]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conferenceon ComputerVision(ICCV), pages 9650–9660, October 2021

  5. [5]

    Maskgit: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InCVPR, 2022

  6. [6]

    Comp: Continual multimodal pre-training for vision foundation models.arXivpreprintarXiv:2503.18931, 2025

    Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, and Yu-Gang Jiang. Comp: Continual multimodal pre-training for vision foundation models.arXivpreprintarXiv:2503.18931, 2025

  7. [7]

    Vision transformers need registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InICLR, 2024

  8. [8]

    ImageNet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848

  9. [9]

    Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction, 2025

    Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, and Chun Yuan. Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction, 2025. URLhttps://arxiv.org/abs/2511.23386. 11

  10. [10]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, 2021

  11. [11]

    Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

    Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023

  12. [12]

    One layer is enough: Adapting pretrained visual encoders for image generation, 2025

    Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pretrained visual encoders for image generation, 2025. URLhttps://arxiv.org/abs/2512.07829

  13. [13]

    Long video generation with time-agnostic vqgan and time-sensitive transformer, 2022

    Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer, 2022. URLhttps://arxiv.org/abs/ 2204.03638

  14. [14]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017

  15. [15]

    beta-vae: Learning basic visual concepts with a constrained variational framework

    IrinaHiggins,LoïcMatthey,ArkaPal,ChristopherP.Burgess,XavierGlorot,MatthewM.Botvinick,ShakirMohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2016. URL https://api.semanticscholar.org/ CorpusID:46798026

  16. [16]

    Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional question answering, 2019

    DrewA.HudsonandChristopherD.Manning. Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional question answering, 2019. URLhttps://arxiv.org/abs/1902.09506

  17. [17]

    Image-to-imagetranslationwithconditionaladversarial networks

    PhillipIsola,Jun-YanZhu,TinghuiZhou,andAlexeiAEfros. Image-to-imagetranslationwithconditionaladversarial networks. InCVPR, 2017

  18. [18]

    Dino-tok: Adapting dino for visual tokenizers, 2025

    Mingkai Jia, Mingxiao Li, Liaoyuan Fan, Tianxing Shi, Jiaxin Guo, Zeming Li, Xiaoyang Guo, Xiao-Xiao Long, Qian Zhang, Ping Tan, and Wei Yin. Dino-tok: Adapting dino for visual tokenizers, 2025. URLhttps://arxiv.org/ abs/2511.20565

  19. [19]

    Product Quantization for Nearest Neighbor Search,

    Herve Jégou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.IEEE Transactionson PatternAnalysisandMachineIntelligence, 33(1):117–128, 2011. doi: 10.1109/TPAMI.2010.57

  20. [20]

    Auto-encoding variational bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. InICLR, 2014

  21. [21]

    The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,

    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov,MatteoMalloci,AlexanderKolesnikov,TomDuerig,andVittorioFerrari. Theopenimagesdatasetv4: Unified image classification, object detection, and visual relationship detection at scale.International Journal of Computer Vision, 128, 03 2020. doi...

  22. [22]

    Improved precision and recall metric for assessing generative models, 2019

    Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models, 2019. URLhttps://arxiv.org/abs/1904.06991

  23. [23]

    Autoregressive image generation using residual quantization, 2022

    Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization, 2022. URLhttps://arxiv.org/abs/2203.01941

  24. [24]

    Seed-bench: Benchmarkingmultimodal llms with generative comprehension.arXivpreprint arXiv:2307.16125, 2023

    BohaoLi,RuiWang,GuangzhiWang,YuyingGe,YixiaoGe,andYingShan. Seed-bench: Benchmarkingmultimodal llms with generative comprehension.arXivpreprint arXiv:2307.16125, 2023

  25. [25]

    Imagefolder: Autoregressive image generation with folded tokens, 2024

    Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, and Zhe Lin. Imagefolder: Autoregressive image generation with folded tokens, 2024. URLhttps://arxiv.org/abs/2410.01756

  26. [26]

    Evaluating object hallucination in large vision-language models, 2023

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models, 2023. URLhttps://arxiv.org/abs/2305.10355

  27. [27]

    Decoupled weight decay regularization, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URLhttps://arxiv.org/abs/ 1711.05101

  28. [28]

    Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprintarXiv:2409.04410, 2024

    Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprintarXiv:2409.04410, 2024

  29. [29]

    Unitok: A unified tokenizer for visual generation and understanding, 2025

    Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding, 2025. URLhttps://arxiv.org/abs/2502.20321. 12

  30. [30]

    Albergo, Nicholas M

    Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers, 2024. URLhttps: //arxiv.org/abs/2401.08740

  31. [31]

    Chartqa: A benchmark for question answering about charts with visual and logical reasoning

    Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InACLFindings, 2022

  32. [32]

    Docvqa: Adatasetforvqaondocumentimages

    MineshMathew,DimosthenisKaratzas,andCVJawahar. Docvqa: Adatasetforvqaondocumentimages. In WACV, 2021

  33. [33]

    Finitescalarquantization: Vq-vaemade simple, 2023

    FabianMentzer, DavidMinnen, EirikurAgustsson, andMichaelTschannen. Finitescalarquantization: Vq-vaemade simple, 2023. URLhttps://arxiv.org/abs/2309.15505

  34. [34]

    Dinov2: Learning robust visual features without supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. TMLR, 2024

  35. [35]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023

  36. [36]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020

  37. [37]

    Generating diverse high-fidelity images with vq-vae-2, 2019

    Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2, 2019

  38. [38]

    Beyond next-token: Next-x prediction for autoregressive visual generation, 2025

    Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. Beyond next-token: Next-x prediction for autoregressive visual generation, 2025. URLhttps://arxiv.org/abs/2502.20388

  39. [39]

    High-resolution image synthesis with latent diffusion models, 2022

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022. URLhttps://arxiv.org/abs/2112.10752

  40. [40]

    Improved techniques for training gans

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. InNeurIPS, 2016

  41. [41]

    A-okvqa: A benchmark for visual question answering using world knowledge

    Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InECCV, 2022

  42. [42]

    Latent diffusion model without variational autoencoder

    Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder. InICLR, 2026

  43. [43]

    Dinov3.arXivpreprint arXiv:2508.10104, 2025

    Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXivpreprint arXiv:2508.10104, 2025

  44. [44]

    Towards vqa models that can read, 2019

    Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read, 2019. URLhttps://arxiv.org/abs/1904.08920

  45. [45]

    Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies

    Wei Song, Yuran Wang, Zijia Song, Yadong Li, Zenan Zhou, Long Chen, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. InICLR, 2026

  46. [46]

    Roformer: Enhanced transformer with rotary position embedding, 2023

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. URLhttps://arxiv.org/abs/2104.09864

  47. [47]

    Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprintarXiv:2406.06525, 2024

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprintarXiv:2406.06525, 2024

  48. [48]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InNeurIPS, 2024

  49. [49]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms

    Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. In NeurIPS, 2024

  50. [50]

    Siglip2: Multilingualvision-languageencoders with improved semantic understanding, localization, and dense features.arXivpreprintarXiv:2502.14786, 2025

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy,TalfanEvans,LucasBeyer,YeXia,BasilMustafa,etal. Siglip2: Multilingualvision-languageencoders with improved semantic understanding, localization, and dense features.arXivpreprintarXiv:2502.14786, 2025. 13

  51. [51]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017

  52. [52]

    Neural discrete representation learning, 2018

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning, 2018. URL https://arxiv.org/abs/1711.00937

  53. [53]

    Omnitokenizer: A joint image-video tokenizer for visual generation

    Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation. InNeurIPS, 2024

  54. [54]

    Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025

    Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025

  55. [55]

    Omnigen-ar: Autoregressive any-to-image generation

    Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Omnigen-ar: Autoregressive any-to-image generation. InNeurIPS, 2025

  56. [56]

    Representation entanglement for generation: Training diffusion transformers is much easier than you think, 2025

    Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, and Xiang Li. Representation entanglement for generation: Training diffusion transformers is much easier than you think, 2025. URLhttps://arxiv.org/abs/2507.01467

  57. [57]

    Grok-1.5 vision preview, 2024

    xAI Team. Grok-1.5 vision preview, 2024. URLhttps://x.ai/blog/grok-1.5v

  58. [58]

    Vision transformer with deformable attention,

    Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao Huang. Vision transformer with deformable attention,

  59. [59]

    URLhttps://arxiv.org/abs/2201.00520

  60. [60]

    Videogpt: Video generation using vq-vae and transformers, 2021

    Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers, 2021

  61. [61]

    Fasterdit: Towards faster diffusion transformers training without architecture modification, 2024

    Jingfeng Yao, Wang Cheng, Wenyu Liu, and Xinggang Wang. Fasterdit: Towards faster diffusion transformers training without architecture modification, 2024. URLhttps://arxiv.org/abs/2410.10356

  62. [62]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings ofthe IEEE/CVF ConferenceonComputerVisionand PatternRecognition, 2025

  63. [63]

    Vector-quantized image modeling with improved vqgan, 2022

    Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan, 2022. URL https: //arxiv.org/abs/2110.04627

  64. [64]

    Scaling autoregressive models for content-rich text-to-image generation, 2022

    Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022. URL https://arxiv.org/abs/2206.10789

  65. [65]

    Language model beats diffusion–tokenizer is key to visual generation

    Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. In ICLR, 2024

  66. [66]

    An image is worth 32 tokens for reconstruction and generation

    Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. InNeurIPS, 2024

  67. [67]

    Representation alignment for generation: Training diffusion transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InICLR, 2024

  68. [68]

    Sigmoid loss for language image pre-training, 2023

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023

  69. [69]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

  70. [70]

    Spherical leech quantization for visual tokenization and generation, 2025

    Yue Zhao, Hanwen Jiang, Zhenlin Xu, Chutong Yang, Ehsan Adeli, and Philipp Krähenbühl. Spherical leech quantization for visual tokenization and generation, 2025. URLhttps://arxiv.org/abs/2512.14697

  71. [71]

    Vision foundation models as effective visual tokenizers for autoregressive image generation, 2025

    Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, and Xiaojuan Qi. Vision foundation models as effective visual tokenizers for autoregressive image generation, 2025. URL https://arxiv.org/abs/2507.08441. 14

  72. [72]

    Diffusion transformers with representation autoen- coders, 2025

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoen- coders, 2025. URLhttps://arxiv.org/abs/2510.11690

  73. [73]

    Fast training of diffusion models with masked transformers, 2024

    Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers, 2024. URLhttps://arxiv.org/abs/2306.09305

  74. [74]

    Scaling the codebook size of vqgan to 100,000 with a utilization rate of 99%, 2024

    Lei Zhu, Fangyun Wei, Yanye Lu, and Dong Chen. Scaling the codebook size of vqgan to 100,000 with a utilization rate of 99%, 2024. URLhttps://arxiv.org/abs/2406.11837

  75. [75]

    Addressingrepresentationcollapseinvectorquantized models with one linear layer, 2025

    YongxinZhu,BochengLi,YifeiXin,ZhihuaXia,andLinliXu. Addressingrepresentationcollapseinvectorquantized models with one linear layer, 2025. URLhttps://arxiv.org/abs/2411.02038. 15 AIdealImplementation Details A.1 Tokenizer Training Details Overall, our tokenizer training recipe closely follows prior work VFMTok [70]. Since VFMTok uses a VFM with a patch siz...