IDEAL: In-DEpth ALignment Makes A Discrete Representation AutoEncoder
Pith reviewed 2026-06-27 13:08 UTC · model grok-4.3
The pith
Aligning quantized tokens with both shallow and deep VFM features produces discrete visual tokens that keep fidelity and semantics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By jointly aligning quantized tokens with both shallow and deep VFM features, Ideal enables the resulting discrete visual tokens to preserve both visual fidelity and rich semantics, achieving 0.61 rFID on ImageNet and outperforming the previous best method by 0.28; when used for autoregressive image generation it produces a gFID of 1.89.
What carries the argument
The in-depth alignment framework that jointly aligns quantized tokens with shallow and deep VFM features.
Load-bearing premise
Shallow VFM features retain considerably richer local appearance and structural detail that can recover low-level information lost after discretizing deep features alone.
What would settle it
An experiment in which joint shallow-plus-deep alignment produces no rFID improvement over deep-only alignment on ImageNet would falsify the claim.
read the original abstract
Built on pretrained vision foundation models (VFMs), representation autoencoders (RAEs) have recently emerged as a promising approach for constructing semantically rich latent spaces for image generation. However, their reconstruction quality often remains suboptimal, largely because deep VFM representations do not preserve sufficient fine-grained visual detail. This limitation becomes even more severe after discretization, where missing low-level information is difficult to recover. In fact, we observe that shallow VFM features retain considerably richer local appearance and structural detail, which complements the high-level semantics carried by deep features used in existing RAEs. Motivated by this complementary property, we propose Ideal, an In-depth Alignment framework for discrete representation autoencoding. By jointly aligning quantized tokens with both shallow and deep VFM features, Ideal enables the resulting discrete visual tokens to preserve both visual fidelity and rich semantics. Extensive experiments demonstrate that Ideal yields superior reconstruction performance, achieving 0.61 rFID on ImageNet and outperforming the previous best method by 0.28. When used for autoregressive image generation, Ideal further produces a gFID of 1.89, establishing a new state of the art for autoregressive image generation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces IDEAL, an in-depth alignment framework for discrete representation autoencoders built on pretrained vision foundation models. It claims that jointly aligning quantized tokens to both shallow and deep VFM features allows the discrete tokens to retain low-level visual detail (lost in deep-only discretization) alongside high-level semantics, yielding 0.61 rFID on ImageNet (0.28 better than prior best) and 1.89 gFID for autoregressive image generation.
Significance. If the empirical gains are robustly attributable to the proposed alignment rather than unstated factors, the work would meaningfully advance discrete latent representations for image generation by exploiting complementary information across VFM depths. The reported metrics represent substantial improvements over prior RAEs.
major comments (2)
- [Abstract] Abstract: performance numbers (0.61 rFID, 1.89 gFID) are stated without any experimental protocol, baseline details, ablation studies, or error analysis, so it is impossible to determine whether gains are due to the joint shallow+deep alignment or other factors.
- [Abstract] Abstract: the central mechanism—that a single finite codebook can simultaneously encode both high-level semantics from deep features and fine-grained appearance from shallow features via joint alignment—is asserted without any equation for the alignment loss, quantization operator, or ablation isolating the shallow term.
Simulated Author's Rebuttal
We thank the referee for the detailed feedback. The concerns focus on the abstract's conciseness. The full manuscript (Sections 3 and 4, plus supplementary material) provides the experimental protocols, baselines, ablations, equations for the alignment loss and quantization, and analysis isolating the shallow term. We will make targeted revisions to the abstract for better context while preserving its brevity.
read point-by-point responses
-
Referee: [Abstract] Abstract: performance numbers (0.61 rFID, 1.89 gFID) are stated without any experimental protocol, baseline details, ablation studies, or error analysis, so it is impossible to determine whether gains are due to the joint shallow+deep alignment or other factors.
Authors: The abstract summarizes key outcomes; full details appear in the manuscript. Section 4.1 describes the ImageNet training protocol, rFID/gFID computation, and comparison to prior RAEs (e.g., VQGAN, RQ-VAE). Section 4.3 and Table 3 contain ablations that isolate the shallow alignment contribution (0.15 rFID gain when added to deep-only baseline). Multiple-run error bars are in Table 2. We will revise the abstract to add one sentence noting 'evaluated on ImageNet with standard rFID/gFID metrics and ablations confirming the joint alignment benefit.' revision: partial
-
Referee: [Abstract] Abstract: the central mechanism—that a single finite codebook can simultaneously encode both high-level semantics from deep features and fine-grained appearance from shallow features via joint alignment—is asserted without any equation for the alignment loss, quantization operator, or ablation isolating the shallow term.
Authors: Abstracts conventionally omit equations. The joint alignment objective (combining L_shallow and L_deep with the quantization operator Q) is defined in Equations (2)–(4) of Section 3.1. The ablation isolating the shallow term is in Table 3 (Section 4.3), demonstrating its necessity for low-level detail preservation. The abstract's description is therefore supported by the technical sections; we do not plan to insert equations into the abstract itself. revision: no
Circularity Check
No circularity; empirical method with external benchmarks
full rationale
The paper describes an empirical training procedure that jointly aligns quantized tokens to shallow and deep VFM features, reporting reconstruction and generation metrics (rFID, gFID) on ImageNet. No equations, derivations, or fitted parameters are presented that reduce to their own inputs by construction. No self-citation chains are invoked to justify uniqueness or load-bearing premises. The contribution rests on observable benchmark improvements rather than any self-referential mathematical step.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Estimating or propagating gradients through stochastic neurons for conditional computation, 2013
Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation, 2013. URLhttps://arxiv.org/abs/1308.3432
Pith/arXiv arXiv 2013
-
[2]
Perception encoder: The best visual embeddings are not at the output of the network
DanielBolya,Po-YaoHuang,PeizeSun,JangHyunCho,AndreaMadotto,ChenWei,TengyuMa,JialeZhi,Jathushan Rajasegaran, Hanoona Bangalath, et al. Perception encoder: The best visual embeddings are not at the output of the network. InNeurIPS, 2025
2025
-
[3]
Perception encoder: The best visual embeddings are not at the output of the network
DanielBolya,Po-YaoHuang,PeizeSun,JangHyunCho,AndreaMadotto,ChenWei,TengyuMa,JialeZhi,Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network. InNeurIPS, 2025
2025
-
[4]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conferenceon ComputerVision(ICCV), pages 9650–9660, October 2021
2021
-
[5]
Maskgit: Masked generative image transformer
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InCVPR, 2022
2022
-
[6]
Yitong Chen, Lingchen Meng, Wujian Peng, Zuxuan Wu, and Yu-Gang Jiang. Comp: Continual multimodal pre-training for vision foundation models.arXivpreprintarXiv:2503.18931, 2025
arXiv 2025
-
[7]
Vision transformers need registers
Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InICLR, 2024
2024
-
[8]
ImageNet: A large- scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi: 10.1109/CVPR.2009.5206848
-
[9]
Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, and Chun Yuan. Vqrae: Representation quantization autoencoders for multimodal understanding, generation and reconstruction, 2025. URLhttps://arxiv.org/abs/2511.23386. 11
arXiv 2025
-
[10]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, 2021
2021
-
[11]
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, et al. Mme: A comprehensive evaluation benchmark for multimodal large language models.arXiv preprint arXiv:2306.13394, 2023
Pith/arXiv arXiv 2023
-
[12]
One layer is enough: Adapting pretrained visual encoders for image generation, 2025
Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pretrained visual encoders for image generation, 2025. URLhttps://arxiv.org/abs/2512.07829
arXiv 2025
-
[13]
Long video generation with time-agnostic vqgan and time-sensitive transformer, 2022
Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer, 2022. URLhttps://arxiv.org/abs/ 2204.03638
arXiv 2022
-
[14]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017
2017
-
[15]
beta-vae: Learning basic visual concepts with a constrained variational framework
IrinaHiggins,LoïcMatthey,ArkaPal,ChristopherP.Burgess,XavierGlorot,MatthewM.Botvinick,ShakirMohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2016. URL https://api.semanticscholar.org/ CorpusID:46798026
2016
-
[16]
Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional question answering, 2019
DrewA.HudsonandChristopherD.Manning. Gqa: Anewdatasetforreal-worldvisualreasoningandcompositional question answering, 2019. URLhttps://arxiv.org/abs/1902.09506
Pith/arXiv arXiv 2019
-
[17]
Image-to-imagetranslationwithconditionaladversarial networks
PhillipIsola,Jun-YanZhu,TinghuiZhou,andAlexeiAEfros. Image-to-imagetranslationwithconditionaladversarial networks. InCVPR, 2017
2017
-
[18]
Dino-tok: Adapting dino for visual tokenizers, 2025
Mingkai Jia, Mingxiao Li, Liaoyuan Fan, Tianxing Shi, Jiaxin Guo, Zeming Li, Xiaoyang Guo, Xiao-Xiao Long, Qian Zhang, Ping Tan, and Wei Yin. Dino-tok: Adapting dino for visual tokenizers, 2025. URLhttps://arxiv.org/ abs/2511.20565
arXiv 2025
-
[19]
Product Quantization for Nearest Neighbor Search,
Herve Jégou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.IEEE Transactionson PatternAnalysisandMachineIntelligence, 33(1):117–128, 2011. doi: 10.1109/TPAMI.2010.57
-
[20]
Auto-encoding variational bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. InICLR, 2014
2014
-
[21]
Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov,MatteoMalloci,AlexanderKolesnikov,TomDuerig,andVittorioFerrari. Theopenimagesdatasetv4: Unified image classification, object detection, and visual relationship detection at scale.International Journal of Computer Vision, 128, 03 2020. doi...
-
[22]
Improved precision and recall metric for assessing generative models, 2019
Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models, 2019. URLhttps://arxiv.org/abs/1904.06991
arXiv 2019
-
[23]
Autoregressive image generation using residual quantization, 2022
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization, 2022. URLhttps://arxiv.org/abs/2203.01941
arXiv 2022
-
[24]
BohaoLi,RuiWang,GuangzhiWang,YuyingGe,YixiaoGe,andYingShan. Seed-bench: Benchmarkingmultimodal llms with generative comprehension.arXivpreprint arXiv:2307.16125, 2023
Pith/arXiv arXiv 2023
-
[25]
Imagefolder: Autoregressive image generation with folded tokens, 2024
Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, and Zhe Lin. Imagefolder: Autoregressive image generation with folded tokens, 2024. URLhttps://arxiv.org/abs/2410.01756
arXiv 2024
-
[26]
Evaluating object hallucination in large vision-language models, 2023
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models, 2023. URLhttps://arxiv.org/abs/2305.10355
Pith/arXiv arXiv 2023
-
[27]
Decoupled weight decay regularization, 2019
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. URLhttps://arxiv.org/abs/ 1711.05101
Pith/arXiv arXiv 2019
-
[28]
Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprintarXiv:2409.04410, 2024
arXiv 2024
-
[29]
Unitok: A unified tokenizer for visual generation and understanding, 2025
Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiaojuan Qi. Unitok: A unified tokenizer for visual generation and understanding, 2025. URLhttps://arxiv.org/abs/2502.20321. 12
arXiv 2025
-
[30]
Nanye Ma, Mark Goldstein, Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers, 2024. URLhttps: //arxiv.org/abs/2401.08740
arXiv 2024
-
[31]
Chartqa: A benchmark for question answering about charts with visual and logical reasoning
Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InACLFindings, 2022
2022
-
[32]
Docvqa: Adatasetforvqaondocumentimages
MineshMathew,DimosthenisKaratzas,andCVJawahar. Docvqa: Adatasetforvqaondocumentimages. In WACV, 2021
2021
-
[33]
Finitescalarquantization: Vq-vaemade simple, 2023
FabianMentzer, DavidMinnen, EirikurAgustsson, andMichaelTschannen. Finitescalarquantization: Vq-vaemade simple, 2023. URLhttps://arxiv.org/abs/2309.15505
Pith/arXiv arXiv 2023
-
[34]
Dinov2: Learning robust visual features without supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. TMLR, 2024
2024
-
[35]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023
2023
-
[36]
Learning transferable visual models from natural language supervision, 2021
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. URLhttps://arxiv.org/abs/2103.00020
Pith/arXiv arXiv 2021
-
[37]
Generating diverse high-fidelity images with vq-vae-2, 2019
Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2, 2019
2019
-
[38]
Beyond next-token: Next-x prediction for autoregressive visual generation, 2025
Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. Beyond next-token: Next-x prediction for autoregressive visual generation, 2025. URLhttps://arxiv.org/abs/2502.20388
arXiv 2025
-
[39]
High-resolution image synthesis with latent diffusion models, 2022
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022. URLhttps://arxiv.org/abs/2112.10752
Pith/arXiv arXiv 2022
-
[40]
Improved techniques for training gans
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. InNeurIPS, 2016
2016
-
[41]
A-okvqa: A benchmark for visual question answering using world knowledge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. InECCV, 2022
2022
-
[42]
Latent diffusion model without variational autoencoder
Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder. InICLR, 2026
2026
-
[43]
Dinov3.arXivpreprint arXiv:2508.10104, 2025
Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXivpreprint arXiv:2508.10104, 2025
Pith/arXiv arXiv 2025
-
[44]
Towards vqa models that can read, 2019
Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read, 2019. URLhttps://arxiv.org/abs/1904.08920
Pith/arXiv arXiv 2019
-
[45]
Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies
Wei Song, Yuran Wang, Zijia Song, Yadong Li, Zenan Zhou, Long Chen, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. InICLR, 2026
2026
-
[46]
Roformer: Enhanced transformer with rotary position embedding, 2023
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023. URLhttps://arxiv.org/abs/2104.09864
Pith/arXiv arXiv 2023
-
[47]
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprintarXiv:2406.06525, 2024
Pith/arXiv arXiv 2024
-
[48]
Visual autoregressive modeling: Scalable image generation via next-scale prediction
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. InNeurIPS, 2024
2024
-
[49]
Cambrian-1: A fully open, vision-centric exploration of multimodal llms
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai C Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. In NeurIPS, 2024
2024
-
[50]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy,TalfanEvans,LucasBeyer,YeXia,BasilMustafa,etal. Siglip2: Multilingualvision-languageencoders with improved semantic understanding, localization, and dense features.arXivpreprintarXiv:2502.14786, 2025. 13
Pith/arXiv arXiv 2025
-
[51]
Neural discrete representation learning
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017
2017
-
[52]
Neural discrete representation learning, 2018
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning, 2018. URL https://arxiv.org/abs/1711.00937
Pith/arXiv arXiv 2018
-
[53]
Omnitokenizer: A joint image-video tokenizer for visual generation
Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation. InNeurIPS, 2024
2024
-
[54]
Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv preprint arXiv:2504.11455, 2025
arXiv 2025
-
[55]
Omnigen-ar: Autoregressive any-to-image generation
Junke Wang, Xun Wang, Qiushan Guo, Peize Sun, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Omnigen-ar: Autoregressive any-to-image generation. InNeurIPS, 2025
2025
-
[56]
Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, and Xiang Li. Representation entanglement for generation: Training diffusion transformers is much easier than you think, 2025. URLhttps://arxiv.org/abs/2507.01467
arXiv 2025
-
[57]
Grok-1.5 vision preview, 2024
xAI Team. Grok-1.5 vision preview, 2024. URLhttps://x.ai/blog/grok-1.5v
2024
-
[58]
Vision transformer with deformable attention,
Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao Huang. Vision transformer with deformable attention,
-
[59]
URLhttps://arxiv.org/abs/2201.00520
-
[60]
Videogpt: Video generation using vq-vae and transformers, 2021
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers, 2021
2021
-
[61]
Fasterdit: Towards faster diffusion transformers training without architecture modification, 2024
Jingfeng Yao, Wang Cheng, Wenyu Liu, and Xinggang Wang. Fasterdit: Towards faster diffusion transformers training without architecture modification, 2024. URLhttps://arxiv.org/abs/2410.10356
arXiv 2024
-
[62]
Reconstruction vs
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings ofthe IEEE/CVF ConferenceonComputerVisionand PatternRecognition, 2025
2025
-
[63]
Vector-quantized image modeling with improved vqgan, 2022
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan, 2022. URL https: //arxiv.org/abs/2110.04627
Pith/arXiv arXiv 2022
-
[64]
Scaling autoregressive models for content-rich text-to-image generation, 2022
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022. URL https://arxiv.org/abs/2206.10789
Pith/arXiv arXiv 2022
-
[65]
Language model beats diffusion–tokenizer is key to visual generation
Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion–tokenizer is key to visual generation. In ICLR, 2024
2024
-
[66]
An image is worth 32 tokens for reconstruction and generation
Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. InNeurIPS, 2024
2024
-
[67]
Representation alignment for generation: Training diffusion transformers is easier than you think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InICLR, 2024
2024
-
[68]
Sigmoid loss for language image pre-training, 2023
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023
2023
-
[69]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018
2018
-
[70]
Spherical leech quantization for visual tokenization and generation, 2025
Yue Zhao, Hanwen Jiang, Zhenlin Xu, Chutong Yang, Ehsan Adeli, and Philipp Krähenbühl. Spherical leech quantization for visual tokenization and generation, 2025. URLhttps://arxiv.org/abs/2512.14697
arXiv 2025
-
[71]
Vision foundation models as effective visual tokenizers for autoregressive image generation, 2025
Anlin Zheng, Xin Wen, Xuanyang Zhang, Chuofan Ma, Tiancai Wang, Gang Yu, Xiangyu Zhang, and Xiaojuan Qi. Vision foundation models as effective visual tokenizers for autoregressive image generation, 2025. URL https://arxiv.org/abs/2507.08441. 14
arXiv 2025
-
[72]
Diffusion transformers with representation autoen- coders, 2025
Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoen- coders, 2025. URLhttps://arxiv.org/abs/2510.11690
Pith/arXiv arXiv 2025
-
[73]
Fast training of diffusion models with masked transformers, 2024
Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers, 2024. URLhttps://arxiv.org/abs/2306.09305
arXiv 2024
-
[74]
Scaling the codebook size of vqgan to 100,000 with a utilization rate of 99%, 2024
Lei Zhu, Fangyun Wei, Yanye Lu, and Dong Chen. Scaling the codebook size of vqgan to 100,000 with a utilization rate of 99%, 2024. URLhttps://arxiv.org/abs/2406.11837
arXiv 2024
-
[75]
Addressingrepresentationcollapseinvectorquantized models with one linear layer, 2025
YongxinZhu,BochengLi,YifeiXin,ZhihuaXia,andLinliXu. Addressingrepresentationcollapseinvectorquantized models with one linear layer, 2025. URLhttps://arxiv.org/abs/2411.02038. 15 AIdealImplementation Details A.1 Tokenizer Training Details Overall, our tokenizer training recipe closely follows prior work VFMTok [70]. Since VFMTok uses a VFM with a patch siz...
arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.