pith. sign in

arxiv: 2605.18390 · v1 · pith:TG3ZPEBZnew · submitted 2026-05-18 · 💻 cs.CV

Vision Foundation Models as Generalist Tokenizers for Image Generation

Pith reviewed 2026-05-20 10:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision foundation modelsimage tokenizationautoregressive generationimage synthesislatent quantizationsemantic reconstructionImageNet class-conditional
0
0 comments X

The pith

A frozen vision foundation model can be used directly as the encoder for a generalist image tokenizer that operates in both discrete and continuous spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that a vision foundation model pre-trained with global contrastive learning and latent masked image modeling provides representations that work well as the basis for image tokenization without any encoder fine-tuning. The authors add a region-adaptive quantization step to cut spatial redundancy in the 2D feature grid and a semantic reconstruction objective that keeps decoded outputs aligned with the original VFM features. These changes produce VFMTok, a tokenizer usable for autoregressive models in discrete token spaces and for denoising models in continuous spaces. A sympathetic reader would care because the approach yields faster model convergence, fewer tokens, and strong generation metrics on ImageNet while removing the need for classifier-free guidance during inference.

Core claim

VFMTok is built by taking a frozen VFM as encoder and adding region-adaptive quantization to remove spatial redundancy from 2D grid features together with a semantic reconstruction objective that aligns decoded outputs with VFM representations. This produces a generalist tokenizer that works seamlessly in discrete latent spaces for autoregressive generation and in continuous spaces for denoising-based generation. On ImageNet class-conditional synthesis the discrete version reaches a gFID of 1.36 with three times faster convergence while the continuous version reaches 1.25 gFID; both achieve high-fidelity results without classifier-free guidance.

What carries the argument

Region-adaptive quantization framework paired with a semantic reconstruction objective applied to features from a frozen vision foundation model encoder, which removes spatial redundancy while preserving semantic fidelity for downstream generation.

If this is right

  • Discrete autoregressive generators converge three times faster.
  • Class-conditional synthesis reaches a gFID of 1.36 on ImageNet.
  • Continuous-space generation with a denoising model reaches a gFID of 1.25.
  • High-fidelity synthesis succeeds without classifier-free guidance in both paradigms.
  • Tokenizer quality depends on the exact combination of self-supervised objectives used in VFM pre-training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same frozen-VFM approach could be tested on video or 3D data to see whether region-adaptive quantization still reduces redundancy effectively.
  • Smaller generative models paired with VFMTok might preserve quality while using even fewer parameters overall.
  • Out-of-distribution images could be used to measure how much the semantic reconstruction objective protects against domain shift.
  • Future tokenizers might be designed by first selecting VFM pre-training objectives that maximize downstream generation metrics rather than designing new quantization schemes from scratch.

Load-bearing premise

Representations from a VFM pre-trained with global contrastive learning plus latent masked image modeling stay optimal for tokenization and generation without any encoder fine-tuning or adaptation.

What would settle it

Train an otherwise identical tokenizer using a VFM pre-trained with only one of the two objectives (contrastive learning or latent masked image modeling) and check whether gFID rises above 1.36 or convergence slows below the reported three-fold speedup on the same ImageNet class-conditional task.

Figures

Figures reproduced from arXiv: 2605.18390 by Anlin Zheng, Chuofan Ma, Gang Yu, Lanxi Gong, Qi Han, Xiangyu Zhang, Xiaojuan Qi, Xin Wen.

Figure 1
Figure 1. Figure 1: VFMTok introduces novel features, including: a). [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The framework of VFMTok/VFMAE. VFMTok/VFMAE utilizes a frozen VFM to extract multi-level image features. A deformable Transformer [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 7
Figure 7. Figure 7: ). For optimal clarity, please zoom in [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 3
Figure 3. Figure 3: Fig.3. Autoregressive class-conditional image generation with classifier-free guidance (CFG). [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Fig.4. Autoregressive class-conditional image generation without classifier-free guidance (CFG). [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
read the original abstract

In this work, we explore the largely unexplored direction of building a generalist image tokenizer directly on top of a frozen vision foundation model (VFM). To build this tokenizer, we utilize a frozen VFM as the encoder and introduce two key innovations: (1) a region-adaptive quantization framework to eliminate spatial redundancy in standard 2D grid features, and (2) a semantic reconstruction objective that aligns the decoded outputs with the VFM's representations to preserve semantic fidelity. Grounded in these designs, we propose VFMTok, a generalist visual tokenizer capable of operating seamlessly in both discrete and continuous latent spaces. VFMTok achieves substantial improvements in synthesis quality while drastically enhancing token efficiency. For discrete autoregressive (AR) generation, it accelerates model convergence by \textbf{3 times} and achieves a state-of-the-art gFID of \textbf{1.36} on ImageNet class-conditional synthesis. Similarly, for continuous-space generation, integrating VFMTok with a denoising model yields an exceptional gFID of \textbf{1.25}. Furthermore, because the latent space inherently captures rich spatial semantics, VFMTok enables high-fidelity class-conditional synthesis without classifier-free guidance (\textbf{w/o CFG}) across both generative paradigms, significantly accelerating inference speed. Beyond these remarkable empirical results, we systematically investigate the underlying mechanisms of our approach. We discover that the specific self-supervised learning objectives utilized during VFM pre-training dictate its effectiveness as a tokenizer. Specifically, a VFM jointly optimized with global contrastive learning and latent masked image modeling provides the optimal representations for image tokenization. These insights establish a strong foundation and offer valuable guidance for the design of future image tokenizers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes VFMTok, a generalist visual tokenizer built atop a frozen vision foundation model (VFM) encoder. It introduces region-adaptive quantization to reduce spatial redundancy in 2D grid features and a semantic reconstruction objective to align decoded outputs with VFM representations. VFMTok supports both discrete and continuous latent spaces for image generation, reporting SOTA gFID of 1.36 on ImageNet class-conditional discrete AR synthesis (with 3x faster convergence) and 1.25 for continuous denoising models, plus CFG-free generation due to rich semantics. The work also investigates SSL objectives, finding that global contrastive learning combined with latent masked image modeling yields optimal VFM representations for tokenization.

Significance. If the empirical claims hold under full verification, the results indicate that frozen VFMs can serve as effective generalist tokenizers with targeted quantization and reconstruction losses, yielding substantial gains in synthesis quality, token efficiency, and inference speed. The finding on SSL objective combinations provides concrete guidance for selecting pre-trained encoders in future tokenizer designs and could reduce the need for end-to-end training of visual encoders in generative pipelines.

major comments (2)
  1. [Abstract / VFM pre-training objectives paragraph] Abstract and § on VFM pre-training objectives: the central claim that a frozen VFM (pre-trained with global contrastive + latent MIM) remains optimal for tokenization without encoder fine-tuning is load-bearing for the reported gFID 1.36/1.25 and 3x convergence, yet no ablation compares this to joint adaptation of the encoder with the region-adaptive quantization and semantic reconstruction objectives. If joint fine-tuning better preserves spatial semantics, the efficiency and quality gains may not represent the strongest instantiation.
  2. [Experiments / Results tables] Experiments section (results on ImageNet AR and continuous generation): the gFID scores and convergence claims lack reported error bars, number of runs, or full baseline comparisons (including recent tokenizers and fine-tuned VFM variants), making it difficult to assess whether the 1.36 gFID and 3x speedup are robust or sensitive to implementation details.
minor comments (2)
  1. [Method] Notation for region-adaptive quantization could be clarified with an explicit equation or diagram showing how patch selection varies per region.
  2. [Discussion] The manuscript would benefit from a dedicated limitations paragraph discussing potential failure modes when the VFM's pre-training data distribution differs from the target generation dataset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and constructive suggestions. Below we respond to each major comment, outlining our planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract / VFM pre-training objectives paragraph] Abstract and § on VFM pre-training objectives: the central claim that a frozen VFM (pre-trained with global contrastive + latent MIM) remains optimal for tokenization without encoder fine-tuning is load-bearing for the reported gFID 1.36/1.25 and 3x convergence, yet no ablation compares this to joint adaptation of the encoder with the region-adaptive quantization and semantic reconstruction objectives. If joint fine-tuning better preserves spatial semantics, the efficiency and quality gains may not represent the strongest instantiation.

    Authors: We thank the referee for this observation. The manuscript's focus is on demonstrating that frozen VFMs, without any encoder fine-tuning, can serve as effective generalist tokenizers when combined with our proposed region-adaptive quantization and semantic reconstruction. This design choice emphasizes efficiency and the reusability of pre-trained models. While we acknowledge that joint fine-tuning could potentially yield further improvements, it would deviate from the generalist and frozen paradigm we aim to explore. In the revision, we will add a paragraph discussing this limitation and why the frozen setting is of particular interest, including references to works that do perform fine-tuning. This constitutes a partial revision as we will enhance the discussion but not conduct new joint fine-tuning experiments at this stage. revision: partial

  2. Referee: [Experiments / Results tables] Experiments section (results on ImageNet AR and continuous generation): the gFID scores and convergence claims lack reported error bars, number of runs, or full baseline comparisons (including recent tokenizers and fine-tuned VFM variants), making it difficult to assess whether the 1.36 gFID and 3x speedup are robust or sensitive to implementation details.

    Authors: We agree that providing error bars and details on the number of runs would enhance the credibility of the empirical results. In the revised manuscript, we will report the mean and standard deviation of gFID scores over multiple runs (specifically, we will run the experiments three times and include the statistics). We will also clarify the convergence speed measurements. For baseline comparisons, we have compared against several established tokenizers; we will expand the experimental section to include more recent methods and add a note on fine-tuned VFM variants, explaining that our work prioritizes the frozen case. These changes will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical tokenizer design and objective investigation are self-contained

full rationale

The paper's central results rest on constructing VFMTok atop a frozen VFM encoder, applying region-adaptive quantization and a semantic reconstruction loss, then reporting downstream gFID, convergence speed, and CFG-free generation metrics on ImageNet. These are external, falsifiable benchmarks rather than quantities derived from the paper's own equations or fitted parameters. The investigation into which VFM pre-training objectives (global contrastive + latent MIM) yield better tokenizers is likewise an empirical comparison across frozen models, not a self-referential reduction or self-citation chain. No load-bearing step equates a claimed prediction to its input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central approach rests on the domain assumption that frozen VFM representations are directly usable for tokenization once augmented with the two proposed components; no explicit free parameters or new invented entities are described in the abstract.

axioms (1)
  • domain assumption Representations from a frozen vision foundation model pre-trained with global contrastive learning and latent masked image modeling are suitable and optimal for building a generalist image tokenizer.
    This premise is invoked when the encoder is kept frozen and when the authors conclude that specific SSL objectives dictate tokenizer effectiveness.

pith-pipeline@v0.9.0 · 5858 in / 1412 out tokens · 55035 ms · 2026-05-20T10:57:32.776159+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · 25 internal anchors

  1. [1]

    Building Normalizing Flows with Stochastic Interpolants

    Michael S Albergo and Eric Vanden-Eijnden. Building nor- malizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571, 2022

  2. [2]

    Flextok: Resampling images into 1d to- ken sequences of flexible length

    Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, O˘ guzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Za- mir, and Afshin Dehghan. Flextok: Resampling images into 1d to- ken sequences of flexible length. arXiv preprint arXiv:2502.13967, 2025

  3. [3]

    Autoencoders

    Dor Bank, Noam Koenigstein, and Raja Giryes. Autoencoders. Machine learning for data science handbook: data mining and knowledge discovery handbook, pages 353–374, 2023

  4. [4]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Esti- mating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013

  5. [5]

    VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

    Tianci Bi, Xiaoyi Zhang, Yan Lu, and Nanning Zheng. Vision foundation models can be good tokenizers for latent diffusion models. arXiv preprint arXiv:2510.18457, 2025

  6. [6]

    Understanding disentangling in $\beta$-VAE

    Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Under- standing disentangling in β-vae. arXiv preprint arXiv:1804.03599, 2018

  7. [7]

    Emerging proper- ties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging proper- ties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021

  8. [8]

    Maskgit: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022

  9. [9]

    arXiv preprint arXiv:2509.25162 (2025) 4

    Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, and Kai Zhang. Aligning visual foundation encoders to tokenizers for diffusion models. arXiv preprint arXiv:2509.25162, 2025

  10. [10]

    Generative pretraining from pixels

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR, 2020

  11. [11]

    Improved Baselines with Momentum Contrastive Learning

    Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Im- proved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020

  12. [12]

    Detection in crowded scenes: One proposal, multiple predictions

    Xuangeng Chu, Anlin Zheng, Xiangyu Zhang, and Jian Sun. Detection in crowded scenes: One proposal, multiple predictions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12214–12223, 2020

  13. [13]

    Deformable convolutional networks

    Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 764–773, 2017

  14. [14]

    Vision Transformers Need Registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bo- janowski. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023

  15. [15]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei- Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009

  16. [16]

    Bert: Pre-training of deep bidirectional transform- ers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transform- ers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019

  17. [17]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021

  18. [18]

    An introduction to variational autoencoders

    P Kingma Diederik and Welling Max. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019

  19. [19]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  20. [21]

    Scaling rectified flow transformers for high- resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high- resolution image synthesis. In Forty-first international conference on machine learning, 2024

  21. [22]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021

  22. [23]

    One layer is enough: Adapting pretrained visual encoders for image generation.arXiv preprint arXiv:2512.07829, 2025

    Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pretrained visual encoders for image generation. arXiv preprint arXiv:2512.07829, 2024

  23. [24]

    Generative adversarial networks

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Commun Acm, 2020

  24. [25]

    Bootstrap your own latent-a new approach to self-supervised learning

    Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020

  25. [26]

    Learnings from scaling visual tokenizers for reconstruction and generation

    Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, and Xinlei Chen. Learnings from scaling visual tokenizers for reconstruction and generation. arXiv preprint arXiv:2501.09755, 2025

  26. [27]

    Masked autoencoders are scalable vision learn- ers

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learn- ers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022

  27. [28]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Gir- shick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019

  28. [29]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

  29. [30]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30, 2017

  30. [31]

    Burgess, Xavier Glorot, Matthew M

    Irina Higgins, Loïc Matthey, Arka Pal, Christopher P . Burgess, Xavier Glorot, Matthew M. Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2016

  31. [32]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020

  32. [33]

    Image-to-image translation with conditional adversarial networks

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017

  33. [34]

    Image-to-image translation with conditional adversarial networks

    Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1125–1134, 2017

  34. [35]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021

  35. [36]

    Guiding a diffusion model with a bad version of itself

    Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehti- nen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself. Advances in Neural Information Processing Systems, 37:52996–53021, 2024

  36. [37]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019

  37. [38]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  38. [39]

    Eq-vae: Equivariance regularized latent space for improved generative image modeling.arXiv preprint arXiv:2502.09509,

    Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Eq-vae: Equivariance regularized latent space for improved generative image modeling. arXiv preprint arXiv:2502.09509, 2025. 14

  39. [40]

    Boosting generative image modeling via joint image-feature synthesis

    Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Boosting generative image modeling via joint image-feature synthesis. arXiv preprint arXiv:2504.16064, 2025

  40. [41]

    The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale

    Alina Kuznetsova, Mohamad Hassan Mohamad Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Com...

  41. [42]

    Improved precision and recall metric for assessing generative models

    Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehti- nen, and Timo Aila. Improved precision and recall metric for assessing generative models. Advances in neural information processing systems, 32, 2019

  42. [43]

    Autoregressive image generation using residual quantization

    Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook- Shin Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022

  43. [44]

    Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end- to-end tuning with latent diffusion transformers. arXiv preprint arXiv:2504.10483, 2025

  44. [45]

    Autoregres- sive image generation without vector quantization.arXiv preprint arXiv:2406.11838,

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. arXiv preprint arXiv:2406.11838, 2024

  45. [46]

    Imagefolder: Autoregres- sive image generation with folded tokens.arXiv preprint arXiv:2410.01756, 2024

    Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, and Zhe Lin. Imagefolder: Autoregressive image generation with folded tokens. arXiv preprint arXiv:2410.01756, 2024

  46. [47]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017

  47. [48]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 2980–2988, 2017

  48. [49]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

  49. [50]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022

  50. [51]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  51. [52]

    Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation

    Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation. arXiv preprint arXiv:2409.04410, 2024

  52. [53]

    Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision, pages 23–40. Springer, 2024

  53. [54]

    Finite Scalar Quantization: VQ-VAE Made Simple

    Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple. arXiv preprint arXiv:2309.15505, 2023

  54. [55]

    Conditional Generative Adversarial Nets

    Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014

  55. [56]

    One-d-piece: Image tokenizer meets quality- controllable compression

    Keita Miwa, Kento Sasaki, Hidehisa Arai, Tsubasa Takahashi, and Yu Yamaguchi. One-d-piece: Image tokenizer meets quality- controllable compression. arXiv preprint arXiv:2501.10064, 2025

  56. [57]

    Improved denois- ing diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denois- ing diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021

  57. [58]

    Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V . Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labat...

  58. [59]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

  59. [60]

    Tokenflow: Unified image tokenizer for multimodal understanding and generation

    Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xinglong Wu. To- kenflow: Unified image tokenizer for multimodal understanding and generation. arXiv preprint arXiv:2412.03069, 2024

  60. [61]

    Learning transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language super- vision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021

  61. [62]

    Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

    Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adver- sarial networks. arXiv preprint arXiv:1511.06434, 2015

  62. [63]

    Improving language understanding by generative pre-training, 2018

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018

  63. [64]

    Generating diverse high-fidelity images with vq-vae-2

    Ali Razavi, Aäron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. Advances in Neural Information Processing Systems, 32, 2019

  64. [65]

    High-resolution image synthesis with latent diffusion models, 2021

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021

  65. [66]

    Photorealistic text-to-image diffusion models with deep language understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gon- tijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479– 36494, 2022

  66. [67]

    Improved techniques for training gans

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016

  67. [68]

    Improved techniques for training gans

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in Neural Information Processing Systems, 29, 2016

  68. [69]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

  69. [70]

    DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

    Wei Song, Yuran Wang, Zijia Song, Yadong Li, Haoze Sun, Weipeng Chen, Zenan Zhou, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324, 2025

  70. [71]

    Roformer: Enhanced transformer with rotary position embedding

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

  71. [72]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024

  72. [73]

    Rethinking the inception architecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016

  73. [74]

    Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278, 2025

    Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and editing. arXiv preprint arXiv:2507.23278, 2025

  74. [75]

    2024.doi:10.48550/arXiv.2404.02905

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905, 2024

  75. [76]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Na- man Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  76. [77]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

  77. [78]

    Conditional 15 image generation with pixelcnn decoders

    Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional 15 image generation with pixelcnn decoders. Advances in Neural Information Processing Systems, 29, 2016

  78. [79]

    Neural discrete representation learning

    Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. Advances in Neural Information Processing Systems, 30, 2017

  79. [80]

    Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion

    Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010

  80. [81]

    Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

    Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer. arXiv preprint arXiv:2504.05741, 2025

Showing first 80 references.