Vision Foundation Models as Generalist Tokenizers for Image Generation
Pith reviewed 2026-05-20 10:57 UTC · model grok-4.3
The pith
A frozen vision foundation model can be used directly as the encoder for a generalist image tokenizer that operates in both discrete and continuous spaces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VFMTok is built by taking a frozen VFM as encoder and adding region-adaptive quantization to remove spatial redundancy from 2D grid features together with a semantic reconstruction objective that aligns decoded outputs with VFM representations. This produces a generalist tokenizer that works seamlessly in discrete latent spaces for autoregressive generation and in continuous spaces for denoising-based generation. On ImageNet class-conditional synthesis the discrete version reaches a gFID of 1.36 with three times faster convergence while the continuous version reaches 1.25 gFID; both achieve high-fidelity results without classifier-free guidance.
What carries the argument
Region-adaptive quantization framework paired with a semantic reconstruction objective applied to features from a frozen vision foundation model encoder, which removes spatial redundancy while preserving semantic fidelity for downstream generation.
If this is right
- Discrete autoregressive generators converge three times faster.
- Class-conditional synthesis reaches a gFID of 1.36 on ImageNet.
- Continuous-space generation with a denoising model reaches a gFID of 1.25.
- High-fidelity synthesis succeeds without classifier-free guidance in both paradigms.
- Tokenizer quality depends on the exact combination of self-supervised objectives used in VFM pre-training.
Where Pith is reading between the lines
- The same frozen-VFM approach could be tested on video or 3D data to see whether region-adaptive quantization still reduces redundancy effectively.
- Smaller generative models paired with VFMTok might preserve quality while using even fewer parameters overall.
- Out-of-distribution images could be used to measure how much the semantic reconstruction objective protects against domain shift.
- Future tokenizers might be designed by first selecting VFM pre-training objectives that maximize downstream generation metrics rather than designing new quantization schemes from scratch.
Load-bearing premise
Representations from a VFM pre-trained with global contrastive learning plus latent masked image modeling stay optimal for tokenization and generation without any encoder fine-tuning or adaptation.
What would settle it
Train an otherwise identical tokenizer using a VFM pre-trained with only one of the two objectives (contrastive learning or latent masked image modeling) and check whether gFID rises above 1.36 or convergence slows below the reported three-fold speedup on the same ImageNet class-conditional task.
Figures
read the original abstract
In this work, we explore the largely unexplored direction of building a generalist image tokenizer directly on top of a frozen vision foundation model (VFM). To build this tokenizer, we utilize a frozen VFM as the encoder and introduce two key innovations: (1) a region-adaptive quantization framework to eliminate spatial redundancy in standard 2D grid features, and (2) a semantic reconstruction objective that aligns the decoded outputs with the VFM's representations to preserve semantic fidelity. Grounded in these designs, we propose VFMTok, a generalist visual tokenizer capable of operating seamlessly in both discrete and continuous latent spaces. VFMTok achieves substantial improvements in synthesis quality while drastically enhancing token efficiency. For discrete autoregressive (AR) generation, it accelerates model convergence by \textbf{3 times} and achieves a state-of-the-art gFID of \textbf{1.36} on ImageNet class-conditional synthesis. Similarly, for continuous-space generation, integrating VFMTok with a denoising model yields an exceptional gFID of \textbf{1.25}. Furthermore, because the latent space inherently captures rich spatial semantics, VFMTok enables high-fidelity class-conditional synthesis without classifier-free guidance (\textbf{w/o CFG}) across both generative paradigms, significantly accelerating inference speed. Beyond these remarkable empirical results, we systematically investigate the underlying mechanisms of our approach. We discover that the specific self-supervised learning objectives utilized during VFM pre-training dictate its effectiveness as a tokenizer. Specifically, a VFM jointly optimized with global contrastive learning and latent masked image modeling provides the optimal representations for image tokenization. These insights establish a strong foundation and offer valuable guidance for the design of future image tokenizers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes VFMTok, a generalist visual tokenizer built atop a frozen vision foundation model (VFM) encoder. It introduces region-adaptive quantization to reduce spatial redundancy in 2D grid features and a semantic reconstruction objective to align decoded outputs with VFM representations. VFMTok supports both discrete and continuous latent spaces for image generation, reporting SOTA gFID of 1.36 on ImageNet class-conditional discrete AR synthesis (with 3x faster convergence) and 1.25 for continuous denoising models, plus CFG-free generation due to rich semantics. The work also investigates SSL objectives, finding that global contrastive learning combined with latent masked image modeling yields optimal VFM representations for tokenization.
Significance. If the empirical claims hold under full verification, the results indicate that frozen VFMs can serve as effective generalist tokenizers with targeted quantization and reconstruction losses, yielding substantial gains in synthesis quality, token efficiency, and inference speed. The finding on SSL objective combinations provides concrete guidance for selecting pre-trained encoders in future tokenizer designs and could reduce the need for end-to-end training of visual encoders in generative pipelines.
major comments (2)
- [Abstract / VFM pre-training objectives paragraph] Abstract and § on VFM pre-training objectives: the central claim that a frozen VFM (pre-trained with global contrastive + latent MIM) remains optimal for tokenization without encoder fine-tuning is load-bearing for the reported gFID 1.36/1.25 and 3x convergence, yet no ablation compares this to joint adaptation of the encoder with the region-adaptive quantization and semantic reconstruction objectives. If joint fine-tuning better preserves spatial semantics, the efficiency and quality gains may not represent the strongest instantiation.
- [Experiments / Results tables] Experiments section (results on ImageNet AR and continuous generation): the gFID scores and convergence claims lack reported error bars, number of runs, or full baseline comparisons (including recent tokenizers and fine-tuned VFM variants), making it difficult to assess whether the 1.36 gFID and 3x speedup are robust or sensitive to implementation details.
minor comments (2)
- [Method] Notation for region-adaptive quantization could be clarified with an explicit equation or diagram showing how patch selection varies per region.
- [Discussion] The manuscript would benefit from a dedicated limitations paragraph discussing potential failure modes when the VFM's pre-training data distribution differs from the target generation dataset.
Simulated Author's Rebuttal
We are grateful to the referee for the thorough review and constructive suggestions. Below we respond to each major comment, outlining our planned revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract / VFM pre-training objectives paragraph] Abstract and § on VFM pre-training objectives: the central claim that a frozen VFM (pre-trained with global contrastive + latent MIM) remains optimal for tokenization without encoder fine-tuning is load-bearing for the reported gFID 1.36/1.25 and 3x convergence, yet no ablation compares this to joint adaptation of the encoder with the region-adaptive quantization and semantic reconstruction objectives. If joint fine-tuning better preserves spatial semantics, the efficiency and quality gains may not represent the strongest instantiation.
Authors: We thank the referee for this observation. The manuscript's focus is on demonstrating that frozen VFMs, without any encoder fine-tuning, can serve as effective generalist tokenizers when combined with our proposed region-adaptive quantization and semantic reconstruction. This design choice emphasizes efficiency and the reusability of pre-trained models. While we acknowledge that joint fine-tuning could potentially yield further improvements, it would deviate from the generalist and frozen paradigm we aim to explore. In the revision, we will add a paragraph discussing this limitation and why the frozen setting is of particular interest, including references to works that do perform fine-tuning. This constitutes a partial revision as we will enhance the discussion but not conduct new joint fine-tuning experiments at this stage. revision: partial
-
Referee: [Experiments / Results tables] Experiments section (results on ImageNet AR and continuous generation): the gFID scores and convergence claims lack reported error bars, number of runs, or full baseline comparisons (including recent tokenizers and fine-tuned VFM variants), making it difficult to assess whether the 1.36 gFID and 3x speedup are robust or sensitive to implementation details.
Authors: We agree that providing error bars and details on the number of runs would enhance the credibility of the empirical results. In the revised manuscript, we will report the mean and standard deviation of gFID scores over multiple runs (specifically, we will run the experiments three times and include the statistics). We will also clarify the convergence speed measurements. For baseline comparisons, we have compared against several established tokenizers; we will expand the experimental section to include more recent methods and add a note on fine-tuned VFM variants, explaining that our work prioritizes the frozen case. These changes will be incorporated in the next version. revision: yes
Circularity Check
No circularity: empirical tokenizer design and objective investigation are self-contained
full rationale
The paper's central results rest on constructing VFMTok atop a frozen VFM encoder, applying region-adaptive quantization and a semantic reconstruction loss, then reporting downstream gFID, convergence speed, and CFG-free generation metrics on ImageNet. These are external, falsifiable benchmarks rather than quantities derived from the paper's own equations or fitted parameters. The investigation into which VFM pre-training objectives (global contrastive + latent MIM) yield better tokenizers is likewise an empirical comparison across frozen models, not a self-referential reduction or self-citation chain. No load-bearing step equates a claimed prediction to its input by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Representations from a frozen vision foundation model pre-trained with global contrastive learning and latent masked image modeling are suitable and optimal for building a generalist image tokenizer.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a VFM jointly optimized with global contrastive learning and latent masked image modeling provides the optimal representations for image tokenization
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
region-adaptive quantization framework to eliminate spatial redundancy in standard 2D grid features
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Building Normalizing Flows with Stochastic Interpolants
Michael S Albergo and Eric Vanden-Eijnden. Building nor- malizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Flextok: Resampling images into 1d to- ken sequences of flexible length
Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, O˘ guzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Za- mir, and Afshin Dehghan. Flextok: Resampling images into 1d to- ken sequences of flexible length. arXiv preprint arXiv:2502.13967, 2025
-
[3]
Dor Bank, Noam Koenigstein, and Raja Giryes. Autoencoders. Machine learning for data science handbook: data mining and knowledge discovery handbook, pages 353–374, 2023
work page 2023
-
[4]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Esti- mating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[5]
VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
Tianci Bi, Xiaoyi Zhang, Yan Lu, and Nanning Zheng. Vision foundation models can be good tokenizers for latent diffusion models. arXiv preprint arXiv:2510.18457, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Understanding disentangling in $\beta$-VAE
Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Under- standing disentangling in β-vae. arXiv preprint arXiv:1804.03599, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[7]
Emerging proper- ties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging proper- ties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021
work page 2021
-
[8]
Maskgit: Masked generative image transformer
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022
work page 2022
-
[9]
arXiv preprint arXiv:2509.25162 (2025) 4
Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, and Kai Zhang. Aligning visual foundation encoders to tokenizers for diffusion models. arXiv preprint arXiv:2509.25162, 2025
-
[10]
Generative pretraining from pixels
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR, 2020
work page 2020
-
[11]
Improved Baselines with Momentum Contrastive Learning
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Im- proved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[12]
Detection in crowded scenes: One proposal, multiple predictions
Xuangeng Chu, Anlin Zheng, Xiangyu Zhang, and Jian Sun. Detection in crowded scenes: One proposal, multiple predictions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12214–12223, 2020
work page 2020
-
[13]
Deformable convolutional networks
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 764–773, 2017
work page 2017
-
[14]
Vision Transformers Need Registers
Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bo- janowski. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei- Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009
work page 2009
-
[16]
Bert: Pre-training of deep bidirectional transform- ers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transform- ers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019
work page 2019
-
[17]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021
work page 2021
-
[18]
An introduction to variational autoencoders
P Kingma Diederik and Welling Max. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019
work page 2019
-
[19]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[21]
Scaling rectified flow transformers for high- resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high- resolution image synthesis. In Forty-first international conference on machine learning, 2024
work page 2024
-
[22]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021
work page 2021
-
[23]
Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pretrained visual encoders for image generation. arXiv preprint arXiv:2512.07829, 2024
-
[24]
Generative adversarial networks
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Commun Acm, 2020
work page 2020
-
[25]
Bootstrap your own latent-a new approach to self-supervised learning
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020
work page 2020
-
[26]
Learnings from scaling visual tokenizers for reconstruction and generation
Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, and Xinlei Chen. Learnings from scaling visual tokenizers for reconstruction and generation. arXiv preprint arXiv:2501.09755, 2025
-
[27]
Masked autoencoders are scalable vision learn- ers
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learn- ers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022
work page 2022
-
[28]
Momentum contrast for unsupervised visual representation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Gir- shick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019
-
[29]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016
work page 2016
-
[30]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30, 2017
work page 2017
-
[31]
Burgess, Xavier Glorot, Matthew M
Irina Higgins, Loïc Matthey, Arka Pal, Christopher P . Burgess, Xavier Glorot, Matthew M. Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2016
work page 2016
-
[32]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020
work page 2020
-
[33]
Image-to-image translation with conditional adversarial networks
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017
work page 2017
-
[34]
Image-to-image translation with conditional adversarial networks
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1125–1134, 2017
work page 2017
-
[35]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021
work page 2021
-
[36]
Guiding a diffusion model with a bad version of itself
Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehti- nen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself. Advances in Neural Information Processing Systems, 37:52996–53021, 2024
work page 2024
-
[37]
A style-based generator architecture for generative adversarial networks
Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019
work page 2019
-
[38]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[39]
Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Eq-vae: Equivariance regularized latent space for improved generative image modeling. arXiv preprint arXiv:2502.09509, 2025. 14
-
[40]
Boosting generative image modeling via joint image-feature synthesis
Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Boosting generative image modeling via joint image-feature synthesis. arXiv preprint arXiv:2504.16064, 2025
-
[41]
Alina Kuznetsova, Mohamad Hassan Mohamad Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Com...
work page 1956
-
[42]
Improved precision and recall metric for assessing generative models
Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehti- nen, and Timo Aila. Improved precision and recall metric for assessing generative models. Advances in neural information processing systems, 32, 2019
work page 2019
-
[43]
Autoregressive image generation using residual quantization
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook- Shin Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022
work page 2022
-
[44]
Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end- to-end tuning with latent diffusion transformers. arXiv preprint arXiv:2504.10483, 2025
-
[45]
Autoregres- sive image generation without vector quantization.arXiv preprint arXiv:2406.11838,
Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. arXiv preprint arXiv:2406.11838, 2024
-
[46]
Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, and Zhe Lin. Imagefolder: Autoregressive image generation with folded tokens. arXiv preprint arXiv:2410.01756, 2024
-
[47]
Feature pyramid networks for object detection
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017
work page 2017
-
[48]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 2980–2988, 2017
work page 2017
-
[49]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[50]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[51]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[52]
Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation
Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation. arXiv preprint arXiv:2409.04410, 2024
-
[53]
Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision, pages 23–40. Springer, 2024
work page 2024
-
[54]
Finite Scalar Quantization: VQ-VAE Made Simple
Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple. arXiv preprint arXiv:2309.15505, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[55]
Conditional Generative Adversarial Nets
Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[56]
One-d-piece: Image tokenizer meets quality- controllable compression
Keita Miwa, Kento Sasaki, Hidehisa Arai, Tsubasa Takahashi, and Yu Yamaguchi. One-d-piece: Image tokenizer meets quality- controllable compression. arXiv preprint arXiv:2501.10064, 2025
-
[57]
Improved denois- ing diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denois- ing diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021
work page 2021
-
[58]
Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V . Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labat...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023
work page 2023
-
[60]
Tokenflow: Unified image tokenizer for multimodal understanding and generation
Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xinglong Wu. To- kenflow: Unified image tokenizer for multimodal understanding and generation. arXiv preprint arXiv:2412.03069, 2024
-
[61]
Learning transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language super- vision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[62]
Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adver- sarial networks. arXiv preprint arXiv:1511.06434, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[63]
Improving language understanding by generative pre-training, 2018
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018
work page 2018
-
[64]
Generating diverse high-fidelity images with vq-vae-2
Ali Razavi, Aäron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[65]
High-resolution image synthesis with latent diffusion models, 2021
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021
work page 2021
-
[66]
Photorealistic text-to-image diffusion models with deep language understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gon- tijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479– 36494, 2022
work page 2022
-
[67]
Improved techniques for training gans
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016
work page 2016
-
[68]
Improved techniques for training gans
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in Neural Information Processing Systems, 29, 2016
work page 2016
-
[69]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[70]
DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
Wei Song, Yuran Wang, Zijia Song, Yadong Li, Haoze Sun, Weipeng Chen, Zenan Zhou, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[71]
Roformer: Enhanced transformer with rotary position embedding
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024
work page 2024
-
[72]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[73]
Rethinking the inception architecture for computer vision
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016
work page 2016
-
[74]
Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and editing. arXiv preprint arXiv:2507.23278, 2025
-
[75]
2024.doi:10.48550/arXiv.2404.02905
Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905, 2024
-
[76]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Na- man Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[77]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[78]
Conditional 15 image generation with pixelcnn decoders
Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional 15 image generation with pixelcnn decoders. Advances in Neural Information Processing Systems, 29, 2016
work page 2016
-
[79]
Neural discrete representation learning
Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. Advances in Neural Information Processing Systems, 30, 2017
work page 2017
-
[80]
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010
work page 2010
-
[81]
Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025
Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer. arXiv preprint arXiv:2504.05741, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.