Vision Foundation Models as Generalist Tokenizers for Image Generation

Anlin Zheng; Chuofan Ma; Gang Yu; Lanxi Gong; Qi Han; Xiangyu Zhang; Xiaojuan Qi; Xin Wen

arxiv: 2605.18390 · v1 · pith:TG3ZPEBZnew · submitted 2026-05-18 · 💻 cs.CV

Vision Foundation Models as Generalist Tokenizers for Image Generation

Anlin Zheng , Qi Han , Xin Wen , Chuofan Ma , Lanxi Gong , Gang Yu , Xiangyu Zhang , Xiaojuan Qi This is my paper

Pith reviewed 2026-05-20 10:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision foundation modelsimage tokenizationautoregressive generationimage synthesislatent quantizationsemantic reconstructionImageNet class-conditional

0 comments

The pith

A frozen vision foundation model can be used directly as the encoder for a generalist image tokenizer that operates in both discrete and continuous spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that a vision foundation model pre-trained with global contrastive learning and latent masked image modeling provides representations that work well as the basis for image tokenization without any encoder fine-tuning. The authors add a region-adaptive quantization step to cut spatial redundancy in the 2D feature grid and a semantic reconstruction objective that keeps decoded outputs aligned with the original VFM features. These changes produce VFMTok, a tokenizer usable for autoregressive models in discrete token spaces and for denoising models in continuous spaces. A sympathetic reader would care because the approach yields faster model convergence, fewer tokens, and strong generation metrics on ImageNet while removing the need for classifier-free guidance during inference.

Core claim

VFMTok is built by taking a frozen VFM as encoder and adding region-adaptive quantization to remove spatial redundancy from 2D grid features together with a semantic reconstruction objective that aligns decoded outputs with VFM representations. This produces a generalist tokenizer that works seamlessly in discrete latent spaces for autoregressive generation and in continuous spaces for denoising-based generation. On ImageNet class-conditional synthesis the discrete version reaches a gFID of 1.36 with three times faster convergence while the continuous version reaches 1.25 gFID; both achieve high-fidelity results without classifier-free guidance.

What carries the argument

Region-adaptive quantization framework paired with a semantic reconstruction objective applied to features from a frozen vision foundation model encoder, which removes spatial redundancy while preserving semantic fidelity for downstream generation.

If this is right

Discrete autoregressive generators converge three times faster.
Class-conditional synthesis reaches a gFID of 1.36 on ImageNet.
Continuous-space generation with a denoising model reaches a gFID of 1.25.
High-fidelity synthesis succeeds without classifier-free guidance in both paradigms.
Tokenizer quality depends on the exact combination of self-supervised objectives used in VFM pre-training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same frozen-VFM approach could be tested on video or 3D data to see whether region-adaptive quantization still reduces redundancy effectively.
Smaller generative models paired with VFMTok might preserve quality while using even fewer parameters overall.
Out-of-distribution images could be used to measure how much the semantic reconstruction objective protects against domain shift.
Future tokenizers might be designed by first selecting VFM pre-training objectives that maximize downstream generation metrics rather than designing new quantization schemes from scratch.

Load-bearing premise

Representations from a VFM pre-trained with global contrastive learning plus latent masked image modeling stay optimal for tokenization and generation without any encoder fine-tuning or adaptation.

What would settle it

Train an otherwise identical tokenizer using a VFM pre-trained with only one of the two objectives (contrastive learning or latent masked image modeling) and check whether gFID rises above 1.36 or convergence slows below the reported three-fold speedup on the same ImageNet class-conditional task.

Figures

Figures reproduced from arXiv: 2605.18390 by Anlin Zheng, Chuofan Ma, Gang Yu, Lanxi Gong, Qi Han, Xiangyu Zhang, Xiaojuan Qi, Xin Wen.

**Figure 2.** Figure 2: The framework of VFMTok/VFMAE. VFMTok/VFMAE utilizes a frozen VFM to extract multi-level image features. A deformable Transformer [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 7.** Figure 7: ). For optimal clarity, please zoom in [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 3.** Figure 3: Fig.3. Autoregressive class-conditional image generation with classifier-free guidance (CFG). [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: Fig.4. Autoregressive class-conditional image generation without classifier-free guidance (CFG). [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

In this work, we explore the largely unexplored direction of building a generalist image tokenizer directly on top of a frozen vision foundation model (VFM). To build this tokenizer, we utilize a frozen VFM as the encoder and introduce two key innovations: (1) a region-adaptive quantization framework to eliminate spatial redundancy in standard 2D grid features, and (2) a semantic reconstruction objective that aligns the decoded outputs with the VFM's representations to preserve semantic fidelity. Grounded in these designs, we propose VFMTok, a generalist visual tokenizer capable of operating seamlessly in both discrete and continuous latent spaces. VFMTok achieves substantial improvements in synthesis quality while drastically enhancing token efficiency. For discrete autoregressive (AR) generation, it accelerates model convergence by \textbf{3 times} and achieves a state-of-the-art gFID of \textbf{1.36} on ImageNet class-conditional synthesis. Similarly, for continuous-space generation, integrating VFMTok with a denoising model yields an exceptional gFID of \textbf{1.25}. Furthermore, because the latent space inherently captures rich spatial semantics, VFMTok enables high-fidelity class-conditional synthesis without classifier-free guidance (\textbf{w/o CFG}) across both generative paradigms, significantly accelerating inference speed. Beyond these remarkable empirical results, we systematically investigate the underlying mechanisms of our approach. We discover that the specific self-supervised learning objectives utilized during VFM pre-training dictate its effectiveness as a tokenizer. Specifically, a VFM jointly optimized with global contrastive learning and latent masked image modeling provides the optimal representations for image tokenization. These insights establish a strong foundation and offer valuable guidance for the design of future image tokenizers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VFMTok uses a frozen VFM with region-adaptive quantization and semantic reconstruction to cut tokens and hit strong gFID numbers on ImageNet, but the no-fine-tuning assumption needs direct testing.

read the letter

The main point is that this paper takes a frozen vision foundation model, adds region-adaptive quantization to drop spatial redundancy, and pairs it with a semantic reconstruction loss that keeps the decoded tokens aligned with the original VFM features. They call the result VFMTok and show it works as a generalist tokenizer in both discrete and continuous spaces for downstream image generation. The reported gains are concrete: 3x faster convergence in autoregressive models, gFID of 1.36 on class-conditional ImageNet, gFID of 1.25 in the continuous case, and the ability to drop classifier-free guidance entirely because the latents already carry usable spatial semantics. That last part could matter for inference speed in practice. What is actually new is the direct use of a frozen VFM as the encoder rather than training a new one from scratch, plus the explicit link they draw between VFM pre-training objectives and tokenizer quality. They find that models trained with global contrastive learning plus latent masked image modeling give the best representations for this task, and they test this across several frozen VFMs. That investigation is useful and goes beyond a simple engineering tweak. The results are presented with clear numbers and a systematic check on the pre-training objectives, which is a strength. The soft spot is the choice to keep the VFM completely frozen. The paper shows that pre-training objectives matter when everything stays frozen, but it does not test whether jointly fine-tuning the encoder with the new quantization and reconstruction objectives would improve semantic alignment or push the gFID numbers lower. If adaptation helps, the current version may not be the strongest possible. The full paper will need to show the ablations on the region-adaptive component and the reconstruction loss, plus fair baselines and error bars, to make the claims fully convincing. This is aimed at researchers working on token-efficient autoregressive or diffusion pipelines for image generation. Anyone trying to reduce token counts or remove CFG would find practical ideas here. The work is empirical and the claims are testable, so it deserves a serious referee to check the experiments and see whether the frozen-encoder design is optimal or just convenient.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes VFMTok, a generalist visual tokenizer built atop a frozen vision foundation model (VFM) encoder. It introduces region-adaptive quantization to reduce spatial redundancy in 2D grid features and a semantic reconstruction objective to align decoded outputs with VFM representations. VFMTok supports both discrete and continuous latent spaces for image generation, reporting SOTA gFID of 1.36 on ImageNet class-conditional discrete AR synthesis (with 3x faster convergence) and 1.25 for continuous denoising models, plus CFG-free generation due to rich semantics. The work also investigates SSL objectives, finding that global contrastive learning combined with latent masked image modeling yields optimal VFM representations for tokenization.

Significance. If the empirical claims hold under full verification, the results indicate that frozen VFMs can serve as effective generalist tokenizers with targeted quantization and reconstruction losses, yielding substantial gains in synthesis quality, token efficiency, and inference speed. The finding on SSL objective combinations provides concrete guidance for selecting pre-trained encoders in future tokenizer designs and could reduce the need for end-to-end training of visual encoders in generative pipelines.

major comments (2)

[Abstract / VFM pre-training objectives paragraph] Abstract and § on VFM pre-training objectives: the central claim that a frozen VFM (pre-trained with global contrastive + latent MIM) remains optimal for tokenization without encoder fine-tuning is load-bearing for the reported gFID 1.36/1.25 and 3x convergence, yet no ablation compares this to joint adaptation of the encoder with the region-adaptive quantization and semantic reconstruction objectives. If joint fine-tuning better preserves spatial semantics, the efficiency and quality gains may not represent the strongest instantiation.
[Experiments / Results tables] Experiments section (results on ImageNet AR and continuous generation): the gFID scores and convergence claims lack reported error bars, number of runs, or full baseline comparisons (including recent tokenizers and fine-tuned VFM variants), making it difficult to assess whether the 1.36 gFID and 3x speedup are robust or sensitive to implementation details.

minor comments (2)

[Method] Notation for region-adaptive quantization could be clarified with an explicit equation or diagram showing how patch selection varies per region.
[Discussion] The manuscript would benefit from a dedicated limitations paragraph discussing potential failure modes when the VFM's pre-training data distribution differs from the target generation dataset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and constructive suggestions. Below we respond to each major comment, outlining our planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract / VFM pre-training objectives paragraph] Abstract and § on VFM pre-training objectives: the central claim that a frozen VFM (pre-trained with global contrastive + latent MIM) remains optimal for tokenization without encoder fine-tuning is load-bearing for the reported gFID 1.36/1.25 and 3x convergence, yet no ablation compares this to joint adaptation of the encoder with the region-adaptive quantization and semantic reconstruction objectives. If joint fine-tuning better preserves spatial semantics, the efficiency and quality gains may not represent the strongest instantiation.

Authors: We thank the referee for this observation. The manuscript's focus is on demonstrating that frozen VFMs, without any encoder fine-tuning, can serve as effective generalist tokenizers when combined with our proposed region-adaptive quantization and semantic reconstruction. This design choice emphasizes efficiency and the reusability of pre-trained models. While we acknowledge that joint fine-tuning could potentially yield further improvements, it would deviate from the generalist and frozen paradigm we aim to explore. In the revision, we will add a paragraph discussing this limitation and why the frozen setting is of particular interest, including references to works that do perform fine-tuning. This constitutes a partial revision as we will enhance the discussion but not conduct new joint fine-tuning experiments at this stage. revision: partial
Referee: [Experiments / Results tables] Experiments section (results on ImageNet AR and continuous generation): the gFID scores and convergence claims lack reported error bars, number of runs, or full baseline comparisons (including recent tokenizers and fine-tuned VFM variants), making it difficult to assess whether the 1.36 gFID and 3x speedup are robust or sensitive to implementation details.

Authors: We agree that providing error bars and details on the number of runs would enhance the credibility of the empirical results. In the revised manuscript, we will report the mean and standard deviation of gFID scores over multiple runs (specifically, we will run the experiments three times and include the statistics). We will also clarify the convergence speed measurements. For baseline comparisons, we have compared against several established tokenizers; we will expand the experimental section to include more recent methods and add a note on fine-tuned VFM variants, explaining that our work prioritizes the frozen case. These changes will be incorporated in the next version. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical tokenizer design and objective investigation are self-contained

full rationale

The paper's central results rest on constructing VFMTok atop a frozen VFM encoder, applying region-adaptive quantization and a semantic reconstruction loss, then reporting downstream gFID, convergence speed, and CFG-free generation metrics on ImageNet. These are external, falsifiable benchmarks rather than quantities derived from the paper's own equations or fitted parameters. The investigation into which VFM pre-training objectives (global contrastive + latent MIM) yield better tokenizers is likewise an empirical comparison across frozen models, not a self-referential reduction or self-citation chain. No load-bearing step equates a claimed prediction to its input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central approach rests on the domain assumption that frozen VFM representations are directly usable for tokenization once augmented with the two proposed components; no explicit free parameters or new invented entities are described in the abstract.

axioms (1)

domain assumption Representations from a frozen vision foundation model pre-trained with global contrastive learning and latent masked image modeling are suitable and optimal for building a generalist image tokenizer.
This premise is invoked when the encoder is kept frozen and when the authors conclude that specific SSL objectives dictate tokenizer effectiveness.

pith-pipeline@v0.9.0 · 5858 in / 1412 out tokens · 55035 ms · 2026-05-20T10:57:32.776159+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a VFM jointly optimized with global contrastive learning and latent masked image modeling provides the optimal representations for image tokenization
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

region-adaptive quantization framework to eliminate spatial redundancy in standard 2D grid features

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · 25 internal anchors

[1]

Building Normalizing Flows with Stochastic Interpolants

Michael S Albergo and Eric Vanden-Eijnden. Building nor- malizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Flextok: Resampling images into 1d to- ken sequences of flexible length

Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, O˘ guzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Za- mir, and Afshin Dehghan. Flextok: Resampling images into 1d to- ken sequences of flexible length. arXiv preprint arXiv:2502.13967, 2025

work page arXiv 2025
[3]

Autoencoders

Dor Bank, Noam Koenigstein, and Raja Giryes. Autoencoders. Machine learning for data science handbook: data mining and knowledge discovery handbook, pages 353–374, 2023

work page 2023
[4]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Esti- mating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[5]

VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

Tianci Bi, Xiaoyi Zhang, Yan Lu, and Nanning Zheng. Vision foundation models can be good tokenizers for latent diffusion models. arXiv preprint arXiv:2510.18457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Understanding disentangling in $\beta$-VAE

Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Under- standing disentangling in β-vae. arXiv preprint arXiv:1804.03599, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Emerging proper- ties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging proper- ties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021

work page 2021
[8]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022

work page 2022
[9]

arXiv preprint arXiv:2509.25162 (2025) 4

Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, and Kai Zhang. Aligning visual foundation encoders to tokenizers for diffusion models. arXiv preprint arXiv:2509.25162, 2025

work page arXiv 2025
[10]

Generative pretraining from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR, 2020

work page 2020
[11]

Improved Baselines with Momentum Contrastive Learning

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Im- proved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2003
[12]

Detection in crowded scenes: One proposal, multiple predictions

Xuangeng Chu, Anlin Zheng, Xiangyu Zhang, and Jian Sun. Detection in crowded scenes: One proposal, multiple predictions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12214–12223, 2020

work page 2020
[13]

Deformable convolutional networks

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 764–773, 2017

work page 2017
[14]

Vision Transformers Need Registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bo- janowski. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei- Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009

work page 2009
[16]

Bert: Pre-training of deep bidirectional transform- ers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transform- ers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019

work page 2019
[17]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021

work page 2021
[18]

An introduction to variational autoencoders

P Kingma Diederik and Welling Max. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019

work page 2019
[19]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[21]

Scaling rectified flow transformers for high- resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high- resolution image synthesis. In Forty-first international conference on machine learning, 2024

work page 2024
[22]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021

work page 2021
[23]

One layer is enough: Adapting pretrained visual encoders for image generation.arXiv preprint arXiv:2512.07829, 2025

Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pretrained visual encoders for image generation. arXiv preprint arXiv:2512.07829, 2024

work page arXiv 2024
[24]

Generative adversarial networks

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Commun Acm, 2020

work page 2020
[25]

Bootstrap your own latent-a new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020

work page 2020
[26]

Learnings from scaling visual tokenizers for reconstruction and generation

Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, and Xinlei Chen. Learnings from scaling visual tokenizers for reconstruction and generation. arXiv preprint arXiv:2501.09755, 2025

work page arXiv 2025
[27]

Masked autoencoders are scalable vision learn- ers

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learn- ers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022

work page 2022
[28]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Gir- shick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019

work page arXiv 1911
[29]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

work page 2016
[30]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30, 2017

work page 2017
[31]

Burgess, Xavier Glorot, Matthew M

Irina Higgins, Loïc Matthey, Arka Pal, Christopher P . Burgess, Xavier Glorot, Matthew M. Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2016

work page 2016
[32]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020

work page 2020
[33]

Image-to-image translation with conditional adversarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017

work page 2017
[34]

Image-to-image translation with conditional adversarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1125–1134, 2017

work page 2017
[35]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021

work page 2021
[36]

Guiding a diffusion model with a bad version of itself

Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehti- nen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself. Advances in Neural Information Processing Systems, 37:52996–53021, 2024

work page 2024
[37]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019

work page 2019
[38]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[39]

Eq-vae: Equivariance regularized latent space for improved generative image modeling.arXiv preprint arXiv:2502.09509,

Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Eq-vae: Equivariance regularized latent space for improved generative image modeling. arXiv preprint arXiv:2502.09509, 2025. 14

work page arXiv 2025
[40]

Boosting generative image modeling via joint image-feature synthesis

Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Boosting generative image modeling via joint image-feature synthesis. arXiv preprint arXiv:2504.16064, 2025

work page arXiv 2025
[41]

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale

Alina Kuznetsova, Mohamad Hassan Mohamad Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Com...

work page 1956
[42]

Improved precision and recall metric for assessing generative models

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehti- nen, and Timo Aila. Improved precision and recall metric for assessing generative models. Advances in neural information processing systems, 32, 2019

work page 2019
[43]

Autoregressive image generation using residual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook- Shin Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022

work page 2022
[44]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end- to-end tuning with latent diffusion transformers. arXiv preprint arXiv:2504.10483, 2025

work page arXiv 2025
[45]

Autoregres- sive image generation without vector quantization.arXiv preprint arXiv:2406.11838,

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. arXiv preprint arXiv:2406.11838, 2024

work page arXiv 2024
[46]

Imagefolder: Autoregres- sive image generation with folded tokens.arXiv preprint arXiv:2410.01756, 2024

Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, and Zhe Lin. Imagefolder: Autoregressive image generation with folded tokens. arXiv preprint arXiv:2410.01756, 2024

work page arXiv 2024
[47]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017

work page 2017
[48]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 2980–2988, 2017

work page 2017
[49]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[51]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[52]

Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation. arXiv preprint arXiv:2409.04410, 2024

work page arXiv 2024
[53]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision, pages 23–40. Springer, 2024

work page 2024
[54]

Finite Scalar Quantization: VQ-VAE Made Simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple. arXiv preprint arXiv:2309.15505, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Conditional Generative Adversarial Nets

Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[56]

One-d-piece: Image tokenizer meets quality- controllable compression

Keita Miwa, Kento Sasaki, Hidehisa Arai, Tsubasa Takahashi, and Yu Yamaguchi. One-d-piece: Image tokenizer meets quality- controllable compression. arXiv preprint arXiv:2501.10064, 2025

work page arXiv 2025
[57]

Improved denois- ing diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denois- ing diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021

work page 2021
[58]

Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V . Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labat...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

work page 2023
[60]

Tokenflow: Unified image tokenizer for multimodal understanding and generation

Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xinglong Wu. To- kenflow: Unified image tokenizer for multimodal understanding and generation. arXiv preprint arXiv:2412.03069, 2024

work page arXiv 2024
[61]

Learning transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language super- vision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021

work page 2021
[62]

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adver- sarial networks. arXiv preprint arXiv:1511.06434, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[63]

Improving language understanding by generative pre-training, 2018

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018

work page 2018
[64]

Generating diverse high-fidelity images with vq-vae-2

Ali Razavi, Aäron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[65]

High-resolution image synthesis with latent diffusion models, 2021

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021

work page 2021
[66]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gon- tijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479– 36494, 2022

work page 2022
[67]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016

work page 2016
[68]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in Neural Information Processing Systems, 29, 2016

work page 2016
[69]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[70]

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Wei Song, Yuran Wang, Zijia Song, Yadong Li, Haoze Sun, Weipeng Chen, Zenan Zhou, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[71]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

work page 2024
[72]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[73]

Rethinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016

work page 2016
[74]

Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278, 2025

Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and editing. arXiv preprint arXiv:2507.23278, 2025

work page arXiv 2025
[75]

2024.doi:10.48550/arXiv.2404.02905

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905, 2024

work page arXiv 2024
[76]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Na- man Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[77]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[78]

Conditional 15 image generation with pixelcnn decoders

Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional 15 image generation with pixelcnn decoders. Advances in Neural Information Processing Systems, 29, 2016

work page 2016
[79]

Neural discrete representation learning

Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. Advances in Neural Information Processing Systems, 30, 2017

work page 2017
[80]

Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion

Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010

work page 2010
[81]

Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer. arXiv preprint arXiv:2504.05741, 2025

work page arXiv 2025

Showing first 80 references.

[1] [1]

Building Normalizing Flows with Stochastic Interpolants

Michael S Albergo and Eric Vanden-Eijnden. Building nor- malizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Flextok: Resampling images into 1d to- ken sequences of flexible length

Roman Bachmann, Jesse Allardice, David Mizrahi, Enrico Fini, O˘ guzhan Fatih Kar, Elmira Amirloo, Alaaeldin El-Nouby, Amir Za- mir, and Afshin Dehghan. Flextok: Resampling images into 1d to- ken sequences of flexible length. arXiv preprint arXiv:2502.13967, 2025

work page arXiv 2025

[3] [3]

Autoencoders

Dor Bank, Noam Koenigstein, and Raja Giryes. Autoencoders. Machine learning for data science handbook: data mining and knowledge discovery handbook, pages 353–374, 2023

work page 2023

[4] [4]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Esti- mating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[5] [5]

VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

Tianci Bi, Xiaoyi Zhang, Yan Lu, and Nanning Zheng. Vision foundation models can be good tokenizers for latent diffusion models. arXiv preprint arXiv:2510.18457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Understanding disentangling in $\beta$-VAE

Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Under- standing disentangling in β-vae. arXiv preprint arXiv:1804.03599, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Emerging proper- ties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging proper- ties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), 2021

work page 2021

[8] [8]

Maskgit: Masked generative image transformer

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022

work page 2022

[9] [9]

arXiv preprint arXiv:2509.25162 (2025) 4

Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, and Kai Zhang. Aligning visual foundation encoders to tokenizers for diffusion models. arXiv preprint arXiv:2509.25162, 2025

work page arXiv 2025

[10] [10]

Generative pretraining from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR, 2020

work page 2020

[11] [11]

Improved Baselines with Momentum Contrastive Learning

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Im- proved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2003

[12] [12]

Detection in crowded scenes: One proposal, multiple predictions

Xuangeng Chu, Anlin Zheng, Xiangyu Zhang, and Jian Sun. Detection in crowded scenes: One proposal, multiple predictions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12214–12223, 2020

work page 2020

[13] [13]

Deformable convolutional networks

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 764–773, 2017

work page 2017

[14] [14]

Vision Transformers Need Registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bo- janowski. Vision transformers need registers. arXiv preprint arXiv:2309.16588, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei- Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009

work page 2009

[16] [16]

Bert: Pre-training of deep bidirectional transform- ers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transform- ers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, 2019

work page 2019

[17] [17]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021

work page 2021

[18] [18]

An introduction to variational autoencoders

P Kingma Diederik and Welling Max. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019

work page 2019

[19] [19]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[20] [21]

Scaling rectified flow transformers for high- resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high- resolution image synthesis. In Forty-first international conference on machine learning, 2024

work page 2024

[21] [22]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021

work page 2021

[22] [23]

One layer is enough: Adapting pretrained visual encoders for image generation.arXiv preprint arXiv:2512.07829, 2025

Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pretrained visual encoders for image generation. arXiv preprint arXiv:2512.07829, 2024

work page arXiv 2024

[23] [24]

Generative adversarial networks

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Commun Acm, 2020

work page 2020

[24] [25]

Bootstrap your own latent-a new approach to self-supervised learning

Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020

work page 2020

[25] [26]

Learnings from scaling visual tokenizers for reconstruction and generation

Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, and Xinlei Chen. Learnings from scaling visual tokenizers for reconstruction and generation. arXiv preprint arXiv:2501.09755, 2025

work page arXiv 2025

[26] [27]

Masked autoencoders are scalable vision learn- ers

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learn- ers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022

work page 2022

[27] [28]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Gir- shick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019

work page arXiv 1911

[28] [29]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

work page 2016

[29] [30]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30, 2017

work page 2017

[30] [31]

Burgess, Xavier Glorot, Matthew M

Irina Higgins, Loïc Matthey, Arka Pal, Christopher P . Burgess, Xavier Glorot, Matthew M. Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2016

work page 2016

[31] [32]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020

work page 2020

[32] [33]

Image-to-image translation with conditional adversarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017

work page 2017

[33] [34]

Image-to-image translation with conditional adversarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1125–1134, 2017

work page 2017

[34] [35]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021

work page 2021

[35] [36]

Guiding a diffusion model with a bad version of itself

Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehti- nen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself. Advances in Neural Information Processing Systems, 37:52996–53021, 2024

work page 2024

[36] [37]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019

work page 2019

[37] [38]

Auto-Encoding Variational Bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[38] [39]

Eq-vae: Equivariance regularized latent space for improved generative image modeling.arXiv preprint arXiv:2502.09509,

Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Eq-vae: Equivariance regularized latent space for improved generative image modeling. arXiv preprint arXiv:2502.09509, 2025. 14

work page arXiv 2025

[39] [40]

Boosting generative image modeling via joint image-feature synthesis

Theodoros Kouzelis, Efstathios Karypidis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Boosting generative image modeling via joint image-feature synthesis. arXiv preprint arXiv:2504.16064, 2025

work page arXiv 2025

[40] [41]

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale

Alina Kuznetsova, Mohamad Hassan Mohamad Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Com...

work page 1956

[41] [42]

Improved precision and recall metric for assessing generative models

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehti- nen, and Timo Aila. Improved precision and recall metric for assessing generative models. Advances in neural information processing systems, 32, 2019

work page 2019

[42] [43]

Autoregressive image generation using residual quantization

Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook- Shin Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11523–11532, 2022

work page 2022

[43] [44]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end- to-end tuning with latent diffusion transformers. arXiv preprint arXiv:2504.10483, 2025

work page arXiv 2025

[44] [45]

Autoregres- sive image generation without vector quantization.arXiv preprint arXiv:2406.11838,

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. arXiv preprint arXiv:2406.11838, 2024

work page arXiv 2024

[45] [46]

Imagefolder: Autoregres- sive image generation with folded tokens.arXiv preprint arXiv:2410.01756, 2024

Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, and Zhe Lin. Imagefolder: Autoregressive image generation with folded tokens. arXiv preprint arXiv:2410.01756, 2024

work page arXiv 2024

[46] [47]

Feature pyramid networks for object detection

Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017

work page 2017

[47] [48]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 2980–2988, 2017

work page 2017

[48] [49]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[49] [50]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[50] [51]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[51] [52]

Open-magvit2: An open-source project toward democratizing auto-regressive visual gener- ation

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation. arXiv preprint arXiv:2409.04410, 2024

work page arXiv 2024

[52] [53]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision, pages 23–40. Springer, 2024

work page 2024

[53] [54]

Finite Scalar Quantization: VQ-VAE Made Simple

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: Vq-vae made simple. arXiv preprint arXiv:2309.15505, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[54] [55]

Conditional Generative Adversarial Nets

Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[55] [56]

One-d-piece: Image tokenizer meets quality- controllable compression

Keita Miwa, Kento Sasaki, Hidehisa Arai, Tsubasa Takahashi, and Yu Yamaguchi. One-d-piece: Image tokenizer meets quality- controllable compression. arXiv preprint arXiv:2501.10064, 2025

work page arXiv 2025

[56] [57]

Improved denois- ing diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denois- ing diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021

work page 2021

[57] [58]

Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V . Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Labat...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [59]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

work page 2023

[59] [60]

Tokenflow: Unified image tokenizer for multimodal understanding and generation

Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xinglong Wu. To- kenflow: Unified image tokenizer for multimodal understanding and generation. arXiv preprint arXiv:2412.03069, 2024

work page arXiv 2024

[60] [61]

Learning transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language super- vision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021

work page 2021

[61] [62]

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adver- sarial networks. arXiv preprint arXiv:1511.06434, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[62] [63]

Improving language understanding by generative pre-training, 2018

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018

work page 2018

[63] [64]

Generating diverse high-fidelity images with vq-vae-2

Ali Razavi, Aäron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[64] [65]

High-resolution image synthesis with latent diffusion models, 2021

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021

work page 2021

[65] [66]

Photorealistic text-to-image diffusion models with deep language understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gon- tijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479– 36494, 2022

work page 2022

[66] [67]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016

work page 2016

[67] [68]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in Neural Information Processing Systems, 29, 2016

work page 2016

[68] [69]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[69] [70]

DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies

Wei Song, Yuran Wang, Zijia Song, Yadong Li, Haoze Sun, Weipeng Chen, Zenan Zhou, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[70] [71]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

work page 2024

[71] [72]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[72] [73]

Rethinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2818–2826, 2016

work page 2016

[73] [74]

Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278, 2025

Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and editing. arXiv preprint arXiv:2507.23278, 2025

work page arXiv 2025

[74] [75]

2024.doi:10.48550/arXiv.2404.02905

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905, 2024

work page arXiv 2024

[75] [76]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Na- man Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[76] [77]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[77] [78]

Conditional 15 image generation with pixelcnn decoders

Aäron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, and Koray Kavukcuoglu. Conditional 15 image generation with pixelcnn decoders. Advances in Neural Information Processing Systems, 29, 2016

work page 2016

[78] [79]

Neural discrete representation learning

Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. Advances in Neural Information Processing Systems, 30, 2017

work page 2017

[79] [80]

Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion

Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010

work page 2010

[80] [81]

Ddt: Decoupled diffusion transformer.arXiv preprint arXiv:2504.05741, 2025

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer. arXiv preprint arXiv:2504.05741, 2025

work page arXiv 2025