VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
Pith reviewed 2026-05-18 05:18 UTC · model grok-4.3
The pith
Frozen vision foundation models serve as strong tokenizers for latent diffusion models when paired with a new decoder.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By freezing a Vision Foundation Model and training only a new decoder, the VFM-VAE lets semantic features from the VFM serve as the latent space for diffusion models. This bypasses distillation losses that degrade original VFM robustness and supports a systematic study of representation effects across diffusion training steps. The resulting dual alignment yields concrete gains: gFID without CFG drops to 2.22 in 80 epochs, a claimed 10 times speedup over earlier tokenizers, and reaches 1.62 after 640 epochs.
What carries the argument
VFM-VAE tokenizer consisting of a frozen Vision Foundation Model encoder that supplies semantic features and a newly designed decoder that reconstructs images from those features.
If this is right
- Latent diffusion training reaches usable image quality in far fewer epochs than with conventional tokenizers.
- Robust semantic features from frozen VFMs improve representation learning throughout the diffusion process.
- Dual-side alignment between tokenizer and diffusion model produces measurable gains in final generation metrics.
- Continued training beyond the initial 80 epochs further reduces gFID without requiring changes to the tokenizer.
Where Pith is reading between the lines
- Similar frozen-foundation-model tokenizers could shorten training for other generative models that rely on learned latents.
- Computational budgets for high-quality image synthesis could drop if the approach scales to larger models and datasets.
- Different pre-trained vision foundation models might be swapped in to tailor the tokenizer to particular image domains.
Load-bearing premise
A newly trained decoder can produce realistic images from the semantic features of an untouched, frozen vision foundation model while keeping those features robust.
What would settle it
If the same diffusion model trained with VFM-VAE tokenizers fails to reach lower gFID scores than prior tokenizers after an equal number of epochs, the efficiency and performance claims would not hold.
Figures
read the original abstract
The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizers. While recent works have explored incorporating Vision Foundation Models (VFMs) into the tokenizers training via distillation, we empirically find this approach inevitably weakens the robustness of learnt representation from original VFM. In this paper, we bypass the distillation by proposing a more direct approach by leveraging the frozen VFM for the LDMs tokenizer, named VFM Variational Autoencoder (VFM-VAE).To fully exploit the potential to leverage frozen VFM for the LDMs tokenizer, we design a new decoder to reconstruct realistic images from the semantic-rich representation of VFM. With the proposed VFM-VAE, we conduct a systematic study on how the representation from different tokenizers impact the representation learning process throughout diffusion training, enabling synergistic benefits of dual-side alignment on both tokenizers and diffusion models. Our effort in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.22 in merely 80 epochs (a 10$\times$ speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62. These results offer solid evidence for the substantial potential of VFMs to serve as visual tokenizers to accelerate the LDM training progress.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes VFM-VAE, a tokenizer for Latent Diffusion Models that uses a frozen Vision Foundation Model (e.g., DINOv2) as the encoder paired with a newly designed decoder to reconstruct images from semantic-rich VFM features, bypassing distillation. It includes a systematic study of how different tokenizers affect diffusion training dynamics and reports strong efficiency gains: gFID (w/o CFG) of 2.22 after 80 epochs (claimed 10× speedup over prior tokenizers) and 1.62 after 640 epochs.
Significance. If the central claims hold, the work would demonstrate that pre-trained VFMs can serve directly as robust tokenizers for LDMs, yielding faster convergence and improved generation quality through dual-side alignment without weakening VFM representations. The systematic tokenizer study is a positive contribution that could inform future tokenizer design.
major comments (2)
- [§4] §4 (Experiments) and associated tables: the gFID results (2.22 at 80 epochs, 1.62 at 640 epochs) and 10× speedup claim are presented without explicit baseline details (exact prior tokenizer architectures, training epochs, dataset splits, or run-to-run variance), making it impossible to verify the performance and efficiency assertions that are central to the paper.
- [§3.2] §3.2 (Decoder design): the claim that the new decoder reconstructs realistic images while fully preserving the original VFM's robustness and invariance (without any VFM updates or distillation) is load-bearing for the 'bypass distillation' argument, yet no direct supporting measurements (e.g., before/after linear probing accuracy, feature invariance tests, or downstream task performance on the frozen VFM) are reported.
minor comments (1)
- [Abstract / §4] The notation 'gFID (w/o CFG)' is used in the abstract and results but its precise computation and relation to standard FID should be defined in the main text or a dedicated section.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our submission. We address each of the major comments in detail below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and associated tables: the gFID results (2.22 at 80 epochs, 1.62 at 640 epochs) and 10× speedup claim are presented without explicit baseline details (exact prior tokenizer architectures, training epochs, dataset splits, or run-to-run variance), making it impossible to verify the performance and efficiency assertions that are central to the paper.
Authors: We agree that additional details are required for independent verification. The baselines are taken from the cited prior works, but explicit side-by-side specifications were omitted. In the revised manuscript we will insert a new comparison table in Section 4 that lists, for each baseline tokenizer, the exact architecture, training epochs, dataset splits, and any reported run-to-run variance. This table will directly substantiate the reported gFID scores and the 10× speedup claim. revision: yes
-
Referee: [§3.2] §3.2 (Decoder design): the claim that the new decoder reconstructs realistic images while fully preserving the original VFM's robustness and invariance (without any VFM updates or distillation) is load-bearing for the 'bypass distillation' argument, yet no direct supporting measurements (e.g., before/after linear probing accuracy, feature invariance tests, or downstream task performance on the frozen VFM) are reported.
Authors: We acknowledge that direct empirical measurements would strengthen the preservation claim. Because the VFM encoder remains completely frozen, its parameters and therefore its robustness properties are unchanged by construction. Nevertheless, to provide explicit evidence we will add, in the revision, linear-probing accuracy and feature-invariance results comparing the original VFM with the VFM-VAE pipeline. These new measurements will be placed in Section 3.2 and the appendix. revision: yes
Circularity Check
No circularity: empirical claims rest on standard external metrics
full rationale
The paper proposes VFM-VAE by freezing a VFM encoder and training only a new decoder to produce latents for LDM training. All reported results are measured by gFID on standard image-generation benchmarks, an externally defined metric independent of any internal parameter fit or self-referential definition. No equations, self-citations, or uniqueness theorems appear in the provided text that would reduce the performance claims to a tautology or to a fitted input renamed as a prediction. The central argument is therefore an empirical design choice whose validity can be checked by replication on public datasets and metrics.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 7 Pith papers
-
DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders
DecQ uses detail-condensing queries on shallow and deep VFM features to improve both reconstruction PSNR and generative convergence/FID in RAEs without fine-tuning the encoder.
-
PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion
PiD is a pixel diffusion decoder that performs latent-to-pixel conversion and 4-8x upsampling in one generative step, enabling early stopping of latent diffusion and achieving sub-second 2048x2048 decoding with claime...
-
Vision Foundation Models as Generalist Tokenizers for Image Generation
VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.
-
Improved Baselines with Representation Autoencoders
RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.
-
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
-
End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer
An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.
-
WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens
WinTok is a hybrid visual tokenizer that supplements pixel tokens with learnable semantic tokens distilled asymmetrically from foundation models to improve reconstruction, understanding, and generation.
Reference graph
Works this paper leans on
-
[1]
Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network. arXiv preprint arXiv:2504.13181,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)
Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Masked diffusion transformer is a strong image synthesizer. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 23107–23116, 2023a. doi: 10.1109/ICCV51070.2023.02117. Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion trans- former is a strong...
-
[6]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
The Platonic Representation Hypothesis
Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. arXiv preprint arXiv:2405.07987,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Eq-vae: Equivariance regularized latent space for improved generative image modeling. arXiv preprint arXiv:2502.09509,
-
[11]
Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers. arXiv preprint arXiv:2504.10483,
-
[12]
Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation
Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024a. Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. Advances in...
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiao- juan Qi. Unitok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321,
-
[14]
Generating images with sparse representations
12 Preprint Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations. arXiv preprint arXiv:2103.03841,
-
[15]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Oriane Sim´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdul- mohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Train- ing diffusion transformers is much easier than you think. arXiv preprint arXiv:2507.01467,
-
[21]
Latent denoising makes good visual tokenizers
13 Preprint Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, and Yue Wang. Latent denoising makes good visual tokenizers. arXiv preprint arXiv:2507.15856,
-
[22]
Vector-quantized Image Modeling with Improved VQGAN
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
Lijun Yu, Jos´e Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion– tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940,
work page internal anchor Pith review Pith/arXiv arXiv
- [25]
-
[26]
measures how well two representations preserve the same local neighborhood structures. Given kernel similarity matrices K, L∈R n×n from two representations of the samensamples, we construct a binary maskA ij = 1{i̸=j, j∈knn K k (i)∧j∈knn L k (i)}to retain only the pairs that are commonk-nearest neighbors in both spaces. The masked kernels are defined asK ...
work page 2024
-
[27]
upsampling unit with a pure PyTorch imple- mentation for better readability and extensibility. The module normalizes input features via Group- Norm, performs3×3depthwise and1×1pointwise convolutions for local extraction and channel mixing, upsamples with PixelShuffle, and finally applies a fixed Gaussian blur to suppress checker- board artifacts. It serve...
work page 2023
-
[28]
to the V AE encoder latent; during inference, this latent is decoded by the V AE decoder to generate the final image. The visual–language backbone isQwen2.5-VL-3B-Instruct(Bai et al., 2025), and the diffusion backbone isLumina-Next (DiT)(Zhuo et al., 2024), where the patch size is reduced to 1 and the input/output channels are aligned to a latent of16×32×...
work page 2025
-
[29]
Our multi-stage training strategy follows the general structure of V A-V AE (Yao et al., 2025). In the strong alignment stage, large representation regularization losses are applied to quickly establish VFM–V AE alignment. In the weak alignment stage, the weight of this loss is reduced to maintain alignment while shifting focus toward reconstruction quali...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.