pith. sign in

arxiv: 2510.18457 · v3 · submitted 2025-10-21 · 💻 cs.CV · cs.LG

VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

Pith reviewed 2026-05-18 05:18 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords latent diffusion modelsvision foundation modelstokenizersvariational autoencoderimage generationrepresentation learningdiffusion training
0
0 comments X

The pith

Frozen vision foundation models serve as strong tokenizers for latent diffusion models when paired with a new decoder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that Vision Foundation Models can be used directly and frozen as encoders in the tokenizer for Latent Diffusion Models, avoiding the weakening effect that distillation has on their representations. A custom decoder is trained to turn the semantic-rich VFM features back into realistic images, creating the VFM-VAE. This design lets the authors examine how different tokenizer representations shape the diffusion training process and produce better alignment between the two components. The outcome is markedly faster progress toward high-quality generation, reaching a gFID of 2.22 without classifier-free guidance after only 80 epochs and 1.62 after 640 epochs.

Core claim

By freezing a Vision Foundation Model and training only a new decoder, the VFM-VAE lets semantic features from the VFM serve as the latent space for diffusion models. This bypasses distillation losses that degrade original VFM robustness and supports a systematic study of representation effects across diffusion training steps. The resulting dual alignment yields concrete gains: gFID without CFG drops to 2.22 in 80 epochs, a claimed 10 times speedup over earlier tokenizers, and reaches 1.62 after 640 epochs.

What carries the argument

VFM-VAE tokenizer consisting of a frozen Vision Foundation Model encoder that supplies semantic features and a newly designed decoder that reconstructs images from those features.

If this is right

  • Latent diffusion training reaches usable image quality in far fewer epochs than with conventional tokenizers.
  • Robust semantic features from frozen VFMs improve representation learning throughout the diffusion process.
  • Dual-side alignment between tokenizer and diffusion model produces measurable gains in final generation metrics.
  • Continued training beyond the initial 80 epochs further reduces gFID without requiring changes to the tokenizer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar frozen-foundation-model tokenizers could shorten training for other generative models that rely on learned latents.
  • Computational budgets for high-quality image synthesis could drop if the approach scales to larger models and datasets.
  • Different pre-trained vision foundation models might be swapped in to tailor the tokenizer to particular image domains.

Load-bearing premise

A newly trained decoder can produce realistic images from the semantic features of an untouched, frozen vision foundation model while keeping those features robust.

What would settle it

If the same diffusion model trained with VFM-VAE tokenizers fails to reach lower gFID scores than prior tokenizers after an equal number of epochs, the efficiency and performance claims would not hold.

Figures

Figures reproduced from arXiv: 2510.18457 by Nanning Zheng, Tianci Bi, Xiaoyi Zhang, Yan Lu.

Figure 1
Figure 1. Figure 1: Comparison of VFM-VAE and Previous Visual Tokenizers for LDM. (a) Distillation￾based approach: VAE variants Yao et al. (2025); Leng et al. (2025) distill advanced representation from VFM. (b) Our VFM-VAE: directly leverage frozen VFM as a part of VAE. (c) Combing our visual tokenizer with LDMs variants leads faster converge and advanced performance. 1 INTRODUCTION Latent Diffusion Models (LDMs) (Rombach et… view at source ↗
Figure 2
Figure 2. Figure 2: Brittleness of aligned representations under semantic-preserving transformations. Specifically, CKNNA (Huh et al., 2024) values for VA-VAE (Yao et al., 2025) and SD-VAE (Rom￾bach et al., 2022) are computed with DINOv2-Large (Oquab et al., 2023), while those for VFM-VAE are computed with SigLIP2-Large (Tschannen et al., 2025). Under semantic-preserving transforma￾tions, VFM-VAE demonstrates notably stronger… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of VFM-VAE architecture design. The model couples a frozen VFM encoder with a multi-scale decoder to preserve semantic alignment and enable high-fidelity reconstruction. 3 [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CKNNA comparison across layers of the generative models. (a) Without explicit VFM alignment, the diffusion model combined with VFM-VAE achieves higher average layer-wise CKNNA than other tokenizer baselines. (b) With alignment enabled, the VFM-VAE system further improves, consistently surpassing other diffusion-aligned baselines in CKNNA. 4.2 TOKENIZER QUALITY: RECONSTRUCTION AND REPRESENTATION VFM-VAE ach… view at source ↗
Figure 5
Figure 5. Figure 5: Stage-wise visualization of generative model training results. Shown under a fixed random seed and identical initial noise, our approach demonstrates impressive performance and greatly accelerates image generation learning. adding feature-regularization losses during training to preserve alignment with VFM features. Build￾ing on this foundation, we progressively extend a minimal baseline with additional co… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of reconstructions from different VAEs. [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of VFM-VAE + REG (640 epochs). Generation uses CFG with w = 4.0; class label is ”Border collie” (232) [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of VFM-VAE + REG (640 epochs). Generation uses CFG with w = 4.0; class label is ”Macaw” (88) [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visualization of VFM-VAE + REG (640 epochs). Generation uses CFG with w = 4.0; class label is ”Bald Eagle” (22). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of VFM-VAE + REG (640 epochs). Generation uses CFG with w = 4.0; class label is ”Giant Panda” (388) [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of VFM-VAE + REG (640 epochs). Generation uses CFG with w = 4.0; class label is ”Lakeside” (975) [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of VFM-VAE + REG (640 epochs). Generation uses CFG with w = 4.0; class label is ”Volcano” (980). 21 [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
read the original abstract

The performance of Latent Diffusion Models (LDMs) is critically dependent on the quality of their visual tokenizers. While recent works have explored incorporating Vision Foundation Models (VFMs) into the tokenizers training via distillation, we empirically find this approach inevitably weakens the robustness of learnt representation from original VFM. In this paper, we bypass the distillation by proposing a more direct approach by leveraging the frozen VFM for the LDMs tokenizer, named VFM Variational Autoencoder (VFM-VAE).To fully exploit the potential to leverage frozen VFM for the LDMs tokenizer, we design a new decoder to reconstruct realistic images from the semantic-rich representation of VFM. With the proposed VFM-VAE, we conduct a systematic study on how the representation from different tokenizers impact the representation learning process throughout diffusion training, enabling synergistic benefits of dual-side alignment on both tokenizers and diffusion models. Our effort in tokenizer design and training strategy lead to superior performance and efficiency: our system reaches a gFID (w/o CFG) of 2.22 in merely 80 epochs (a 10$\times$ speedup over prior tokenizers). With continued training to 640 epochs, it further attains a gFID (w/o CFG) of 1.62. These results offer solid evidence for the substantial potential of VFMs to serve as visual tokenizers to accelerate the LDM training progress.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes VFM-VAE, a tokenizer for Latent Diffusion Models that uses a frozen Vision Foundation Model (e.g., DINOv2) as the encoder paired with a newly designed decoder to reconstruct images from semantic-rich VFM features, bypassing distillation. It includes a systematic study of how different tokenizers affect diffusion training dynamics and reports strong efficiency gains: gFID (w/o CFG) of 2.22 after 80 epochs (claimed 10× speedup over prior tokenizers) and 1.62 after 640 epochs.

Significance. If the central claims hold, the work would demonstrate that pre-trained VFMs can serve directly as robust tokenizers for LDMs, yielding faster convergence and improved generation quality through dual-side alignment without weakening VFM representations. The systematic tokenizer study is a positive contribution that could inform future tokenizer design.

major comments (2)
  1. [§4] §4 (Experiments) and associated tables: the gFID results (2.22 at 80 epochs, 1.62 at 640 epochs) and 10× speedup claim are presented without explicit baseline details (exact prior tokenizer architectures, training epochs, dataset splits, or run-to-run variance), making it impossible to verify the performance and efficiency assertions that are central to the paper.
  2. [§3.2] §3.2 (Decoder design): the claim that the new decoder reconstructs realistic images while fully preserving the original VFM's robustness and invariance (without any VFM updates or distillation) is load-bearing for the 'bypass distillation' argument, yet no direct supporting measurements (e.g., before/after linear probing accuracy, feature invariance tests, or downstream task performance on the frozen VFM) are reported.
minor comments (1)
  1. [Abstract / §4] The notation 'gFID (w/o CFG)' is used in the abstract and results but its precise computation and relation to standard FID should be defined in the main text or a dedicated section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our submission. We address each of the major comments in detail below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and associated tables: the gFID results (2.22 at 80 epochs, 1.62 at 640 epochs) and 10× speedup claim are presented without explicit baseline details (exact prior tokenizer architectures, training epochs, dataset splits, or run-to-run variance), making it impossible to verify the performance and efficiency assertions that are central to the paper.

    Authors: We agree that additional details are required for independent verification. The baselines are taken from the cited prior works, but explicit side-by-side specifications were omitted. In the revised manuscript we will insert a new comparison table in Section 4 that lists, for each baseline tokenizer, the exact architecture, training epochs, dataset splits, and any reported run-to-run variance. This table will directly substantiate the reported gFID scores and the 10× speedup claim. revision: yes

  2. Referee: [§3.2] §3.2 (Decoder design): the claim that the new decoder reconstructs realistic images while fully preserving the original VFM's robustness and invariance (without any VFM updates or distillation) is load-bearing for the 'bypass distillation' argument, yet no direct supporting measurements (e.g., before/after linear probing accuracy, feature invariance tests, or downstream task performance on the frozen VFM) are reported.

    Authors: We acknowledge that direct empirical measurements would strengthen the preservation claim. Because the VFM encoder remains completely frozen, its parameters and therefore its robustness properties are unchanged by construction. Nevertheless, to provide explicit evidence we will add, in the revision, linear-probing accuracy and feature-invariance results comparing the original VFM with the VFM-VAE pipeline. These new measurements will be placed in Section 3.2 and the appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on standard external metrics

full rationale

The paper proposes VFM-VAE by freezing a VFM encoder and training only a new decoder to produce latents for LDM training. All reported results are measured by gFID on standard image-generation benchmarks, an externally defined metric independent of any internal parameter fit or self-referential definition. No equations, self-citations, or uniqueness theorems appear in the provided text that would reduce the performance claims to a tautology or to a fitted input renamed as a prediction. The central argument is therefore an empirical design choice whose validity can be checked by replication on public datasets and metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the new decoder is introduced as an architectural choice whose details are not provided.

pith-pipeline@v0.9.0 · 5793 in / 1104 out tokens · 41575 ms · 2026-05-18T05:18:21.848277+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

    cs.CV 2026-05 unverdicted novelty 7.0

    DecQ uses detail-condensing queries on shallow and deep VFM features to improve both reconstruction PSNR and generative convergence/FID in RAEs without fine-tuning the encoder.

  2. PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion

    cs.CV 2026-05 unverdicted novelty 6.0

    PiD is a pixel diffusion decoder that performs latent-to-pixel conversion and 4-8x upsampling in one generative step, enabling early stopping of latent diffusion and achieving sub-second 2048x2048 decoding with claime...

  3. Vision Foundation Models as Generalist Tokenizers for Image Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.

  4. Improved Baselines with Representation Autoencoders

    cs.CV 2026-05 conditional novelty 6.0

    RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.

  5. What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

    cs.CV 2026-05 unverdicted novelty 6.0

    Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

  6. End-to-End Autoregressive Image Generation with 1D Semantic Tokenizer

    cs.CV 2026-05 unverdicted novelty 6.0

    An end-to-end autoregressive model with a jointly trained 1D semantic tokenizer achieves state-of-the-art FID 1.48 on ImageNet 256x256 generation without guidance.

  7. WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens

    cs.CV 2026-05 unverdicted novelty 5.0

    WinTok is a hybrid visual tokenizer that supplements pixel tokens with learnable semantic tokens distilled asymmetrically from foundation models to improve reconstruction, understanding, and generation.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 7 Pith papers · 17 internal anchors

  1. [1]

    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

    Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797,

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923,

  3. [3]

    Perception Encoder: The best visual embeddings are not at the output of the network

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network. arXiv preprint arXiv:2504.13181,

  4. [4]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset. arXiv preprint arXiv:2505.09568,

  5. [5]

    In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

    Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Masked diffusion transformer is a strong image synthesizer. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 23107–23116, 2023a. doi: 10.1109/ICCV51070.2023.02117. Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion trans- former is a strong...

  6. [6]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

  7. [7]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment. arXiv preprint arXiv:2403.05135,

  8. [8]

    The Platonic Representation Hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. arXiv preprint arXiv:2405.07987,

  9. [9]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,

  10. [10]

    Eq-vae: Equivariance regularized latent space for improved generative image modeling.arXiv preprint arXiv:2502.09509,

    Theodoros Kouzelis, Ioannis Kakogeorgiou, Spyros Gidaris, and Nikos Komodakis. Eq-vae: Equivariance regularized latent space for improved generative image modeling. arXiv preprint arXiv:2502.09509,

  11. [11]

    Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers. arXiv preprint arXiv:2504.10483,

  12. [12]

    Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

    Daiqing Li, Aleks Kamko, Ehsan Akhgari, Ali Sabet, Linmiao Xu, and Suhail Doshi. Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation. arXiv preprint arXiv:2402.17245, 2024a. Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. Advances in...

  13. [13]

    Unitok: A unified tokenizer for visual generation and understanding.arXiv preprint arXiv:2502.20321, 2025a

    Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, and Xiao- juan Qi. Unitok: A unified tokenizer for visual generation and understanding. arXiv preprint arXiv:2502.20321,

  14. [14]

    Generating images with sparse representations

    12 Preprint Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations. arXiv preprint arXiv:2103.03841,

  15. [15]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193,

  16. [16]

    Oriane Sim´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104,

  17. [17]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525,

  18. [18]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389,

  19. [19]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdul- mohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786,

  20. [20]

    Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025

    Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Train- ing diffusion transformers is much easier than you think. arXiv preprint arXiv:2507.01467,

  21. [21]

    Latent denoising makes good visual tokenizers

    13 Preprint Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, and Yue Wang. Latent denoising makes good visual tokenizers. arXiv preprint arXiv:2507.15856,

  22. [22]

    Vector-quantized Image Modeling with Improved VQGAN

    Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627,

  23. [23]

    Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    Lijun Yu, Jos´e Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, et al. Language model beats diffusion– tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737,

  24. [24]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940,

  25. [25]

    Zheng, W

    Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers. arXiv preprint arXiv:2306.09305,

  26. [26]

    measures how well two representations preserve the same local neighborhood structures. Given kernel similarity matrices K, L∈R n×n from two representations of the samensamples, we construct a binary maskA ij = 1{i̸=j, j∈knn K k (i)∧j∈knn L k (i)}to retain only the pairs that are commonk-nearest neighbors in both spaces. The masked kernels are defined asK ...

  27. [27]

    upsampling unit with a pure PyTorch imple- mentation for better readability and extensibility. The module normalizes input features via Group- Norm, performs3×3depthwise and1×1pointwise convolutions for local extraction and channel mixing, upsamples with PixelShuffle, and finally applies a fixed Gaussian blur to suppress checker- board artifacts. It serve...

  28. [28]

    to the V AE encoder latent; during inference, this latent is decoded by the V AE decoder to generate the final image. The visual–language backbone isQwen2.5-VL-3B-Instruct(Bai et al., 2025), and the diffusion backbone isLumina-Next (DiT)(Zhuo et al., 2024), where the patch size is reduced to 1 and the input/output channels are aligned to a latent of16×32×...

  29. [29]

    In the strong alignment stage, large representation regularization losses are applied to quickly establish VFM–V AE alignment

    Our multi-stage training strategy follows the general structure of V A-V AE (Yao et al., 2025). In the strong alignment stage, large representation regularization losses are applied to quickly establish VFM–V AE alignment. In the weak alignment stage, the weight of this loss is reduced to maintain alignment while shifting focus toward reconstruction quali...