pith. sign in

arxiv: 2605.22777 · v1 · pith:HUF46PRFnew · submitted 2026-05-21 · 💻 cs.CV

DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

Pith reviewed 2026-05-22 05:58 UTC · model grok-4.3

classification 💻 cs.CV
keywords representation autoencodersdetail-condensing queriesfrozen vision foundation modelsreconstruction-generation trade-offlatent diffusion modelsimage reconstructiongenerative modeling
0
0 comments X

The pith

Detail-condensing queries resolve the reconstruction-generation trade-off in representation autoencoders that use frozen vision foundation models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Representation autoencoders rely on frozen vision foundation models to supply robust semantic representations that speed up convergence and improve quality in latent diffusion models. Freezing the model limits its ability to capture fine spatial details needed for accurate reconstruction, while any fine-tuning to add reconstruction signals tends to damage the semantic space and lower generative performance. The paper proposes adding a small number of lightweight detail-condensing queries that pull fine-grained information from intermediate layers of the frozen model through dedicated condenser modules. These queries feed into the decoder for better reconstruction and are produced jointly with the usual patch tokens during generation. The result is simultaneous gains in both reconstruction and generation with only minimal added computation.

Core claim

DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance.

What carries the argument

Lightweight detail-condensing queries that extract and aggregate fine-grained details from shallow and deep layers of the frozen vision foundation model to support reconstruction while preserving the pretrained semantic space.

If this is right

  • Reconstruction PSNR rises from 19.13 dB to 22.76 dB with only 8 queries and 3.9 percent extra computation.
  • Generative modeling converges 3.3 times faster and reaches FID of 1.41 without guidance or 1.05 with guidance.
  • Latent diffusion models gain improved support for fine-grained generation and image editing tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The query approach could be applied to other frozen foundation models to test whether similar trade-off mitigation occurs beyond the tested encoder.
  • Joint generation of the queries alongside patch tokens may allow extra conditioning signals for tasks such as targeted image editing.
  • The low overhead suggests the method could be combined with larger-scale diffusion training to reduce overall compute requirements.

Load-bearing premise

Lightweight detail-condensing queries can be added to the decoder and jointly generated with patch tokens while preserving the pretrained semantic space of the frozen VFM without causing the degradation seen when reconstruction signals are introduced through fine-tuning.

What would settle it

An experiment in which adding the queries produces no increase in reconstruction PSNR or no reduction in generative convergence time relative to the frozen baseline would show the claimed mitigation of the trade-off does not occur.

Figures

Figures reproduced from arXiv: 2605.22777 by Jiaqi Wang, Min Li, Tianhang Wang, Wei Song, Yitong Chen, Zuxuan Wu.

Figure 1
Figure 1. Figure 1: (Left) An empirical study of different VFM-based image tokenizer paradigms based on DINOv2. VFM-freeze, corresponding to the RAE baseline, keeps the VFM encoder frozen and directly uses its representations for reconstruction. VFM-finetune denotes directly fine-tuning the VFM encoder, VFM-distill uses a frozen VFM copy to distill the encoder outputs, and VFM-feat￾concat keeps the VFM frozen while concatenat… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the DecQ architecture. Given an input image, the frozen VFM first converts it into patch tokens and processes them through a stack of Transformer blocks. DecQ attaches learnable queries to multiple intermediate VFM layers and uses condenser modules to progressively aggregate multi-level features into detail-condensing queries. These queries are then fed into the ViT decoder together with the VF… view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of the condenser. Encoder with Condensers. We introduce K learnable query to￾kens Q(0) ∈ R K×C alongside the frozen VFM backbone, where C is the feature dimension of the patch tokens. In practice, K ≪ N, so the query tokens provide a compact representation for complementary fine-grained information. To aggregate multi-level features without modifying the pretrained VFM representations, we atta… view at source ↗
Figure 4
Figure 4. Figure 4: Image generation with detail-condensing queries. Compared with the RAE baseline that denoises and decodes only VFM patch tokens, DecQ jointly denoises detail-condensing queries with patch tokens during diffusion. Both patch and query tokens are initialized from Gaussian noise and generated as a unified latent sequence, and are then jointly decoded into the output image. an FFN. In the cross-attention block… view at source ↗
Figure 5
Figure 5. Figure 5: Convergence of the proposed DecQ compared with REPA [21] and RAE [4] [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Images clustered using dif￾ferent token representations. Generalization across different VFMs. To evaluate generality, we conduct analogous experi￾ments with SigLIP2-B, as reported in Tab. 7. Consistent with DINOv2, reconstruction with a frozen SigLIP2 is limited, while introducing detail-condensing queries substantially improves pixel-level metrics. Although SigLIP2 shows lower reconstruction and generati… view at source ↗
Figure 7
Figure 7. Figure 7: More results on cluster visualization. Similar to [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: More qualitative results of our image reconstruction compared with RAE based on DINOv2. The cases share the same setting as in [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative results of our image reconstruction compared with RAE based on SigLIP2. In some cases, SigLIP2 appears to retain a semantic impression of textual content but fails to accurately reproduce its colors. In contrast, DecQ better preserves these fine-grained details [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative results of our image generation. C.1 Tokenizer and Reconstruction Overhead Baseline. The frozen ViT-B/14 encoder processes N=256 tokens through 12 Transformer layers (d=768, dff=3072), resulting in 22.2 GFLOPs. The ViT-MAE XL decoder processes Ndec=257 tokens through 28 layers (ddec=1152, dff=4096), costing 106.7 GFLOPs. This gives a total baseline cost of 128.9 GFLOPs per image, with 501.9M a… view at source ↗
read the original abstract

Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3$\times$ faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes DecQ for Representation Autoencoders (RAEs) that use frozen vision foundation models (VFMs) as encoders. It introduces lightweight detail-condensing queries extracted via condenser modules from intermediate VFM layers; these queries augment the decoder for reconstruction and are jointly generated with patch tokens in latent diffusion models. The central claim is that aggregating shallow and deep layer information mitigates the reconstruction-generation trade-off without fine-tuning the VFM. Experiments report a PSNR rise from 19.13 dB to 22.76 dB using 8 queries (3.9% extra compute), 3.3× faster convergence, and improved FID scores (1.41 without guidance, 1.05 with guidance).

Significance. If the experimental claims hold under fuller controls, DecQ provides a low-overhead way to improve both reconstruction fidelity and generative quality in RAEs while preserving the frozen VFM's semantic space. This could meaningfully advance latent diffusion pipelines for high-resolution synthesis and editing by avoiding the semantic degradation typically induced by reconstruction fine-tuning. The reported minimal parameter overhead and concrete metric gains are strengths that, if reproducible, would make the method practically attractive.

major comments (2)
  1. Abstract: The reported PSNR gain (19.13 dB to 22.76 dB) and FID improvements are presented without error bars, standard deviations across runs, or dataset specifications; this directly affects the load-bearing claim that DecQ consistently mitigates the reconstruction-generation trade-off.
  2. Abstract and §4 (Experiments): The generative results cite 3.3× faster convergence and FID 1.41/1.05 but provide no ablation on query count, layer selection, or comparison against recent RAE variants; without these controls it is unclear whether the joint generation of queries with patch tokens is the operative factor or whether the gains could arise from other architectural choices.
minor comments (2)
  1. Notation: The term 'detail-condensing queries' is introduced without a precise mathematical definition or diagram showing how they are concatenated with patch tokens before the decoder; a small equation or figure would clarify the integration.
  2. Abstract: The phrase '3.9% extra computation' should specify the exact metric (FLOPs, wall-clock time, or parameter count) and the baseline model size for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the changes we will make to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: Abstract: The reported PSNR gain (19.13 dB to 22.76 dB) and FID improvements are presented without error bars, standard deviations across runs, or dataset specifications; this directly affects the load-bearing claim that DecQ consistently mitigates the reconstruction-generation trade-off.

    Authors: We agree that the abstract would benefit from explicit dataset references and measures of result variability to better support the central claim. The reported numbers come from our primary experimental setup on standard image datasets (detailed in Section 4), but these are not restated in the abstract and no error bars are shown. In the revised manuscript we will update the abstract to name the evaluation datasets and add error bars computed from multiple independent runs for both PSNR and FID metrics. revision: yes

  2. Referee: Abstract and §4 (Experiments): The generative results cite 3.3× faster convergence and FID 1.41/1.05 but provide no ablation on query count, layer selection, or comparison against recent RAE variants; without these controls it is unclear whether the joint generation of queries with patch tokens is the operative factor or whether the gains could arise from other architectural choices.

    Authors: The referee is correct that the current manuscript does not include systematic ablations on query count or layer selection, nor direct comparisons against the most recent RAE variants beyond the base frozen RAE. While the main results demonstrate the overall benefit of DecQ, these additional controls would help isolate the contribution of jointly generating the detail-condensing queries. We will expand Section 4 with the requested ablations (varying query count and layer choices) and include comparisons to additional recent RAE methods in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation

full rationale

The paper introduces DecQ as an architectural addition of lightweight detail-condensing queries that aggregate shallow and deep VFM features for a frozen encoder. Central claims of mitigating the reconstruction-generation trade-off are supported directly by reported experimental metrics (PSNR increase from 19.13 dB to 22.76 dB with 8 queries, 3.3× faster convergence, FID scores of 1.41/1.05). No equations, fitted parameters renamed as predictions, or self-citation chains are used to derive the performance gains; the method is described as a simple extension whose benefits are measured against external benchmarks like DINOv2-based RAE. This constitutes a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the standard domain assumption that frozen VFMs already supply useful high-level representations, plus the introduction of a new query module whose effectiveness is demonstrated only empirically.

free parameters (1)
  • number of queries = 8
    The paper selects eight queries as the operating point that yields the reported gains with modest overhead.
axioms (1)
  • domain assumption Frozen vision foundation models provide robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models.
    Explicitly stated in the opening of the abstract as the foundation for RAEs.
invented entities (1)
  • detail-condensing queries no independent evidence
    purpose: Extract fine-grained information from intermediate VFM features via condenser modules for use in both reconstruction and generation.
    New component introduced by the paper to resolve the stated trade-off.

pith-pipeline@v0.9.0 · 5794 in / 1410 out tokens · 48798 ms · 2026-05-22T05:58:09.388833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 4 internal anchors

  1. [1]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, 2022. 1, 3, 6

  2. [2]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  3. [3]

    SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

    Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024. 1

  4. [4]

    Diffusion transformers with representation autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. InICLR, 2026. 1, 3, 4, 5, 6, 7, 8, 14

  5. [5]

    Dinov2: Learning robust visual features without supervision.TMLR, 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khali- dov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.TMLR, 2024. 1

  6. [6]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Al- abdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, local- ization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 1

  7. [7]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021. 1

  8. [8]

    Sigmoid loss for lan- guage image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for lan- guage image pre-training. InICCV, 2023. 1

  9. [9]

    Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025. 1

  10. [10]

    Auto-encoding variational bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes. InICLR, 2014. 1

  11. [11]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017

  12. [12]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR, 2021. 1

  13. [13]

    Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278, 2025

    Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278, 2025. 1, 2

  14. [14]

    Latent diffusion model without variational autoencoder

    Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder. InICLR,

  15. [15]

    Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies

    Wei Song, Yuran Wang, Zijia Song, Yadong Li, Zenan Zhou, Long Chen, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. InICLR, 2026. 1

  16. [16]

    Aligning visual foundation encoders to tokenizers for diffusion models

    Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jian- ming Zhang, and Kai Zhang. Aligning visual foundation encoders to tokenizers for diffusion models. InICLR, 2026. 2, 3

  17. [17]

    Rpiae: A representation-pivoted autoencoder enhancing both image generation and editing.arXiv preprint arXiv:2603.19206, 2026

    Yue Gong, Hongyu Li, Shanyuan Liu, Bo Cheng, Yuhang Ma, Liebucha Wu, Xiaoyu Wu, Manyuan Zhang, Dawei Leng, Yuhui Yin, et al. Rpiae: A representation-pivoted autoencoder enhancing both image generation and editing.arXiv preprint arXiv:2603.19206, 2026. 2, 3, 6, 7 10

  18. [18]

    Improving reconstruction of representation autoencoder.arXiv preprint arXiv:2602.08620, 2026

    Siyu Liu, Chujie Qin, Hubery Yin, Qixin Yan, Zheng-Peng Duan, Chen Li, Jing Lyu, Chun-Le Guo, and Chongyi Li. Improving reconstruction of representation autoencoder.arXiv preprint arXiv:2602.08620, 2026. 2, 3, 7

  19. [19]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 3

  20. [20]

    Sit: Exploring flow and diffusion-based generative models with scalable inter- polant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable inter- polant transformers. InECCV, 2024. 3, 7

  21. [21]

    Representation alignment for generation: Training diffusion transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InICLR, 2025. 3, 7, 8

  22. [22]

    What matters for representation alignment: Global information or spatial struc- ture? InICLR, 2026

    Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial struc- ture? InICLR, 2026. 3

  23. [23]

    Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers. In ICCV, 2025. 3

  24. [24]

    Representation entanglement for generation: Training diffusion transformers is much easier than you think

    Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, jian Yang, Ming-Ming Cheng, and Xiang Li. Representation entanglement for generation: Training diffusion transformers is much easier than you think. In NeurIPS, 2026. 3, 7, 8

  25. [25]

    Catok: Taming mean flows for one-dimensional causal image tokenization

    Yitong Chen, Zuxuan Wu, Xipeng Qiu, and Yu-Gang Jiang. Catok: Taming mean flows for one-dimensional causal image tokenization. InCVPR, 2026. 3

  26. [26]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming opti- mization dilemma in latent diffusion models. InCVPR, 2025. 3, 6, 7, 13

  27. [27]

    Distribution matching variational autoencoder.arXiv preprint arXiv:2512.07778, 2025

    Sen Ye, Jianning Pei, Mengde Xu, Shuyang Gu, Chunyu Wang, Liwei Wang, and Han Hu. Distribution matching variational autoencoder.arXiv preprint arXiv:2512.07778, 2025. 3, 7

  28. [28]

    Taming sampling perturbations with variance expansion loss for latent diffusion models.arXiv preprint arXiv:2603.21085, 2026

    Qifan Li, Xingyu Zhou, Jinhua Zhang, Weiyi You, and Shuhang Gu. Taming sampling perturbations with variance expansion loss for latent diffusion models.arXiv preprint arXiv:2603.21085, 2026. 3

  29. [29]

    VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

    Tianci Bi, Xiaoyi Zhang, Yan Lu, and Nanning Zheng. Vision foundation models can be good tokenizers for latent diffusion models.arXiv preprint arXiv:2510.18457, 2025. 3

  30. [30]

    Generative multimodal pretraining with discrete diffusion timestep tokens

    Kaihang Pan, Wang Lin, Zhongqi Yue, Tenglong Ao, Liyu Jia, Wei Zhao, Juncheng Li, Siliang Tang, and Hanwang Zhang. Generative multimodal pretraining with discrete diffusion timestep tokens. InCVPR, 2025. 5, 6

  31. [31]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InICML, 2024. 3, 6

  32. [32]

    One layer is enough: Adapting pretrained visual encoders for image generation.arXiv preprint arXiv:2512.07829, 2025

    Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pre- trained visual encoders for image generation.arXiv preprint arXiv:2512.07829, 2025. 3, 6, 7

  33. [33]

    Efros, Eli Shechtman, and Oliver Wang

    Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InCVPR, 2018. 4

  34. [34]

    Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis

    Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. InICML, 2023. 11

  35. [35]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR,

  36. [36]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 6

  37. [37]

    Bovik, H.R

    Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.TIP, 2004. 6

  38. [38]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS,

  39. [39]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021. 6

  40. [40]

    Fast training of diffusion models with masked transformers.TMLR, 2023

    Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers.TMLR, 2023. 7

  41. [41]

    Vision transformers need registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InICLR, 2024. 13

  42. [42]

    Active Params

    Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself. InNeurIPS, 2025. 13 12 A Implementation Details A.1 DecQ Implementation We follow the training scheme of RAE. For the encoder, we use DINOv2 with Registers [41] to process images resized to224×224, producin...