DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

Jiaqi Wang; Min Li; Tianhang Wang; Wei Song; Yitong Chen; Zuxuan Wu

arxiv: 2605.22777 · v1 · pith:HUF46PRFnew · submitted 2026-05-21 · 💻 cs.CV

DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders

Tianhang Wang , Yitong Chen , Wei Song , Zuxuan Wu , Min Li , Jiaqi Wang This is my paper

Pith reviewed 2026-05-22 05:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords representation autoencodersdetail-condensing queriesfrozen vision foundation modelsreconstruction-generation trade-offlatent diffusion modelsimage reconstructiongenerative modeling

0 comments

The pith

Detail-condensing queries resolve the reconstruction-generation trade-off in representation autoencoders that use frozen vision foundation models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Representation autoencoders rely on frozen vision foundation models to supply robust semantic representations that speed up convergence and improve quality in latent diffusion models. Freezing the model limits its ability to capture fine spatial details needed for accurate reconstruction, while any fine-tuning to add reconstruction signals tends to damage the semantic space and lower generative performance. The paper proposes adding a small number of lightweight detail-condensing queries that pull fine-grained information from intermediate layers of the frozen model through dedicated condenser modules. These queries feed into the decoder for better reconstruction and are produced jointly with the usual patch tokens during generation. The result is simultaneous gains in both reconstruction and generation with only minimal added computation.

Core claim

DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance.

What carries the argument

Lightweight detail-condensing queries that extract and aggregate fine-grained details from shallow and deep layers of the frozen vision foundation model to support reconstruction while preserving the pretrained semantic space.

If this is right

Reconstruction PSNR rises from 19.13 dB to 22.76 dB with only 8 queries and 3.9 percent extra computation.
Generative modeling converges 3.3 times faster and reaches FID of 1.41 without guidance or 1.05 with guidance.
Latent diffusion models gain improved support for fine-grained generation and image editing tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The query approach could be applied to other frozen foundation models to test whether similar trade-off mitigation occurs beyond the tested encoder.
Joint generation of the queries alongside patch tokens may allow extra conditioning signals for tasks such as targeted image editing.
The low overhead suggests the method could be combined with larger-scale diffusion training to reduce overall compute requirements.

Load-bearing premise

Lightweight detail-condensing queries can be added to the decoder and jointly generated with patch tokens while preserving the pretrained semantic space of the frozen VFM without causing the degradation seen when reconstruction signals are introduced through fine-tuning.

What would settle it

An experiment in which adding the queries produces no increase in reconstruction PSNR or no reduction in generative convergence time relative to the frozen baseline would show the claimed mitigation of the trade-off does not occur.

Figures

Figures reproduced from arXiv: 2605.22777 by Jiaqi Wang, Min Li, Tianhang Wang, Wei Song, Yitong Chen, Zuxuan Wu.

**Figure 1.** Figure 1: (Left) An empirical study of different VFM-based image tokenizer paradigms based on DINOv2. VFM-freeze, corresponding to the RAE baseline, keeps the VFM encoder frozen and directly uses its representations for reconstruction. VFM-finetune denotes directly fine-tuning the VFM encoder, VFM-distill uses a frozen VFM copy to distill the encoder outputs, and VFM-featconcat keeps the VFM frozen while concatenat… view at source ↗

**Figure 2.** Figure 2: Overview of the DecQ architecture. Given an input image, the frozen VFM first converts it into patch tokens and processes them through a stack of Transformer blocks. DecQ attaches learnable queries to multiple intermediate VFM layers and uses condenser modules to progressively aggregate multi-level features into detail-condensing queries. These queries are then fed into the ViT decoder together with the VF… view at source ↗

**Figure 3.** Figure 3: Architecture of the condenser. Encoder with Condensers. We introduce K learnable query tokens Q(0) ∈ R K×C alongside the frozen VFM backbone, where C is the feature dimension of the patch tokens. In practice, K ≪ N, so the query tokens provide a compact representation for complementary fine-grained information. To aggregate multi-level features without modifying the pretrained VFM representations, we atta… view at source ↗

**Figure 4.** Figure 4: Image generation with detail-condensing queries. Compared with the RAE baseline that denoises and decodes only VFM patch tokens, DecQ jointly denoises detail-condensing queries with patch tokens during diffusion. Both patch and query tokens are initialized from Gaussian noise and generated as a unified latent sequence, and are then jointly decoded into the output image. an FFN. In the cross-attention block… view at source ↗

**Figure 5.** Figure 5: Convergence of the proposed DecQ compared with REPA [21] and RAE [4] [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Images clustered using different token representations. Generalization across different VFMs. To evaluate generality, we conduct analogous experiments with SigLIP2-B, as reported in Tab. 7. Consistent with DINOv2, reconstruction with a frozen SigLIP2 is limited, while introducing detail-condensing queries substantially improves pixel-level metrics. Although SigLIP2 shows lower reconstruction and generati… view at source ↗

**Figure 7.** Figure 7: More results on cluster visualization. Similar to [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: More qualitative results of our image reconstruction compared with RAE based on DINOv2. The cases share the same setting as in [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative results of our image reconstruction compared with RAE based on SigLIP2. In some cases, SigLIP2 appears to retain a semantic impression of textual content but fails to accurately reproduce its colors. In contrast, DecQ better preserves these fine-grained details [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative results of our image generation. C.1 Tokenizer and Reconstruction Overhead Baseline. The frozen ViT-B/14 encoder processes N=256 tokens through 12 Transformer layers (d=768, dff=3072), resulting in 22.2 GFLOPs. The ViT-MAE XL decoder processes Ndec=257 tokens through 28 layers (ddec=1152, dff=4096), costing 106.7 GFLOPs. This gives a total baseline cost of 128.9 GFLOPs per image, with 501.9M a… view at source ↗

read the original abstract

Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during generative modeling. By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction--generation trade-off, improving both reconstruction quality and generative performance. Our experiments demonstrate that: (1) with only 8 additional queries and 3.9% extra computation, DecQ improves reconstruction over the frozen DINOv2-based RAE, increasing PSNR from 19.13 dB to 22.76 dB; and (2) for generative modeling, DecQ achieves 3.3$\times$ faster convergence than RAE, attaining an FID of 1.41 without guidance and 1.05 with guidance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes DecQ for Representation Autoencoders (RAEs) that use frozen vision foundation models (VFMs) as encoders. It introduces lightweight detail-condensing queries extracted via condenser modules from intermediate VFM layers; these queries augment the decoder for reconstruction and are jointly generated with patch tokens in latent diffusion models. The central claim is that aggregating shallow and deep layer information mitigates the reconstruction-generation trade-off without fine-tuning the VFM. Experiments report a PSNR rise from 19.13 dB to 22.76 dB using 8 queries (3.9% extra compute), 3.3× faster convergence, and improved FID scores (1.41 without guidance, 1.05 with guidance).

Significance. If the experimental claims hold under fuller controls, DecQ provides a low-overhead way to improve both reconstruction fidelity and generative quality in RAEs while preserving the frozen VFM's semantic space. This could meaningfully advance latent diffusion pipelines for high-resolution synthesis and editing by avoiding the semantic degradation typically induced by reconstruction fine-tuning. The reported minimal parameter overhead and concrete metric gains are strengths that, if reproducible, would make the method practically attractive.

major comments (2)

Abstract: The reported PSNR gain (19.13 dB to 22.76 dB) and FID improvements are presented without error bars, standard deviations across runs, or dataset specifications; this directly affects the load-bearing claim that DecQ consistently mitigates the reconstruction-generation trade-off.
Abstract and §4 (Experiments): The generative results cite 3.3× faster convergence and FID 1.41/1.05 but provide no ablation on query count, layer selection, or comparison against recent RAE variants; without these controls it is unclear whether the joint generation of queries with patch tokens is the operative factor or whether the gains could arise from other architectural choices.

minor comments (2)

Notation: The term 'detail-condensing queries' is introduced without a precise mathematical definition or diagram showing how they are concatenated with patch tokens before the decoder; a small equation or figure would clarify the integration.
Abstract: The phrase '3.9% extra computation' should specify the exact metric (FLOPs, wall-clock time, or parameter count) and the baseline model size for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the changes we will make to strengthen the presentation of our results.

read point-by-point responses

Referee: Abstract: The reported PSNR gain (19.13 dB to 22.76 dB) and FID improvements are presented without error bars, standard deviations across runs, or dataset specifications; this directly affects the load-bearing claim that DecQ consistently mitigates the reconstruction-generation trade-off.

Authors: We agree that the abstract would benefit from explicit dataset references and measures of result variability to better support the central claim. The reported numbers come from our primary experimental setup on standard image datasets (detailed in Section 4), but these are not restated in the abstract and no error bars are shown. In the revised manuscript we will update the abstract to name the evaluation datasets and add error bars computed from multiple independent runs for both PSNR and FID metrics. revision: yes
Referee: Abstract and §4 (Experiments): The generative results cite 3.3× faster convergence and FID 1.41/1.05 but provide no ablation on query count, layer selection, or comparison against recent RAE variants; without these controls it is unclear whether the joint generation of queries with patch tokens is the operative factor or whether the gains could arise from other architectural choices.

Authors: The referee is correct that the current manuscript does not include systematic ablations on query count or layer selection, nor direct comparisons against the most recent RAE variants beyond the base frozen RAE. While the main results demonstrate the overall benefit of DecQ, these additional controls would help isolate the contribution of jointly generating the detail-condensing queries. We will expand Section 4 with the requested ablations (varying query count and layer choices) and include comparisons to additional recent RAE methods in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation

full rationale

The paper introduces DecQ as an architectural addition of lightweight detail-condensing queries that aggregate shallow and deep VFM features for a frozen encoder. Central claims of mitigating the reconstruction-generation trade-off are supported directly by reported experimental metrics (PSNR increase from 19.13 dB to 22.76 dB with 8 queries, 3.3× faster convergence, FID scores of 1.41/1.05). No equations, fitted parameters renamed as predictions, or self-citation chains are used to derive the performance gains; the method is described as a simple extension whose benefits are measured against external benchmarks like DINOv2-based RAE. This constitutes a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the standard domain assumption that frozen VFMs already supply useful high-level representations, plus the introduction of a new query module whose effectiveness is demonstrated only empirically.

free parameters (1)

number of queries = 8
The paper selects eight queries as the operating point that yields the reported gains with modest overhead.

axioms (1)

domain assumption Frozen vision foundation models provide robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models.
Explicitly stated in the opening of the abstract as the foundation for RAEs.

invented entities (1)

detail-condensing queries no independent evidence
purpose: Extract fine-grained information from intermediate VFM features via condenser modules for use in both reconstruction and generation.
New component introduced by the paper to resolve the stated trade-off.

pith-pipeline@v0.9.0 · 5794 in / 1410 out tokens · 48798 ms · 2026-05-22T05:58:09.388833+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules... By aggregating information from both shallow and deep layers, DecQ effectively mitigates the reconstruction–generation trade-off
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

condensers attached to VFM layers 0, 3, 6, and 9... shallow layers favor reconstruction, while deeper layers benefit generation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 4 internal anchors

[1]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, 2022. 1, 3, 6

work page 2022
[2]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[3]

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Diffusion transformers with representation autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. InICLR, 2026. 1, 3, 4, 5, 6, 7, 8, 14

work page 2026
[5]

Dinov2: Learning robust visual features without supervision.TMLR, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khali- dov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.TMLR, 2024. 1

work page 2024
[6]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Al- abdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, local- ization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021. 1

work page 2021
[8]

Sigmoid loss for lan- guage image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for lan- guage image pre-training. InICCV, 2023. 1

work page 2023
[9]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Auto-encoding variational bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. InICLR, 2014. 1

work page 2014
[11]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017

work page 2017
[12]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR, 2021. 1

work page 2021
[13]

Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278, 2025

Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278, 2025. 1, 2

work page arXiv 2025
[14]

Latent diffusion model without variational autoencoder

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder. InICLR,

work page
[15]

Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies

Wei Song, Yuran Wang, Zijia Song, Yadong Li, Zenan Zhou, Long Chen, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. InICLR, 2026. 1

work page 2026
[16]

Aligning visual foundation encoders to tokenizers for diffusion models

Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jian- ming Zhang, and Kai Zhang. Aligning visual foundation encoders to tokenizers for diffusion models. InICLR, 2026. 2, 3

work page 2026
[17]

Rpiae: A representation-pivoted autoencoder enhancing both image generation and editing.arXiv preprint arXiv:2603.19206, 2026

Yue Gong, Hongyu Li, Shanyuan Liu, Bo Cheng, Yuhang Ma, Liebucha Wu, Xiaoyu Wu, Manyuan Zhang, Dawei Leng, Yuhui Yin, et al. Rpiae: A representation-pivoted autoencoder enhancing both image generation and editing.arXiv preprint arXiv:2603.19206, 2026. 2, 3, 6, 7 10

work page arXiv 2026
[18]

Improving reconstruction of representation autoencoder.arXiv preprint arXiv:2602.08620, 2026

Siyu Liu, Chujie Qin, Hubery Yin, Qixin Yan, Zheng-Peng Duan, Chen Li, Jing Lyu, Chun-Le Guo, and Chongyi Li. Improving reconstruction of representation autoencoder.arXiv preprint arXiv:2602.08620, 2026. 2, 3, 7

work page arXiv 2026
[19]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 3

work page 2023
[20]

Sit: Exploring flow and diffusion-based generative models with scalable inter- polant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable inter- polant transformers. InECCV, 2024. 3, 7

work page 2024
[21]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InICLR, 2025. 3, 7, 8

work page 2025
[22]

What matters for representation alignment: Global information or spatial struc- ture? InICLR, 2026

Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial struc- ture? InICLR, 2026. 3

work page 2026
[23]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers. In ICCV, 2025. 3

work page 2025
[24]

Representation entanglement for generation: Training diffusion transformers is much easier than you think

Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, jian Yang, Ming-Ming Cheng, and Xiang Li. Representation entanglement for generation: Training diffusion transformers is much easier than you think. In NeurIPS, 2026. 3, 7, 8

work page 2026
[25]

Catok: Taming mean flows for one-dimensional causal image tokenization

Yitong Chen, Zuxuan Wu, Xipeng Qiu, and Yu-Gang Jiang. Catok: Taming mean flows for one-dimensional causal image tokenization. InCVPR, 2026. 3

work page 2026
[26]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming opti- mization dilemma in latent diffusion models. InCVPR, 2025. 3, 6, 7, 13

work page 2025
[27]

Distribution matching variational autoencoder.arXiv preprint arXiv:2512.07778, 2025

Sen Ye, Jianning Pei, Mengde Xu, Shuyang Gu, Chunyu Wang, Liwei Wang, and Han Hu. Distribution matching variational autoencoder.arXiv preprint arXiv:2512.07778, 2025. 3, 7

work page arXiv 2025
[28]

Taming sampling perturbations with variance expansion loss for latent diffusion models.arXiv preprint arXiv:2603.21085, 2026

Qifan Li, Xingyu Zhou, Jinhua Zhang, Weiyi You, and Shuhang Gu. Taming sampling perturbations with variance expansion loss for latent diffusion models.arXiv preprint arXiv:2603.21085, 2026. 3

work page arXiv 2026
[29]

VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

Tianci Bi, Xiaoyi Zhang, Yan Lu, and Nanning Zheng. Vision foundation models can be good tokenizers for latent diffusion models.arXiv preprint arXiv:2510.18457, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Generative multimodal pretraining with discrete diffusion timestep tokens

Kaihang Pan, Wang Lin, Zhongqi Yue, Tenglong Ao, Liyu Jia, Wei Zhao, Juncheng Li, Siliang Tang, and Hanwang Zhang. Generative multimodal pretraining with discrete diffusion timestep tokens. InCVPR, 2025. 5, 6

work page 2025
[31]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InICML, 2024. 3, 6

work page 2024
[32]

One layer is enough: Adapting pretrained visual encoders for image generation.arXiv preprint arXiv:2512.07829, 2025

Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pre- trained visual encoders for image generation.arXiv preprint arXiv:2512.07829, 2025. 3, 6, 7

work page arXiv 2025
[33]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InCVPR, 2018. 4

work page 2018
[34]

Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis

Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. InICML, 2023. 11

work page 2023
[35]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR,

work page
[36]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 6

work page 2009
[37]

Bovik, H.R

Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.TIP, 2004. 6

work page 2004
[38]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS,

work page
[39]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021. 6

work page 2021
[40]

Fast training of diffusion models with masked transformers.TMLR, 2023

Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers.TMLR, 2023. 7

work page 2023
[41]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InICLR, 2024. 13

work page 2024
[42]

Active Params

Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself. InNeurIPS, 2025. 13 12 A Implementation Details A.1 DecQ Implementation We follow the training scheme of RAE. For the encoder, we use DINOv2 with Registers [41] to process images resized to224×224, producin...

work page 2025

[1] [1]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, 2022. 1, 3, 6

work page 2022

[2] [2]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024

[3] [3]

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers.arXiv preprint arXiv:2410.10629, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Diffusion transformers with representation autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. InICLR, 2026. 1, 3, 4, 5, 6, 7, 8, 14

work page 2026

[5] [5]

Dinov2: Learning robust visual features without supervision.TMLR, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khali- dov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.TMLR, 2024. 1

work page 2024

[6] [6]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Al- abdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, local- ization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021. 1

work page 2021

[8] [8]

Sigmoid loss for lan- guage image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for lan- guage image pre-training. InICCV, 2023. 1

work page 2023

[9] [9]

Oriane Siméoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3. arXiv preprint arXiv:2508.10104, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Auto-encoding variational bayes

Diederik P Kingma and Max Welling. Auto-encoding variational bayes. InICLR, 2014. 1

work page 2014

[11] [11]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. InNeurIPS, 2017

work page 2017

[12] [12]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InCVPR, 2021. 1

work page 2021

[13] [13]

Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278, 2025

Hao Tang, Chenwei Xie, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, and Liwei Wang. Unilip: Adapting clip for unified multimodal understanding, generation and editing.arXiv preprint arXiv:2507.23278, 2025. 1, 2

work page arXiv 2025

[14] [14]

Latent diffusion model without variational autoencoder

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder. InICLR,

work page

[15] [15]

Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies

Wei Song, Yuran Wang, Zijia Song, Yadong Li, Zenan Zhou, Long Chen, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. InICLR, 2026. 1

work page 2026

[16] [16]

Aligning visual foundation encoders to tokenizers for diffusion models

Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jian- ming Zhang, and Kai Zhang. Aligning visual foundation encoders to tokenizers for diffusion models. InICLR, 2026. 2, 3

work page 2026

[17] [17]

Rpiae: A representation-pivoted autoencoder enhancing both image generation and editing.arXiv preprint arXiv:2603.19206, 2026

Yue Gong, Hongyu Li, Shanyuan Liu, Bo Cheng, Yuhang Ma, Liebucha Wu, Xiaoyu Wu, Manyuan Zhang, Dawei Leng, Yuhui Yin, et al. Rpiae: A representation-pivoted autoencoder enhancing both image generation and editing.arXiv preprint arXiv:2603.19206, 2026. 2, 3, 6, 7 10

work page arXiv 2026

[18] [18]

Improving reconstruction of representation autoencoder.arXiv preprint arXiv:2602.08620, 2026

Siyu Liu, Chujie Qin, Hubery Yin, Qixin Yan, Zheng-Peng Duan, Chen Li, Jing Lyu, Chun-Le Guo, and Chongyi Li. Improving reconstruction of representation autoencoder.arXiv preprint arXiv:2602.08620, 2026. 2, 3, 7

work page arXiv 2026

[19] [19]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. 3

work page 2023

[20] [20]

Sit: Exploring flow and diffusion-based generative models with scalable inter- polant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable inter- polant transformers. InECCV, 2024. 3, 7

work page 2024

[21] [21]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InICLR, 2025. 3, 7, 8

work page 2025

[22] [22]

What matters for representation alignment: Global information or spatial struc- ture? InICLR, 2026

Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial struc- ture? InICLR, 2026. 3

work page 2026

[23] [23]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers. In ICCV, 2025. 3

work page 2025

[24] [24]

Representation entanglement for generation: Training diffusion transformers is much easier than you think

Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, jian Yang, Ming-Ming Cheng, and Xiang Li. Representation entanglement for generation: Training diffusion transformers is much easier than you think. In NeurIPS, 2026. 3, 7, 8

work page 2026

[25] [25]

Catok: Taming mean flows for one-dimensional causal image tokenization

Yitong Chen, Zuxuan Wu, Xipeng Qiu, and Yu-Gang Jiang. Catok: Taming mean flows for one-dimensional causal image tokenization. InCVPR, 2026. 3

work page 2026

[26] [26]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming opti- mization dilemma in latent diffusion models. InCVPR, 2025. 3, 6, 7, 13

work page 2025

[27] [27]

Distribution matching variational autoencoder.arXiv preprint arXiv:2512.07778, 2025

Sen Ye, Jianning Pei, Mengde Xu, Shuyang Gu, Chunyu Wang, Liwei Wang, and Han Hu. Distribution matching variational autoencoder.arXiv preprint arXiv:2512.07778, 2025. 3, 7

work page arXiv 2025

[28] [28]

Taming sampling perturbations with variance expansion loss for latent diffusion models.arXiv preprint arXiv:2603.21085, 2026

Qifan Li, Xingyu Zhou, Jinhua Zhang, Weiyi You, and Shuhang Gu. Taming sampling perturbations with variance expansion loss for latent diffusion models.arXiv preprint arXiv:2603.21085, 2026. 3

work page arXiv 2026

[29] [29]

VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

Tianci Bi, Xiaoyi Zhang, Yan Lu, and Nanning Zheng. Vision foundation models can be good tokenizers for latent diffusion models.arXiv preprint arXiv:2510.18457, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Generative multimodal pretraining with discrete diffusion timestep tokens

Kaihang Pan, Wang Lin, Zhongqi Yue, Tenglong Ao, Liyu Jia, Wei Zhao, Juncheng Li, Siliang Tang, and Hanwang Zhang. Generative multimodal pretraining with discrete diffusion timestep tokens. InCVPR, 2025. 5, 6

work page 2025

[31] [31]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InICML, 2024. 3, 6

work page 2024

[32] [32]

One layer is enough: Adapting pretrained visual encoders for image generation.arXiv preprint arXiv:2512.07829, 2025

Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pre- trained visual encoders for image generation.arXiv preprint arXiv:2512.07829, 2025. 3, 6, 7

work page arXiv 2025

[33] [33]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InCVPR, 2018. 4

work page 2018

[34] [34]

Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis

Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. InICML, 2023. 11

work page 2023

[35] [35]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR,

work page

[36] [36]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, 2009. 6

work page 2009

[37] [37]

Bovik, H.R

Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity.TIP, 2004. 6

work page 2004

[38] [38]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS,

work page

[39] [39]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021. 6

work page 2021

[40] [40]

Fast training of diffusion models with masked transformers.TMLR, 2023

Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers.TMLR, 2023. 7

work page 2023

[41] [41]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InICLR, 2024. 13

work page 2024

[42] [42]

Active Params

Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself. InNeurIPS, 2025. 13 12 A Implementation Details A.1 DecQ Implementation We follow the training scheme of RAE. For the encoder, we use DINOv2 with Registers [41] to process images resized to224×224, producin...

work page 2025