Improved Baselines with Representation Autoencoders
Pith reviewed 2026-05-20 11:35 UTC · model grok-4.3
The pith
Representation autoencoders reach SOTA image generation by summing last k encoder layers, combining with REPA, and reparameterizing for free guidance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By adopting a generalized formulation where the representation sums the last k layers of a pretrained encoder, recognizing that RAE and REPA are complementary so the same representation can be used for both encoding and alignment, and re-parameterizing the diffusion model output to obtain guidance for free, RAEv2 achieves more than 10x faster convergence, a state-of-the-art gFID of 1.06 in 80 epochs on ImageNet-256, and a state-of-the-art FDr^k of 2.17 at 80 epochs without post-training.
What carries the argument
Generalized multi-layer sum representation in RAE together with its complementarity to REPA that enables free guidance via output re-parameterization.
If this is right
- RAEv2 attains EP_FID@2 of 35 epochs versus 177 epochs for the original RAE.
- State-of-the-art FDr^k of 2.17 is reached at 80 epochs compared with the prior best of 3.26 at 800 epochs.
- No second diffusion model is needed for AutoGuidance.
- Consistent gains appear in text-to-image generation and navigation world models.
Where Pith is reading between the lines
- The same layer-summation and complementarity pattern could be tested in video or 3D diffusion models to check for similar efficiency gains.
- If the free-guidance trick generalizes, many existing diffusion training pipelines could drop the cost of separate guidance models.
- The approach suggests that assumptions about whether alignment replaces or augments autoencoding should be re-examined in other generative settings.
Load-bearing premise
Pretrained vision encoders supply representations general enough that the improvements transfer to new domains and architectures without major hyperparameter retuning.
What would settle it
Training RAEv2 on a new dataset or architecture and finding that convergence speed and final quality match the original RAE only after extensive retuning would show the claimed generality does not hold.
read the original abstract
Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last k encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills the same representation to intermediate layers instead. Through large-scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit complementary working mechanisms, allowing the same representation to be used as both encoder and target for intermediate diffusion layers. Finally, the original RAE struggles with classifier-free guidance (CFG) and requires training a second, weaker diffusion model for AutoGuidance (AG). We show that REPA itself can be viewed as x-prediction in RAE latent space. By simply re-parameterizing the output of the DiT model, it can provide guidance for "free". Overall, RAEv2 leads to more than 10x faster convergence over the original RAE, achieving a state-of-the-art gFID of 1.06 in just 80 epochs on ImageNet-256. On FDr^k, RAEv2 achieves a state-of-the-art 2.17 at just 80 epochs compared to the previous best 3.26 (800 epochs) without any post-training. This motivates EP_FID@k (epochs to reach unguided gFID <= k) as a measure of training efficiency. RAEv2 attains an EP_FID@2 of 35 epochs, versus 177 for the original RAE. We also validate our approach across diverse settings for text-to-image generation and navigation world models, showing consistent improvements. Code is available at https://raev2.github.io.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RAEv2, an improved version of Representation Autoencoders that replaces VAEs with pretrained vision encoders. Key contributions include: (1) defining the representation as the sum of the last k encoder layers rather than only the final layer, improving reconstruction without finetuning; (2) empirical demonstration that RAE and REPA are complementary, allowing the same pretrained representation to serve as both encoder and intermediate-layer target in diffusion models; (3) re-parameterizing the DiT output to enable free classifier-free guidance without training a second model. These changes yield >10x faster convergence, with state-of-the-art gFID of 1.06 and FDr^k of 2.17 on ImageNet-256 after only 80 epochs (vs. prior best at 800 epochs), plus consistent gains on text-to-image and navigation tasks. A new efficiency metric EP_FID@k is proposed, and code is released.
Significance. If the empirical results hold under broader conditions, this provides a strong, simplified baseline for diffusion-based generative models that leverages pretrained representations for both encoding and alignment. The reported 10x speedup in convergence to high-quality samples, free guidance mechanism, and new efficiency metric could influence training practices in image synthesis and related domains. The large-scale ablations on ImageNet-256 with DiT and cross-task validation add credibility, while the code release supports reproducibility.
major comments (2)
- [Complementarity analysis and ImageNet-256 experiments] The central claim of complementarity between RAE and REPA (allowing the same representation for encoder and REPA target) is load-bearing for the RAEv2 recipe and the reported speedups. However, the experiments primarily use CLIP-style encoders on ImageNet-256 with DiT; if this interaction depends on specific encoder statistics or the x-prediction re-parameterization, the gains may not generalize without retuning. Additional ablations with alternative encoder families (e.g., DINOv2 or non-CLIP variants) would directly test this assumption.
- [Guidance re-parameterization section] The free guidance via re-parameterization of the DiT output is presented as a key simplification over AutoGuidance. The manuscript should clarify whether this re-parameterization preserves the exact equivalence to REPA's intermediate-layer distillation or introduces any approximation that could affect guidance strength at different scales.
minor comments (3)
- [Method and experimental details] The exact values of k (number of summed layers) and the specific pretrained encoder checkpoints used in the main results should be stated explicitly in the experimental setup, as these are free parameters in the method.
- [Training details] Training schedules, learning rates, and batch sizes for the 80-epoch RAEv2 runs versus the 800-epoch baselines should be tabulated for direct comparison to ensure the efficiency claims are not confounded by optimization differences.
- [Figures] Figure captions and axis labels for the convergence plots could more clearly indicate the number of epochs at which each method reaches the reported gFID thresholds.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and constructive comments. We respond to each major comment below and indicate the corresponding revisions to the manuscript.
read point-by-point responses
-
Referee: The central claim of complementarity between RAE and REPA (allowing the same representation for encoder and REPA target) is load-bearing for the RAEv2 recipe and the reported speedups. However, the experiments primarily use CLIP-style encoders on ImageNet-256 with DiT; if this interaction depends on specific encoder statistics or the x-prediction re-parameterization, the gains may not generalize without retuning. Additional ablations with alternative encoder families (e.g., DINOv2 or non-CLIP variants) would directly test this assumption.
Authors: We thank the referee for this observation. Our primary large-scale ablations and ImageNet-256 results do center on CLIP-style encoders with DiT, as these constitute the standard experimental setting for such models. The complementarity finding is supported by extensive controlled ablations within this regime. We also report consistent gains when transferring the full RAEv2 recipe to text-to-image generation and navigation world models, which employ different data distributions and encoder families. In the revision we will expand the discussion section to explicitly link these cross-task results to the question of generalization and to note that the core mechanisms (multi-layer summation, joint encoder-target usage, and output re-parameterization) are architecture-agnostic. Full-scale DINOv2 ablations on ImageNet-256 would require substantial additional compute; we will therefore flag this as valuable future work rather than claim to have performed it. revision: partial
-
Referee: The free guidance via re-parameterization of the DiT output is presented as a key simplification over AutoGuidance. The manuscript should clarify whether this re-parameterization preserves the exact equivalence to REPA's intermediate-layer distillation or introduces any approximation that could affect guidance strength at different scales.
Authors: We appreciate the request for clarification. The re-parameterization is an exact algebraic rewriting that treats the REPA target as an x-prediction objective inside the RAE latent space; no distributional approximation is introduced. Because the transformation is linear, the equivalence holds for any classifier-free guidance scale. In the revised manuscript we will insert a short derivation (either in the main text or as an appendix) that makes this equivalence explicit and confirms scale independence. revision: yes
Circularity Check
No circularity; purely empirical ablations and benchmarks
full rationale
The paper's claims rest on large-scale empirical ablations (sum of last k layers, RAE+REPA complementarity, re-parameterization for free guidance) measured against external baselines on ImageNet-256, text-to-image, and navigation tasks. No derivations, equations, or first-principles results are presented that reduce to fitted inputs or self-citations by construction. All reported gains (e.g., gFID 1.06 at 80 epochs, EP_FID@2 of 35) are direct experimental outcomes compared to prior external work, with no self-definitional loops, renamed predictions, or load-bearing uniqueness theorems from the authors' prior papers.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of encoder layers k to sum
- diffusion training hyperparameters (learning rate, batch size, epochs)
axioms (1)
- domain assumption Pretrained vision encoders provide sufficiently general representations without domain-specific finetuning
Reference graph
Works this paper leans on
-
[1]
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024
work page 2024
-
[2]
Self-supervised learning from images with a joint-embedding predictive architecture
Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023
work page 2023
-
[3]
Navigation world models.arXiv preprint arXiv:2412.03572, 2024
Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models.arXiv preprint arXiv:2412.03572, 2024. URLhttps://arxiv.org/abs/2412.03572
-
[4]
MIND: Monge Inception Distance for Generative Models Evaluation
Quentin Berthet, Yu-Han Wu, Clement Crepy, Romuald Elie, Klaus Greff, and Michael Eli Sander. Mind: Monge inception distance for generative models evaluation.arXiv preprint arXiv:2605.06797, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models
Tianci Bi, Xiaoyi Zhang, Yan Lu, and Nanning Zheng. VFM-VAE: Vision foundation models can be good tokenizers for latent diffusion models.arXiv preprint arXiv:2510.18457, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network.arXiv:2504.13181, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Laminating representation autoencoders for efficient diffu- sion,
Ramon Calvo-González and François Fleuret. Laminating representation autoencoders for efficient diffusion. arXiv preprint arXiv:2602.04873, 2026
-
[8]
Hyperspherical Autoencoder for High-Fidelity Image Reconstruction and Generation
Hun Chang, Byunghee Cha, and Jong Chul Ye. Dino-sae: Dino spherical autoencoder for high-fidelity image reconstruction and generation.arXiv preprint arXiv:2601.22904, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
arXiv preprint arXiv:2509.25162 (2025) 4
Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, and Kai Zhang. Aligntok: Aligning visual foundation encoders to tokenizers for diffusion models.arXiv preprint arXiv:2509.25162, 2025
-
[11]
Masked autoencoders are effective tokenizers for diffusion models
Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked autoencoders are effective tokenizers for diffusion models. InInternational Conference on Machine Learning, 2025
work page 2025
-
[12]
Visual generation without guidance.arXiv preprint arXiv:2501.15420, 2025
Huayu Chen, Kai Jiang, Kaiwen Zheng, Jianfei Chen, Hang Su, and Jun Zhu. Visual generation without guidance.arXiv preprint arXiv:2501.15420, 2025
-
[13]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, and Ran Xu. Blip3-o: A family of fully open unified multimodal models–architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
An empirical study of training self-supervised vision transformers
Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9640–9649, 2021
work page 2021
-
[15]
Scaling language-free visual representation learning.arXiv preprint arXiv:2504.01017,
David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, et al. Scaling language-free visual representation learning.arXiv preprint arXiv:2504.01017, 2025
-
[16]
Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, and Ziwei Liu. The prism hypothesis: Harmonizing semantic and pixel representations via unified autoencoding.arXiv preprint arXiv:2512.19693, 2025
-
[17]
Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. DreamSim: Learning new dimensions of human visual similarity using synthetic data. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv.org/abs/2306.09344
-
[18]
Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pretrained visual encoders for image generation.arXiv preprint arXiv:2512.07829, 2025. 15
-
[19]
Geneval: An object-focused framework for evaluating text-to-image alignment
Dhruba Ghosh, Hanna Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[20]
Yue Gong, Hongyu Li, Shanyuan Liu, Bo Cheng, Yuhang Ma, Liebucha Wu, Xiaoyu Wu, Manyuan Zhang, Dawei Leng, Yuhui Yin, and Lijun Zhang. RPiAE: A representation-pivoted autoencoder enhancing both image generation and editing.arXiv preprint arXiv:2603.19206, 2026
-
[21]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022
work page 2022
-
[22]
Unified latents (ul): How to train your latents.arXiv preprint arXiv:2602.17270, 2026
Jonathan Heek, Emiel Hoogeboom, Thomas Mensink, and Tim Salimans. Unified latents (ul): How to train your latents.arXiv preprint arXiv:2602.17270, 2026
-
[23]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017
work page 2017
-
[24]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Dino-tok: Adapting dino for visual tokenizers.arXiv preprint arXiv:2511.20565, 2025
Mingkai Jia, Mingxiao Li, Zhijian Shu, Anlin Zheng, Liaoyuan Fan, Jiaxin Guo, Tianxing Shi, Dongyue Lu, Zeming Li, Xiaoyang Guo, Xiaojuan Qi, Xiao-Xiao Long, Qian Zhang, Ping Tan, and Wei Yin. Dino-tok: Adapting dino for visual tokenizers.arXiv preprint arXiv:2511.20565, 2025
-
[27]
arXiv preprint arXiv:2505.02831 (2025)
Dengyang Jiang, Mengmeng Wang, Liuzhuozheng Li, Lei Zhang, Haoyu Wang, Wei Wei, Guang Dai, Yanning Zhang, and Jingdong Wang. No other representation component is needed: Diffusion transformers can provide representation guidance by themselves.arXiv preprint arXiv:2505.02831, 2025
-
[28]
modded-nanogpt: Speedrunning the nanogpt baseline, 2024
Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024. URLhttps://github.com/KellerJordan/modded-nanogpt
work page 2024
-
[29]
Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself.Advances in Neural Information Processing Systems, 37, 2024
work page 2024
-
[30]
Applying guidance in a limited interval improves sample and distribution quality in diffusion models
Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.arXiv preprint arXiv:2404.07724, 2024
-
[31]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
work page 2024
-
[32]
Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In Artificial intelligence and statistics, pages 562–570. Pmlr, 2015
work page 2015
-
[33]
Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025
-
[34]
Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, et al. Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024
-
[35]
Back to Basics: Let Denoising Generative Models Denoise
Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Improving reconstruction of representation autoencoder.arXiv preprint arXiv:2602.08620, 2026
Siyu Liu, Chujie Qin, Hubery Yin, Qixin Yan, Zheng-Peng Duan, Chen Li, Jing Lyu, Chun-Le Guo, and Chongyi Li. Improving reconstruction of representation autoencoder.arXiv preprint arXiv:2602.08620, 2026
-
[37]
Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024
work page 2024
-
[38]
Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7739–7751, 2025. 16
work page 2025
-
[39]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, pages 1–31, 2024
work page 2024
-
[40]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023
work page 2023
-
[41]
Würstchen: An efficient architecture for large-scale text-to-image diffusion models
Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubreville. Würstchen: An efficient architecture for large-scale text-to-image diffusion models. InThe Twelfth International Conference on Learning Representations, 2023
work page 2023
-
[42]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[43]
Improved techniques for training gans.Advances in neural information processing systems, 29, 2016
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016
work page 2016
-
[44]
GNM: A General Navigation Model to Drive Any Robot
Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, and Sergey Levine. GNM: A general navigation model to drive any robot. InInternational Conference on Robotics and Automation (ICRA), 2023. URL https: //arxiv.org/abs/2210.03370
-
[45]
ViNT: A foundation model for visual navigation,
Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachowicz, Kevin Black, Noriaki Hirose, and Sergey Levine. ViNT: A foundation model for visual navigation. InConference on Robot Learning (CoRL), 2023. URL https: //arxiv.org/abs/2306.14846
-
[46]
Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025
Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025
-
[47]
Oriane Siméoni, Huy V . Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025
-
[49]
NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration
Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. NoMaD: Goal masked diffusion policies for navigation and exploration.arXiv preprint arXiv:2310.07896, 2023. URL https://arxiv.org/abs/2310.078 96
-
[50]
Journeydb: A benchmark for generative image understanding
Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, Limin Wang, and Hongsheng Li. Journeydb: A benchmark for generative image understanding. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[51]
Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, and Saining Xie. Scaling text-to-image diffusion transformers with representation autoencoders. arXiv preprint arXiv:2601.16208, 2026
-
[52]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[53]
Diffuse and disperse: Image generation with representation regularization
Runqian Wang and Kaiming He. Diffuse and disperse: Image generation with representation regularization. arXiv preprint arXiv:2506.09027, 2025
-
[54]
Ddt: Decoupled diffusion transformer, 2025
Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer, 2025
work page 2025
-
[55]
Dimensionality reduction – Wikipedia, the free encyclopedia, 2026
Wikipedia contributors. Dimensionality reduction – Wikipedia, the free encyclopedia, 2026. URL https: //en.wikipedia.org/wiki/Dimensionality_reduction. [Online; accessed April 2026]
work page 2026
-
[56]
Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, and Xiang Li. Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025. 17
-
[57]
Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling
Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, and Jiang Bian. Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [58]
-
[59]
Wanghan Xu, Xiaoyu Yue, Zidong Wang, Yao Teng, Wenlong Zhang, Xihui Liu, Luping Zhou, Wanli Ouyang, and Lei Bai. Exploring representation-aligned latent space for better generation.arXiv preprint arXiv:2502.00359, 2025
-
[60]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
Representation Fr\'echet Loss for Visual Generation
Jiawei Yang, Zhengyang Geng, Xuan Ju, Yonglong Tian, and Yue Wang. Representation fréchet loss for visual generation.arXiv preprint arXiv:2604.28190, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[62]
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models.arXiv preprint arXiv:2501.01423, 2025
-
[63]
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InCVPR, 2025
work page 2025
-
[64]
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[65]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. 2023
work page 2023
-
[66]
The unreasonable effectiveness of deep features as a perceptual metric
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018
work page 2018
-
[67]
Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, and Ping Luo. Both semantics and reconstruction matter: Making representation encoders ready for text-to-image generation and editing.arXiv preprint arXiv:2512.17909, 2025
-
[68]
arXiv preprint arXiv:2505.23656 (2025)
Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Videorepa: Learning physics for video generation through relational alignment with foundation models.arXiv preprint arXiv:2505.23656, 2025
-
[69]
Diffusion Transformers with Representation Autoencoders
Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025. URLhttps://arxiv.org/abs/2510.11690
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[70]
Xingyu Zhou, Qifan Li, Xiaobin Hu, Hai Chen, and Shuhang Gu. Guiding a diffusion transformer with the internal dynamics of itself.arXiv preprint arXiv:2512.24176, 2025
-
[71]
Efficient universal perception encoder.arXiv preprint arXiv:2603.22387, 2026
Chenchen Zhu, Saksham Suri, Cijo Jose, Maxime Oquab, Marc Szafraniec, Wei Wen, Yunyang Xiong, Patrick Labatut, Piotr Bojanowski, Raghuraman Krishnamoorthi, and Vikas Chandra. Efficient universal perception encoder.arXiv preprint arXiv:2603.22387, 2026. 18 Configuration ImageNet 256×256 Text-to-Image World Models Architecture Backbone DiT 𝐷𝐻 -XL DiT 𝐷𝐻 -XL...
-
[72]
A white horse in a storm of fire above the ocean
-
[73]
a tall white chihuahua in the lotus position, draped in saffron robes
-
[74]
background of a frog and mushroom with hyper realistic detail in watercolor
-
[75]
Small cute hedgehog, for childrens book, chibi style, lovely style character design, funny cartoon, lovely animation, simple watercolor, white background, artistic watercolor, very detailed, watercolor, white background
-
[76]
a Fox druid wearing blue colorful robes casting thunder Wave
-
[77]
A mischievous Monkey riding a Harley Davidson on a desert highway, wearing aviator goggles and a leather jacket, with a trail of dust behind them. intricate 8k, soft lighting, beautifully color graded, Unreal Engine, Cinematic , Color Grading, Photography, Photoshoot,
-
[78]
Sea turtle swimming with fish and it is very clear, wide view ,also you can see sharks and manta rays in the distance, colored corals are visible on the bottom ,wide angle,Oil painting full body,very detailed,photograph, taken with Hasselblad X1D50c,
-
[79]
Dynamic action shot of a wet and scruffy lurcher dog, running with determination, splashing water droplets, blurred background to emphasize motion, outdoor setting, overcast day, Nikon D850, 70200mm lens, f2.8, 11000s shutter speed, ISO 800
-
[80]
photorealistic image of a golden retriever happily running through a green field with a lake in the background with cinematic lighting, high definition, depth of field superresolution, insanely detailed 10.Capture the magic of the nighttime forest with an incredible image of an owl perched in a tree, illuminated by the full moon. Use the Canon EOS1D X Mar...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.