Improved Baselines with Representation Autoencoders

Boyang Zheng; Eli Shechtman; Jaskirat Singh; Richard Zhang; Saining Xie; Zongze Wu

REVIEW 2 major objections 3 minor 6 cited by

Representation autoencoders reach SOTA image generation by summing last k encoder layers, combining with REPA, and reparameterizing for free guidance.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-20 11:35 UTC pith:U57Y3URS

load-bearing objection Summing the last k layers from a pretrained encoder, pairing RAE with REPA on the same rep, and reparameterizing for free guidance delivers over 10x faster convergence to strong FID on ImageNet-256. the 2 major comments →

arxiv 2605.18324 v1 pith:U57Y3URS submitted 2026-05-18 cs.CV cs.AIcs.GRcs.LGstat.ML

Improved Baselines with Representation Autoencoders

Jaskirat Singh , Boyang Zheng , Zongze Wu , Richard Zhang , Eli Shechtman , Saining Xie This is my paper

classification cs.CV cs.AIcs.GRcs.LGstat.ML

keywords representation autoencodersdiffusion modelsimage generationrepresentation alignmentclassifier-free guidanceImageNettraining efficiency

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that representation autoencoders, which replace VAEs with pretrained vision encoders in diffusion models, can be improved through three targeted changes. Defining the representation as the sum of the last k encoder layers rather than only the final layer raises reconstruction quality without encoder fine-tuning or specialized data. Large-scale tests reveal that RAE and representation alignment (REPA) complement each other, so the same pretrained features can serve as both the encoder and the target for intermediate diffusion layers. Reparameterizing the DiT output lets REPA supply classifier-free guidance without training a second weaker model. Together these steps produce RAEv2, which converges more than ten times faster while reaching higher final quality on ImageNet generation.

Core claim

By adopting a generalized formulation where the representation sums the last k layers of a pretrained encoder, recognizing that RAE and REPA are complementary so the same representation can be used for both encoding and alignment, and re-parameterizing the diffusion model output to obtain guidance for free, RAEv2 achieves more than 10x faster convergence, a state-of-the-art gFID of 1.06 in 80 epochs on ImageNet-256, and a state-of-the-art FDr^k of 2.17 at 80 epochs without post-training.

What carries the argument

Generalized multi-layer sum representation in RAE together with its complementarity to REPA that enables free guidance via output re-parameterization.

Load-bearing premise

Pretrained vision encoders supply representations general enough that the improvements transfer to new domains and architectures without major hyperparameter retuning.

What would settle it

Training RAEv2 on a new dataset or architecture and finding that convergence speed and final quality match the original RAE only after extensive retuning would show the claimed generality does not hold.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

RAEv2 attains EP_FID@2 of 35 epochs versus 177 epochs for the original RAE.
State-of-the-art FDr^k of 2.17 is reached at 80 epochs compared with the prior best of 3.26 at 800 epochs.
No second diffusion model is needed for AutoGuidance.
Consistent gains appear in text-to-image generation and navigation world models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layer-summation and complementarity pattern could be tested in video or 3D diffusion models to check for similar efficiency gains.
If the free-guidance trick generalizes, many existing diffusion training pipelines could drop the cost of separate guidance models.
The approach suggests that assumptions about whether alignment replaces or augments autoencoding should be re-examined in other generative settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Summing the last k layers from a pretrained encoder, pairing RAE with REPA on the same rep, and reparameterizing for free guidance delivers over 10x faster convergence to strong FID on ImageNet-256.

read the letter

This paper's main result is that three straightforward changes to representation autoencoders—summing the last k layers instead of just the final one, running RAE and REPA together on the same pretrained representation, and reparameterizing the DiT output to get classifier-free guidance without a second model—produce more than 10x faster convergence while hitting a gFID of 1.06 at 80 epochs on ImageNet-256. They also introduce EP_FID@k as a practical way to track training efficiency and show the recipe works on text-to-image and navigation tasks too.

Referee Report

2 major / 3 minor

Summary. The paper introduces RAEv2, an improved version of Representation Autoencoders that replaces VAEs with pretrained vision encoders. Key contributions include: (1) defining the representation as the sum of the last k encoder layers rather than only the final layer, improving reconstruction without finetuning; (2) empirical demonstration that RAE and REPA are complementary, allowing the same pretrained representation to serve as both encoder and intermediate-layer target in diffusion models; (3) re-parameterizing the DiT output to enable free classifier-free guidance without training a second model. These changes yield >10x faster convergence, with state-of-the-art gFID of 1.06 and FDr^k of 2.17 on ImageNet-256 after only 80 epochs (vs. prior best at 800 epochs), plus consistent gains on text-to-image and navigation tasks. A new efficiency metric EP_FID@k is proposed, and code is released.

Significance. If the empirical results hold under broader conditions, this provides a strong, simplified baseline for diffusion-based generative models that leverages pretrained representations for both encoding and alignment. The reported 10x speedup in convergence to high-quality samples, free guidance mechanism, and new efficiency metric could influence training practices in image synthesis and related domains. The large-scale ablations on ImageNet-256 with DiT and cross-task validation add credibility, while the code release supports reproducibility.

major comments (2)

[Complementarity analysis and ImageNet-256 experiments] The central claim of complementarity between RAE and REPA (allowing the same representation for encoder and REPA target) is load-bearing for the RAEv2 recipe and the reported speedups. However, the experiments primarily use CLIP-style encoders on ImageNet-256 with DiT; if this interaction depends on specific encoder statistics or the x-prediction re-parameterization, the gains may not generalize without retuning. Additional ablations with alternative encoder families (e.g., DINOv2 or non-CLIP variants) would directly test this assumption.
[Guidance re-parameterization section] The free guidance via re-parameterization of the DiT output is presented as a key simplification over AutoGuidance. The manuscript should clarify whether this re-parameterization preserves the exact equivalence to REPA's intermediate-layer distillation or introduces any approximation that could affect guidance strength at different scales.

minor comments (3)

[Method and experimental details] The exact values of k (number of summed layers) and the specific pretrained encoder checkpoints used in the main results should be stated explicitly in the experimental setup, as these are free parameters in the method.
[Training details] Training schedules, learning rates, and batch sizes for the 80-epoch RAEv2 runs versus the 800-epoch baselines should be tabulated for direct comparison to ensure the efficiency claims are not confounded by optimization differences.
[Figures] Figure captions and axis labels for the convergence plots could more clearly indicate the number of epochs at which each method reaches the reported gFID thresholds.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive comments. We respond to each major comment below and indicate the corresponding revisions to the manuscript.

read point-by-point responses

Referee: The central claim of complementarity between RAE and REPA (allowing the same representation for encoder and REPA target) is load-bearing for the RAEv2 recipe and the reported speedups. However, the experiments primarily use CLIP-style encoders on ImageNet-256 with DiT; if this interaction depends on specific encoder statistics or the x-prediction re-parameterization, the gains may not generalize without retuning. Additional ablations with alternative encoder families (e.g., DINOv2 or non-CLIP variants) would directly test this assumption.

Authors: We thank the referee for this observation. Our primary large-scale ablations and ImageNet-256 results do center on CLIP-style encoders with DiT, as these constitute the standard experimental setting for such models. The complementarity finding is supported by extensive controlled ablations within this regime. We also report consistent gains when transferring the full RAEv2 recipe to text-to-image generation and navigation world models, which employ different data distributions and encoder families. In the revision we will expand the discussion section to explicitly link these cross-task results to the question of generalization and to note that the core mechanisms (multi-layer summation, joint encoder-target usage, and output re-parameterization) are architecture-agnostic. Full-scale DINOv2 ablations on ImageNet-256 would require substantial additional compute; we will therefore flag this as valuable future work rather than claim to have performed it. revision: partial
Referee: The free guidance via re-parameterization of the DiT output is presented as a key simplification over AutoGuidance. The manuscript should clarify whether this re-parameterization preserves the exact equivalence to REPA's intermediate-layer distillation or introduces any approximation that could affect guidance strength at different scales.

Authors: We appreciate the request for clarification. The re-parameterization is an exact algebraic rewriting that treats the REPA target as an x-prediction objective inside the RAE latent space; no distributional approximation is introduced. Because the transformation is linear, the equivalence holds for any classifier-free guidance scale. In the revised manuscript we will insert a short derivation (either in the main text or as an appendix) that makes this equivalence explicit and confirms scale independence. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical ablations and benchmarks

full rationale

The paper's claims rest on large-scale empirical ablations (sum of last k layers, RAE+REPA complementarity, re-parameterization for free guidance) measured against external baselines on ImageNet-256, text-to-image, and navigation tasks. No derivations, equations, or first-principles results are presented that reduce to fitted inputs or self-citations by construction. All reported gains (e.g., gFID 1.06 at 80 epochs, EP_FID@2 of 35) are direct experimental outcomes compared to prior external work, with no self-definitional loops, renamed predictions, or load-bearing uniqueness theorems from the authors' prior papers.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The work is empirical and therefore rests on standard ML training assumptions plus several tunable choices; no new physical or mathematical entities are postulated.

free parameters (2)

number of encoder layers k to sum
Chosen to improve reconstruction; value is not derived from first principles and affects the reported gains.
diffusion training hyperparameters (learning rate, batch size, epochs)
Standard but tuned for the 80-epoch regime that produces the headline metrics.

axioms (1)

domain assumption Pretrained vision encoders provide sufficiently general representations without domain-specific finetuning
Invoked when claiming improvements hold without specialized data such as text or faces.

pith-pipeline@v0.9.0 · 5923 in / 1395 out tokens · 52495 ms · 2026-05-20T11:35:44.861778+00:00 · methodology

0 comments

read the original abstract

Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last k encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills the same representation to intermediate layers instead. Through large-scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit complementary working mechanisms, allowing the same representation to be used as both encoder and target for intermediate diffusion layers. Finally, the original RAE struggles with classifier-free guidance (CFG) and requires training a second, weaker diffusion model for AutoGuidance (AG). We show that REPA itself can be viewed as x-prediction in RAE latent space. By simply re-parameterizing the output of the DiT model, it can provide guidance for "free". Overall, RAEv2 leads to more than 10x faster convergence over the original RAE, achieving a state-of-the-art gFID of 1.06 in just 80 epochs on ImageNet-256. On FDr^k, RAEv2 achieves a state-of-the-art 2.17 at just 80 epochs compared to the previous best 3.26 (800 epochs) without any post-training. This motivates EP_FID@k (epochs to reach unguided gFID <= k) as a measure of training efficiency. RAEv2 attains an EP_FID@2 of 35 epochs, versus 177 for the original RAE. We also validate our approach across diverse settings for text-to-image generation and navigation world models, showing consistent improvements. Code is available at https://raev2.github.io.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Multiplayer Interactive World Models with Representation Autoencoders
cs.CV 2026-07 accept novelty 7.0

A 5B-parameter latent diffusion model generates real-time four-player Rocket League matches conditioned on all players' actions, staying stable far beyond its training horizon.
EgoWAM: World Action Models Beyond Pixels with In-the-Wild Egocentric Human Data
cs.RO 2026-07 accept novelty 6.5

World Action Model co-training with DINO or 3D-flow targets scales human-to-robot transfer on bimanual tasks far better than behavior cloning, while pixel prediction transfers weakly.
MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation
cs.CV 2026-06 unverdicted novelty 6.0

MIMFlow is an end-to-end model that routes semantic latents through a normalizing flow while a decoder handles high-frequency pixels, reporting FID 2.50 and 71.3% linear probing accuracy on ImageNet 256x256 with 128 tokens.
MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation
cs.CV 2026-06 unverdicted novelty 6.0

MIMFlow uses a VAE on masked images to feed semantic latents to a normalizing flow while a decoder handles high-frequency details, reporting FID 2.50 and 71.3% linear probing on ImageNet 256x256 with 128 tokens.
MIMFlow: Integrating Masked Image Modeling with Normalizing Flows for End-to-End Image Generation
cs.CV 2026-06 accept novelty 6.0

End-to-end masked-image VAE plus normalizing flow yields FID 2.50 on ImageNet 256 with 128 tokens and higher linear-probe accuracy than unmasked counterparts.
DiffusionBench: On Holistic Evaluation of Diffusion Transformers
cs.CV 2026-06 conditional novelty 6.0

NanoGen unifies DiT training on ImageNet and T2I, reveals negative Pearson correlations (-0.377 to -0.580) in method rankings across metrics from 21 models, and motivates DiffusionBench for holistic evaluation.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 4 Pith papers · 16 internal anchors

[1]

Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

work page 2024
[2]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023

work page 2023
[3]

Navigation World Models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models.arXiv preprint arXiv:2412.03572, 2024. URLhttps://arxiv.org/abs/2412.03572

work page Pith review arXiv 2024
[4]

MIND: Monge Inception Distance for Generative Models Evaluation

Quentin Berthet, Yu-Han Wu, Clement Crepy, Romuald Elie, Klaus Greff, and Michael Eli Sander. Mind: Monge inception distance for generative models evaluation.arXiv preprint arXiv:2605.06797, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

Tianci Bi, Xiaoyi Zhang, Yan Lu, and Nanning Zheng. VFM-VAE: Vision foundation models can be good tokenizers for latent diffusion models.arXiv preprint arXiv:2510.18457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network.arXiv:2504.13181, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

arXiv preprint arXiv:2602.04873 , year=

Ramon Calvo-González and François Fleuret. Laminating representation autoencoders for efficient diffusion. arXiv preprint arXiv:2602.04873, 2026

work page arXiv 2026
[8]

Hyperspherical Autoencoder for High-Fidelity Image Reconstruction and Generation

Hun Chang, Byunghee Cha, and Jong Chul Ye. Dino-sae: Dino spherical autoencoder for high-fidelity image reconstruction and generation.arXiv preprint arXiv:2601.22904, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Aligning visual foundation encoders to tokenizers for diffusion models

Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, and Kai Zhang. Aligntok: Aligning visual foundation encoders to tokenizers for diffusion models.arXiv preprint arXiv:2509.25162, 2025

work page arXiv 2025
[11]

Masked autoencoders are effective tokenizers for diffusion models

Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked autoencoders are effective tokenizers for diffusion models. InInternational Conference on Machine Learning, 2025

work page 2025
[12]

arXiv preprint arXiv:2501.15420 , year=

Huayu Chen, Kai Jiang, Kaiwen Zheng, Jianfei Chen, Hang Su, and Jun Zhu. Visual generation without guidance.arXiv preprint arXiv:2501.15420, 2025

work page arXiv 2025
[13]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, and Ran Xu. Blip3-o: A family of fully open unified multimodal models–architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

An empirical study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9640–9649, 2021

work page 2021
[15]

Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Bohua Chen, Yi Su, Dongping Chen, Siyuan Wu, Xing Zhou, and 1 others

David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, et al. Scaling language-free visual representation learning.arXiv preprint arXiv:2504.01017, 2025

work page arXiv 2025
[16]

arXiv preprint arXiv:2512.19693 , year=

Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, and Ziwei Liu. The prism hypothesis: Harmonizing semantic and pixel representations via unified autoencoding.arXiv preprint arXiv:2512.19693, 2025

work page arXiv 2025
[17]

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. DreamSim: Learning new dimensions of human visual similarity using synthetic data. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv.org/abs/2306.09344

work page internal anchor Pith review arXiv 2023
[18]

One layer is enough: Adapting pretrained visual encoders for image generation

Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pretrained visual encoders for image generation.arXiv preprint arXiv:2512.07829, 2025. 15

work page arXiv 2025
[19]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hanna Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[20]

Rpiae: A representation-pivoted autoencoder enhancing both image generation and editing.arXiv preprint arXiv:2603.19206, 2026

Yue Gong, Hongyu Li, Shanyuan Liu, Bo Cheng, Yuhang Ma, Liebucha Wu, Xiaoyu Wu, Manyuan Zhang, Dawei Leng, Yuhui Yin, and Lijun Zhang. RPiAE: A representation-pivoted autoencoder enhancing both image generation and editing.arXiv preprint arXiv:2603.19206, 2026

work page arXiv 2026
[21]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

work page 2022
[22]

Unified latents (ul): How to train your latents

Jonathan Heek, Emiel Hoogeboom, Thomas Mensink, and Tim Salimans. Unified latents (ul): How to train your latents.arXiv preprint arXiv:2602.17270, 2026

work page arXiv 2026
[23]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017
[24]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Dino-tok: Adapting dino for visual tokenizers, 2025

Mingkai Jia, Mingxiao Li, Zhijian Shu, Anlin Zheng, Liaoyuan Fan, Jiaxin Guo, Tianxing Shi, Dongyue Lu, Zeming Li, Xiaoyang Guo, Xiaojuan Qi, Xiao-Xiao Long, Qian Zhang, Ping Tan, and Wei Yin. Dino-tok: Adapting dino for visual tokenizers.arXiv preprint arXiv:2511.20565, 2025

work page arXiv 2025
[27]

arXiv preprint arXiv:2505.02831 , year=

Dengyang Jiang, Mengmeng Wang, Liuzhuozheng Li, Lei Zhang, Haoyu Wang, Wei Wei, Guang Dai, Yanning Zhang, and Jingdong Wang. No other representation component is needed: Diffusion transformers can provide representation guidance by themselves.arXiv preprint arXiv:2505.02831, 2025

work page arXiv 2025
[28]

modded-nanogpt: Speedrunning the nanogpt baseline, 2024

Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024. URLhttps://github.com/KellerJordan/modded-nanogpt

work page 2024
[29]

Guiding a diffusion model with a bad version of itself.Advances in Neural Information Processing Systems, 37, 2024

Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself.Advances in Neural Information Processing Systems, 37, 2024

work page 2024
[30]

Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models

Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.arXiv preprint arXiv:2404.07724, 2024

work page Pith review arXiv 2024
[31]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[32]

Deeply-supervised nets

Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In Artificial intelligence and statistics, pages 562–570. Pmlr, 2015

work page 2015
[33]

arXiv preprint arXiv:2504.10483 , year=

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

work page arXiv 2025
[34]

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, et al. Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

work page Pith review arXiv 2024
[35]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Improving reconstruction of representation autoencoder.arXiv preprint arXiv:2602.08620, 2026

Siyu Liu, Chujie Qin, Hubery Yin, Qixin Yan, Zheng-Peng Duan, Chen Li, Jing Lyu, Chun-Le Guo, and Chongyi Li. Improving reconstruction of representation autoencoder.arXiv preprint arXiv:2602.08620, 2026

work page arXiv 2026
[37]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

work page 2024
[38]

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7739–7751, 2025. 16

work page 2025
[39]

Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, pages 1–31, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, pages 1–31, 2024

work page 2024
[40]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

work page 2023
[41]

Würstchen: An efficient architecture for large-scale text-to-image diffusion models

Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubreville. Würstchen: An efficient architecture for large-scale text-to-image diffusion models. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[42]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[43]

Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

work page 2016
[44]

Gnm: A general navigation model to drive any robot.arXiv preprint arXiv:2210.03370, 2022

Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, and Sergey Levine. GNM: A general navigation model to drive any robot. InInternational Conference on Robotics and Automation (ICRA), 2023. URL https: //arxiv.org/abs/2210.03370

work page arXiv 2023
[45]

ViNT: A foundation model for visual navigation,

Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachowicz, Kevin Black, Noriaki Hirose, and Sergey Levine. ViNT: A foundation model for visual navigation. InConference on Robot Learning (CoRL), 2023. URL https: //arxiv.org/abs/2306.14846

work page arXiv 2023
[46]

Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301,

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025

work page arXiv 2025
[47]

Oriane Siméoni, Huy V . Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

arXiv preprint arXiv:2512.10794 , year=

Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

work page arXiv 2025
[49]

Sridhar, D

Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. NoMaD: Goal masked diffusion policies for navigation and exploration.arXiv preprint arXiv:2310.07896, 2023. URL https://arxiv.org/abs/2310.078 96

work page arXiv 2023
[50]

Journeydb: A benchmark for generative image understanding

Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, Limin Wang, and Hongsheng Li. Journeydb: A benchmark for generative image understanding. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[51]

Scaling text-to-image diffusion transformers with representation autoencoders

Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, and Saining Xie. Scaling text-to-image diffusion transformers with representation autoencoders. arXiv preprint arXiv:2601.16208, 2026

work page arXiv 2026
[52]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[53]

arXiv preprint arXiv:2506.09027 , year=

Runqian Wang and Kaiming He. Diffuse and disperse: Image generation with representation regularization. arXiv preprint arXiv:2506.09027, 2025

work page arXiv 2025
[54]

Ddt: Decoupled diffusion transformer, 2025

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer, 2025

work page 2025
[55]

Dimensionality reduction – Wikipedia, the free encyclopedia, 2026

Wikipedia contributors. Dimensionality reduction – Wikipedia, the free encyclopedia, 2026. URL https: //en.wikipedia.org/wiki/Dimensionality_reduction. [Online; accessed April 2026]

work page 2026
[56]

Representa- tion entanglement for generation: Training diffusion trans- formers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025

Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, and Xiang Li. Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025. 17

work page arXiv 2025
[57]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, and Jiang Bian. Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

RIP REPA

Saining Xie. RIP REPA. X (formerly Twitter), 2025. URL https://x.com/sainingxie/status/19779367 27839736189

work page arXiv 2025
[59]

arXiv preprint arXiv:2502.00359 , year=

Wanghan Xu, Xiaoyu Yue, Zidong Wang, Yao Teng, Wenlong Zhang, Xihui Liu, Luping Zhou, Wanli Ouyang, and Lei Bai. Exploring representation-aligned latent space for better generation.arXiv preprint arXiv:2502.00359, 2025

work page arXiv 2025
[60]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Representation Fr\'echet Loss for Visual Generation

Jiawei Yang, Zhengyang Geng, Xuan Ju, Yonglong Tian, and Yue Wang. Representation fréchet loss for visual generation.arXiv preprint arXiv:2604.28190, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[62]

generation: Taming optimization dilemma in latent diffusion models (2025),https://arxiv.org/abs/2501.01423

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models.arXiv preprint arXiv:2501.01423, 2025

work page arXiv 2025
[63]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InCVPR, 2025

work page 2025
[64]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. 2023

work page 2023
[66]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

work page 2018
[67]

arXiv preprint arXiv:2512.17909 , year=

Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, and Ping Luo. Both semantics and reconstruction matter: Making representation encoders ready for text-to-image generation and editing.arXiv preprint arXiv:2512.17909, 2025

work page arXiv 2025
[68]

arXiv preprint arXiv:2505.23656 , year=

Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Videorepa: Learning physics for video generation through relational alignment with foundation models.arXiv preprint arXiv:2505.23656, 2025

work page arXiv 2025
[69]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025. URLhttps://arxiv.org/abs/2510.11690

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

arXiv preprint arXiv:2512.24176 , year=

Xingyu Zhou, Qifan Li, Xiaobin Hu, Hai Chen, and Shuhang Gu. Guiding a diffusion transformer with the internal dynamics of itself.arXiv preprint arXiv:2512.24176, 2025

work page arXiv 2025
[71]

EUPE : Efficient universal perception encoder

Chenchen Zhu, Saksham Suri, Cijo Jose, Maxime Oquab, Marc Szafraniec, Wei Wen, Yunyang Xiong, Patrick Labatut, Piotr Bojanowski, Raghuraman Krishnamoorthi, and Vikas Chandra. Efficient universal perception encoder.arXiv preprint arXiv:2603.22387, 2026. 18 Configuration ImageNet 256×256 Text-to-Image World Models Architecture Backbone DiT 𝐷𝐻 -XL DiT 𝐷𝐻 -XL...

work page arXiv 2026
[72]

A white horse in a storm of fire above the ocean

work page
[73]

a tall white chihuahua in the lotus position, draped in saffron robes

work page
[74]

background of a frog and mushroom with hyper realistic detail in watercolor

work page
[75]

Small cute hedgehog, for childrens book, chibi style, lovely style character design, funny cartoon, lovely animation, simple watercolor, white background, artistic watercolor, very detailed, watercolor, white background

work page
[76]

a Fox druid wearing blue colorful robes casting thunder Wave

work page
[77]

intricate 8k, soft lighting, beautifully color graded, Unreal Engine, Cinematic , Color Grading, Photography, Photoshoot,

A mischievous Monkey riding a Harley Davidson on a desert highway, wearing aviator goggles and a leather jacket, with a trail of dust behind them. intricate 8k, soft lighting, beautifully color graded, Unreal Engine, Cinematic , Color Grading, Photography, Photoshoot,

work page
[78]

Sea turtle swimming with fish and it is very clear, wide view ,also you can see sharks and manta rays in the distance, colored corals are visible on the bottom ,wide angle,Oil painting full body,very detailed,photograph, taken with Hasselblad X1D50c,

work page
[79]

Dynamic action shot of a wet and scruffy lurcher dog, running with determination, splashing water droplets, blurred background to emphasize motion, outdoor setting, overcast day, Nikon D850, 70200mm lens, f2.8, 11000s shutter speed, ISO 800

work page
[80]

photorealistic image of a golden retriever happily running through a green field with a lake in the background with cinematic lighting, high definition, depth of field superresolution, insanely detailed 10.Capture the magic of the nighttime forest with an incredible image of an owl perched in a tree, illuminated by the full moon. Use the Canon EOS1D X Mar...

work page

[1] [1]

Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

work page 2024

[2] [2]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023

work page 2023

[3] [3]

Navigation World Models

Amir Bar, Gaoyue Zhou, Danny Tran, Trevor Darrell, and Yann LeCun. Navigation world models.arXiv preprint arXiv:2412.03572, 2024. URLhttps://arxiv.org/abs/2412.03572

work page Pith review arXiv 2024

[4] [4]

MIND: Monge Inception Distance for Generative Models Evaluation

Quentin Berthet, Yu-Han Wu, Clement Crepy, Romuald Elie, Klaus Greff, and Michael Eli Sander. Mind: Monge inception distance for generative models evaluation.arXiv preprint arXiv:2605.06797, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

Tianci Bi, Xiaoyi Zhang, Yan Lu, and Nanning Zheng. VFM-VAE: Vision foundation models can be good tokenizers for latent diffusion models.arXiv preprint arXiv:2510.18457, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network.arXiv:2504.13181, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

arXiv preprint arXiv:2602.04873 , year=

Ramon Calvo-González and François Fleuret. Laminating representation autoencoders for efficient diffusion. arXiv preprint arXiv:2602.04873, 2026

work page arXiv 2026

[8] [8]

Hyperspherical Autoencoder for High-Fidelity Image Reconstruction and Generation

Hun Chang, Byunghee Cha, and Jong Chul Ye. Dino-sae: Dino spherical autoencoder for high-fidelity image reconstruction and generation.arXiv preprint arXiv:2601.22904, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [10]

Aligning visual foundation encoders to tokenizers for diffusion models

Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, and Kai Zhang. Aligntok: Aligning visual foundation encoders to tokenizers for diffusion models.arXiv preprint arXiv:2509.25162, 2025

work page arXiv 2025

[10] [11]

Masked autoencoders are effective tokenizers for diffusion models

Hao Chen, Yujin Han, Fangyi Chen, Xiang Li, Yidong Wang, Jindong Wang, Ze Wang, Zicheng Liu, Difan Zou, and Bhiksha Raj. Masked autoencoders are effective tokenizers for diffusion models. InInternational Conference on Machine Learning, 2025

work page 2025

[11] [12]

arXiv preprint arXiv:2501.15420 , year=

Huayu Chen, Kai Jiang, Kaiwen Zheng, Jianfei Chen, Hang Su, and Jun Zhu. Visual generation without guidance.arXiv preprint arXiv:2501.15420, 2025

work page arXiv 2025

[12] [13]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, Le Xue, Caiming Xiong, and Ran Xu. Blip3-o: A family of fully open unified multimodal models–architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [14]

An empirical study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9640–9649, 2021

work page 2021

[14] [15]

Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Bohua Chen, Yi Su, Dongping Chen, Siyuan Wu, Xing Zhou, and 1 others

David Fan, Shengbang Tong, Jiachen Zhu, Koustuv Sinha, Zhuang Liu, Xinlei Chen, Michael Rabbat, Nicolas Ballas, Yann LeCun, Amir Bar, et al. Scaling language-free visual representation learning.arXiv preprint arXiv:2504.01017, 2025

work page arXiv 2025

[15] [16]

arXiv preprint arXiv:2512.19693 , year=

Weichen Fan, Haiwen Diao, Quan Wang, Dahua Lin, and Ziwei Liu. The prism hypothesis: Harmonizing semantic and pixel representations via unified autoencoding.arXiv preprint arXiv:2512.19693, 2025

work page arXiv 2025

[16] [17]

DreamSim: Learning New Dimensions of Human Visual Similarity using Synthetic Data

Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. DreamSim: Learning new dimensions of human visual similarity using synthetic data. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv.org/abs/2306.09344

work page internal anchor Pith review arXiv 2023

[17] [18]

One layer is enough: Adapting pretrained visual encoders for image generation

Yuan Gao, Chen Chen, Tianrong Chen, and Jiatao Gu. One layer is enough: Adapting pretrained visual encoders for image generation.arXiv preprint arXiv:2512.07829, 2025. 15

work page arXiv 2025

[18] [19]

Geneval: An object-focused framework for evaluating text-to-image alignment

Dhruba Ghosh, Hanna Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[19] [20]

Rpiae: A representation-pivoted autoencoder enhancing both image generation and editing.arXiv preprint arXiv:2603.19206, 2026

Yue Gong, Hongyu Li, Shanyuan Liu, Bo Cheng, Yuhang Ma, Liebucha Wu, Xiaoyu Wu, Manyuan Zhang, Dawei Leng, Yuhui Yin, and Lijun Zhang. RPiAE: A representation-pivoted autoencoder enhancing both image generation and editing.arXiv preprint arXiv:2603.19206, 2026

work page arXiv 2026

[20] [21]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022

work page 2022

[21] [22]

Unified latents (ul): How to train your latents

Jonathan Heek, Emiel Hoogeboom, Thomas Mensink, and Tim Salimans. Unified latents (ul): How to train your latents.arXiv preprint arXiv:2602.17270, 2026

work page arXiv 2026

[22] [23]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

work page 2017

[23] [24]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [25]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [26]

Dino-tok: Adapting dino for visual tokenizers, 2025

Mingkai Jia, Mingxiao Li, Zhijian Shu, Anlin Zheng, Liaoyuan Fan, Jiaxin Guo, Tianxing Shi, Dongyue Lu, Zeming Li, Xiaoyang Guo, Xiaojuan Qi, Xiao-Xiao Long, Qian Zhang, Ping Tan, and Wei Yin. Dino-tok: Adapting dino for visual tokenizers.arXiv preprint arXiv:2511.20565, 2025

work page arXiv 2025

[26] [27]

arXiv preprint arXiv:2505.02831 , year=

Dengyang Jiang, Mengmeng Wang, Liuzhuozheng Li, Lei Zhang, Haoyu Wang, Wei Wei, Guang Dai, Yanning Zhang, and Jingdong Wang. No other representation component is needed: Diffusion transformers can provide representation guidance by themselves.arXiv preprint arXiv:2505.02831, 2025

work page arXiv 2025

[27] [28]

modded-nanogpt: Speedrunning the nanogpt baseline, 2024

Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024. URLhttps://github.com/KellerJordan/modded-nanogpt

work page 2024

[28] [29]

Guiding a diffusion model with a bad version of itself.Advances in Neural Information Processing Systems, 37, 2024

Tero Karras, Miika Aittala, Tuomas Kynkäänniemi, Jaakko Lehtinen, Timo Aila, and Samuli Laine. Guiding a diffusion model with a bad version of itself.Advances in Neural Information Processing Systems, 37, 2024

work page 2024

[29] [30]

Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models

Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.arXiv preprint arXiv:2404.07724, 2024

work page Pith review arXiv 2024

[30] [31]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024

[31] [32]

Deeply-supervised nets

Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In Artificial intelligence and statistics, pages 562–570. Pmlr, 2015

work page 2015

[32] [33]

arXiv preprint arXiv:2504.10483 , year=

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483, 2025

work page arXiv 2025

[33] [34]

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, et al. Genai-bench: Evaluating and improving compositional text-to-visual generation.arXiv preprint arXiv:2406.13743, 2024

work page Pith review arXiv 2024

[34] [35]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [36]

Improving reconstruction of representation autoencoder.arXiv preprint arXiv:2602.08620, 2026

Siyu Liu, Chujie Qin, Hubery Yin, Qixin Yan, Zheng-Peng Duan, Chen Li, Jing Lyu, Chun-Le Guo, and Chongyi Li. Improving reconstruction of representation autoencoder.arXiv preprint arXiv:2602.08620, 2026

work page arXiv 2026

[36] [37]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InEuropean Conference on Computer Vision, pages 23–40. Springer, 2024

work page 2024

[37] [38]

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7739–7751, 2025. 16

work page 2025

[38] [39]

Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, pages 1–31, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, pages 1–31, 2024

work page 2024

[39] [40]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

work page 2023

[40] [41]

Würstchen: An efficient architecture for large-scale text-to-image diffusion models

Pablo Pernias, Dominic Rampas, Mats Leon Richter, Christopher Pal, and Marc Aubreville. Würstchen: An efficient architecture for large-scale text-to-image diffusion models. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023

[41] [42]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021

[42] [43]

Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

work page 2016

[43] [44]

Gnm: A general navigation model to drive any robot.arXiv preprint arXiv:2210.03370, 2022

Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, and Sergey Levine. GNM: A general navigation model to drive any robot. InInternational Conference on Robotics and Automation (ICRA), 2023. URL https: //arxiv.org/abs/2210.03370

work page arXiv 2023

[44] [45]

ViNT: A foundation model for visual navigation,

Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachowicz, Kevin Black, Noriaki Hirose, and Sergey Levine. ViNT: A foundation model for visual navigation. InConference on Robot Learning (CoRL), 2023. URL https: //arxiv.org/abs/2306.14846

work page arXiv 2023

[45] [46]

Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301,

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025

work page arXiv 2025

[46] [47]

Oriane Siméoni, Huy V . Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [48]

arXiv preprint arXiv:2512.10794 , year=

Jaskirat Singh, Xingjian Leng, Zongze Wu, Liang Zheng, Richard Zhang, Eli Shechtman, and Saining Xie. What matters for representation alignment: Global information or spatial structure?arXiv preprint arXiv:2512.10794, 2025

work page arXiv 2025

[48] [49]

Sridhar, D

Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey Levine. NoMaD: Goal masked diffusion policies for navigation and exploration.arXiv preprint arXiv:2310.07896, 2023. URL https://arxiv.org/abs/2310.078 96

work page arXiv 2023

[49] [50]

Journeydb: A benchmark for generative image understanding

Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, Limin Wang, and Hongsheng Li. Journeydb: A benchmark for generative image understanding. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[50] [51]

Scaling text-to-image diffusion transformers with representation autoencoders

Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, and Saining Xie. Scaling text-to-image diffusion transformers with representation autoencoders. arXiv preprint arXiv:2601.16208, 2026

work page arXiv 2026

[51] [52]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[52] [53]

arXiv preprint arXiv:2506.09027 , year=

Runqian Wang and Kaiming He. Diffuse and disperse: Image generation with representation regularization. arXiv preprint arXiv:2506.09027, 2025

work page arXiv 2025

[53] [54]

Ddt: Decoupled diffusion transformer, 2025

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. Ddt: Decoupled diffusion transformer, 2025

work page 2025

[54] [55]

Dimensionality reduction – Wikipedia, the free encyclopedia, 2026

Wikipedia contributors. Dimensionality reduction – Wikipedia, the free encyclopedia, 2026. URL https: //en.wikipedia.org/wiki/Dimensionality_reduction. [Online; accessed April 2026]

work page 2026

[55] [56]

Representa- tion entanglement for generation: Training diffusion trans- formers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025

Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, and Xiang Li. Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025. 17

work page arXiv 2025

[56] [57]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, and Jiang Bian. Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [58]

RIP REPA

Saining Xie. RIP REPA. X (formerly Twitter), 2025. URL https://x.com/sainingxie/status/19779367 27839736189

work page arXiv 2025

[58] [59]

arXiv preprint arXiv:2502.00359 , year=

Wanghan Xu, Xiaoyu Yue, Zidong Wang, Yao Teng, Wenlong Zhang, Xihui Liu, Luping Zhou, Wanli Ouyang, and Lei Bai. Exploring representation-aligned latent space for better generation.arXiv preprint arXiv:2502.00359, 2025

work page arXiv 2025

[59] [60]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [61]

Representation Fr\'echet Loss for Visual Generation

Jiawei Yang, Zhengyang Geng, Xuan Ju, Yonglong Tian, and Yue Wang. Representation fréchet loss for visual generation.arXiv preprint arXiv:2604.28190, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[61] [62]

generation: Taming optimization dilemma in latent diffusion models (2025),https://arxiv.org/abs/2501.01423

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models.arXiv preprint arXiv:2501.01423, 2025

work page arXiv 2025

[62] [63]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InCVPR, 2025

work page 2025

[63] [64]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[64] [65]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. 2023

work page 2023

[65] [66]

The unreasonable effectiveness of deep features as a perceptual metric

Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

work page 2018

[66] [67]

arXiv preprint arXiv:2512.17909 , year=

Shilong Zhang, He Zhang, Zhifei Zhang, Chongjian Ge, Shuchen Xue, Shaoteng Liu, Mengwei Ren, Soo Ye Kim, Yuqian Zhou, Qing Liu, Daniil Pakhomov, Kai Zhang, Zhe Lin, and Ping Luo. Both semantics and reconstruction matter: Making representation encoders ready for text-to-image generation and editing.arXiv preprint arXiv:2512.17909, 2025

work page arXiv 2025

[67] [68]

arXiv preprint arXiv:2505.23656 , year=

Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, and Yu Cheng. Videorepa: Learning physics for video generation through relational alignment with foundation models.arXiv preprint arXiv:2505.23656, 2025

work page arXiv 2025

[68] [69]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. arXiv preprint arXiv:2510.11690, 2025. URLhttps://arxiv.org/abs/2510.11690

work page internal anchor Pith review Pith/arXiv arXiv 2025

[69] [70]

arXiv preprint arXiv:2512.24176 , year=

Xingyu Zhou, Qifan Li, Xiaobin Hu, Hai Chen, and Shuhang Gu. Guiding a diffusion transformer with the internal dynamics of itself.arXiv preprint arXiv:2512.24176, 2025

work page arXiv 2025

[70] [71]

EUPE : Efficient universal perception encoder

Chenchen Zhu, Saksham Suri, Cijo Jose, Maxime Oquab, Marc Szafraniec, Wei Wen, Yunyang Xiong, Patrick Labatut, Piotr Bojanowski, Raghuraman Krishnamoorthi, and Vikas Chandra. Efficient universal perception encoder.arXiv preprint arXiv:2603.22387, 2026. 18 Configuration ImageNet 256×256 Text-to-Image World Models Architecture Backbone DiT 𝐷𝐻 -XL DiT 𝐷𝐻 -XL...

work page arXiv 2026

[71] [72]

A white horse in a storm of fire above the ocean

work page

[72] [73]

a tall white chihuahua in the lotus position, draped in saffron robes

work page

[73] [74]

background of a frog and mushroom with hyper realistic detail in watercolor

work page

[74] [75]

Small cute hedgehog, for childrens book, chibi style, lovely style character design, funny cartoon, lovely animation, simple watercolor, white background, artistic watercolor, very detailed, watercolor, white background

work page

[75] [76]

a Fox druid wearing blue colorful robes casting thunder Wave

work page

[76] [77]

intricate 8k, soft lighting, beautifully color graded, Unreal Engine, Cinematic , Color Grading, Photography, Photoshoot,

A mischievous Monkey riding a Harley Davidson on a desert highway, wearing aviator goggles and a leather jacket, with a trail of dust behind them. intricate 8k, soft lighting, beautifully color graded, Unreal Engine, Cinematic , Color Grading, Photography, Photoshoot,

work page

[77] [78]

Sea turtle swimming with fish and it is very clear, wide view ,also you can see sharks and manta rays in the distance, colored corals are visible on the bottom ,wide angle,Oil painting full body,very detailed,photograph, taken with Hasselblad X1D50c,

work page

[78] [79]

Dynamic action shot of a wet and scruffy lurcher dog, running with determination, splashing water droplets, blurred background to emphasize motion, outdoor setting, overcast day, Nikon D850, 70200mm lens, f2.8, 11000s shutter speed, ISO 800

work page

[79] [80]

photorealistic image of a golden retriever happily running through a green field with a lake in the background with cinematic lighting, high definition, depth of field superresolution, insanely detailed 10.Capture the magic of the nighttime forest with an incredible image of an owl perched in a tree, illuminated by the full moon. Use the Canon EOS1D X Mar...

work page