arxiv: 2604.28190 · v1 · submitted 2026-04-30 · 💻 cs.CV

Recognition: unknown

Representation Fr\'echet Loss for Visual Generation

Jiawei Yang , Zhengyang Geng , Xuan Ju , Yonglong Tian , Yue Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 05:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords Fréchet DistanceFD-lossvisual generationone-step generatorsImageNetFID metricgenerative modelsrepresentation space

0 comments

The pith

Fréchet Distance can be optimized as a training loss in representation space to improve visual generators

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the Fréchet Distance becomes usable as a training objective once population-level estimation is decoupled from the smaller batches needed for backpropagation. This FD-loss, when applied post-training in feature spaces such as Inception, raises output quality for both one-step and multi-step generators on ImageNet. The same loss converts multi-step models into competitive one-step models without distillation or adversarial training. Standard Inception FID can misrank models, which motivates a multi-representation metric that better tracks actual visual quality.

Core claim

Fréchet Distance (FD) can be effectively optimized in the representation space by decoupling the population size for FD estimation from the batch size for gradient computation. Post-training a base generator with FD-loss in different representation spaces consistently improves visual quality. Under the Inception feature space, a one-step generator achieves 0.72 FID on ImageNet 256x256. The same FD-loss repurposes multi-step generators into strong one-step generators without teacher distillation, adversarial training or per-sample targets. FID can misrank visual quality, motivating a multi-representation metric FDr^k.

What carries the argument

FD-loss: Fréchet Distance between real and generated features from a fixed extractor, with mean and covariance from a large population but gradients computed only on the current generated batch

If this is right

Post-training with FD-loss in different representation spaces consistently improves visual quality
A one-step generator reaches 0.72 FID on ImageNet 256x256 using Inception features
Multi-step generators become strong one-step generators using only FD-loss, without distillation or adversarial training
Modern representations can yield better samples despite worse Inception FID, supporting the FDr^k multi-representation metric

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other distributional distances computed in learned feature spaces may serve as stable training objectives for generators
Precomputing real feature statistics once allows the approach to work with modest batch sizes during optimization
The observation that single-representation FID can misrank quality suggests evaluation protocols should routinely combine several modern feature extractors

Load-bearing premise

Gradients of the Fréchet Distance computed on small batches of generated features, using fixed large-population real statistics, still steer the generator toward a better output distribution without artifacts or overfitting to the chosen features

What would settle it

Training a base generator with FD-loss produces no gain or a drop in independent perceptual scores and human preference rankings compared with the unrefined base model

Figures

Figures reproduced from arXiv: 2604.28190 by Jiawei Yang, Xuan Ju, Yonglong Tian, Yue Wang, Zhengyang Geng.

**Figure 1.** Figure 1: One-step samples on ImageNet 256×256, before and after FD-loss post-training. Left: samples from the base generators. Right: samples after post-training with FD-loss. Top two rows: pMF-H [30], a one-step generator. Bottom: JiT-H [23], a multi-step generator. All models generate in a single network evaluation (1-NFE). FD-loss improves existing one-step models and can repurpose multi-step models into one-ste… view at source ↗

**Figure 3.** Figure 3: Is ImageNet generation “solved”? Left: FID over time. Recent methods surpass the real validation set images (red dashed) in terms of FID. Right: FDr6 (Eq. 8), which averages normalized Fréchet Distance ratios across six representation spaces. Under this metric, even the strongest existing methods remain far from the validation images, indicating that FID alone masks significant quality gaps. Each method is… view at source ↗

**Figure 4.** Figure 4: FD-loss improves visual quality under different representations. Samples from pMF-B/16 [30] post-trained with FD-loss. Darker green: lower FID; darker yellow: lower FDr6 . Post-trained models improve over the base model (left). Inception post-trained model achieves the lowest FID (0.81) yet does not produce the best samples; models post-trained with modern representations achieve lower FDr6 and show better… view at source ↗

**Figure 5.** Figure 5: Repurposing a multi-step model into a one-step generator with FD-loss. Samples from the same noise input across the base model and different post-trained models. The naive one-step base model fails to produce sensible images. After post-training, the 1-NFE models generate sensible images, and the strongest variants are visually comparable or superior to the 50-step base model. enough, so the repurposed mod… view at source ↗

**Figure 6.** Figure 6: Human preference study. Left: Our post-trained 1-NFE models (warm∗ ) are preferred over their base models. Right: Our pMF-H∗ is the most preferred generator against real ImageNet validation images, but still loses to real. This is consistent with view at source ↗

**Figure 7.** Figure 7: Qualitative text-conditioned generation example. We post-train SD3.5 Medium [9] with FD-loss using BLIP3o-GPT4o-60k [3], a curated 60k dataset distilled from GPT-4o with a stylized aesthetic as the reference image distribution. Despite 56× NFE reduction, the post-trained 1-NFE model can preserve recognizable prompt content while inheriting the stylized look of the reference distribution. More in Appendix G. 9 view at source ↗

read the original abstract

We show that Fr\'echet Distance (FD), long considered impractical as a training objective, can in fact be effectively optimized in the representation space. Our idea is simple: decouple the population size for FD estimation (e.g., 50k) from the batch size for gradient computation (e.g., 1024). We term this approach FD-loss. Optimizing FD-loss reveals several surprising findings. First, post-training a base generator with FD-loss in different representation spaces consistently improves visual quality. Under the Inception feature space, a one-step generator achieves0.72 FID on ImageNet 256x256. Second, the same FD-loss repurposes multi-step generators into strong one-step generators without teacher distillation, adversarial training or per-sample targets. Third, FID can misrank visual quality: modern representations can yield better samples despite worse Inception FID. This motivates FDr$^k$, a multi-representation metric. We hope this work will encourage further exploration of distributional distances in diverse representation spaces as both training objectives and evaluation metrics for generative models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They show Fréchet distance can be optimized directly as a generator loss by decoupling large population stats from small gradient batches, turning multi-step models into competitive one-step ones without distillation.

read the letter

The central result is that FD can serve as a training objective once you estimate real statistics on a large set (50k) while back-propagating from small batches (1024). This produces a one-step generator at 0.72 FID on ImageNet 256 and repurposes existing multi-step generators into strong one-step models with no teacher or adversarial component. The decoupling is the practical novelty; prior work treated FD as an evaluation-only metric because full-batch gradients were assumed necessary.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FD-loss, a training objective that optimizes Fréchet Distance (FD) in representation space by decoupling the population size used to estimate real/generated statistics (e.g., 50k samples) from the smaller batch size used for gradient computation (e.g., 1024). It claims that post-training base generators with FD-loss in spaces such as Inception features yields consistent visual improvements, including a one-step generator reaching 0.72 FID on ImageNet 256x256; the same loss can convert multi-step generators into strong one-step models without distillation, adversarial training, or per-sample targets; and standard FID can misrank quality, motivating the new multi-representation metric FDr^k.

Significance. If the central claims are substantiated, the work would be significant for generative modeling: it would show that a long-considered-impractical distributional distance can serve as a practical training objective in feature spaces, offering a route to high-quality one-step generators and exposing limitations of single-representation FID. The decoupling trick and empirical results on ImageNet would be noteworthy strengths, as would the introduction of FDr^k if it proves more reliable than Inception FID. The significance hinges on whether the reported FID improvements reflect genuine distributional alignment rather than artifacts of the optimization procedure.

major comments (2)

[FD-loss formulation and gradient computation] The FD-loss formulation (described in the abstract and method sections) does not specify how the matrix square root of C_r C_g and its derivative are computed when the sample covariance C_g estimated from 1024 samples in 2048-dimensional Inception space is singular (rank ≤1023). The standard FD expression is undefined or unstable without regularization (e.g., εI) or a pseudo-inverse; no equation or implementation detail addresses this, so the back-propagated gradients may optimize a regularized surrogate rather than true population FD. This directly undermines the claim that FD 'can in fact be effectively optimized' and that the 0.72 FID reflects genuine improvement.
[Experimental results] The reported quantitative results (0.72 FID for the one-step generator under Inception features) lack supporting details on gradient computation through the fixed representation extractor, ablation studies on population/batch sizes, controls for representation choice, or verification that optimization reduces the population-level FD rather than introducing feature-specific artifacts. These omissions make the central empirical claims difficult to verify from the given text.

minor comments (2)

[Abstract] Typo in the abstract: 'achieves0.72 FID' is missing a space.
[Metric definition] The exact definition and computation of the proposed FDr^k metric should be stated explicitly with an equation, rather than only motivated by observed discrepancies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and detailed review. We have carefully considered the major comments and revised the manuscript to provide the requested clarifications and additional experiments. Our point-by-point responses are as follows.

read point-by-point responses

Referee: The FD-loss formulation (described in the abstract and method sections) does not specify how the matrix square root of C_r C_g and its derivative are computed when the sample covariance C_g estimated from 1024 samples in 2048-dimensional Inception space is singular (rank ≤1023). The standard FD expression is undefined or unstable without regularization (e.g., εI) or a pseudo-inverse; no equation or implementation detail addresses this, so the back-propagated gradients may optimize a regularized surrogate rather than true population FD. This directly undermines the claim that FD 'can in fact be effectively optimized' and that the 0.72 FID reflects genuine improvement.

Authors: We agree that the original manuscript lacked sufficient implementation details on handling the matrix square root for potentially singular covariances. In our implementation, we regularize both covariance matrices by adding εI with ε = 1e-6 prior to computing the square root and its derivative, following standard practices for numerical stability in Fréchet Distance calculations. This regularization is applied consistently during both training and evaluation. We have updated the method section with the full mathematical formulation including this regularization term and provided pseudocode in the appendix. While the optimized objective is indeed a regularized FD, empirical results show that it effectively aligns the distributions, as evidenced by improvements in population-level FID computed on large independent sets. We believe this addresses the concern without undermining the core claim. revision: yes
Referee: The reported quantitative results (0.72 FID for the one-step generator under Inception features) lack supporting details on gradient computation through the fixed representation extractor, ablation studies on population/batch sizes, controls for representation choice, or verification that optimization reduces the population-level FD rather than introducing feature-specific artifacts. These omissions make the central empirical claims difficult to verify from the given text.

Authors: We thank the referee for pointing out these omissions. In the revised version, we have added substantial details and experiments: (i) The representation extractor is fixed and frozen, with gradients flowing directly from the FD-loss through the generated features to the generator parameters; we include a diagram and equations clarifying this. (ii) Ablation studies on population sizes (e.g., 10k vs 50k) and batch sizes (256 to 2048) are now in the appendix, showing robustness. (iii) Controls across multiple representation spaces (Inception, DINO, CLIP) are provided. (iv) We verify population-level FD reduction by computing FD on 50k samples pre- and post-optimization, demonstrating consistent decreases. These additions confirm the improvements reflect genuine distributional improvements rather than artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: FD-loss is an empirical optimization procedure with independent experimental validation

full rationale

The paper introduces FD-loss by decoupling fixed population statistics (e.g., 50k real samples) from per-batch gradient computation (e.g., 1024 generated samples) and then reports the empirical outcomes of applying this loss to generators. The low reported FID values and visual-quality gains are direct results of the optimization process rather than quantities defined into the method by construction. No equations reduce a claimed prediction to a fitted parameter, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled via prior work. The derivation chain consists of a straightforward engineering proposal followed by experimental measurement; the final metrics are computed on held-out large-sample evaluations and are therefore not tautological with the training objective.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The approach depends on the mathematical feasibility of estimating and differentiating Fréchet Distance in representation space with chosen sample sizes, plus the introduction of a new multi-representation metric based on internal observations.

free parameters (2)

FD population size = e.g., 50k
Example value of 50k chosen to enable accurate estimation of the full distribution for the loss.
gradient batch size = e.g., 1024
Example value of 1024 chosen to allow practical gradient computation during optimization.

axioms (1)

domain assumption Fréchet Distance between two distributions in a fixed representation space can be estimated from finite samples and its gradient with respect to generator parameters can be computed for optimization.
This underpins the entire FD-loss training procedure described in the abstract.

invented entities (1)

FDr^k no independent evidence
purpose: A multi-representation Fréchet metric intended to provide more reliable evaluation when single-representation FID misranks sample quality.
Introduced to address the abstract's observation that modern representations can yield better samples despite worse Inception FID.

pith-pipeline@v0.9.0 · 5484 in / 1682 out tokens · 47623 ms · 2026-05-07T05:52:53.851369+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Normalizing Trajectory Models
cs.CV 2026-05 unverdicted novelty 7.0

NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.
Normalizing Trajectory Models
cs.CV 2026-05 unverdicted novelty 7.0

NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.

Reference graph

Works this paper leans on

61 extracted references · 7 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Claude code.https://www.anthropic.com/claude-code, 2025

Anthropic. Claude code.https://www.anthropic.com/claude-code, 2025

2025
[2]

Demystifying MMD GANs

Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. InICLR, 2018

2018
[3]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv:2505.09568, 2025

work page internal anchor Pith review arXiv 2025
[4]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InNeurIPS, 2017

2017
[5]

ImageNet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InCVPR, 2009

2009
[6]

Generative Modeling via Drifting

Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He. Generative modeling via drifting.arXiv:2602.04770, 2026

work page internal anchor Pith review arXiv 2026
[7]

Generative modeling using the sliced wasserstein distance

Ishan Deshpande, Ziyu Zhang, and Alexander G Schwing. Generative modeling using the sliced wasserstein distance. InCVPR, 2018

2018
[8]

Image generation via minimizing fréchet distance in discriminator feature space.arXiv:2003.11774, 2020

Khoa D Doan, Saurav Manchanda, Fengjiao Wang, Sathiya Keerthi, Avradeep Bhowmik, and Chandan K Reddy. Image generation via minimizing fréchet distance in discriminator feature space.arXiv:2003.11774, 2020

work page arXiv 2003
[9]

Scaling rectified flow transform- ers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. InICML, 2024

2024
[10]

Scaling laws for reward model overoptimization

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InICML, 2023

2023
[11]

Mean flows for one-step generative modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. InNeurIPS, 2025

2025
[12]

Improved mean flows: On the challenges of fast-forward generative models

Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J Zico Kolter, and Kaiming He. Improved mean flows: On the challenges of fast-forward generative models. InCVPR, 2026

2026
[13]

Generative adversarial nets

Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InNeurIPS, 2014

2014
[14]

Problems of monetary management: the uk experience

Charles AE Goodhart. Problems of monetary management: the uk experience. InMonetary theory and practice: The UK experience, 1984

1984
[15]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022

2022
[16]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InCVPR, 2020

2020
[17]

GANs trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017

2017
[18]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InICML, 2015

2015
[19]

Rethinking fid: Towards a better evaluation metric for image generation

Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Rethinking fid: Towards a better evaluation metric for image generation. In CVPR, 2024. 11

2024
[20]

The role of ImageNet classes in fréchet inception distance

Tuomas Kynkäänniemi, Tero Karras, Miika Aittala, Timo Aila, and Jaakko Lehtinen. The role of ImageNet classes in fréchet inception distance. InICLR, 2023

2023
[21]

Improved precision and recall metric for assessing generative models

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. InNeurIPS, 2019

2019
[22]

Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InICCV, 2025

2025
[23]

Back to basics: Let denoising generative models denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise. In CVPR, 2026

2026
[24]

Autoregressive image generation without vector quantization

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InNeurIPS, 2024

2024
[25]

Generative moment matching networks

Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. InICML, 2015

2015
[26]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

2023
[27]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InCVPR, 2022

2022
[28]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

2019
[29]

Simplifying, stabilizing and scaling continuous-time consistency models

Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. InICLR, 2025

2025
[30]

One-step Latent-free Image Generation with Pixel Mean Flows

Yiyang Lu, Susie Lu, Qiao Sun, Hanhong Zhao, Zhicheng Jiang, Xianbang Wang, Tianhong Li, Zhengyang Geng, and Kaiming He. One-step latent-free image generation with pixel mean flows.arXiv:2601.22158, 2026

work page internal anchor Pith review arXiv 2026
[31]

Diff- instruct: A universal approach for transferring knowledge from pre-trained diffusion models

Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff- instruct: A universal approach for transferring knowledge from pre-trained diffusion models. In NeurIPS, 2023

2023
[32]

Zico Kolter, and Guo-jun Qi

Weijian Luo, Zemin Huang, Zhengyang Geng, J. Zico Kolter, and Guo-jun Qi. One-step diffusion distillation through score implicit matching. InNeurIPS, 2024

2024
[33]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV, 2024

2024
[34]

Backpropagating through fréchet inception distance.arXiv:2009.14075, 2020

Alexander Mathiasen and Frederik Hvilshøj. Backpropagating through fréchet inception distance.arXiv:2009.14075, 2020

work page arXiv 2009
[35]

Group diffusion: Enhancing image generation by unlocking cross-sample collaboration

Sicheng Mo, Thao Nguyen, Richard Zhang, Nick Kolkin, Siddharth Srinivasan Iyer, Eli Shecht- man, Krishna Kumar Singh, Yong Jae Lee, Bolei Zhou, and Yuheng Li. Group diffusion: Enhancing image generation by unlocking cross-sample collaboration. InCVPR, 2026

2026
[36]

Mcgan: Mean and covariance feature matching gan

Youssef Mroueh, Tom Sercu, and Vaibhava Goel. Mcgan: Mean and covariance feature matching gan. InICML, 2017

2017
[37]

Codex cli.https://github.com/openai/codex, 2025

OpenAI. Codex cli.https://github.com/openai/codex, 2025

2025
[38]

Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024

2024
[39]

Scalable diffusion models with Transformers

William Peebles and Saining Xie. Scalable diffusion models with Transformers. InICCV, 2023

2023
[40]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021

2021
[41]

Flowar: Scale-wise autoregressive image generation meets flow matching

Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. Flowar: Scale-wise autoregressive image generation meets flow matching. InICML, 2025

2025
[42]

Assess- ing generative models via precision and recall

Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. Assess- ing generative models via precision and recall. InNeurIPS, 2018. 12

2018
[43]

Improved techniques for training GANs

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. InNeurIPS, 2016

2016
[44]

Learning implicit generative models by matching perceptual features

Cicero Nogueira dos Santos, Youssef Mroueh, Inkit Padhi, and Pierre Dognin. Learning implicit generative models by matching perceptual features. InICCV, 2019

2019
[45]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InICML, 2023

2023
[46]

Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models

George Stein, Jesse Cresswell, Rasa Hosseinzadeh, Yi Sui, Brendan Ross, Valentin Villecroze, Zhaoyan Liu, Anthony L Caterini, Eric Taylor, and Gabriel Loaiza-Ganem. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. InNeurIPS, 2023

2023
[47]

Re- thinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re- thinking the inception architecture for computer vision. InCVPR, 2016

2016
[48]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv:2502.14786, 2025

work page internal anchor Pith review arXiv 2025
[49]

Pixnerd: Pixel neural field diffusion

Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion. InICLR, 2026

2026
[50]

DDT: Decoupled diffusion transformer

Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. DDT: Decoupled diffusion transformer. InCVPR, 2026

2026
[51]

Representation entanglement for generation: Training diffusion transformers is much easier than you think

Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transformers is much easier than you think. InNeurIPS, 2025

2025
[52]

Unsupervised feature learning via non-parametric instance discrimination

Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. InCVPR, 2018

2018
[53]

Revisiting the evaluation of image synthesis with GANs

Ceyuan Yang, Yichi Zhang, Qingyan Bai, Yujun Shen, Bo Dai, et al. Revisiting the evaluation of image synthesis with GANs. InNeurIPS, 2023

2023
[54]

Latent denoising makes good visual tokenizers

Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, and Yue Wang. Latent denoising makes good visual tokenizers. InICLR, 2026

2026
[55]

Reconstruction vs

Jingfeng Yao and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InCVPR, 2025

2025
[56]

One-step diffusion with distribution matching distillation

Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, 2024

2024
[57]

Autoregressive image generation with masked bit modeling.arXiv:2602.09024, 2026

Qihang Yu, Qihao Liu, Ju He, Xinyang Zhang, Yang Liu, Liang-Chieh Chen, and Xi Chen. Autoregressive image generation with masked bit modeling.arXiv:2602.09024, 2026

work page arXiv 2026
[58]

Representation alignment for generation: Training diffusion transformers is easier than you think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InICLR, 2025

2025
[59]

Diffusion transformers with representation autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. InICLR, 2026

2026
[60]

Inductive moment matching

Linqi Zhou, Stefano Ermon, and Jiaming Song. Inductive moment matching. InICML, 2025

2025
[61]

negative signals

Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, and Hai Huang. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. InICML, 2024. 13 Appendix for Representation Fréchet Loss for Visual Generation Contents A. Additional Design Attempts and Observations pp. 14–15 B. Implementation ...

2024