pith. machine review for the scientific record. sign in

arxiv: 2604.28190 · v1 · submitted 2026-04-30 · 💻 cs.CV

Recognition: unknown

Representation Fr\'echet Loss for Visual Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-07 05:52 UTC · model grok-4.3

classification 💻 cs.CV
keywords Fréchet DistanceFD-lossvisual generationone-step generatorsImageNetFID metricgenerative modelsrepresentation space
0
0 comments X

The pith

Fréchet Distance can be optimized as a training loss in representation space to improve visual generators

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the Fréchet Distance becomes usable as a training objective once population-level estimation is decoupled from the smaller batches needed for backpropagation. This FD-loss, when applied post-training in feature spaces such as Inception, raises output quality for both one-step and multi-step generators on ImageNet. The same loss converts multi-step models into competitive one-step models without distillation or adversarial training. Standard Inception FID can misrank models, which motivates a multi-representation metric that better tracks actual visual quality.

Core claim

Fréchet Distance (FD) can be effectively optimized in the representation space by decoupling the population size for FD estimation from the batch size for gradient computation. Post-training a base generator with FD-loss in different representation spaces consistently improves visual quality. Under the Inception feature space, a one-step generator achieves 0.72 FID on ImageNet 256x256. The same FD-loss repurposes multi-step generators into strong one-step generators without teacher distillation, adversarial training or per-sample targets. FID can misrank visual quality, motivating a multi-representation metric FDr^k.

What carries the argument

FD-loss: Fréchet Distance between real and generated features from a fixed extractor, with mean and covariance from a large population but gradients computed only on the current generated batch

If this is right

  • Post-training with FD-loss in different representation spaces consistently improves visual quality
  • A one-step generator reaches 0.72 FID on ImageNet 256x256 using Inception features
  • Multi-step generators become strong one-step generators using only FD-loss, without distillation or adversarial training
  • Modern representations can yield better samples despite worse Inception FID, supporting the FDr^k multi-representation metric

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other distributional distances computed in learned feature spaces may serve as stable training objectives for generators
  • Precomputing real feature statistics once allows the approach to work with modest batch sizes during optimization
  • The observation that single-representation FID can misrank quality suggests evaluation protocols should routinely combine several modern feature extractors

Load-bearing premise

Gradients of the Fréchet Distance computed on small batches of generated features, using fixed large-population real statistics, still steer the generator toward a better output distribution without artifacts or overfitting to the chosen features

What would settle it

Training a base generator with FD-loss produces no gain or a drop in independent perceptual scores and human preference rankings compared with the unrefined base model

Figures

Figures reproduced from arXiv: 2604.28190 by Jiawei Yang, Xuan Ju, Yonglong Tian, Yue Wang, Zhengyang Geng.

Figure 1
Figure 1. Figure 1: One-step samples on ImageNet 256×256, before and after FD-loss post-training. Left: samples from the base generators. Right: samples after post-training with FD-loss. Top two rows: pMF-H [30], a one-step generator. Bottom: JiT-H [23], a multi-step generator. All models generate in a single network evaluation (1-NFE). FD-loss improves existing one-step models and can repurpose multi-step models into one-ste… view at source ↗
Figure 3
Figure 3. Figure 3: Is ImageNet generation “solved”? Left: FID over time. Recent methods surpass the real validation set images (red dashed) in terms of FID. Right: FDr6 (Eq. 8), which averages normalized Fréchet Distance ratios across six representation spaces. Under this metric, even the strongest existing methods remain far from the validation images, indicating that FID alone masks significant quality gaps. Each method is… view at source ↗
Figure 4
Figure 4. Figure 4: FD-loss improves visual quality under different representations. Samples from pMF-B/16 [30] post-trained with FD-loss. Darker green: lower FID; darker yellow: lower FDr6 . Post-trained models improve over the base model (left). Inception post-trained model achieves the lowest FID (0.81) yet does not produce the best samples; models post-trained with modern representations achieve lower FDr6 and show better… view at source ↗
Figure 5
Figure 5. Figure 5: Repurposing a multi-step model into a one-step generator with FD-loss. Samples from the same noise input across the base model and different post-trained models. The naive one-step base model fails to produce sensible images. After post-training, the 1-NFE models generate sensible images, and the strongest variants are visually comparable or superior to the 50-step base model. enough, so the repurposed mod… view at source ↗
Figure 6
Figure 6. Figure 6: Human preference study. Left: Our post-trained 1-NFE models (warm∗ ) are preferred over their base models. Right: Our pMF-H∗ is the most preferred generator against real ImageNet validation images, but still loses to real. This is consistent with view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative text-conditioned generation example. We post-train SD3.5 Medium [9] with FD-loss using BLIP3o-GPT4o-60k [3], a curated 60k dataset distilled from GPT-4o with a stylized aesthetic as the reference image distribution. Despite 56× NFE reduction, the post-trained 1-NFE model can preserve recognizable prompt content while inheriting the stylized look of the reference distribution. More in Appendix G. 9 view at source ↗
read the original abstract

We show that Fr\'echet Distance (FD), long considered impractical as a training objective, can in fact be effectively optimized in the representation space. Our idea is simple: decouple the population size for FD estimation (e.g., 50k) from the batch size for gradient computation (e.g., 1024). We term this approach FD-loss. Optimizing FD-loss reveals several surprising findings. First, post-training a base generator with FD-loss in different representation spaces consistently improves visual quality. Under the Inception feature space, a one-step generator achieves0.72 FID on ImageNet 256x256. Second, the same FD-loss repurposes multi-step generators into strong one-step generators without teacher distillation, adversarial training or per-sample targets. Third, FID can misrank visual quality: modern representations can yield better samples despite worse Inception FID. This motivates FDr$^k$, a multi-representation metric. We hope this work will encourage further exploration of distributional distances in diverse representation spaces as both training objectives and evaluation metrics for generative models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FD-loss, a training objective that optimizes Fréchet Distance (FD) in representation space by decoupling the population size used to estimate real/generated statistics (e.g., 50k samples) from the smaller batch size used for gradient computation (e.g., 1024). It claims that post-training base generators with FD-loss in spaces such as Inception features yields consistent visual improvements, including a one-step generator reaching 0.72 FID on ImageNet 256x256; the same loss can convert multi-step generators into strong one-step models without distillation, adversarial training, or per-sample targets; and standard FID can misrank quality, motivating the new multi-representation metric FDr^k.

Significance. If the central claims are substantiated, the work would be significant for generative modeling: it would show that a long-considered-impractical distributional distance can serve as a practical training objective in feature spaces, offering a route to high-quality one-step generators and exposing limitations of single-representation FID. The decoupling trick and empirical results on ImageNet would be noteworthy strengths, as would the introduction of FDr^k if it proves more reliable than Inception FID. The significance hinges on whether the reported FID improvements reflect genuine distributional alignment rather than artifacts of the optimization procedure.

major comments (2)
  1. [FD-loss formulation and gradient computation] The FD-loss formulation (described in the abstract and method sections) does not specify how the matrix square root of C_r C_g and its derivative are computed when the sample covariance C_g estimated from 1024 samples in 2048-dimensional Inception space is singular (rank ≤1023). The standard FD expression is undefined or unstable without regularization (e.g., εI) or a pseudo-inverse; no equation or implementation detail addresses this, so the back-propagated gradients may optimize a regularized surrogate rather than true population FD. This directly undermines the claim that FD 'can in fact be effectively optimized' and that the 0.72 FID reflects genuine improvement.
  2. [Experimental results] The reported quantitative results (0.72 FID for the one-step generator under Inception features) lack supporting details on gradient computation through the fixed representation extractor, ablation studies on population/batch sizes, controls for representation choice, or verification that optimization reduces the population-level FD rather than introducing feature-specific artifacts. These omissions make the central empirical claims difficult to verify from the given text.
minor comments (2)
  1. [Abstract] Typo in the abstract: 'achieves0.72 FID' is missing a space.
  2. [Metric definition] The exact definition and computation of the proposed FDr^k metric should be stated explicitly with an equation, rather than only motivated by observed discrepancies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and detailed review. We have carefully considered the major comments and revised the manuscript to provide the requested clarifications and additional experiments. Our point-by-point responses are as follows.

read point-by-point responses
  1. Referee: The FD-loss formulation (described in the abstract and method sections) does not specify how the matrix square root of C_r C_g and its derivative are computed when the sample covariance C_g estimated from 1024 samples in 2048-dimensional Inception space is singular (rank ≤1023). The standard FD expression is undefined or unstable without regularization (e.g., εI) or a pseudo-inverse; no equation or implementation detail addresses this, so the back-propagated gradients may optimize a regularized surrogate rather than true population FD. This directly undermines the claim that FD 'can in fact be effectively optimized' and that the 0.72 FID reflects genuine improvement.

    Authors: We agree that the original manuscript lacked sufficient implementation details on handling the matrix square root for potentially singular covariances. In our implementation, we regularize both covariance matrices by adding εI with ε = 1e-6 prior to computing the square root and its derivative, following standard practices for numerical stability in Fréchet Distance calculations. This regularization is applied consistently during both training and evaluation. We have updated the method section with the full mathematical formulation including this regularization term and provided pseudocode in the appendix. While the optimized objective is indeed a regularized FD, empirical results show that it effectively aligns the distributions, as evidenced by improvements in population-level FID computed on large independent sets. We believe this addresses the concern without undermining the core claim. revision: yes

  2. Referee: The reported quantitative results (0.72 FID for the one-step generator under Inception features) lack supporting details on gradient computation through the fixed representation extractor, ablation studies on population/batch sizes, controls for representation choice, or verification that optimization reduces the population-level FD rather than introducing feature-specific artifacts. These omissions make the central empirical claims difficult to verify from the given text.

    Authors: We thank the referee for pointing out these omissions. In the revised version, we have added substantial details and experiments: (i) The representation extractor is fixed and frozen, with gradients flowing directly from the FD-loss through the generated features to the generator parameters; we include a diagram and equations clarifying this. (ii) Ablation studies on population sizes (e.g., 10k vs 50k) and batch sizes (256 to 2048) are now in the appendix, showing robustness. (iii) Controls across multiple representation spaces (Inception, DINO, CLIP) are provided. (iv) We verify population-level FD reduction by computing FD on 50k samples pre- and post-optimization, demonstrating consistent decreases. These additions confirm the improvements reflect genuine distributional improvements rather than artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: FD-loss is an empirical optimization procedure with independent experimental validation

full rationale

The paper introduces FD-loss by decoupling fixed population statistics (e.g., 50k real samples) from per-batch gradient computation (e.g., 1024 generated samples) and then reports the empirical outcomes of applying this loss to generators. The low reported FID values and visual-quality gains are direct results of the optimization process rather than quantities defined into the method by construction. No equations reduce a claimed prediction to a fitted parameter, no self-citation supplies a load-bearing uniqueness theorem, and no ansatz is smuggled via prior work. The derivation chain consists of a straightforward engineering proposal followed by experimental measurement; the final metrics are computed on held-out large-sample evaluations and are therefore not tautological with the training objective.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The approach depends on the mathematical feasibility of estimating and differentiating Fréchet Distance in representation space with chosen sample sizes, plus the introduction of a new multi-representation metric based on internal observations.

free parameters (2)
  • FD population size = e.g., 50k
    Example value of 50k chosen to enable accurate estimation of the full distribution for the loss.
  • gradient batch size = e.g., 1024
    Example value of 1024 chosen to allow practical gradient computation during optimization.
axioms (1)
  • domain assumption Fréchet Distance between two distributions in a fixed representation space can be estimated from finite samples and its gradient with respect to generator parameters can be computed for optimization.
    This underpins the entire FD-loss training procedure described in the abstract.
invented entities (1)
  • FDr^k no independent evidence
    purpose: A multi-representation Fréchet metric intended to provide more reliable evaluation when single-representation FID misranks sample quality.
    Introduced to address the abstract's observation that modern representations can yield better samples despite worse Inception FID.

pith-pipeline@v0.9.0 · 5484 in / 1682 out tokens · 47623 ms · 2026-05-07T05:52:53.851369+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Normalizing Trajectory Models

    cs.CV 2026-05 unverdicted novelty 7.0

    NTM uses per-step conditional normalizing flows plus a trajectory-wide predictor to achieve exact-likelihood 4-step sampling that matches or exceeds baselines on text-to-image tasks.

  2. Normalizing Trajectory Models

    cs.CV 2026-05 unverdicted novelty 7.0

    NTM models each generative reverse step as a conditional normalizing flow with a hybrid shallow-deep architecture, enabling exact-likelihood training and strong four-step sampling performance on text-to-image tasks.

Reference graph

Works this paper leans on

61 extracted references · 7 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Claude code.https://www.anthropic.com/claude-code, 2025

    Anthropic. Claude code.https://www.anthropic.com/claude-code, 2025

  2. [2]

    Demystifying MMD GANs

    Mikołaj Bi´nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. InICLR, 2018

  3. [3]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv:2505.09568, 2025

  4. [4]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. InNeurIPS, 2017

  5. [5]

    ImageNet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InCVPR, 2009

  6. [6]

    Generative Modeling via Drifting

    Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He. Generative modeling via drifting.arXiv:2602.04770, 2026

  7. [7]

    Generative modeling using the sliced wasserstein distance

    Ishan Deshpande, Ziyu Zhang, and Alexander G Schwing. Generative modeling using the sliced wasserstein distance. InCVPR, 2018

  8. [8]

    Image generation via minimizing fréchet distance in discriminator feature space.arXiv:2003.11774, 2020

    Khoa D Doan, Saurav Manchanda, Fengjiao Wang, Sathiya Keerthi, Avradeep Bhowmik, and Chandan K Reddy. Image generation via minimizing fréchet distance in discriminator feature space.arXiv:2003.11774, 2020

  9. [9]

    Scaling rectified flow transform- ers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. InICML, 2024

  10. [10]

    Scaling laws for reward model overoptimization

    Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. InICML, 2023

  11. [11]

    Mean flows for one-step generative modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. InNeurIPS, 2025

  12. [12]

    Improved mean flows: On the challenges of fast-forward generative models

    Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J Zico Kolter, and Kaiming He. Improved mean flows: On the challenges of fast-forward generative models. InCVPR, 2026

  13. [13]

    Generative adversarial nets

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. InNeurIPS, 2014

  14. [14]

    Problems of monetary management: the uk experience

    Charles AE Goodhart. Problems of monetary management: the uk experience. InMonetary theory and practice: The UK experience, 1984

  15. [15]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InCVPR, 2022

  16. [16]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. InCVPR, 2020

  17. [17]

    GANs trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS, 2017

  18. [18]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InICML, 2015

  19. [19]

    Rethinking fid: Towards a better evaluation metric for image generation

    Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Rethinking fid: Towards a better evaluation metric for image generation. In CVPR, 2024. 11

  20. [20]

    The role of ImageNet classes in fréchet inception distance

    Tuomas Kynkäänniemi, Tero Karras, Miika Aittala, Timo Aila, and Jaakko Lehtinen. The role of ImageNet classes in fréchet inception distance. InICLR, 2023

  21. [21]

    Improved precision and recall metric for assessing generative models

    Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. InNeurIPS, 2019

  22. [22]

    Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning of latent diffusion transformers. InICCV, 2025

  23. [23]

    Back to basics: Let denoising generative models denoise

    Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise. In CVPR, 2026

  24. [24]

    Autoregressive image generation without vector quantization

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. InNeurIPS, 2024

  25. [25]

    Generative moment matching networks

    Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. InICML, 2015

  26. [26]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InICLR, 2023

  27. [27]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InCVPR, 2022

  28. [28]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

  29. [29]

    Simplifying, stabilizing and scaling continuous-time consistency models

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. InICLR, 2025

  30. [30]

    One-step Latent-free Image Generation with Pixel Mean Flows

    Yiyang Lu, Susie Lu, Qiao Sun, Hanhong Zhao, Zhicheng Jiang, Xianbang Wang, Tianhong Li, Zhengyang Geng, and Kaiming He. One-step latent-free image generation with pixel mean flows.arXiv:2601.22158, 2026

  31. [31]

    Diff- instruct: A universal approach for transferring knowledge from pre-trained diffusion models

    Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff- instruct: A universal approach for transferring knowledge from pre-trained diffusion models. In NeurIPS, 2023

  32. [32]

    Zico Kolter, and Guo-jun Qi

    Weijian Luo, Zemin Huang, Zhengyang Geng, J. Zico Kolter, and Guo-jun Qi. One-step diffusion distillation through score implicit matching. InNeurIPS, 2024

  33. [33]

    Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV, 2024

  34. [34]

    Backpropagating through fréchet inception distance.arXiv:2009.14075, 2020

    Alexander Mathiasen and Frederik Hvilshøj. Backpropagating through fréchet inception distance.arXiv:2009.14075, 2020

  35. [35]

    Group diffusion: Enhancing image generation by unlocking cross-sample collaboration

    Sicheng Mo, Thao Nguyen, Richard Zhang, Nick Kolkin, Siddharth Srinivasan Iyer, Eli Shecht- man, Krishna Kumar Singh, Yong Jae Lee, Bolei Zhou, and Yuheng Li. Group diffusion: Enhancing image generation by unlocking cross-sample collaboration. InCVPR, 2026

  36. [36]

    Mcgan: Mean and covariance feature matching gan

    Youssef Mroueh, Tom Sercu, and Vaibhava Goel. Mcgan: Mean and covariance feature matching gan. InICML, 2017

  37. [37]

    Codex cli.https://github.com/openai/codex, 2025

    OpenAI. Codex cli.https://github.com/openai/codex, 2025

  38. [38]

    Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.Transactions on Machine Learning Research Journal, 2024

  39. [39]

    Scalable diffusion models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with Transformers. InICCV, 2023

  40. [40]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, 2021

  41. [41]

    Flowar: Scale-wise autoregressive image generation meets flow matching

    Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, and Liang-Chieh Chen. Flowar: Scale-wise autoregressive image generation meets flow matching. InICML, 2025

  42. [42]

    Assess- ing generative models via precision and recall

    Mehdi SM Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. Assess- ing generative models via precision and recall. InNeurIPS, 2018. 12

  43. [43]

    Improved techniques for training GANs

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. InNeurIPS, 2016

  44. [44]

    Learning implicit generative models by matching perceptual features

    Cicero Nogueira dos Santos, Youssef Mroueh, Inkit Padhi, and Pierre Dognin. Learning implicit generative models by matching perceptual features. InICCV, 2019

  45. [45]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InICML, 2023

  46. [46]

    Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models

    George Stein, Jesse Cresswell, Rasa Hosseinzadeh, Yi Sui, Brendan Ross, Valentin Villecroze, Zhaoyan Liu, Anthony L Caterini, Eric Taylor, and Gabriel Loaiza-Ganem. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. InNeurIPS, 2023

  47. [47]

    Re- thinking the inception architecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re- thinking the inception architecture for computer vision. InCVPR, 2016

  48. [48]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv:2502.14786, 2025

  49. [49]

    Pixnerd: Pixel neural field diffusion

    Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, and Limin Wang. Pixnerd: Pixel neural field diffusion. InICLR, 2026

  50. [50]

    DDT: Decoupled diffusion transformer

    Shuai Wang, Zhi Tian, Weilin Huang, and Limin Wang. DDT: Decoupled diffusion transformer. InCVPR, 2026

  51. [51]

    Representation entanglement for generation: Training diffusion transformers is much easier than you think

    Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transformers is much easier than you think. InNeurIPS, 2025

  52. [52]

    Unsupervised feature learning via non-parametric instance discrimination

    Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. InCVPR, 2018

  53. [53]

    Revisiting the evaluation of image synthesis with GANs

    Ceyuan Yang, Yichi Zhang, Qingyan Bai, Yujun Shen, Bo Dai, et al. Revisiting the evaluation of image synthesis with GANs. InNeurIPS, 2023

  54. [54]

    Latent denoising makes good visual tokenizers

    Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, and Yue Wang. Latent denoising makes good visual tokenizers. InICLR, 2026

  55. [55]

    Reconstruction vs

    Jingfeng Yao and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InCVPR, 2025

  56. [56]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, 2024

  57. [57]

    Autoregressive image generation with masked bit modeling.arXiv:2602.09024, 2026

    Qihang Yu, Qihao Liu, Ju He, Xinyang Zhang, Yang Liu, Liang-Chieh Chen, and Xi Chen. Autoregressive image generation with masked bit modeling.arXiv:2602.09024, 2026

  58. [58]

    Representation alignment for generation: Training diffusion transformers is easier than you think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. InICLR, 2025

  59. [59]

    Diffusion transformers with representation autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders. InICLR, 2026

  60. [60]

    Inductive moment matching

    Linqi Zhou, Stefano Ermon, and Jiaming Song. Inductive moment matching. InICML, 2025

  61. [61]

    negative signals

    Mingyuan Zhou, Huangjie Zheng, Zhendong Wang, Mingzhang Yin, and Hai Huang. Score identity distillation: Exponentially fast distillation of pretrained diffusion models for one-step generation. InICML, 2024. 13 Appendix for Representation Fréchet Loss for Visual Generation Contents A. Additional Design Attempts and Observations pp. 14–15 B. Implementation ...