pith. sign in

arxiv: 2605.06797 · v1 · submitted 2026-05-07 · 💻 cs.LG

MIND: Monge Inception Distance for Generative Models Evaluation

Pith reviewed 2026-05-11 00:46 UTC · model grok-4.3

classification 💻 cs.LG
keywords generative modelsevaluation metricsFréchet Inception Distancesliced Wasserstein distanceInception featuressample efficiencyadversarial robustness
0
0 comments X

The pith

MIND replaces the Fréchet Inception Distance with a sliced Wasserstein metric that needs far fewer samples to evaluate generative models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MIND as an alternative to FID for assessing how well generative models produce images or data. Instead of fitting Gaussians to high-dimensional features from an Inception network, it computes average one-dimensional optimal transport distances after projecting to lines. This change eliminates the need to estimate large covariance matrices, which are unstable with limited samples. As a result, evaluations become reliable with roughly ten times fewer examples, run much quicker, and hold up better when models try to fool the metric by matching moments. The authors show that scores from MIND using 5,000 samples track closely with FID scores from 50,000 samples while offering stronger ability to tell good models from bad ones.

Core claim

MIND computes the sliced Wasserstein distance between the distributions of Inception-v3 features extracted from real and generated data. By averaging the exact one-dimensional Wasserstein distances obtained through sorting, the metric avoids all covariance estimation and achieves better sample efficiency and computational speed than FID while remaining correlated with it.

What carries the argument

The sliced Wasserstein distance applied to Inception-v3 features, which reduces the comparison of two high-dimensional distributions to repeated one-dimensional optimal transport problems solved by sorting.

If this is right

  • MIND with 5k samples achieves the evaluation performance of FID with 50k samples.
  • The metric computes two orders of magnitude faster than FID.
  • It resists moment-matching adversarial attacks that can fool FID.
  • Even 1k or 2k samples yield informative scores for quick model iteration.
  • MIND maintains high correlation with FID on standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If MIND becomes standard, generative model development cycles could shorten because fewer samples are needed for each evaluation round.
  • The robustness property suggests MIND may better reflect true distributional differences rather than superficial statistics.
  • Future work could test whether combining MIND with other feature extractors yields further gains in domains beyond images.

Load-bearing premise

That the sliced Wasserstein distance on Inception-v3 features accurately reflects meaningful differences between real and generated distributions without needing the Gaussian assumption used by FID.

What would settle it

Run MIND and FID on the same set of models and datasets; if MIND with 5k samples fails to rank models in the same order as FID with 50k samples on a broad benchmark, or if human preference studies disagree with MIND rankings, the replacement claim would not hold.

Figures

Figures reproduced from arXiv: 2605.06797 by Clement Crepy, Klaus Greff, Michael Eli Sander, Quentin Berthet, Romuald Elie, Yu-Han Wu.

Figure 1
Figure 1. Figure 1: (Left) MIND metric during a diffusion model training run on ImageNet-64 (log scale), illustrating how MIND5k can be used to replace FID50k, with a larger range - see Section 4.3. (Right) Correlation with number of training steps - better for MIND1k and MIND5k than FID with 50k samples. 1 Introduction Generative models, especially diffusion models (Ho et al., 2020), have set new standards in high￾quality da… view at source ↗
Figure 2
Figure 2. Figure 2: General pipeline for evaluating generative model sampling distance to a dataset. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Computation of MIND based on the idea of Sliced Wasserstein, illustrated in 2D with a single projection. (Left) Two samples of synthetic embeddings (orange and blue), along with the unit sphere and a random unit direction u. (Bottom Right) The two histograms of distributions of the projections along u. (Top Right) The associated cumulative distribution functions (cdf), the hatched area is related to 1D Was… view at source ↗
Figure 4
Figure 4. Figure 4: JAX (a) and PyTorch (b) implementation of [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (Right) shows that the MIND remains an affine relation with respect to FID while varying the dimension of the feature space in the log-log plot—justifying the choice of the scaling factor α ∝ d. 0 5 10 15 20 25 30 35 40 FID 0 10 20 30 40 50 60 MIND 10 1 10 0 10 1 10 2 10 3 FID 10 5 10 4 10 3 10 2 10 1 MIND D 384 512 768 1024 1152 2048 [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (Top Left and Middle) Behavior of the MIND and FID metric in n, to distinguish true images from the dataset (base - in blue) from generated images (model - in orange). (Bottom Left and Middle) Histogram of the trials for n = 5, 000 - A bigger gap is better. (Right) Probability of error defined in Section 4.4.1 for three values of M ∈ {10, 100, 1000}. 4.3 MIND analysis We illustrate the behavior of the MIND… view at source ↗
Figure 7
Figure 7. Figure 7: Sample complexity measured by the probability of error for the correct order at five different steps of training. Running evaluations during training of a diffu￾sion model, we observe that instead of using FID50k (commonly used post-training because of the cost and time associated with the high sample size), we can use MIND5k (we evaluate their precisions more quantitatively in the rest of this section). A… view at source ↗
Figure 8
Figure 8. Figure 8: Sample complexity measured by the probability of detecting a small perturbation. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Walltime and peak memory comparison for MIND, MMD, and FID n = 4096 and we separate the optimization problem. We use in our evaluation M = 1000 for MIND and 50k to compute the reference mean and covariance for the FID. The results summarized in [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Two elements of the batch, all initial images are the same. ( [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
read the original abstract

We propose the Monge Inception Distance (MIND), a metric for evaluating generative models that addresses key limitations of the widely adopted Fr\'echet Inception Distance (FID). The MIND metric leverages the sliced Wasserstein distance to compare distributions by averaging one-dimensional optimal transport distances, efficiently computed via sorting. This approach circumvents the estimation of high-dimensional means and covariance matrices, which underlie FID's poor sample complexity and vulnerability to adversarial attacks. We empirically demonstrate three primary advantages: (i) it is more sample-efficient by one order of magnitude, (ii) it is faster to compute by two orders of magnitude, (iii) it is more robust to adversarial attacks such as moment-matching. We show that MIND with 5k samples can replace the evaluation performance of FID with 50k samples, providing high correlation with this standard benchmark and superior discriminative performance. We further demonstrate that even smaller sample sizes (e.g., 1k or 2k) remain highly informative for rapid model iteration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes MIND, a metric for generative model evaluation that computes the sliced Wasserstein distance on Inception-v3 features instead of the Gaussian moment-matching used by FID. It claims three advantages: one-order-of-magnitude better sample efficiency, two-order-of-magnitude faster computation, and greater robustness to adversarial attacks such as moment matching. The central empirical result is that MIND evaluated on 5k samples achieves correlation and discriminative performance comparable to FID on 50k samples, with even smaller sizes (1k–2k) remaining informative.

Significance. If the reported efficiency and robustness gains hold across broader model classes and datasets, MIND could meaningfully reduce the computational cost of generative-model evaluation and enable faster iteration, especially in regimes where collecting 50k high-quality samples is expensive. The absence of fitted parameters and the direct use of optimal transport on projections are positive features that avoid some of the known instabilities of covariance estimation.

major comments (3)
  1. [§4.3] §4.3 and Figure 4: the claim that MIND at 5k samples 'replaces' FID at 50k samples is supported only by correlation coefficients on standard benchmarks; no statistical test (e.g., bootstrap confidence intervals or paired significance test) is reported for the difference in discriminative power, leaving open whether the observed parity is robust or dataset-specific.
  2. [§3.1] §3.1, Eq. (3): while the sliced Wasserstein distance is correctly noted to be a lower bound on the true Wasserstein distance, the manuscript provides no analysis or ablation showing that the directions missed by random projections do not systematically affect perceptual ranking of generative models; this is load-bearing for the claim that MIND faithfully substitutes for FID.
  3. [§4.4] §4.4: the robustness experiments against moment-matching attacks report single-run results without specifying the number of random projections, the attack magnitude, or variance across seeds; this weakens the generality of the 'more robust' claim relative to the sample-efficiency result.
minor comments (2)
  1. [§3.2] The definition of the number of random projections L is introduced in §3.2 but its default value and sensitivity are only shown in an appendix; moving a brief sensitivity plot to the main text would improve readability.
  2. [Table 1] Table 1 caption states 'correlation with FID' but does not specify whether Pearson or Spearman correlation is used; this should be stated explicitly.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the statistical support, empirical validation, and experimental details of the manuscript.

read point-by-point responses
  1. Referee: [§4.3] §4.3 and Figure 4: the claim that MIND at 5k samples 'replaces' FID at 50k samples is supported only by correlation coefficients on standard benchmarks; no statistical test (e.g., bootstrap confidence intervals or paired significance test) is reported for the difference in discriminative power, leaving open whether the observed parity is robust or dataset-specific.

    Authors: We agree that statistical tests are needed to substantiate the replacement claim. In the revised version we will report bootstrap confidence intervals (1000 resamples) on the Pearson and Spearman correlations between MIND@5k and FID@50k across all evaluated datasets. We will also add paired Wilcoxon signed-rank tests on the per-model discriminative scores (e.g., ranking accuracy on held-out model pairs) computed over 10 independent random seeds, thereby quantifying whether the observed parity is statistically significant and consistent across datasets. revision: yes

  2. Referee: [§3.1] §3.1, Eq. (3): while the sliced Wasserstein distance is correctly noted to be a lower bound on the true Wasserstein distance, the manuscript provides no analysis or ablation showing that the directions missed by random projections do not systematically affect perceptual ranking of generative models; this is load-bearing for the claim that MIND faithfully substitutes for FID.

    Authors: We acknowledge the absence of a dedicated ablation on projection coverage. While a complete theoretical characterization of missed directions remains an open question in the sliced optimal transport literature, we will add an empirical ablation that varies the number of random projections from 10 to 500 and measures the stability of model rankings on the same Inception features. We will also report the Kendall-tau distance between rankings obtained with increasing projection counts and show that rankings converge rapidly and remain consistent with FID rankings. This provides concrete evidence that, in practice on standard generative-model benchmarks, the directions captured by a modest number of projections suffice to preserve perceptual ordering. revision: partial

  3. Referee: [§4.4] §4.4: the robustness experiments against moment-matching attacks report single-run results without specifying the number of random projections, the attack magnitude, or variance across seeds; this weakens the generality of the 'more robust' claim relative to the sample-efficiency result.

    Authors: We agree that the robustness section requires additional specification and statistical reporting. In the revision we will explicitly state that all attack experiments use 100 random projections, detail the attack magnitudes (additive perturbations to the first two moments at levels 0.1, 0.5, and 1.0 times the feature standard deviation), and report mean and standard deviation of MIND and FID scores over five independent random seeds. These changes will make the robustness advantage directly comparable in rigor to the sample-efficiency results. revision: yes

Circularity Check

0 steps flagged

MIND definition is direct and non-circular; empirical claims are external to the construction

full rationale

The paper defines MIND explicitly as the average of 1D optimal transport distances (via sorting) on random projections of Inception-v3 features. This is a straightforward application of the known sliced Wasserstein distance with no fitted parameters, no self-referential predictions, and no reduction of the central quantity to prior author results by construction. The reported sample-efficiency, speed, and robustness advantages are presented as empirical observations on standard benchmarks rather than algebraic identities. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from the authors' earlier work appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard optimal transport theory and the pre-trained Inception feature extractor; no new free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Sliced Wasserstein distance provides a useful approximation to the true Wasserstein distance between high-dimensional distributions
    Invoked to justify replacing full Wasserstein or covariance-based comparison.
  • domain assumption Inception-v3 features are a suitable embedding space for comparing image distributions
    Inherited from FID and prior generative model evaluation literature.

pith-pipeline@v0.9.0 · 5486 in / 1327 out tokens · 62769 ms · 2026-05-11T00:46:19.317881+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Improved Baselines with Representation Autoencoders

    cs.CV 2026-05 conditional novelty 6.0

    RAE v2 reaches gFID 1.06 on ImageNet-256 in 80 epochs by combining multi-layer encoder sums, complementary REPA targets, and free guidance via output reparameterization.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Dudley , journal=

    Richard M. Dudley , journal=. The Speed of Mean. 1969 , volume=

  2. [2]

    International conference on machine learning , pages=

    Learning transferable visual models from natural language supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  3. [3]

    DINOv3

    Sim. arXiv:2508.10104 , year=

  4. [4]

    2013 , school=

    Unidimensional and evolution methods for optimal transportation , author=. 2013 , school=

  5. [5]

    Representation

    Yang, Jiawei and Geng, Zhengyang and Ju, Xuan and Tian, Yonglong and Wang, Yue , journal=. Representation

  6. [6]

    Progressive growing of

    Karras, Tero and Aila, Timo and Laine, Samuli and Lehtinen, Jaakko , journal=. Progressive growing of

  7. [7]

    and Macke, Jakob H

    A practical guide to sample-based statistical distances for evaluating generative models in science , author=. arXiv:2403.12636 , year=

  8. [8]

    Sutherland and Michael Arbel and Arthur Gretton , booktitle=

    Mikołaj Bińkowski and Dougal J. Sutherland and Michael Arbel and Arthur Gretton , booktitle=. Demystifying

  9. [9]

    International conference on machine learning , pages=

    Improved denoising diffusion probabilistic models , author=. International conference on machine learning , pages=. 2021 , organization=

  10. [10]

    2015 , organization=

    Ronneberger, Olaf and Fischer, Philipp and Brox, Thomas , booktitle=. 2015 , organization=

  11. [11]

    2017 , publisher=

    Probability and measure , author=. 2017 , publisher=

  12. [12]

    Effectively unbiased

    Chong, Min Jin and Forsyth, David , booktitle=. Effectively unbiased

  13. [13]

    Transactions on Machine Learning Research , issn=

    Maxime Oquab and Timoth. Transactions on Machine Learning Research , issn=

  14. [14]

    Rethinking

    Jayasumana, Sadeep and Ramalingam, Srikumar and Veit, Andreas and Glasner, Daniel and Chakrabarti, Ayan and Kumar, Sanjiv , booktitle=. Rethinking

  15. [15]

    1999 , publisher=

    Elements of information theory , author=. 1999 , publisher=

  16. [16]

    Bradbury, James and Frostig, Roy and Hawkins, Peter and Johnson, Matthew James and Leary, Chris and Maclaurin, Dougal and Necula, George and Paszke, Adam and VanderPlas, Jake and Wanderman-Milne, Skye and others , year=

  17. [17]

    Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models , volume =

    Stein, George and Cresswell, Jesse and Hosseinzadeh, Rasa and Sui, Yi and Ross, Brendan and Villecroze, Valentin and Liu, Zhaoyan and Caterini, Anthony L and Taylor, Eric and Loaiza-Ganem, Gabriel , booktitle =. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models , volume =

  18. [18]

    Journal of Machine Learning Research , volume=

    A kernel two-sample test , author=. Journal of Machine Learning Research , volume=. 2012 , publisher=

  19. [19]

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

    Rethinking the Inception Architecture for Computer Vision , author=. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. 2016 , organization=

  20. [20]

    2008 , publisher=

    Optimal transport: old and new , author=. 2008 , publisher=

  21. [21]

    Foundations and Trends

    Computational optimal transport: With applications to data science , author=. Foundations and Trends. 2019 , publisher=

  22. [22]

    International conference on scale space and variational methods in computer vision , pages=

    Wasserstein barycenter and its application to texture mixing , author=. International conference on scale space and variational methods in computer vision , pages=. 2011 , organization=

  23. [23]

    2021 , MONTH = Nov, KEYWORDS =

    Nadjahi, Kimia , URL =. 2021 , MONTH = Nov, KEYWORDS =

  24. [24]

    Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=

    On a property of the lognormal distribution , author=. Journal of the Royal Statistical Society Series B: Statistical Methodology , volume=. 1963 , publisher=

  25. [25]

    2009 , volume =

    Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Kai Li and Li Fei-Fei , booktitle =. 2009 , volume =

  26. [26]

    Deep Learning Face Attributes in the Wild , year=

    Liu, Ziwei and Luo, Ping and Wang, Xiaogang and Tang, Xiaoou , booktitle=. Deep Learning Face Attributes in the Wild , year=

  27. [27]

    Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu an...

  28. [28]

    Electronic Journal of Statistics , volume=

    Minimax confidence intervals for the sliced Wasserstein distance , author=. Electronic Journal of Statistics , volume=. 2022 , publisher=

  29. [29]

    Statistical and Topological Properties of Sliced Probability Divergences , volume =

    Nadjahi, Kimia and Durmus, Alain and Chizat, L\'. Statistical and Topological Properties of Sliced Probability Divergences , volume =. Advances in Neural Information Processing Systems , editor =

  30. [30]

    Heusel, Martin and Ramsauer, Hubert and Unterthiner, Thomas and Nessler, Bernhard and Hochreiter, Sepp , booktitle =

  31. [31]

    Learning Generative Models with

    Genevay, Aude and Peyre, Gabriel and Cuturi, Marco , booktitle =. Learning Generative Models with. 2018 , editor =

  32. [32]

    Sinkhorn Distances: Lightspeed Computation of Optimal Transport , volume =

    Cuturi, Marco , booktitle =. Sinkhorn Distances: Lightspeed Computation of Optimal Transport , volume =

  33. [33]

    Denoising Diffusion Probabilistic Models , volume =

    Ho, Jonathan and Jain, Ajay and Abbeel, Pieter , booktitle =. Denoising Diffusion Probabilistic Models , volume =

  34. [34]

    Improved Techniques for Training

    Salimans, Tim and Goodfellow, Ian and Zaremba, Wojciech and Cheung, Vicki and Radford, Alec and Chen, Xi , booktitle =. Improved Techniques for Training

  35. [35]

    Sajjadi, Mehdi S. M. and Bachem, Olivier and Lucic, Mario and Bousquet, Olivier and Gelly, Sylvain , booktitle =. Assessing Generative Models via Precision and Recall , volume =

  36. [36]

    Foundations and Trends in Machine Learning , volume =

    Peyré, Gabriel and Cuturi, Marco , title =. Foundations and Trends in Machine Learning , volume =. 2019 , month =

  37. [37]

    Monge, Gaspard , journal =. M. 1781 , pages =

  38. [38]

    Doklady Akademii Nauk SSSR , volume =

    On the translocation of masses , author =. Doklady Akademii Nauk SSSR , volume =