pith. machine review for the scientific record. sign in

arxiv: 2604.03064 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Gram-MMD: A Texture-Aware Metric for Image Realism Assessment

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:02 UTC · model grok-4.3

classification 💻 cs.CV
keywords image realism assessmentGram-MMDtexture metricgenerative modelsMaximum Mean DiscrepancyGram matricesimage quality evaluationfeature correlations
0
0 comments X

The pith

Gram-MMD judges generated image realism by measuring correlations between feature maps instead of semantic content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Gram-MMD, a metric that extracts Gram matrices from intermediate layers of pretrained networks to capture textural and structural patterns in images. Existing methods like FID and CMMD compare distributions at a semantic level and can miss fine details that distinguish real from generated content. By vectorizing the upper triangle of these matrices and applying Maximum Mean Discrepancy against a real-image anchor set, the metric produces scores sensitive to lower-level characteristics. Hyperparameters are chosen through a meta-protocol that applies controlled degradations to MS-COCO images and checks rank correlation with perceived quality. Experiments on KADID-10k, RAISE, and cross-domain driving scenes show the metric preserves correct realism orderings where semantic metrics fail.

Core claim

Gram-MMD computes symmetric Gram matrices from activations at chosen layers of backbones such as DINOv2, VGG19, or Stable Diffusion VAE, extracts their upper-triangular elements, and reports the MMD distance to an anchor distribution of real images, thereby encoding textural and structural correlations at a granularity finer than global embeddings.

What carries the argument

Gram matrices formed from outer products of feature-map vectors at intermediate backbone layers, with MMD computed on their flattened upper-triangular parts.

If this is right

  • GMMD can flag textural artifacts in generated images that FID and CMMD overlook.
  • In domain-shift settings such as real versus synthetic driving scenes, GMMD maintains the expected ranking of realism.
  • The same Gram-MMD formulation works across multiple backbone architectures without retraining.
  • Meta-metric selection on controlled degradations produces hyperparameters that transfer to unseen datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Combining GMMD scores with a semantic metric could yield a two-axis realism evaluation that separates texture failures from semantic failures.
  • The approach might extend to video or 3D assets where frame-to-frame texture consistency matters.
  • Layer selection could be automated by maximizing the meta-metric correlation rather than fixed by hand.

Load-bearing premise

Gram matrices computed from intermediate activations of pretrained networks reliably encode the textural and structural differences that separate real photographs from generated images.

What would settle it

A controlled test set in which images differ only in fine-grained texture statistics while semantic content and global statistics remain matched, with human preference labels available; if GMMD scores show no correlation with those labels while semantic metrics do, the claim fails.

Figures

Figures reproduced from arXiv: 2604.03064 by Jo\'e Napolitano, Pascal Nguyen.

Figure 1
Figure 1. Figure 1: Overview of the GMMD pipeline. Images from the anchor and evaluation sets are passed through a pretrained [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Spearman ρ vs. layer index for each backbone, top-7 γ values (decreasing opacity) and best configura￾tion (⋆) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of Spearman’s ρ across all configura￾tions, grouped by backbone. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Spearman’s ρ vs. Kendall’s τ on KADID-10k for each configuration [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Spearman’s ρ (top) and Kendall’s τ (bottom) on KADID-10k across our 13 configurations and CMMD [2]. Figures 6 and 7 show that GMMD consistently outper￾forms CMMD on both Spearman’s ρ and Kendall’s τ . Both correlation coefficients are negative, as expected: as DMOS increases (stronger degradation), the MMD (a positive dis￾tance) decreases, since more degraded images move further from the real anchor distri… view at source ↗
Figure 10
Figure 10. Figure 10: Two examples illustrating MMD behaviour. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 9
Figure 9. Figure 9: Linear regression between group-averaged MOS [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: GMMD scores for Virtual KITTI 2 (synthetic) [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Examples of 7 out of 20 degradation types (rows) applied to a MS-COCO reference image at severity levels 1, 3, [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: KADID-10k evaluation. Top: MS-COCO anchor images (left) and sample images from five degradation groups [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: RAISE realism assessment. Top: AI-generated images from groups at ranks 1, 2, 3, . . . , 23, 24, sorted by ascending [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Sample images from the three datasets used in the KITTI experiment: KITTI (real driving scenes), Virtual KITTI 2 [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗
read the original abstract

Evaluating the realism of generated images remains a fundamental challenge in generative modeling. Existing distributional metrics such as the Frechet Inception Distance (FID) and CLIP-MMD (CMMD) compare feature distributions at a semantic level but may overlook fine-grained textural information that can be relevant for distinguishing real from generated images. We introduce Gram-MMD (GMMD), a realism metric that leverages Gram matrices computed from intermediate activations of pretrained backbone networks to capture correlations between feature maps. By extracting the upper-triangular part of these symmetric Gram matrices and measuring the Maximum Mean Discrepancy (MMD) between an anchor distribution of real images and an evaluation distribution, GMMD produces a representation that encodes textural and structural characteristics at a finer granularity than global embeddings. To select the hyperparameters of the metric, we employ a meta-metric protocol based on controlled degradations applied to MS-COCO images, measuring monotonicity via Spearman's rank correlation and Kendall's tau. We conduct experiments on both the KADID-10k database and the RAISE realness assessment dataset using various backbone architectures, including DINOv2, DC-AE, Stable Diffusion's VAE encoder, VGG19, and the AlexNet backbone from LPIPS, among others. We also demonstrate on a cross-domain driving scenario (KITTI / Virtual KITTI / Stanford Cars) that CMMD can incorrectly rank real images as less realistic than synthetic ones due to its semantic bias, while GMMD preserves the correct ordering. Our results suggest that GMMD captures complementary information to existing semantic-level metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Gram-MMD (GMMD), a realism metric that extracts upper-triangular Gram matrices from intermediate activations of pretrained backbones (DINOv2, VGG19, LPIPS-AlexNet, etc.) and computes MMD against an anchor real-image distribution. Hyperparameters (backbone, layer, kernel) are chosen via a meta-metric that maximizes Spearman/Kendall monotonicity under controlled degradations on MS-COCO. Experiments on KADID-10k and RAISE report positive trends, and a cross-domain example (KITTI/Virtual KITTI/Stanford Cars) shows GMMD preserving correct real-vs-synthetic ordering where CMMD fails due to semantic bias. The central claim is that GMMD supplies complementary textural information to semantic metrics such as FID and CMMD.

Significance. If the generalization claim holds, GMMD would be a useful addition to the evaluation toolkit for generative models, addressing the known semantic bias of embedding-based distances. The Gram-matrix construction is parameter-light once hyperparameters are fixed and the meta-metric protocol is reproducible in principle, which strengthens the contribution relative to purely empirical metrics.

major comments (3)
  1. [Abstract] Abstract / meta-metric protocol: the hyperparameter search is driven exclusively by monotonicity under a fixed menu of controlled degradations on MS-COCO. No experiment is reported that verifies whether the selected configuration remains monotonic or correlates with human judgments when the distribution shift is produced by actual diffusion or GAN generators (texture repetition, high-frequency artifacts, color shifts). This assumption is load-bearing for the claim that GMMD generalizes beyond the meta-metric training degradations.
  2. [Experiments] Cross-domain experiment: the manuscript states that CMMD incorrectly ranks real images below synthetic ones while GMMD preserves the correct ordering, yet no numerical GMMD or CMMD scores, image counts, or backbone/layer choices are supplied for the KITTI/Virtual KITTI/Stanford Cars sets. Without these values it is impossible to judge the magnitude or statistical reliability of the reported ordering difference.
  3. [Experiments] KADID-10k and RAISE results: the abstract reports “positive experimental trends” but supplies neither error bars, ablation tables across backbones/layers, nor statistical significance tests. The absence of these quantities makes it difficult to determine whether the observed complementarity to CMMD/FID is robust or merely suggestive.
minor comments (2)
  1. [Abstract] The acronym GMMD is introduced in the title and abstract but the text alternates between Gram-MMD and GMMD; a single consistent abbreviation would reduce reader confusion.
  2. Backbone names such as “DC-AE” and “Stable Diffusion’s VAE encoder” appear without citation or architectural reference; adding the original papers or a brief description would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract / meta-metric protocol: the hyperparameter search is driven exclusively by monotonicity under a fixed menu of controlled degradations on MS-COCO. No experiment is reported that verifies whether the selected configuration remains monotonic or correlates with human judgments when the distribution shift is produced by actual diffusion or GAN generators (texture repetition, high-frequency artifacts, color shifts). This assumption is load-bearing for the claim that GMMD generalizes beyond the meta-metric training degradations.

    Authors: We appreciate this observation regarding the meta-metric protocol. The controlled degradations on MS-COCO were selected to emulate common textural artifacts (e.g., blur, noise, compression) that frequently appear in diffusion and GAN outputs. While we did not conduct separate monotonicity verification on actual generator samples for hyperparameter selection, the reported positive trends on KADID-10k and RAISE—which incorporate human perceptual judgments on real-world distortions—provide supporting evidence for applicability beyond the meta-training set. In the revision we will add a dedicated paragraph clarifying the design rationale for the degradations and their relation to generative artifacts. revision: partial

  2. Referee: [Experiments] Cross-domain experiment: the manuscript states that CMMD incorrectly ranks real images below synthetic ones while GMMD preserves the correct ordering, yet no numerical GMMD or CMMD scores, image counts, or backbone/layer choices are supplied for the KITTI/Virtual KITTI/Stanford Cars sets. Without these values it is impossible to judge the magnitude or statistical reliability of the reported ordering difference.

    Authors: We agree that the absence of numerical scores, image counts, and exact hyperparameter choices limits evaluation of the cross-domain results. This was an oversight in the current draft. The revised manuscript will include a table reporting GMMD and CMMD scores for KITTI, Virtual KITTI, and Stanford Cars, the number of images sampled from each set, and the specific backbone/layer used for GMMD. revision: yes

  3. Referee: [Experiments] KADID-10k and RAISE results: the abstract reports “positive experimental trends” but supplies neither error bars, ablation tables across backbones/layers, nor statistical significance tests. The absence of these quantities makes it difficult to determine whether the observed complementarity to CMMD/FID is robust or merely suggestive.

    Authors: We concur that error bars, ablations, and significance testing are needed to substantiate the claims of complementarity. The revised version will add error bars to the KADID-10k and RAISE results, include ablation tables across the tested backbones and layers, and report statistical significance tests (e.g., p-values on rank correlations) to demonstrate robustness. revision: yes

Circularity Check

0 steps flagged

Gram-MMD definition is direct from Gram matrices and MMD; no reduction to inputs by construction

full rationale

The paper defines GMMD explicitly via extraction of upper-triangular Gram matrices from intermediate activations of pretrained backbones followed by MMD computation between real and evaluation distributions. Hyperparameter selection (backbone, layer, kernel) occurs via an external meta-metric protocol measuring monotonicity under controlled degradations on MS-COCO; this choice is made once and does not enter the final metric equation or reduce any reported score to a fitted parameter by construction. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked to justify the core formulation. The derivation chain remains self-contained against the stated assumptions about what Gram matrices encode.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Gram matrices from standard pretrained backbones encode realism-relevant texture; no new free parameters are introduced beyond hyperparameter tuning via the meta-metric protocol, and no invented entities are postulated.

free parameters (1)
  • hyperparameters of GMMD
    Selected via meta-metric protocol on controlled degradations of MS-COCO images.
axioms (1)
  • domain assumption Gram matrices from intermediate activations capture textural and structural correlations relevant to image realism
    Invoked in the definition of the metric and in the claim of finer granularity than global embeddings.

pith-pipeline@v0.9.0 · 5582 in / 1339 out tokens · 64116 ms · 2026-05-13T20:02:55.060824+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

  1. [1]

    Heusel, H

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NeurIPS, 2017

  2. [2]

    Jayasumana, S

    S. Jayasumana, S. Ramalingam, A. Veit, D. Glasner, A. Chakrabarti, and S. Kumar. Rethinking FID: To- wards a better evaluation metric for image generation. InCVPR, 2024

  3. [3]

    H. Lin, V . Hosu, and D. Saupe. KADID-10k: A large- scale artificially distorted IQA database. InQoMEX, 2019

  4. [4]

    Spearman

    C. Spearman. The proof and measurement of associ- ation between two things.Am. J. Psychol., 15(1):72– 101, 1904

  5. [5]

    Mukherjee, S

    A. Mukherjee, S. Dubey, and S. Paul. RAISE: Real- ness assessment for image synthesis and evaluation. arXiv preprint arXiv:2505.19233, 2025

  6. [6]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni,et al.DINOv2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  7. [7]

    J. Chen, H. Cai, J. Chen, E. Xie, S. Yang, H. Tang, M. Li, Y . Lu, and S. Han. Deep compression autoen- coder for efficient high-resolution diffusion models. arXiv preprint arXiv:2410.10733, 2024

  8. [8]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with la- tent diffusion models. InCVPR, 2022

  9. [9]

    Simonyan and A

    K. Simonyan and A. Zisserman. Very deep convolu- tional networks for large-scale image recognition. In ICLR, 2015

  10. [10]

    Goodfellow, J

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Ben- gio. Generative adversarial nets. InNeurIPS, 2014

  11. [11]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

  12. [12]

    Salimans, I

    T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. InNeurIPS, 2016

  13. [13]

    Grisse, A

    T. Grisse, A. Moisand, A. Asséman, K. Brokman, and V . Grégoire. How good are humans at detecting AI- generated images? Learnings from an experiment. arXiv preprint arXiv:2507.18640, 2025. 9

  14. [14]

    Gretton, K

    A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. JMLR, 13:723–773, 2012

  15. [15]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy,et al.Learning transferable visual models from natural language su- pervision. InICML, 2021

  16. [16]

    L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style.arXiv preprint arXiv:1508.06576, 2015

  17. [17]

    Dang-Nguyen, C

    D.-T. Dang-Nguyen, C. Pasquini, V . Conotter, and G. Boato. RAISE: A raw images dataset for digital im- age forensics. InProc. 6th ACM MMSys, pp. 219–224, 2015

  18. [18]

    Bi ´nkowski, D

    M. Bi ´nkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying MMD GANs. InICLR, 2018

  19. [19]

    Zhang, P

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep fea- tures as a perceptual metric. InCVPR, 2018

  20. [20]

    Jayasumana, R

    S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi. Kernel methods on the Riemannian mani- fold of symmetric positive definite matrices. InCVPR, 2013

  21. [21]

    A. Han, B. Mishra, P. Jawanpuria, and J. Gao. On Riemannian optimization over positive definite matri- ces with the Bures-Wasserstein geometry. InNeurIPS, 2021

  22. [22]

    Final report from the VQEG on the validation of objective mod- els of video quality assessment, phase II

    Video Quality Experts Group (VQEG). Final report from the VQEG on the validation of objective mod- els of video quality assessment, phase II. Technical report, 2003

  23. [23]

    M. D. Zeiler and R. Fergus. Visualizing and under- standing convolutional networks. InECCV, 2014

  24. [24]

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. InECCV, 2014

  25. [25]

    M. G. Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938

  26. [26]

    Geiger, P

    A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. InCVPR, 2012

  27. [27]

    Gaidon, Q

    A. Gaidon, Q. Wang, Y . Cabon, and E. Vig. Virtual worlds as proxy for multi-object tracking analysis. In CVPR, 2016

  28. [28]

    Krause, M

    J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3D ob- ject representations for fine-grained categorization. In ICCV Workshops, 2013

  29. [29]

    Raghu, T

    M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy. Do vision transformers see like con- volutional neural networks? InNeurIPS, 2021. 10 Appendix Table 2: 20 synthetic degradation types×10 severity levels applied to the MS-COCO anchor images. Linear interpolation: level 1→minimal parameter, level 10→maximal parameter. # KADID Nom Param. ...