arxiv: 2604.03064 · v1 · submitted 2026-04-03 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Gram-MMD: A Texture-Aware Metric for Image Realism Assessment

Jo\'e Napolitano , Pascal Nguyen

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords image realism assessmentGram-MMDtexture metricgenerative modelsMaximum Mean DiscrepancyGram matricesimage quality evaluationfeature correlations

0 comments

The pith

Gram-MMD judges generated image realism by measuring correlations between feature maps instead of semantic content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Gram-MMD, a metric that extracts Gram matrices from intermediate layers of pretrained networks to capture textural and structural patterns in images. Existing methods like FID and CMMD compare distributions at a semantic level and can miss fine details that distinguish real from generated content. By vectorizing the upper triangle of these matrices and applying Maximum Mean Discrepancy against a real-image anchor set, the metric produces scores sensitive to lower-level characteristics. Hyperparameters are chosen through a meta-protocol that applies controlled degradations to MS-COCO images and checks rank correlation with perceived quality. Experiments on KADID-10k, RAISE, and cross-domain driving scenes show the metric preserves correct realism orderings where semantic metrics fail.

Core claim

Gram-MMD computes symmetric Gram matrices from activations at chosen layers of backbones such as DINOv2, VGG19, or Stable Diffusion VAE, extracts their upper-triangular elements, and reports the MMD distance to an anchor distribution of real images, thereby encoding textural and structural correlations at a granularity finer than global embeddings.

What carries the argument

Gram matrices formed from outer products of feature-map vectors at intermediate backbone layers, with MMD computed on their flattened upper-triangular parts.

If this is right

GMMD can flag textural artifacts in generated images that FID and CMMD overlook.
In domain-shift settings such as real versus synthetic driving scenes, GMMD maintains the expected ranking of realism.
The same Gram-MMD formulation works across multiple backbone architectures without retraining.
Meta-metric selection on controlled degradations produces hyperparameters that transfer to unseen datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Combining GMMD scores with a semantic metric could yield a two-axis realism evaluation that separates texture failures from semantic failures.
The approach might extend to video or 3D assets where frame-to-frame texture consistency matters.
Layer selection could be automated by maximizing the meta-metric correlation rather than fixed by hand.

Load-bearing premise

Gram matrices computed from intermediate activations of pretrained networks reliably encode the textural and structural differences that separate real photographs from generated images.

What would settle it

A controlled test set in which images differ only in fine-grained texture statistics while semantic content and global statistics remain matched, with human preference labels available; if GMMD scores show no correlation with those labels while semantic metrics do, the claim fails.

Figures

Figures reproduced from arXiv: 2604.03064 by Jo\'e Napolitano, Pascal Nguyen.

**Figure 1.** Figure 1: Overview of the GMMD pipeline. Images from the anchor and evaluation sets are passed through a pretrained [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Spearman ρ vs. layer index for each backbone, top-7 γ values (decreasing opacity) and best configuration (⋆) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 5.** Figure 5: Distribution of Spearman’s ρ across all configurations, grouped by backbone. 6 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Spearman’s ρ vs. Kendall’s τ on KADID-10k for each configuration [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Spearman’s ρ (top) and Kendall’s τ (bottom) on KADID-10k across our 13 configurations and CMMD [2]. Figures 6 and 7 show that GMMD consistently outperforms CMMD on both Spearman’s ρ and Kendall’s τ . Both correlation coefficients are negative, as expected: as DMOS increases (stronger degradation), the MMD (a positive distance) decreases, since more degraded images move further from the real anchor distri… view at source ↗

**Figure 10.** Figure 10: Two examples illustrating MMD behaviour. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 9.** Figure 9: Linear regression between group-averaged MOS [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 11.** Figure 11: GMMD scores for Virtual KITTI 2 (synthetic) [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

**Figure 12.** Figure 12: Examples of 7 out of 20 degradation types (rows) applied to a MS-COCO reference image at severity levels 1, 3, [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

**Figure 13.** Figure 13: KADID-10k evaluation. Top: MS-COCO anchor images (left) and sample images from five degradation groups [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗

**Figure 14.** Figure 14: RAISE realism assessment. Top: AI-generated images from groups at ranks 1, 2, 3, . . . , 23, 24, sorted by ascending [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗

**Figure 15.** Figure 15: Sample images from the three datasets used in the KITTI experiment: KITTI (real driving scenes), Virtual KITTI 2 [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗

read the original abstract

Evaluating the realism of generated images remains a fundamental challenge in generative modeling. Existing distributional metrics such as the Frechet Inception Distance (FID) and CLIP-MMD (CMMD) compare feature distributions at a semantic level but may overlook fine-grained textural information that can be relevant for distinguishing real from generated images. We introduce Gram-MMD (GMMD), a realism metric that leverages Gram matrices computed from intermediate activations of pretrained backbone networks to capture correlations between feature maps. By extracting the upper-triangular part of these symmetric Gram matrices and measuring the Maximum Mean Discrepancy (MMD) between an anchor distribution of real images and an evaluation distribution, GMMD produces a representation that encodes textural and structural characteristics at a finer granularity than global embeddings. To select the hyperparameters of the metric, we employ a meta-metric protocol based on controlled degradations applied to MS-COCO images, measuring monotonicity via Spearman's rank correlation and Kendall's tau. We conduct experiments on both the KADID-10k database and the RAISE realness assessment dataset using various backbone architectures, including DINOv2, DC-AE, Stable Diffusion's VAE encoder, VGG19, and the AlexNet backbone from LPIPS, among others. We also demonstrate on a cross-domain driving scenario (KITTI / Virtual KITTI / Stanford Cars) that CMMD can incorrectly rank real images as less realistic than synthetic ones due to its semantic bias, while GMMD preserves the correct ordering. Our results suggest that GMMD captures complementary information to existing semantic-level metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gram-MMD adds a texture-focused complement to FID and CMMD via Gram matrices and MMD, with a solid cross-domain ranking example, but its hyperparameter tuning on controlled degradations may not generalize to actual generator artifacts.

read the letter

Gram-MMD takes Gram matrices from intermediate activations of backbones like DINOv2 or VGG19, keeps the upper triangle, and runs MMD against a real-image anchor. This targets textural correlations that semantic embeddings often skip. The clearest win is the KITTI / Virtual KITTI / Stanford Cars test, where it keeps the expected real-over-synthetic ordering while CMMD reverses it because of semantic bias. That single demonstration shows the metric can supply information the others miss. Experiments on KADID-10k and RAISE with several backbones also indicate it is easy to compute and produces sensible trends under the meta-metric protocol. The construction itself is clean and extends prior distributional work without introducing circularity or heavy fitting. The main soft spot is the hyperparameter search. Selecting backbone, layer, and kernel by Spearman and Kendall correlation on MS-COCO degradations treats those specific degradations as a good proxy for the distribution shift created by diffusion or GAN models. If the dominant artifacts in current generators fall outside that menu, the chosen configuration could be suboptimal or even misaligned with human judgments on real generated images. The abstract gives positive trends but no full tables, error bars, or ablation on actual generative outputs, so the empirical support stays moderate until those checks are in. This paper is aimed at people who build or use realism metrics in generative vision. A reader who wants a practical texture-aware score to run alongside FID or CMMD will find the idea and the ordering test worth their time. The thinking is coherent and the motivation is honest, so the work deserves a serious referee to verify the generalization claim and tighten the validation.

Referee Report

3 major / 2 minor

Summary. The paper introduces Gram-MMD (GMMD), a realism metric that extracts upper-triangular Gram matrices from intermediate activations of pretrained backbones (DINOv2, VGG19, LPIPS-AlexNet, etc.) and computes MMD against an anchor real-image distribution. Hyperparameters (backbone, layer, kernel) are chosen via a meta-metric that maximizes Spearman/Kendall monotonicity under controlled degradations on MS-COCO. Experiments on KADID-10k and RAISE report positive trends, and a cross-domain example (KITTI/Virtual KITTI/Stanford Cars) shows GMMD preserving correct real-vs-synthetic ordering where CMMD fails due to semantic bias. The central claim is that GMMD supplies complementary textural information to semantic metrics such as FID and CMMD.

Significance. If the generalization claim holds, GMMD would be a useful addition to the evaluation toolkit for generative models, addressing the known semantic bias of embedding-based distances. The Gram-matrix construction is parameter-light once hyperparameters are fixed and the meta-metric protocol is reproducible in principle, which strengthens the contribution relative to purely empirical metrics.

major comments (3)

[Abstract] Abstract / meta-metric protocol: the hyperparameter search is driven exclusively by monotonicity under a fixed menu of controlled degradations on MS-COCO. No experiment is reported that verifies whether the selected configuration remains monotonic or correlates with human judgments when the distribution shift is produced by actual diffusion or GAN generators (texture repetition, high-frequency artifacts, color shifts). This assumption is load-bearing for the claim that GMMD generalizes beyond the meta-metric training degradations.
[Experiments] Cross-domain experiment: the manuscript states that CMMD incorrectly ranks real images below synthetic ones while GMMD preserves the correct ordering, yet no numerical GMMD or CMMD scores, image counts, or backbone/layer choices are supplied for the KITTI/Virtual KITTI/Stanford Cars sets. Without these values it is impossible to judge the magnitude or statistical reliability of the reported ordering difference.
[Experiments] KADID-10k and RAISE results: the abstract reports “positive experimental trends” but supplies neither error bars, ablation tables across backbones/layers, nor statistical significance tests. The absence of these quantities makes it difficult to determine whether the observed complementarity to CMMD/FID is robust or merely suggestive.

minor comments (2)

[Abstract] The acronym GMMD is introduced in the title and abstract but the text alternates between Gram-MMD and GMMD; a single consistent abbreviation would reduce reader confusion.
Backbone names such as “DC-AE” and “Stable Diffusion’s VAE encoder” appear without citation or architectural reference; adding the original papers or a brief description would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract / meta-metric protocol: the hyperparameter search is driven exclusively by monotonicity under a fixed menu of controlled degradations on MS-COCO. No experiment is reported that verifies whether the selected configuration remains monotonic or correlates with human judgments when the distribution shift is produced by actual diffusion or GAN generators (texture repetition, high-frequency artifacts, color shifts). This assumption is load-bearing for the claim that GMMD generalizes beyond the meta-metric training degradations.

Authors: We appreciate this observation regarding the meta-metric protocol. The controlled degradations on MS-COCO were selected to emulate common textural artifacts (e.g., blur, noise, compression) that frequently appear in diffusion and GAN outputs. While we did not conduct separate monotonicity verification on actual generator samples for hyperparameter selection, the reported positive trends on KADID-10k and RAISE—which incorporate human perceptual judgments on real-world distortions—provide supporting evidence for applicability beyond the meta-training set. In the revision we will add a dedicated paragraph clarifying the design rationale for the degradations and their relation to generative artifacts. revision: partial
Referee: [Experiments] Cross-domain experiment: the manuscript states that CMMD incorrectly ranks real images below synthetic ones while GMMD preserves the correct ordering, yet no numerical GMMD or CMMD scores, image counts, or backbone/layer choices are supplied for the KITTI/Virtual KITTI/Stanford Cars sets. Without these values it is impossible to judge the magnitude or statistical reliability of the reported ordering difference.

Authors: We agree that the absence of numerical scores, image counts, and exact hyperparameter choices limits evaluation of the cross-domain results. This was an oversight in the current draft. The revised manuscript will include a table reporting GMMD and CMMD scores for KITTI, Virtual KITTI, and Stanford Cars, the number of images sampled from each set, and the specific backbone/layer used for GMMD. revision: yes
Referee: [Experiments] KADID-10k and RAISE results: the abstract reports “positive experimental trends” but supplies neither error bars, ablation tables across backbones/layers, nor statistical significance tests. The absence of these quantities makes it difficult to determine whether the observed complementarity to CMMD/FID is robust or merely suggestive.

Authors: We concur that error bars, ablations, and significance testing are needed to substantiate the claims of complementarity. The revised version will add error bars to the KADID-10k and RAISE results, include ablation tables across the tested backbones and layers, and report statistical significance tests (e.g., p-values on rank correlations) to demonstrate robustness. revision: yes

Circularity Check

0 steps flagged

Gram-MMD definition is direct from Gram matrices and MMD; no reduction to inputs by construction

full rationale

The paper defines GMMD explicitly via extraction of upper-triangular Gram matrices from intermediate activations of pretrained backbones followed by MMD computation between real and evaluation distributions. Hyperparameter selection (backbone, layer, kernel) occurs via an external meta-metric protocol measuring monotonicity under controlled degradations on MS-COCO; this choice is made once and does not enter the final metric equation or reduce any reported score to a fitted parameter by construction. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked to justify the core formulation. The derivation chain remains self-contained against the stated assumptions about what Gram matrices encode.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Gram matrices from standard pretrained backbones encode realism-relevant texture; no new free parameters are introduced beyond hyperparameter tuning via the meta-metric protocol, and no invented entities are postulated.

free parameters (1)

hyperparameters of GMMD
Selected via meta-metric protocol on controlled degradations of MS-COCO images.

axioms (1)

domain assumption Gram matrices from intermediate activations capture textural and structural correlations relevant to image realism
Invoked in the definition of the metric and in the claim of finer granularity than global embeddings.

pith-pipeline@v0.9.0 · 5582 in / 1339 out tokens · 64116 ms · 2026-05-13T20:02:55.060824+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 1 internal anchor

[1]

Heusel, H

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In NeurIPS, 2017

work page 2017
[2]

Jayasumana, S

S. Jayasumana, S. Ramalingam, A. Veit, D. Glasner, A. Chakrabarti, and S. Kumar. Rethinking FID: To- wards a better evaluation metric for image generation. InCVPR, 2024

work page 2024
[3]

H. Lin, V . Hosu, and D. Saupe. KADID-10k: A large- scale artificially distorted IQA database. InQoMEX, 2019

work page 2019
[4]

Spearman

C. Spearman. The proof and measurement of associ- ation between two things.Am. J. Psychol., 15(1):72– 101, 1904

work page 1904
[5]

Mukherjee, S

A. Mukherjee, S. Dubey, and S. Paul. RAISE: Real- ness assessment for image synthesis and evaluation. arXiv preprint arXiv:2505.19233, 2025

work page arXiv 2025
[6]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni,et al.DINOv2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

J. Chen, H. Cai, J. Chen, E. Xie, S. Yang, H. Tang, M. Li, Y . Lu, and S. Han. Deep compression autoen- coder for efficient high-resolution diffusion models. arXiv preprint arXiv:2410.10733, 2024

work page arXiv 2024
[8]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with la- tent diffusion models. InCVPR, 2022

work page 2022
[9]

Simonyan and A

K. Simonyan and A. Zisserman. Very deep convolu- tional networks for large-scale image recognition. In ICLR, 2015

work page 2015
[10]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Ben- gio. Generative adversarial nets. InNeurIPS, 2014

work page 2014
[11]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InNeurIPS, 2020

work page 2020
[12]

Salimans, I

T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. InNeurIPS, 2016

work page 2016
[13]

Grisse, A

T. Grisse, A. Moisand, A. Asséman, K. Brokman, and V . Grégoire. How good are humans at detecting AI- generated images? Learnings from an experiment. arXiv preprint arXiv:2507.18640, 2025. 9

work page arXiv 2025
[14]

Gretton, K

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. JMLR, 13:723–773, 2012

work page 2012
[15]

Radford, J

A. Radford, J. W. Kim, C. Hallacy,et al.Learning transferable visual models from natural language su- pervision. InICML, 2021

work page 2021
[16]

L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style.arXiv preprint arXiv:1508.06576, 2015

work page Pith review arXiv 2015
[17]

Dang-Nguyen, C

D.-T. Dang-Nguyen, C. Pasquini, V . Conotter, and G. Boato. RAISE: A raw images dataset for digital im- age forensics. InProc. 6th ACM MMSys, pp. 219–224, 2015

work page 2015
[18]

Bi ´nkowski, D

M. Bi ´nkowski, D. J. Sutherland, M. Arbel, and A. Gretton. Demystifying MMD GANs. InICLR, 2018

work page 2018
[19]

Zhang, P

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep fea- tures as a perceptual metric. InCVPR, 2018

work page 2018
[20]

Jayasumana, R

S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi. Kernel methods on the Riemannian mani- fold of symmetric positive definite matrices. InCVPR, 2013

work page 2013
[21]

A. Han, B. Mishra, P. Jawanpuria, and J. Gao. On Riemannian optimization over positive definite matri- ces with the Bures-Wasserstein geometry. InNeurIPS, 2021

work page 2021
[22]

Final report from the VQEG on the validation of objective mod- els of video quality assessment, phase II

Video Quality Experts Group (VQEG). Final report from the VQEG on the validation of objective mod- els of video quality assessment, phase II. Technical report, 2003

work page 2003
[23]

M. D. Zeiler and R. Fergus. Visualizing and under- standing convolutional networks. InECCV, 2014

work page 2014
[24]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. InECCV, 2014

work page 2014
[25]

M. G. Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938

work page 1938
[26]

Geiger, P

A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. InCVPR, 2012

work page 2012
[27]

Gaidon, Q

A. Gaidon, Q. Wang, Y . Cabon, and E. Vig. Virtual worlds as proxy for multi-object tracking analysis. In CVPR, 2016

work page 2016
[28]

Krause, M

J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3D ob- ject representations for fine-grained categorization. In ICCV Workshops, 2013

work page 2013
[29]

Raghu, T

M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovitskiy. Do vision transformers see like con- volutional neural networks? InNeurIPS, 2021. 10 Appendix Table 2: 20 synthetic degradation types×10 severity levels applied to the MS-COCO anchor images. Linear interpolation: level 1→minimal parameter, level 10→maximal parameter. # KADID Nom Param. ...

work page 2021