arxiv: 2604.07254 · v1 · submitted 2026-04-08 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Non-identifiability of Explanations from Model Behavior in Deep Networks of Image Authenticity Judgments

Icaro Re Depaolini , Uri Hasson

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:40 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords deep neural networksimage authenticity judgmentsattribution mapsmodel explainabilityhuman behavior predictionnon-identifiabilityGrad-CAMLIME

0 comments

The pith

Deep neural networks that predict human image authenticity ratings produce attribution maps that disagree across architectures

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether networks that match how humans rate image authenticity also reveal the visual cues driving those ratings. It adds simple prediction layers to several pretrained vision models and generates maps highlighting the image regions each model finds most important for its rating. Some models reach good prediction accuracy, close to 80 percent of the highest level possible given human response variability, yet the highlighted regions differ substantially between models with similar accuracy. A sympathetic reader cares because this gap shows that matching human behavior does not automatically mean the model uses the same information humans use, so its highlighted regions cannot be taken as reliable evidence of human decision processes.

Core claim

Several pretrained vision models fitted with regression heads predict human authenticity ratings of images at levels reaching about 80 percent of the noise ceiling. Attribution maps from Grad-CAM, LIME, and multiscale pixel masking remain stable within a given architecture across random seeds but show weak agreement across different architectures even when predictive performance is comparable. VGG-based models mainly track overall image quality rather than authenticity-specific features. Ensembles of models raise prediction accuracy and support image-level analysis via pixel masking, yet the cross-architecture disagreement in maps persists. The authors therefore conclude that the networks do

What carries the argument

Comparison of attribution maps (visualizations of image regions most influential to each model's rating) across multiple network architectures that achieve similar accuracy in predicting human judgments. The comparison serves as the test for whether any map can be identified as reflecting the cues humans actually use.

If this is right

Attribution maps are more consistent within an architecture for images humans rate as highly authentic.
VGG models base ratings primarily on general image quality rather than authenticity cues.
Ensembles of models improve accuracy in matching human authenticity judgments.
Post-hoc explanations from models that predict behavior well supply only weak evidence for the underlying cognitive mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern of predictive success without cross-model agreement on explanations may appear in models of other human visual judgments such as emotion detection or scene categorization.
Direct experiments that alter specific image features and measure changes in human ratings could isolate the actual cues without depending on model maps.
Adding constraints that force models to match not only final ratings but also human reaction times or eye-movement patterns might increase consistency across architectures.

Load-bearing premise

That agreement between the important regions highlighted by different model architectures is necessary to confirm that those regions reflect the cues humans use when judging authenticity.

What would settle it

If two architectures with comparable accuracy in predicting human ratings produce nearly identical attribution maps on the same images, that result would undermine the claim that explanations are non-identifiable.

Figures

Figures reproduced from arXiv: 2604.07254 by Icaro Re Depaolini, Uri Hasson.

**Figure 1.** Figure 1: Explanation consistency analysis: within- and across-architecture. For each image and each architecture, 10 explainability maps were generated from 10 independently trained model variants. Within-architecture consistency was computed as the mean pairwise similarity among the 10 heatmaps per image. Across-architecture consistency was computed by first averaging the 10 maps per architecture into a single pro… view at source ↗

**Figure 2.** Figure 2: Distribution and relationships of human authenticity and quality ratings. (A) Perimage mean QUALITY and AUTHENTICITY ratings. QUALITY showed a bimodal distribution while AUTHENTICITY did not. (B) Positive correlation between per-image mean AUTHENTICITY and QUALITY ratings (r = 0.84). (C) Relationship between mean AUTHENTICITY and inter-rater standard deviation: images rated as less authentic showed great… view at source ↗

**Figure 3.** Figure 3: Across-architecture explanation agreement. Each cell shows the mean Spearman correlation between prototype explanation maps for a pair of architectures, averaged across test-set images. Panel A: Grad-CAM. Panel B: Multiscale Pixel Masking (MPM). Panel C: LIME. Higher values indicate greater similarity. SDs across images are provided in Appendix Figure A.3. be due to the greater smoothness of Grad-CAM maps… view at source ↗

**Figure 4.** Figure 4: Examples of across-architecture Grad-CAM agreement. Each column shows the attribution map produced by one architecture for the same image. Upper row: an image with relatively high across-architecture agreement. Lower row: an image with low across-architecture agreement. Note that VGG architectures systematically produced attribution maps that differed from the other architectures (see text). Across-archit… view at source ↗

**Figure 5.** Figure 5: MPM attributions provided for Ensemble explanations. For two target images, MPM attribution maps are shown for the Bagging Ensemble and for each of the six individual architectures. Single-architecture maps are averaged across 10 model variants. Warm colors indicate regions whose occlusion reduces predicted authenticity; cool colors indicate regions whose occlusion increases it. therefore evaluated wheth… view at source ↗

read the original abstract

Deep neural networks can predict human judgments, but this does not imply that they rely on human-like information or reveal the cues underlying those judgments. Prior work has addressed this issue using attribution heatmaps, but their explanatory value in itself depends on robustness. Here we tested the robustness of such explanations by evaluating whether models that predict human authenticity ratings also produce consistent explanations within and across architectures. We fit lightweight regression heads to multiple frozen pretrained vision models and generated attribution maps using Grad-CAM, LIME, and multiscale pixel masking. Several architectures predicted ratings well, reaching about 80% of the noise ceiling. VGG models achieved this by tracking image quality rather than authenticity-specific variance, limiting the relevance of their attributions. Among the remaining models, attribution maps were generally stable across random seeds within an architecture, especially for EfficientNetB3 and Barlow Twins, and consistency was higher for images judged as more authentic. Crucially, agreement in attribution across architectures was weak even when predictive performance was similar. To address this, we combined models in ensembles, which improved prediction of human authenticity judgments and enabled image-level attribution via pixel masking. We conclude that while deep networks can predict human authenticity judgments well, they do not produce identifiable explanations for those judgments. More broadly, our findings suggest that post hoc explanations from successful models of behavior should be treated as weak evidence for cognitive mechanism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Models reach decent predictive accuracy on authenticity ratings but attribution maps disagree across architectures, and the paper treats that disagreement as evidence that explanations are non-identifiable.

read the letter

The main thing to know is that several pretrained vision models with lightweight heads predict human authenticity judgments at roughly 80% of the noise ceiling, yet Grad-CAM, LIME, and pixel-masking attributions show weak agreement across architectures even when predictive performance is matched. Within each architecture the maps are more stable, especially for EfficientNetB3 and Barlow Twins, and consistency rises for images rated as authentic. VGG stands out because it largely tracks image quality rather than authenticity-specific variance, which the authors rightly flag as limiting its explanatory value. Ensembles improve prediction and support image-level masking attributions. Those are the concrete empirical results.

Referee Report

2 major / 3 minor

Summary. The manuscript claims that deep neural networks can accurately predict human authenticity judgments of images (reaching ~80% of the noise ceiling) but that post-hoc attribution methods (Grad-CAM, LIME, multiscale pixel masking) do not produce consistent, identifiable explanations across architectures. After fitting lightweight regression heads to frozen pretrained vision models, the authors report within-architecture stability (especially for EfficientNetB3 and Barlow Twins) but weak cross-architecture agreement, note that VGG models track image quality rather than authenticity cues, and show that ensembles improve both prediction and image-level attribution. They conclude that successful behavioral models yield only weak evidence for cognitive mechanisms.

Significance. If the core empirical pattern holds, the work usefully cautions against treating post-hoc attributions from high-performing DNNs as direct evidence of human-like cues in perceptual judgment tasks. It is strengthened by the systematic multi-architecture, multi-method design and the constructive use of ensembles. The findings are relevant to both interpretability research in computer vision and cognitive modeling of authenticity perception.

major comments (2)

[Cross-architecture attribution results] Cross-architecture attribution results (abstract and corresponding results section): The inference that weak agreement across architectures demonstrates non-identifiable explanations treats cross-architecture consistency as a necessary condition for attributions to reflect underlying human cues. The manuscript demonstrates within-architecture stability but does not test the alternative that divergent maps could still be veridical if architectures encode the same cues via different internal representations (e.g., texture statistics versus edge patterns). This alternative is not ruled out by the reported controls and therefore weakens the central non-identifiability claim.
[Methods and VGG analysis] Methods and VGG analysis (results on model performance): The exclusion of VGG models on the grounds that they track image quality rather than authenticity-specific variance is load-bearing for the remaining cross-architecture comparisons, yet the manuscript provides insufficient detail on how this reliance was quantified (e.g., specific quality metrics or correlation thresholds) and whether analogous checks were applied to the other architectures retained in the analysis.

minor comments (3)

[Experimental setup] Experimental details: Information on data splits, exact image counts, computation of the noise ceiling, and statistical tests for attribution-map agreement measures is missing or too brief to allow full evaluation of the reported predictive performance and consistency results.
[Figures] Visualization: Additional attribution-map examples stratified by authenticity rating level, together with side-by-side within- versus across-architecture panels, would improve clarity of the consistency claims.
[Introduction/Discussion] Related work: The positioning would benefit from explicit citations to prior studies on the robustness and identifiability of post-hoc explanations in perceptual DNN models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which have prompted us to clarify several aspects of our analysis and strengthen the manuscript. We address each major comment below.

read point-by-point responses

Referee: Cross-architecture attribution results (abstract and corresponding results section): The inference that weak agreement across architectures demonstrates non-identifiable explanations treats cross-architecture consistency as a necessary condition for attributions to reflect underlying human cues. The manuscript demonstrates within-architecture stability but does not test the alternative that divergent maps could still be veridical if architectures encode the same cues via different internal representations (e.g., texture statistics versus edge patterns). This alternative is not ruled out by the reported controls and therefore weakens the central non-identifiability claim.

Authors: We agree that our manuscript does not explicitly rule out the possibility that different architectures could encode the same cues through different representations. Our argument for non-identifiability rests on the observation that multiple high-performing models produce divergent attributions despite similar predictive accuracy, making it difficult to identify a unique set of cues from the model behavior. We will add a paragraph in the discussion section acknowledging this alternative explanation and discussing its implications for the interpretability of post-hoc attributions. revision: partial
Referee: Methods and VGG analysis (results on model performance): The exclusion of VGG models on the grounds that they track image quality rather than authenticity-specific variance is load-bearing for the remaining cross-architecture comparisons, yet the manuscript provides insufficient detail on how this reliance was quantified (e.g., specific quality metrics or correlation thresholds) and whether analogous checks were applied to the other architectures retained in the analysis.

Authors: We acknowledge that the manuscript lacks sufficient detail on the quantification of VGG models' reliance on image quality. We will revise the Methods section to provide the specific metrics and procedures used to determine this, as well as the results of analogous checks on the other architectures. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparisons with independent results

full rationale

The paper performs an empirical analysis by fitting lightweight regression heads to frozen pretrained vision models, generating attribution maps via Grad-CAM, LIME, and pixel masking, and measuring within- and cross-architecture consistency. Predictive performance reaches ~80% of the noise ceiling for several models, and the central finding of weak cross-architecture agreement in attributions is reported as a direct observational result. No derivation, equation, or claim reduces to a fitted parameter by construction, no self-citation chain bears the load of the identifiability conclusion, and no ansatz or uniqueness theorem is smuggled in. The work is self-contained against external benchmarks of model performance and attribution stability.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study; no free parameters, axioms, or invented entities are introduced beyond standard pretrained vision models and attribution techniques from prior literature.

pith-pipeline@v0.9.0 · 5552 in / 1063 out tokens · 42284 ms · 2026-05-10T18:40:41.480976+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 9 canonical work pages · 2 internal anchors

[1]

URLhttp://arxiv.org/abs/1810. 03292. arXiv:1810.03292 [cs]. Jeffrey S Bowers, Gaurav Malhotra, Marin Dujmovi ´c, Milton Llera Montero, Christian Tsvetkov, Valerio Biscione, Guillermo Puebla, Federico Adolfi, John E Hummel, Rachel F Heaton, et al. Deep problems with neural network models of human vision.Behavioral and Brain Sciences, 46: e385,

work page arXiv
[2]

Imagenet-trained cnns are not bi- ased towards texture: Revisiting feature reliance through controlled suppression.arXiv preprint arXiv:2509.20234,

Tom Burgert, Oliver Stoll, Paolo Rota, and Beg ¨um Demir. Imagenet-trained cnns are not bi- ased towards texture: Revisiting feature reliance through controlled suppression.arXiv preprint arXiv:2509.20234,

work page arXiv
[3]

Deep Residual Learning for Image Recognition

URLhttp://arxiv.org/abs/1512.03385. arXiv:1512.03385 [cs]. Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely Connected Convolutional Networks, January

work page internal anchor Pith review arXiv
[4]

Densely connected convolutional networks,

URLhttp://arxiv.org/abs/1608.06993. arXiv:1608.06993 [cs]. Aditya Khosla, Akhil S Raju, Antonio Torralba, and Aude Oliva. Understanding and predicting image memorability at a large scale. InProceedings of the IEEE international conference on computer vision, pp. 2390–2398,

work page arXiv
[5]

R.et al.Grad-cam: Visual explanations from deep networks via gradient-based localization.International Journal of Computer Vision128, 336–359 (2019)

ISSN 0920-5691, 1573-1405. doi: 10.1007/s11263-019-01228-7. URLhttp://arxiv.org/ abs/1610.02391. arXiv:1610.02391 [cs]. Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition, April

work page doi:10.1007/s11263-019-01228-7
[6]

Very Deep Convolutional Networks for Large-Scale Image Recognition

URLhttp://arxiv.org/abs/1409.1556. arXiv:1409.1556 [cs]. Katherine R Storrs, Tim C Kietzmann, Alexander Walther, Johannes Mehrer, and Nikolaus Kriegeskorte. Diverse deep neural networks all predict human inferior temporal cortex well, after training and fitting.Journal of cognitive neuroscience, 33(10):2044–2064,

work page internal anchor Pith review Pith/arXiv arXiv 2044
[7]

arXiv preprint arXiv:2310.13018 , year=

Ilia Sucholutsky, Lukas Muttenthaler, Adrian Weller, Andi Peng, Andreea Bobu, Been Kim, Bradley C Love, Erin Grant, Iris Groen, Jascha Achterberg, et al. Getting aligned on repre- sentational alignment.arXiv preprint arXiv:2310.13018,

work page arXiv
[8]

Efficientnet: Rethinking model scaling for convolutional neural networks,

URLhttp://arxiv.org/abs/1905.11946. arXiv:1905.11946 [cs]. Zhenchen Tang, Zichuan Wang, Bo Peng, and Jing Dong. Clip-agiqa: boosting the performance of ai-generated image quality assessment with clip. InInternational Conference on Pattern Recog- nition, pp. 48–61. Springer,

work page arXiv 1905
[9]

URLhttp://arxiv.org/abs/2103. 03230. arXiv:2103.03230 [cs]. Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Springer,

work page arXiv
[10]

GradCAM similarity IoU @ 5 GradCAM similarity IoU @ 25 Figure A.4:Across-architecture similarity (IoU)

Each cell shows the mean Spearman correlation between prototype Grad-CAM maps for a pair of architectures, averaged across test-set images, with the SD across images in parentheses. GradCAM similarity IoU @ 5 GradCAM similarity IoU @ 25 Figure A.4:Across-architecture similarity (IoU). Each cell shows average Intersection-Over- Union overlap of top 5% (and...

2048