Recognition: 2 theorem links
· Lean TheoremAGOP as Explanation: From Feature Learning to Per-Sample Attribution in Image Classifiers
Pith reviewed 2026-05-14 20:02 UTC · model grok-4.3
The pith
The Average Gradient Outer Product matrix from training data supplies a prior that improves per-sample attribution maps in image classifiers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Average Gradient Outer Product matrix M computed over the training distribution supplies a fixed prior diag(M) whose normalized square root, when multiplied into a test-sample gradient, produces attribution maps that agree more closely with pixel-level ground truth than Integrated Gradients, SmoothGrad, GradCAM or VanillaGrad; the companion AGOP-Global map diag(M) itself requires only a disk lookup at inference time.
What carries the argument
The diagonal of the AGOP matrix M, used either to weight a per-sample gradient or directly as a saliency map, serving as a training-derived importance prior.
If this is right
- AGOP-Global attribution delivers pixel-level explanations at zero additional inference cost after a single training-time accumulation.
- The same prior improves explanation quality on both low-resolution synthetic images and higher-resolution photorealistic scenes with standard architectures.
- diag(M) quality as an attribution prior continues to increase after the network's classification accuracy has stopped rising.
- Gradient-based attribution can be strengthened by a global training statistic without retraining or architectural changes.
- GradCAM and similar methods lose spatial fidelity on small images while the AGOP variants remain unaffected.
Where Pith is reading between the lines
- Attribution methods may benefit more from incorporating global training statistics than from purely local gradient operations at test time.
- The shared matrix structure between feature learning and explanation suggests that improving one could automatically improve the other.
- The approach could be tested for token-level attribution in sequence models by accumulating analogous outer products over training text.
- Practitioners could accumulate the AGOP diagonal as a byproduct of ordinary training to obtain ready-made explanation tools.
Load-bearing premise
The AGOP matrix derived from the training distribution supplies an unbiased prior that reliably reduces gradient noise for test samples even when the test distribution differs from training.
What would settle it
Compute mIoU of AGOP-Weighted attributions on a test set deliberately drawn from a visibly shifted distribution; if performance drops below that of plain VanillaGrad, the training prior introduces systematic error.
read the original abstract
The Average Gradient Outer Product (AGOP) governs feature learning in neural networks: the Neural Feature Ansatz states that weight Gram matrices at each layer align with the corresponding AGOP matrices computed over the training distribution. We ask a complementary question: can this same quantity serve as a post-hoc attribution method for explaining individual predictions? We introduce AGOP-Weighted: a novel attribution method that multiplies the per-sample gradient by sqrt(diag(M) / max diag(M)), a training-distribution prior that suppresses gradient noise and amplifies consistently important pixels -- a combination not present in any prior attribution method. We formalise two companion variants -- AGOP-Local (per-sample gradient, equivalent to VanillaGrad) and AGOP-Global (diag(M) directly as a zero-cost saliency map) -- and implement an efficient training-time accumulation hook; AGOP-Global then requires zero inference cost (disk lookup) while AGOP-Weighted requires only a single gradient pass. We conduct the first rigorous comparison of AGOP attribution against Integrated Gradients (IG), SmoothGrad, GradCAM, and VanillaGrad across two benchmarks with pixel-level ground truth: (i) the synthetic XAI-TRIS benchmark (four classification scenarios, 8x8 images, CNN8by8) and (ii) the photorealistic CLEVR-XAI benchmark (ResNet-18 fine-tuned from ImageNet). AGOP-Weighted achieves 44% higher mIoU than IG on linear tasks; AGOP-Global achieves 7x higher mIoU than IG on multiplicative tasks (where IG falls below random) at zero inference cost. Both findings generalise to ResNet-18 on CLEVR-XAI (+18% and +37% respectively). We further show that GradCAM fails on small-resolution images due to spatial resolution collapse, and that diag(M) quality improves monotonically throughout training even after classification accuracy has plateaued.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes using the Average Gradient Outer Product (AGOP) matrix, precomputed from the training distribution, to derive post-hoc attribution methods for image classifiers. Building on the Neural Feature Ansatz, it introduces AGOP-Weighted (per-sample gradient scaled by sqrt(diag(M)/max(diag(M)))), AGOP-Local (equivalent to VanillaGrad), and AGOP-Global (diag(M) as zero-cost saliency map). An efficient training-time accumulation hook is described. Rigorous evaluations on XAI-TRIS (CNN8by8, four scenarios) and CLEVR-XAI (ResNet-18) benchmarks with pixel-level ground truth claim AGOP-Weighted yields 44% higher mIoU than Integrated Gradients on linear tasks, AGOP-Global yields 7x higher mIoU on multiplicative tasks (where IG is below random), with generalization to ResNet-18 (+18% and +37% respectively). Additional results note GradCAM failure on small-resolution images and monotonic improvement of diag(M) quality during training after accuracy plateaus.
Significance. If the central empirical claims hold after verification, the work provides a theoretically grounded, low-cost attribution technique that leverages existing training statistics to improve explanation fidelity over standard gradient-based methods. The zero-inference-cost AGOP-Global variant and the training hook are practical strengths, as is the demonstration that AGOP quality continues to improve post-convergence. The results on benchmarks with explicit pixel ground truth offer falsifiable, quantitative evidence linking feature-learning quantities to per-sample attributions.
major comments (2)
- [Abstract and evaluation sections] The central performance claims (44% mIoU gain on linear tasks, 7x on multiplicative tasks, and ResNet-18 generalization) rest on the fixed AGOP prior suppressing gradient noise without introducing systematic distortion when test samples exhibit distribution shift relative to training (different object counts, positions, or combinations in CLEVR-XAI). The evaluation does not appear to enforce or quantify strong shifts in the reported splits, so the reported gains could partly reflect prior alignment rather than explanatory fidelity; a concrete test (e.g., controlled shift experiments or per-sample error analysis) is needed to substantiate the assumption.
- [Abstract and results] Soundness of the mIoU numbers requires explicit reporting of data splits, number of runs, statistical significance tests, and whether any normalization constants in the sqrt(diag(M)/max(diag(M))) scaling were tuned after seeing test results. Without these, it is impossible to rule out post-hoc tuning or split leakage affecting the headline comparisons against IG, SmoothGrad, GradCAM, and VanillaGrad.
minor comments (2)
- [Methods] Clarify the exact definition and accumulation procedure for the AGOP matrix M in the methods section, including whether it is computed only on correctly classified training samples or the full set.
- [Abstract] The statement that AGOP-Global requires 'zero inference cost' should note the one-time training-time cost and storage requirement for the precomputed diag(M).
Simulated Author's Rebuttal
We thank the referee for the constructive comments on robustness under distribution shift and experimental transparency. We address both points below and will revise the manuscript accordingly to strengthen the claims.
read point-by-point responses
-
Referee: [Abstract and evaluation sections] The central performance claims (44% mIoU gain on linear tasks, 7x on multiplicative tasks, and ResNet-18 generalization) rest on the fixed AGOP prior suppressing gradient noise without introducing systematic distortion when test samples exhibit distribution shift relative to training (different object counts, positions, or combinations in CLEVR-XAI). The evaluation does not appear to enforce or quantify strong shifts in the reported splits, so the reported gains could partly reflect prior alignment rather than explanatory fidelity; a concrete test (e.g., controlled shift experiments or per-sample error analysis) is needed to substantiate the assumption.
Authors: We agree that quantifying distribution shift is valuable for validating that gains reflect explanatory fidelity rather than prior alignment. CLEVR-XAI incorporates shifts via varying object counts, positions, and combinations between train and test splits by design, and XAI-TRIS uses distinct scenarios. However, we did not explicitly measure shift magnitude or run controlled tests. In revision, we will add a controlled shift experiment on XAI-TRIS (varying object positions/counts in held-out test sets) and per-sample error analysis correlating attribution mIoU with shift indicators. This will be reported in a new subsection. revision: yes
-
Referee: [Abstract and results] Soundness of the mIoU numbers requires explicit reporting of data splits, number of runs, statistical significance tests, and whether any normalization constants in the sqrt(diag(M)/max(diag(M))) scaling were tuned after seeing test results. Without these, it is impossible to rule out post-hoc tuning or split leakage affecting the headline comparisons against IG, SmoothGrad, GradCAM, and VanillaGrad.
Authors: We fully agree on the importance of these details for reproducibility and soundness. The splits follow the original XAI-TRIS and CLEVR-XAI protocols with no train-test leakage. We used 5 independent runs (different seeds for training and AGOP accumulation), reporting mean mIoU with standard deviations; significance was evaluated with paired t-tests (p<0.01 for key gains). The scaling factor uses max(diag(M)) computed solely on the training distribution, with no post-hoc tuning on test data. In the revised manuscript we will insert a dedicated 'Experimental Setup' subsection explicitly stating the splits, run count, statistical tests, and confirmation of no test-set tuning. revision: yes
Circularity Check
No significant circularity: AGOP prior is precomputed input, performance claims rest on independent empirical comparisons
full rationale
The paper defines AGOP from the training distribution as a fixed prior (via training-time accumulation hook) and uses it to weight per-sample gradients or as a global saliency map. Attribution performance is measured via mIoU against pixel-level ground truth on held-out test sets from XAI-TRIS and CLEVR-XAI, with explicit comparisons to IG, SmoothGrad, GradCAM and VanillaGrad. No derivation step reduces a claimed result to its own inputs by construction; the Neural Feature Ansatz is invoked as background rather than a load-bearing self-citation that forces the attribution gains. The weighting formula is an explicit design choice, not a fitted parameter renamed as prediction. Distribution-shift concerns affect correctness but do not create circularity in the reported metrics.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Neural Feature Ansatz: weight Gram matrices at each layer align with the corresponding AGOP matrices computed over the training distribution
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AGOP-Weighted: multiplies the per-sample gradient by √(diag(M)/max diag(M)), a training-distribution prior
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
M. Sundararajan, A. Taly, Q. Yan, Axiomatic attribution for deep networks, in: Proceedings of ICML 2017, 2017, pp. 3319–3328
work page 2017
-
[2]
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, Grad-CAM: Visual explanations from deep networks via gradient-based localization, in: Proceedings of ICCV 2017, 2017, pp. 618–626
work page 2017
-
[3]
SmoothGrad: removing noise by adding noise
D. Smilkov, N. Thorat, B. Kim, F. Viégas, M. Wattenberg, SmoothGrad: Removing noise by adding noise, in: ICML 2017 Workshop on Visualization for Deep Learning, 2017. ArXiv:1706.03825
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[4]
A. Radhakrishnan, D. Beaglehole, P. Pandit, M. Belkin, Mechanism for feature learning in neural networks and backpropagation-free machine learning models, Science 383 (2024) 1461–1467
work page 2024
-
[5]
D. Beaglehole, A. Radhakrishnan, P. Pandit, M. Belkin, Mechanism of feature learning in convolu- tional neural networks, arXiv preprint arXiv:2309.00570 (2024)
- [6]
- [7]
-
[8]
K. Simonyan, A. Vedaldi, A. Zisserman, Deep inside convolutional networks: Visualising image classification models and saliency maps, in: ICLR 2014 Workshop on Learning Representations,
work page 2014
-
[9]
A. Chattopadhyay, A. Sarkar, P. Howlader, V. N. Balasubramanian, Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks, in: Proceedings of WACV 2018, 2018, pp. 839–847
work page 2018
-
[10]
A. Radhakrishnan, et al., xRFM: Accurate, scalable, and interpretable feature learning models for tabular data, in: Workshop on AI for Time Series and Dynamic Data (AITD) at NeurIPS 2025, 2025. ArXiv:2508.10053
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
M. T. Ribeiro, S. Singh, C. Guestrin, "why should I trust you?": Explaining the predictions of any classifier, in: Proceedings of ACM KDD 2016, 2016, pp. 1135–1144
work page 2016
-
[12]
S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model predictions, in: Advances in Neural Information Processing Systems 30 (NeurIPS 2017), 2017, pp. 4765–4774
work page 2017
-
[13]
C. Agarwal, S. Krishna, E. Saxena, M. Pawelczyk, N. Johnson, I. Puri, M. Zitnik, H. Lakkaraju, OpenXAI: Towards a transparent evaluation of post hoc model explanations, in: Advances in Neural Information Processing Systems 35 (NeurIPS 2022), Curran Associates, Inc., 2022
work page 2022
- [14]
-
[15]
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of CVPR 2016, 2016, pp. 770–778
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.