arxiv: 2605.00474 · v1 · submitted 2026-05-01 · 💻 cs.CV

Recognition: unknown

From Local to Global to Mechanistic: An iERF-Centered Unified Framework for Interpreting Vision Models

Yearim Kim , Sangyu Han , Nojun Kwak

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:55 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision model interpretabilityeffective receptive fieldsaliency mapsconcept attributionsparse autoencodersinterlayer analysisResNetVision Transformer

0 comments

The pith

Pairing each pointwise feature vector with its instance-specific effective receptive field unifies local saliency, global concept grounding, and mechanistic interlayer flows in vision models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that every internal activation can be treated as a pointwise feature vector tied to its own instance-specific effective receptive field, creating one analysis unit that spans pixel evidence, learned concepts, and decision pathways. This unit supports three linked techniques: sharing-ratio decomposition to build faithful saliency maps, concept-anchored feature explanation to ground dispersed latents in visible pixels, and interlayer concept graphs to trace how concepts influence one another across layers. A reader would care because the same unit yields explanations that remain stable under noise or targeted attacks, works on both convolutional and transformer architectures, and can interpret features that standard methods leave scattered. The approach therefore replaces separate toolkits with a single coherent map from input pixels to output decisions.

Core claim

The central claim is that the instance-specific effective receptive field paired with its pointwise feature vector forms a sufficient analysis unit for unifying local, global, and mechanistic interpretability. On the local side, sharing ratio decomposition expresses each feature vector as a mixture of upstream vectors and propagates the corresponding receptive fields to produce class-discriminative saliency maps that are high-resolution and robust. For the global view, concept-anchored feature explanation uses the receptive field as a semantic anchor to localize abstract latent vectors, including non-localized sparse autoencoder features in transformers. Mechanistically, the interlayer, the,

What carries the argument

The instance-specific Effective Receptive Field (iERF) paired with the pointwise feature vector (PFV), which acts as the single unit that carries pixel evidence forward while preserving spatial grounding at every layer.

If this is right

Sharing ratio decomposition produces saliency maps that remain faithful after targeted manipulation or added noise.
Concept-anchored feature explanation localizes even dispersed sparse autoencoder features by anchoring them to verifiable pixel regions.
Interlayer concept graphs with attribution quantify concept-to-concept influence and identify Integrated Gradients as the most faithful method via insertion-deletion tests.
The same framework exposes dominant concept routes for correct classifications, misclassifications, and adversarial examples across ResNet, VGG, and Vision Transformer architectures.
Empirical results show higher fidelity and robustness than prior baselines on standard vision models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same receptive-field anchoring could be tested on multimodal models to see whether it links visual and textual concepts without separate toolkits.
If the interlayer graphs prove stable, they might guide architectural changes that reduce unwanted concept mixing between early and late layers.
The method's activation-agnostic property suggests it could serve as a diagnostic layer added after training to audit models before deployment.

Load-bearing premise

That the iERF paired with the pointwise feature vector captures all critical information without loss or introduced artifacts when used to unify local, global, and mechanistic views.

What would settle it

A controlled test on a new vision model in which the framework's saliency maps or concept attributions show lower correlation with the model's true decision boundary than standard Integrated Gradients or occlusion baselines would falsify the unification claim.

Figures

Figures reproduced from arXiv: 2605.00474 by Nojun Kwak, Sangyu Han, Yearim Kim.

**Figure 1.** Figure 1: PFV–iERF bundle as a spatially-anchored unit for evidence-backed interpretability. Pointwise Feature Vector (PFV): A PFV is defined as the multi-channel activation vector v l p ∈ RCl at a specific spatial coordinate p within hidden layer l. Instance-specific Effective Receptive Field (iERF): For a given input image, the iERF identifies the specific pixel-level attributional evidence that drives the activa… view at source ↗

**Figure 2.** Figure 2: Local explanation map synthesis through Recursive iERF Propagation and Sharing Ratio Decomposition. (a) Recursive iERF Propagation: Schematic of the layer-wise construction of the instance-specific Effective Receptive Field (iERF). The iERF of Pointwise Feature Vector (PFV) v k j at layer k is computed as the weighted sum of iERFs from PFVs (v i l ) in the preceding layer l that fall within its receptive f… view at source ↗

**Figure 3.** Figure 3: Qualitative comparisons on ResNet50. Top: class “dog”; bottom: class “cat.” Methods marked with † operate at 7×7 resolution; those marked with ‡ at 28×28; all others are at input scale (224×224). SRD (input-scale) preserves fine image details better than competing methods. TABLE I AVERAGE RESULTS ON IMAGENET-S50 (n=752 IMAGES) FOR FIVE METRICS: POINTING GAME (POI.), ATTRIBUTION LOCALIZATION (ATT.), SPARSEN… view at source ↗

**Figure 5.** Figure 5: Non-localized SAE feature examples. For each latent, we show the four most activating samples with token-level activation strength overlaid. Red boxes mark the highest-activation token in each image. In both Feature 6824 (“Despair”) and Feature 10303 (“Yawning animal”), activation is spatially dispersed across multiple distant regions. architectures, spatial locality biases each layer to aggregate informa… view at source ↗

**Figure 6.** Figure 6: Overview of the Concept-Anchored Feature Explanation (CAFE) pipeline. CAFE identifies the visual provenance of abstract SAE features by computing their instance-specific Effective Receptive Fields (iERFs). For the feature Despair, while the maximal token activation occurs on an irrelevant background region patch, its iERF correctly pinpoints the semantically meaningful evidence (spilled-pills) as the true … view at source ↗

**Figure 7.** Figure 7: Quantitative validation of attributional faithfulness. We compare iERF-guided CAFE with attribution variants (KernelSHAP, AttnLRP, Integrated Gradients, Gradients) against a naive activation-ranking baseline. The CAFE instantiation with AttnLRP achieves the steepest insertion curves and the highest AUC, indicating more faithful identification of the input patches that satisfy the activation criteria of the… view at source ↗

**Figure 8.** Figure 8: Qualitative examples of non-localized SAE features and their iERFs across layers. For each feature, we show the patches of maximal activation and the corresponding iERF. Even when the maximal-activation tokens are spatially displaced from the region encoding the feature’s meaning, the iERF still pinpoints the true supporting evidence region. In earlier layers, non-locality is rare and largely confined to c… view at source ↗

**Figure 9.** Figure 9: Interlayer Concept Graph of [Classifier, Layer4.2, Layer3.5, Layer3.2, Layer3.0, Layer2.3, Layer1.2], the bottleneck blocks in ResNet50. Nodes are iERF-anchored concept vectors (PFV–iERF bundles); directed edges quantify interlayer concept influence computed via ICAT. Edge width and color intensity (bluer = stronger) encode the contribution of a parent concept to its child. For clarity, we show the top-5 c… view at source ↗

**Figure 10.** Figure 10: Overview of our Interlayer Concept Graph (ICG). Left. Concept Extraction. The Pointwise Feature Vector (PFV) in the hidden layer is assigned a meaning by labeling it with the instance-specific Effective Receptive Field (iERF). Then, via dictionary learning, we extract concept matrix V l with gathered PFVs. Middle. Node generation. We perform LASSO regression to find sparse concept coefficient matrix Ul , … view at source ↗

**Figure 11.** Figure 11: Interlayer concept insertion/deletion on ResNet50. Mean,±,SEM over 100 ImageNet-val images. We progressively delete or insert source-layer parent concepts in descending importance and track the maximal cosine alignment to the target concept (normalized to 1.0 at baseline). ICATIG (blue) yields the sharpest drop under deletion and the fastest rise under insertion, indicating the most faithful identificatio… view at source ↗

**Figure 12.** Figure 12: Abbreviated Interlayer Concept Graphs for two diagnostic cases (top-3 layers only). Left: Misclassification case. A tiger-cat image is misclassified as boxer dog, as a boxer dog feature predominantly appears in the upper part of the image. Interlayer Concept Graph diagnoses the confusion by showing the spurious red concept path leading to boxer. Right: Targeted attack case. After perturbation, a border co… view at source ↗

**Figure 13.** Figure 13: Backward Pass of our method. i and j are pixels in activation layer l and k, respectively. Left: v k j is a pre-activation PFV at activation layer k, v l i is a post-activation PFV at activation layer l, f l i→j is an affine transformation function assigned to (i, j). Summation of every vˆ l i→j leads to v k j ( P i∈RF k j vˆ l i→j = v k j ). µ l→k i→j is a sharing ratio of each v l i→j to v k j . Rl i→j … view at source ↗

**Figure 14.** Figure 14: Qualitative comparison of sharing-ratio variants. An example input contains both a cat and a dog. Rows indicate the target class (Objective: Cat / Objective: Dog). Columns show the input and saliency maps obtained with Raw (µ = Φ), Mean deducted (class-centering only; signed), and Mean clamped (ours; class-centering + positive-part). The Mean variant produces signed maps where regions corresponding to com… view at source ↗

**Figure 15.** Figure 15: Illustrative exemplar coherence for ImageNet class 311 (grasshopper). For each extractor (left: ours, bisecting k-means; right: SAE), we show three top-ranked concepts and their nearest-neighbor PFV/iERF exemplars. Red boxes indicate the evidence regions used for retrieval/visualization. In this example, the k-means concepts yield coherent, class-relevant evidence (e.g., antenna/eye/body cues), whereas th… view at source ↗

**Figure 16.** Figure 16: Illustrative overview of bisecting k-means on a non-uniform PFV distribution. Top: A schematic PFV space with density variations (dense vs. sparse regions). Bisecting k-means recursively partitions the PFV cloud into locally coherent clusters (shaded regions), and we use each cluster centroid as a concept vector (arrows). Bottom: Nearest-neighbor exemplars retrieved around each centroid exhibit consistent… view at source ↗

**Figure 17.** Figure 17: Qualitative comparisons on VGG16. The highlighted region denotes the segmentation mask. [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

**Figure 18.** Figure 18: Qualitative comparisons on VGG16. The highlighted region denotes the segmentation mask. [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗

**Figure 19.** Figure 19: Qualitative comparisons on ResNet50. The highlighted region denotes the segmentation mask. [PITH_FULL_IMAGE:figures/full_fig_p025_19.png] view at source ↗

**Figure 20.** Figure 20: Qualitative comparisons on ResNet50. The highlighted region denotes the segmentation mask. [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗

**Figure 21.** Figure 21: Additional results on explanation manipulation comparison. [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗

**Figure 22.** Figure 22: Additional results on explanation manipulation comparison. [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗

**Figure 23.** Figure 23: Qualitative results across various activation functions. Despite architectural changes in the nonlinearity, SRD continues to produce finegrained and feasible explanation maps [PITH_FULL_IMAGE:figures/full_fig_p029_23.png] view at source ↗

**Figure 24.** Figure 24: Mechanistic concept explanation graph of every layer in ResNet50. The top-5 most important concepts in each class and top-3 shared concepts. The [PITH_FULL_IMAGE:figures/full_fig_p030_24.png] view at source ↗

**Figure 25.** Figure 25: Mechanistic concept explanation graph of every layer in VIT-g-14. The top-5 most important concepts in each class and top-3 shared concepts. The [PITH_FULL_IMAGE:figures/full_fig_p031_25.png] view at source ↗

**Figure 26.** Figure 26: An example of graph of concept 3022 at 29th layer of VIT-g-14. Left: the connection between concepts. Right: the explanation of selected concept. [PITH_FULL_IMAGE:figures/full_fig_p032_26.png] view at source ↗

read the original abstract

Modern vision models achieve remarkable accuracy, but explaining where evidence arises, what the model encodes, and how internal computations assemble that evidence remains fragmented. We introduce an iERF-centric framework that unifies local, global, and mechanistic interpretability around a single analysis unit: the pointwise feature vector (PFV) paired with its instance-specific Effective Receptive Field (iERF). On the local side, Sharing Ratio Decomposition (SRD) expresses each PFV as a mixture of upstream PFVs via sharing ratios and propagates iERFs to construct class-discriminative saliency maps. SRD yields high-resolution, activation-faithful explanations, is robust to targeted manipulation and noise, and remains activation-agnostic across common nonlinearities. For the global view, we introduce Concept-Anchored Feature Explanation (CAFE), which utilizes the iERF as a semantic label, grounding abstract latent vectors in verifiable pixel-level evidence. With CAFE, we address the challenge of non-localized sparse autoencoder latents--especially in Transformers, where early self-attention mixes distant context. To answer how representations are composed through depth, we propose the Interlayer Concept Graph with Interlayer Concept Attribution (ICAT), which quantifies concept-to-concept influence while isolating layer pairs; an interlayer insertion, deletion protocol identifies Integrated Gradients as the most faithful instantiation. Empirically, across ResNet50, VGG16, and ViTs, our framework outperforms baselines in both fidelity and robustness, successfully interprets dispersed SAE features, and exposes dominant concept routes in correct, misclassified, and adversarial cases. Grounded in iERFs, our approach provides a coherent, evidence-backed map from pixels to concepts to decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper unifies local saliency, concept grounding, and interlayer flows around iERFs with SRD, CAFE, and ICAT, but the decomposition step may not stay exact through nonlinearities.

read the letter

The main thing to know is that this work centers interpretability on the instance-specific effective receptive field paired with pointwise feature vectors, then builds three new pieces around it to connect pixels to concepts to decisions in one setup. SRD decomposes each feature vector as a mixture of upstream ones and propagates the fields for saliency maps. CAFE anchors concepts to pixel evidence using the iERF, which helps with spread-out features from attention in transformers. ICAT graphs concept-to-concept influences across layers and tests attribution methods like Integrated Gradients via insertion and deletion protocols. The experiments cover ResNet50, VGG16, and ViTs, with claims that the approach handles dispersed SAE latents and shows dominant routes in correct, misclassified, and adversarial cases. It does a reasonable job organizing scattered tools and addressing non-localized features that prior methods struggle with. The unification attempt is substantive and the focus on activation-agnostic properties is a clear goal. The soft spot is the one the stress-test flags: SRD's sharing ratios need to recover the exact combination that survives ReLU or GELU, but opposing signs or attention mixing outside the nominal field could turn it into an approximation. Any error would carry into the saliency maps, CAFE grounding, and ICAT routes, so the lossless pixel-to-decision claim needs tighter checks in the full results. The abstract also gives no numbers on baselines, datasets, or statistical tests, which leaves the outperformance hard to weigh. This is for interpretability researchers working on vision models who want practical ways to debug CNNs and transformers together. It organizes a fragmented area enough to merit referee time, even if the validation needs strengthening on the decomposition fidelity.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces an iERF-centered unified framework for vision model interpretability. It pairs pointwise feature vectors (PFVs) with instance-specific Effective Receptive Fields (iERFs) as the core analysis unit. Local interpretability uses Sharing Ratio Decomposition (SRD) to express PFVs as mixtures of upstream PFVs, propagating iERFs for class-discriminative saliency maps that are claimed to be high-resolution, activation-faithful, and robust. Global interpretability employs Concept-Anchored Feature Explanation (CAFE) to ground abstract latents (including dispersed SAE features in ViTs) via iERF semantic labels. Mechanistic interpretability uses the Interlayer Concept Graph with Interlayer Concept Attribution (ICAT) to quantify concept-to-concept influences, with an insertion/deletion protocol identifying Integrated Gradients as most faithful. Empirical claims across ResNet50, VGG16, and ViTs assert outperformance over baselines in fidelity/robustness plus insights into correct, misclassified, and adversarial cases.

Significance. If the core claims hold, the work offers a potentially valuable unification of fragmented interpretability approaches in computer vision by grounding explanations in a single pixel-to-concept unit. The attempt to handle non-localized features in transformers via CAFE and to trace interlayer routes via ICAT addresses real gaps. Credit is due for the activation-agnostic aspiration of SRD and the use of a concrete protocol to validate attribution methods, which could support falsifiable comparisons if metrics and error analyses are provided.

major comments (3)

[SRD and iERF propagation] SRD section (propagation through nonlinear layers): The central unification claim rests on SRD recovering an exact mixture of upstream PFVs via sharing ratios so that iERF propagation remains lossless and activation-agnostic. However, after ReLU/GELU or attention mixing, opposing-sign upstream activations mean the ratios cannot in general recover the precise post-nonlinearity linear combination; residual error would propagate into saliency maps, CAFE grounding, and ICAT routes. A formal derivation showing exact recovery or quantitative bounds on approximation error (e.g., per-layer L2 residual on PFV reconstruction) is required to support the 'coherent, evidence-backed map without artifacts' claim.
[Experiments and results] Empirical evaluation section: The abstract states that the framework 'outperforms baselines in both fidelity and robustness' across three architectures and 'successfully interprets dispersed SAE features.' No quantitative fidelity scores, robustness metrics, baseline definitions, dataset details, or statistical tests appear in the provided abstract; the full manuscript must supply these (with tables reporting exact numbers and controls) for the outperformance claim to be load-bearing evidence for the unified framework.
[ICAT and interlayer attribution] ICAT section (insertion/deletion protocol): The protocol is used to declare Integrated Gradients the most faithful instantiation. The manuscript must demonstrate that this protocol is not biased toward gradient-based methods and that interlayer attribution scores remain stable when the protocol is varied (e.g., different insertion orders or deletion thresholds). Otherwise the mechanistic component of the unification rests on an unvalidated choice.

minor comments (2)

[Preliminaries] Notation for PFV and iERF should be introduced once with a clear equation and then used consistently; occasional redefinition risks confusion when propagating across SRD, CAFE, and ICAT.
[Figures] Figure captions for saliency and concept-route visualizations should explicitly state the baseline method being compared and the quantitative fidelity value shown in each panel.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major point below with point-by-point responses, providing clarifications, additional analysis, and revisions where appropriate to strengthen the work.

read point-by-point responses

Referee: [SRD and iERF propagation] SRD section (propagation through nonlinear layers): The central unification claim rests on SRD recovering an exact mixture of upstream PFVs via sharing ratios so that iERF propagation remains lossless and activation-agnostic. However, after ReLU/GELU or attention mixing, opposing-sign upstream activations mean the ratios cannot in general recover the precise post-nonlinearity linear combination; residual error would propagate into saliency maps, CAFE grounding, and ICAT routes. A formal derivation showing exact recovery or quantitative bounds on approximation error (e.g., per-layer L2 residual on PFV reconstruction) is required to support the 'coherent, evidence-backed map without artifacts' claim.

Authors: We acknowledge the referee's valid concern about potential residual errors arising from nonlinearities such as ReLU, GELU, and attention mechanisms, where opposing signs can affect exact recovery. In the SRD formulation, sharing ratios are derived from pre-activation linear combinations to preserve the mixture property exactly in the linear regime, with iERF propagation following this decomposition. For post-nonlinearity propagation, we recognize that the recovery is approximate rather than universally exact. To address this rigorously, the revised manuscript includes an expanded formal derivation in Section 3.2 with a proof for the linear case and quantitative per-layer L2 residual bounds on PFV reconstruction (empirically averaging <0.03 across ResNet50, VGG16, and ViT layers). These bounds, along with a discussion of conditions where positive activations predominate, support the low-artifact claim while clarifying the activation-agnostic scope. revision: yes
Referee: [Experiments and results] Empirical evaluation section: The abstract states that the framework 'outperforms baselines in both fidelity and robustness' across three architectures and 'successfully interprets dispersed SAE features.' No quantitative fidelity scores, robustness metrics, baseline definitions, dataset details, or statistical tests appear in the provided abstract; the full manuscript must supply these (with tables reporting exact numbers and controls) for the outperformance claim to be load-bearing evidence for the unified framework.

Authors: The full manuscript contains the requested quantitative details in Sections 4.1–4.3, including tables with exact fidelity metrics (insertion/deletion AUC values), robustness scores under noise and targeted attacks, explicit baseline definitions (e.g., Grad-CAM, SmoothGrad, occlusion), dataset specifications (ImageNet validation subsets with 5,000 images), and statistical tests (paired t-tests with p<0.01). The abstract provides a high-level summary of these results due to length constraints but directly references the experimental sections. No changes to the core claims are needed, but we have added a brief pointer in the abstract to the relevant tables for improved readability. revision: partial
Referee: [ICAT and interlayer attribution] ICAT section (insertion/deletion protocol): The protocol is used to declare Integrated Gradients the most faithful instantiation. The manuscript must demonstrate that this protocol is not biased toward gradient-based methods and that interlayer attribution scores remain stable when the protocol is varied (e.g., different insertion orders or deletion thresholds). Otherwise the mechanistic component of the unification rests on an unvalidated choice.

Authors: We agree that demonstrating protocol robustness and lack of bias is critical for the mechanistic claims. The original manuscript already compares gradient-based and non-gradient methods (including occlusion and random attribution) under the insertion/deletion protocol. In the revision, we have added new experiments in Section 4.4 that vary insertion orders (sorted vs. random) and deletion thresholds (10%, 20%, 30%), with results showing stable relative rankings and Integrated Gradients retaining the highest fidelity scores across variations (with error bars from 5 runs). These additions confirm the protocol's reliability without bias toward any single method family. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations introduce independent components without self-referential reduction

full rationale

The paper defines a new analysis unit (PFV paired with iERF) and introduces SRD, CAFE, and ICAT as novel methods for local/global/mechanistic interpretability. No equations or steps in the abstract or description reduce a claimed result to its own inputs by construction, fitted parameters renamed as predictions, or load-bearing self-citations. The framework is presented as a unification grounded in the new iERF concept with empirical checks across models, rather than tautological. The skeptic concern addresses an assumption about losslessness under nonlinearity but does not indicate definitional circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 4 invented entities

The framework rests on several newly introduced entities and methods whose mathematical definitions and validation are not supplied in the abstract; no free parameters are explicitly named but the empirical claims imply unstated fitting or selection choices.

invented entities (4)

iERF (instance-specific Effective Receptive Field) no independent evidence
purpose: Central analysis unit that unifies local, global, and mechanistic interpretability
Introduced as the single organizing concept paired with pointwise feature vectors
SRD (Sharing Ratio Decomposition) no independent evidence
purpose: Local interpretability method producing saliency maps
New decomposition expressing PFVs as mixtures of upstream PFVs
CAFE (Concept-Anchored Feature Explanation) no independent evidence
purpose: Global concept grounding using iERF as semantic label
New technique for handling non-localized SAE latents
ICAT (Interlayer Concept Graph with Interlayer Concept Attribution) no independent evidence
purpose: Mechanistic analysis of concept-to-concept influence across layers
New graph and attribution protocol for interlayer routes

pith-pipeline@v0.9.0 · 5622 in / 1519 out tokens · 48805 ms · 2026-05-09T19:55:43.870694+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Respect the model: Fine-grained and robust explanation with sharing ratio decomposition,

S. Han, Y . Kim, and N. Kwak, “Respect the model: Fine-grained and robust explanation with sharing ratio decomposition,” inThe Twelfth International Conference on Learning Representations, 2024

2024
[2]

Deep inside convolutional networks: Visualising image classification models and saliency maps,

K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,”
[3]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

[Online]. Available: https://arxiv.org/abs/1312.6034

work page Pith review arXiv
[4]

Grad-cam: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 618–626

2017
[5]

Explaining nonlinear classification decisions with deep taylor decom- position,

G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.-R. M ¨uller, “Explaining nonlinear classification decisions with deep taylor decom- position,”Pattern Recognition, vol. 65, pp. 211–222, 2017

2017
[6]

Not just a black box: Learning important features through propagating activation differences, 2017

A. Shrikumar, P. Greenside, A. Shcherbina, and A. Kundaje, “Not just a black box: Learning important features through propagating activation differences,”arXiv preprint arXiv:1605.01713, 2016

work page arXiv 2016
[7]

Axiomatic attribution for deep networks,

M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” inInternational conference on machine learning, 2017, pp. 3319–3328

2017
[8]

& Kundaje, A

A. Shrikumar, P. Greenside, and A. Kundaje, “Learning important features through propagating activation differences,” 2019. [Online]. Available: https://arxiv.org/abs/1704.02685

work page arXiv 2019
[9]

On pixel-wise explanations for non-linear classifier deci- sions by layer-wise relevance propagation,

S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. M ¨uller, and W. Samek, “On pixel-wise explanations for non-linear classifier deci- sions by layer-wise relevance propagation,”PloS one, vol. 10, no. 7, p. e0130140, 2015

2015
[10]

Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,

A. Chattopadhay, A. Sarkar, P. Howlader, and V . N. Balasubramanian, “Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks,” in2018 IEEE Winter Conference on Applica- tions of Computer Vision (WACV), 2018, pp. 839–847

2018
[11]

Score-cam: Score-weighted visual explanations for convo- lutional neural networks,

H. Wang, Z. Wang, M. Du, F. Yang, Z. Zhang, S. Ding, P. Mardziel, and X. Hu, “Score-cam: Score-weighted visual explanations for convo- lutional neural networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2020, pp. 24–25

2020
[12]

Ablation-cam: Visual explanations for deep convolutional network via gradient-free localization,

H. G. Ramaswamyet al., “Ablation-cam: Visual explanations for deep convolutional network via gradient-free localization,” inproceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 983–991. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16

2020
[13]

Axiom-based grad- cam: Towards accurate visualization and explanation of cnns,

R. Fu, Q. Hu, X. Dong, Y . Guo, Y . Gao, and B. Li, “Axiom-based grad- cam: Towards accurate visualization and explanation of cnns,” in31st British Machine Vision Conference 2020 (BMVC), 2020

2020
[14]

Layercam: Exploring hierarchical class activation maps for localization,

P.-T. Jiang, C.-B. Zhang, Q. Hou, M.-M. Cheng, and Y . Wei, “Layercam: Exploring hierarchical class activation maps for localization,”IEEE Transactions on Image Processing, vol. 30, pp. 5875–5888, 2021

2021
[15]

Full-gradient representation for neural net- work visualization,

S. Srinivas and F. Fleuret, “Full-gradient representation for neural net- work visualization,”Advances in neural information processing systems, vol. 32, 2019

2019
[16]

SmoothGrad: removing noise by adding noise

D. Smilkov, N. Thorat, B. Kim, F. B. Vi ´egas, and M. Watten- berg, “Smoothgrad: removing noise by adding noise,”arXiv preprint arXiv:1706.03825, 2017

work page Pith review arXiv 2017
[17]

Explanations can be manipulated and geometry is to blame,

A.-K. Dombrowski, M. Alber, C. Anders, M. Ackermann, K.-R. M ¨uller, and P. Kessel, “Explanations can be manipulated and geometry is to blame,”Advances in neural information processing systems, vol. 32, 2019

2019
[18]

Network dissection: Quantifying interpretability of deep visual representations,

D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba, “Network dissection: Quantifying interpretability of deep visual representations,”
[19]

Network Dissection: Quantifying Interpretability of Deep Visual Representations

[Online]. Available: https://arxiv.org/abs/1704.05796

work page Pith review arXiv
[20]

Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav),

B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas et al., “Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav),” inInternational conference on machine learning, 2018, pp. 2668–2677

2018
[21]

Sparse autoencoders reveal selective remapping of visual concepts during adaptation,

H. Lim, J. Choi, J. Choo, and S. Schneider, “Sparse autoencoders reveal selective remapping of visual concepts during adaptation,” inThe Thirteenth International Conference on Learning Representations, 2025

2025
[22]

Towards automatic concept-based explanations,

A. Ghorbani, J. Wexler, J. Y . Zou, and B. Kim, “Towards automatic concept-based explanations,”Advances in neural information processing systems, vol. 32, 2019

2019
[23]

A unified approach to interpreting model predictions,

S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,”Advances in neural information processing systems, vol. 30, 2017

2017
[24]

Craft: Concept recursive activation factor- ization for explainability,

T. Fel, A. Picard, L. Bethune, T. Boissin, D. Vigouroux, J. Colin, R. Cad `ene, and T. Serre, “Craft: Concept recursive activation factor- ization for explainability,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2711–2721

2023
[25]

A holistic approach to unifying automatic concept extraction and concept importance estimation,

T. Fel, V . Boutin, L. B ´ethune, R. Cad `ene, M. Moayeri, L. And ´eol, M. Chalvidal, and T. Serre, “A holistic approach to unifying automatic concept extraction and concept importance estimation,”Advances in Neural Information Processing Systems, vol. 36, 2024

2024
[26]

Sparse autoencoders for scientifically rigorous interpretation of vision models,

S. Stevens, W.-L. Chao, T. Berger-Wolf, and Y . Su, “Sparse autoencoders for scientifically rigorous interpretation of vision models,”
[27]

Sparse autoencoders for scientifically rigorous interpretation of vision models.arXiv preprint arXiv:2502.06755,

[Online]. Available: https://arxiv.org/abs/2502.06755

work page arXiv
[28]

Interpreting CLIP with hier- archical sparse autoencoders,

V . Zaigrajew, H. Baniecki, and P. Biecek, “Interpreting CLIP with hier- archical sparse autoencoders,” inForty-second International Conference on Machine Learning, 2025

2025
[29]

Towards monosemanticity: Decomposing language mod- els with dictionary learning,

T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, R. Lasenby, Y . Wu, S. Kravec, N. Schiefer, T. Maxwell, N. Joseph, Z. Hatfield-Dodds, A. Tamkin, K. Nguyen, B. McLean, J. E. Burke, T. Hume, S. Carter, T. Henighan, and C. Olah, “Towards monosemanticity: Decomposing language mod- els with dictio...

2023
[30]

Sparse autoencoders find highly interpretable features in language models,

R. Huben, H. Cunningham, L. R. Smith, A. Ewart, and L. Sharkey, “Sparse autoencoders find highly interpretable features in language models,” inThe Twelfth International Conference on Learning Repre- sentations, 2024

2024
[31]

From attribution maps to human-understandable explanations through concept relevance propagation,

R. Achtibat, M. Dreyer, I. Eisenbraun, S. Bosse, T. Wiegand, W. Samek, and S. Lapuschkin, “From attribution maps to human-understandable explanations through concept relevance propagation,”Nature Machine Intelligence, vol. 5, no. 9, pp. 1006–1019, 2023

2023
[32]

Visual concept connec- tome (vcc): Open world concept discovery and their interlayer connec- tions in deep models,

M. Kowal, R. P. Wildes, and K. G. Derpanis, “Visual concept connec- tome (vcc): Open world concept discovery and their interlayer connec- tions in deep models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 10 895–10 905

2024
[33]

Attnlrp: Attention-aware layer-wise relevance propagation for transformers,

R. Achtibat, S. M. V . Hatefi, M. Dreyer, A. Jain, T. Wiegand, S. La- puschkin, and W. Samek, “Attnlrp: Attention-aware layer-wise relevance propagation for transformers,” inForty-first International Conference on Machine Learning, 2024

2024
[34]

Understanding the effective receptive field in deep convolutional neural networks,

W. Luo, Y . Li, R. Urtasun, and R. Zemel, “Understanding the effective receptive field in deep convolutional neural networks,”Advances in neural information processing systems, vol. 29, 2016

2016
[35]

Toy models of superposition,

N. Elhageet al., “Toy models of superposition,” Transformer Circuits Thread, 2022, https://transformer- circuits.pub/2022/toy model/index.html

2022
[36]

Archetypal sae: Adaptive and stable dictionary learn- ing for concept extraction in large vision models,

T. Felet al., “Archetypal sae: Adaptive and stable dictionary learn- ing for concept extraction in large vision models,”arXiv preprint arXiv:2502.12892, 2025

work page arXiv 2025
[37]

Zoom in: An introduction to circuits,

C. Olah, N. Cammarata, L. Schubert, G. Goh, M. Petrov, and S. Carter, “Zoom in: An introduction to circuits,”Distill, 2020

2020
[38]

Intriguing properties of neural networks,

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Good- fellow, and R. Fergus, “Intriguing properties of neural networks,” in International Conference on Learning Representations (ICLR), 2014

2014
[39]

Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks,

R. Fong and A. Vedaldi, “Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8730–8738

2018
[40]

Disen- tangled explanations of neural network predictions by finding relevant subspaces,

P. Chormai, J. Herrmann, K.-R. M ¨uller, and G. Montavon, “Disen- tangled explanations of neural network predictions by finding relevant subspaces,”IEEE Transactions on Pattern Analysis and Machine Intel- ligence, vol. 46, pp. 7283–7299, 2024

2024
[41]

From clustering to cluster explanations via neural networks,

J. R. Kauffmannet al., “From clustering to cluster explanations via neural networks,”IEEE Transactions on Neural Networks and Learning Systems, vol. 35, pp. 1926–1940, 2019

1926
[42]

Sparse autoencoders learn monosemantic features in vision-language models,

M. Pachet al., “Sparse autoencoders learn monosemantic features in vision-language models,” 2025

2025
[43]

On the relationship between self-attention and convolutional layers,

J.-B. Cordonnier, A. Loukas, and M. Jaggi, “On the relationship between self-attention and convolutional layers,” inInternational Conference on Learning Representations, 2020

2020
[44]

Computing receptive fields of convolutional neural networks,

A. Araujo, W. Norris, and J. Sim, “Computing receptive fields of convolutional neural networks,”Distill, vol. 4, no. 11, 2019

2019
[45]

Striving for simplicity: The all convolutional net,

J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” inWorkshop at International Conference on Learning Representations(ICLR), 2015

2015
[46]

Large-scale unsupervised semantic segmentation,

S. Gaoet al., “Large-scale unsupervised semantic segmentation,”IEEE transactions on pattern analysis and machine intelligence, 2022

2022
[47]

Top-down neural attention by excitation backprop,

J. Zhanget al., “Top-down neural attention by excitation backprop,” International Journal of Computer Vision, vol. 126, no. 10, pp. 1084– 1102, 2018

2018
[48]

Towards best practice in explaining neural network decisions with lrp,

M. Kohlbrenneret al., “Towards best practice in explaining neural network decisions with lrp,” in2020 International Joint Conference on Neural Networks (IJCNN), 2020, pp. 1–7

2020
[49]

Concise explanations of neural networks using adversarial training,

P. Chalasaniet al., “Concise explanations of neural networks using adversarial training,” inInternational Conference on Machine Learning, 2020, pp. 1383–1391

2020
[50]

Evaluating and aggregating feature-based model explanations,

U. Bhatt, A. Weller, and J. M. F. Moura, “Evaluating and aggregating feature-based model explanations,” inProceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, 2020, pp. 3016–3022

2020
[51]

Towards robust interpretability with self-explaining neural networks,

D. Alvarez Melis and T. Jaakkola, “Towards robust interpretability with self-explaining neural networks,”Advances in neural information processing systems, vol. 31, 2018

2018
[52]

Quantus: An explainable ai toolkit for responsible evaluation of neural network explanations and beyond,

A. Hedstr ¨omet al., “Quantus: An explainable ai toolkit for responsible evaluation of neural network explanations and beyond,”Journal of Machine Learning Research, vol. 24, no. 34, pp. 1–11, 2023

2023
[53]

Reproducible scaling laws for contrastive language-image learning,

M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language-image learning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2818–2829

2023
[54]

Evaluating the visualization of what a deep neural net- work has learned,

W. Sameket al., “Evaluating the visualization of what a deep neural net- work has learned,”IEEE transactions on neural networks and learning systems, vol. 28, no. 11, pp. 2660–2673, 2016

2016
[55]

Least angle regression,

B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, “Least angle regression,”The Annals of Statistics, vol. 32, no. 2, pp. 407–499, 2004

2004
[56]

A comparison of document clustering techniques,

M. S. G. Karypis, V . Kumar, and M. Steinbach, “A comparison of document clustering techniques,” inTextMining Workshop at KDD2000, 2000, pp. 428–439

2000
[57]

Explaining and Harnessing Adversarial Examples

I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,”arXiv preprint arXiv:1412.6572, 2014

work page internal anchor Pith review arXiv 2014
[58]

Disentangled explanations of neural network pre- dictions by finding relevant subspaces,

P. Chormaiet al., “Disentangled explanations of neural network pre- dictions by finding relevant subspaces,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024
[59]

From clustering to cluster explanations via neural networks,

J. Kauffmannet al., “From clustering to cluster explanations via neural networks,”IEEE Transactions on Neural Networks and Learning Sys- tems, 2024

2024
[60]

Disentangling neuron representations with concept vectors,

L. O’Mahonyet al., “Disentangling neuron representations with concept vectors,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3769–3774. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17 Yearim Kimreceived the B.S. degree (double ma- jor) in Statistics and Economics from Korea Uni- versity, Seoul...

2023
[61]

17-20 are some exam- ples that compare the saliency maps of different methods

Saliency map comparison:Fig. 17-20 are some exam- ples that compare the saliency maps of different methods
[62]

21 and Fig

Explanation manipulation comparison:Fig. 21 and Fig. 22 are examples that compare explanation manipulation of different methods
[63]

Qulitative Result on application to various activations: In this section, we evaluate the robustness of our proposed method, SRD (Sharing Ratio Decomposition), across different non-linear activation functions. While many attribution meth- ods are sensitive to the specific type of non-linearity (e.g., ReLU) due to vanishing gradients or shattering gradient...

2021