From Weight Perturbation to Feature Attribution for Explaining Fully Connected Neural Networks

Denia Kanellopoulou; Thodoris Lymperopoulos

arxiv: 2605.15328 · v1 · pith:6GXPO7TYnew · submitted 2026-05-14 · 💻 cs.LG

From Weight Perturbation to Feature Attribution for Explaining Fully Connected Neural Networks

Thodoris Lymperopoulos , Denia Kanellopoulou This is my paper

Pith reviewed 2026-05-19 16:01 UTC · model grok-4.3

classification 💻 cs.LG

keywords feature attributionneural network interpretabilityweight perturbationocclusion methodexplainable AIfully connected networksmodel explanations

0 comments

The pith

Perturbing weights attached to input features produces reliable attributions that avoid bias and out-of-distribution problems in occlusion methods for fully connected neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a new way to measure feature importance in fully connected neural networks by changing the weights linked to each feature rather than changing the feature values. This leads to two methods named XWP and XWP_c that aim to fix problems common in occlusion approaches, such as adding bias or creating unrealistic inputs. The authors show these methods perform at least as well as established techniques when tested on identifying key parts of images using standard evaluation measures. A reader would care because clearer and less flawed explanations could make it easier to trust what simple neural networks are doing.

Core claim

Applying perturbation to the features' attached weights instead of their values leads to novel attribution methods XWP and XWP_c that mitigate common limitations in Occlusion techniques such as Added Bias and Out-of-Distribution data and achieve competitive performance in identifying image signals for simple DNNs on standard baseline metrics.

What carries the argument

Weight perturbation for attribution, the process of measuring a feature's importance by altering the weights connected to that feature while keeping input values fixed.

If this is right

XWP and XWP_c can generate explanations for fully connected network predictions without the bias that value occlusion often adds.
The methods reach performance levels comparable to leading attribution techniques on standard image signal detection metrics.
Simple rule-based perturbation of weights offers one path to more stable interpretability in basic deep networks.
This approach contributes a framework for reducing long-standing weaknesses in occlusion-based explainability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same weight perturbation idea could be tried on other model types where changing input values risks creating unrealistic samples.
It may reduce reliance on separate validation steps that current attribution methods often need.
Similar weight-focused changes might help explain decisions in models beyond image tasks.

Load-bearing premise

That perturbing weights attached to features produces a valid and unbiased measure of feature importance that directly addresses the added bias and out-of-distribution problems of value perturbation without introducing new artifacts or requiring additional validation on the specific network architecture.

What would settle it

A direct comparison on image classification tasks with fully connected networks where XWP attributions fail to highlight the same input regions that human-labeled ground truth or multiple other attribution methods consistently identify.

Figures

Figures reproduced from arXiv: 2605.15328 by Denia Kanellopoulou, Thodoris Lymperopoulos.

**Figure 1.** Figure 1: The produced attribution maps for our proposed methods, using weight perturbation for estimating feature attribution. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: A visualization of first layer weights of a trained FCNN on TMNIST, for specific neurons of the first layer. Distinct [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The importance scores of different attribution methods for the Typeface MNIST and Fashion MNIST datasets. A visual [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The evolution of Deletion AUC for different attribution methods. Shapley values and Integrated Gradients experience [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: The importance scores of different attribution methods for the Typeface MNIST datasets. XWP [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: The importance scores of different attribution methods for the Fashion MNIST datasets. Both XWP and XWP [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Fully Connected Neural Networks (FCNNs) are often regarded as simple and intuitive architectures, yet they serve as the foundation for more complex models. Nonetheless, the lack of consensus on their interpretability continues to pose challenges, underscoring the enduring relevance of simpler, attribution-based approaches for understanding even the most advanced neural architectures. In this regard, we explore a novel idea for estimating feature attribution, by applying perturbation to the features' attached weights instead of their values. This method offers a fresh perspective aimed at mitigating common limitations in Occlusion techniques, such as Added Bias and Out-of-Distribution data. The application of this rule leads to the formation of a pair of novel attribution methods we call XWP and XWP_c. Founded on simple rules, our methods achieve competitive performance in identifying image signals for simple DNNs, competing with the most established attribution methods on standard baseline metrics. Our work thus contributes to the field of Explainability by introducing a robust framework that paves the way for addressing these long-standing vulnerabilities, and leads to more reliable and interpretable model explanations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes perturbing weights attached to features rather than input values for attribution in FCNNs, but the abstract supplies no numbers or derivations to back the claim that this fixes occlusion problems.

read the letter

The main takeaway is that this work suggests a weight-perturbation route to feature attribution for fully connected networks, framing it as a way to dodge the added bias and out-of-distribution issues that come with standard occlusion. They name the resulting methods XWP and XWP_c and say these rest on simple rules while matching established techniques on baseline metrics for image signals in simple DNNs.

Referee Report

3 major / 1 minor

Summary. The paper claims that perturbing the weights attached to input features (rather than their values) in fully connected neural networks yields two new attribution methods, XWP and XWP_c, that mitigate the added-bias and out-of-distribution problems of occlusion techniques while achieving competitive performance on standard baseline metrics for identifying image signals in simple DNNs.

Significance. If the weight-perturbation approach can be shown to isolate marginal feature contributions without introducing new global-model or interaction artifacts, and if the competitive performance is confirmed by quantitative experiments with proper controls, the work would supply a simple, parameter-light alternative to value-based occlusion that could improve reliability of explanations for FCNNs and their descendants.

major comments (3)

[Abstract] Abstract: the claim of 'competitive performance ... on standard baseline metrics' is stated without any quantitative results, error bars, or experimental-setup details, leaving the central empirical claim unsupported by verifiable evidence.
[Method description] Description of XWP and XWP_c: no derivation (via chain rule, marginal contribution, or Shapley-style decomposition) is supplied showing that the output delta obtained by perturbing a feature's attached weight equals the marginal contribution of that feature or remains unbiased once non-linearities and downstream layers are present; the mitigation of occlusion artifacts is therefore asserted rather than demonstrated.
[Experimental evaluation] Experimental evaluation: the absence of any reported numbers, ablation studies, or comparisons with established methods (e.g., occlusion, gradient-based) prevents assessment of whether the proposed scores actually avoid the confounding the skeptic note identifies.

minor comments (1)

[Abstract] The subscript 'c' in XWP_c is introduced without an immediate definition of what the variant differs from the base XWP method.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments that identify key areas for strengthening the manuscript. We address each major comment below and will make substantial revisions to provide the requested empirical support, derivations, and comparisons.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'competitive performance ... on standard baseline metrics' is stated without any quantitative results, error bars, or experimental-setup details, leaving the central empirical claim unsupported by verifiable evidence.

Authors: We agree that the abstract's claim requires concrete support. In the revised manuscript we will update the abstract to report specific quantitative metrics (e.g., AUC or accuracy on baseline tests) together with error bars and a concise description of the experimental protocol. revision: yes
Referee: [Method description] Description of XWP and XWP_c: no derivation (via chain rule, marginal contribution, or Shapley-style decomposition) is supplied showing that the output delta obtained by perturbing a feature's attached weight equals the marginal contribution of that feature or remains unbiased once non-linearities and downstream layers are present; the mitigation of occlusion artifacts is therefore asserted rather than demonstrated.

Authors: The referee correctly notes the absence of a formal derivation. We will add a dedicated subsection that derives the attribution scores from a marginal-contribution perspective and analyzes the effect of non-linear activations and subsequent layers to clarify when the weight-perturbation delta remains unbiased. revision: yes
Referee: [Experimental evaluation] Experimental evaluation: the absence of any reported numbers, ablation studies, or comparisons with established methods (e.g., occlusion, gradient-based) prevents assessment of whether the proposed scores actually avoid the confounding the skeptic note identifies.

Authors: We acknowledge that the current manuscript lacks the quantitative evidence needed for rigorous assessment. The revision will include tabulated numerical results, ablation studies on perturbation magnitude and choice of baseline, and head-to-head comparisons against occlusion and gradient-based methods on the same datasets and metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: method defined by direct rule application, evaluated empirically

full rationale

The paper proposes XWP and XWP_c by directly defining feature attribution via weight perturbation on attached parameters rather than input values. No equations, derivations, or self-citations are presented that reduce the attribution scores to fitted parameters, prior self-referential results, or inputs by construction. Performance is assessed via standard baseline metrics on image signals for simple DNNs, making the central contribution an empirical proposal rather than a closed-loop derivation. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that weight perturbation avoids the documented flaws of input perturbation; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Perturbing weights attached to features yields a faithful attribution without introducing bias or distribution shift
This premise underpins the formation of XWP and XWP_c and the claim of mitigation of occlusion limitations.

pith-pipeline@v0.9.0 · 5719 in / 1197 out tokens · 41886 ms · 2026-05-19T16:01:48.329692+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 4 internal anchors

[1]

Shreyash Arya, Sukrut Rao, Moritz Böhle, and Bernt Schiele. 2024. B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 62756–62786. doi:10.52202/079017-2007

work page doi:10.52202/079017-2007 2024
[2]

Beyza Nur Aydoğan and Tevfik Aytekin. 2025. An in-depth analysis of KernelSHAP and SamplingSHAP: assessing robustness, error, and efficiency. Knowledge and Information Systems67 (2025), 10545 – 10579. https://api. semanticscholar.org/CorpusID:282832460

work page 2025
[3]

Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Ben- netot, Siham Tabik, Alberto Barbado, Salvador Garcia, Sergio Gil-Lopez, Daniel Molina, Richard Benjamins, Raja Chatila, and Francisco Herrera. 2020. Ex- plainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.Information...

work page doi:10.1016/j.inffus.2019.12.012 2020
[4]

Leonard Bereska and Efstratios Gavves. 2024. Mechanistic Interpretability for AI Safety – A Review. arXiv:2404.14082 [cs.AI] https://arxiv.org/abs/2404.14082

work page arXiv 2024
[5]

Alexander Binder, Sebastian Bach, Gregoire Montavon, Klaus-Robert Muller, and Wojciech Samek. 2016. Layer-Wise Relevance Propagation for Deep Neural Network Architectures. InInformation Science and Applications (ICISA) 2016, Kuinam J. Kim and Nikolai Joukov (Eds.). Springer Singapore, Singapore, 913– 922

work page 2016
[6]

Alexander Binder, Grégoire Montavon, Sebastian Bach, Klaus-Robert Müller, and Wojciech Samek. 2016. Layer-wise Relevance Propagation for Neural Networks with Local Renormalization Layers. arXiv:1604.00825 [cs.CV] https://arxiv.org/ abs/1604.00825

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

Ho Chan and Eduardo Veas. 2024. Importance Estimate of Features via analysis of their Weight and Gradient profile. (04 2024). doi:10.21203/rs.3.rs-4217886/v1

work page doi:10.21203/rs.3.rs-4217886/v1 2024
[8]

Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasub- ramanian. 2018. Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks. In2018 IEEE Winter Conference on Applications of Computer Vision (W ACV). 839–847. doi:10.1109/WACV.2018.00097

work page doi:10.1109/wacv.2018.00097 2018
[9]

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Das- Sarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah....

work page 2021
[10]

Thomas Fel, Melanie Ducoffe, David Vigouroux, Remi Cadene, Mikael Capelle, Claire Nicodeme, and Thomas Serre. 2023. Don’t Lie to Me! Robust and Efficient Explainability with Verified Perturbation Analysis. arXiv:2202.07728 [cs.CV] https://arxiv.org/abs/2202.07728

work page arXiv 2023
[11]

Fong and Andrea Vedaldi

Ruth C. Fong and Andrea Vedaldi. 2017. Interpretable Explanations of Black Boxes by Meaningful Perturbation. In2017 IEEE International Conference on Computer Vision (ICCV). Association for Computing Machinery, 3449–3457. doi:10.1109/ ICCV.2017.371

work page 2017
[12]

Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, and Thomas Icard. 2025. Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability. arXiv:2301.04709 [cs.AI] https://arxiv.org/abs/2301. 04709

work page arXiv 2025
[13]

Tristan Gomez, Thomas Fréour, and Harold Mouchère. 2022. Metrics for saliency map evaluation of deep learning explanation methods. arXiv:2201.13291 [cs.CV] https://arxiv.org/abs/2201.13291

work page arXiv 2022
[14]

Cheng-Yu Hsieh, Chih-Kuan Yeh, Xuanqing Liu, Pradeep Ravikumar, Seungyeon Kim, Sanjiv Kumar, and Cho-Jui Hsieh. 2021. Evaluations and Methods for Explanation through Robustness Analysis. arXiv:2006.00442 [cs.LG] https:// arxiv.org/abs/2006.00442

work page arXiv 2021
[15]

Kusner, and Ricardo Silva

Jean Kaddour, Aengus Lynch, Qi Liu, Matt J. Kusner, and Ricardo Silva. 2022. Causal Machine Learning: A Survey and Open Problems. arXiv:2206.15475 [cs.LG] https://arxiv.org/abs/2206.15475

work page arXiv 2022
[16]

Kotsiantis

Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris B. Kotsiantis. 2020. Explainable AI: A Review of Machine Learning Interpretability Methods.Entropy 23 (2020), 45. https://api.semanticscholar.org/CorpusID:229722844

work page 2020
[17]

Lundberg and Su-In Lee

Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 4768–4777

work page 2017
[18]

Mariusz Karol Nowak and Kamil Lelowicz. 2021. Weight Perturbation as a Method for Improving Performance of Deep Neural Networks. In2021 25th International Conference on Methods and Models in Automation and Robotics (MMAR). 127–132. doi:10.1109/MMAR49549.2021.9528460

work page doi:10.1109/mmar49549.2021.9528460 2021
[19]

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020. Zoom In: An Introduction to Circuits.Distill(2020). doi:10.23915/distill.00024.001 https://distill.pub/2020/circuits/zoom-in

work page doi:10.23915/distill.00024.001 2020
[20]

Vitali Petsiuk, Abir Das, and Kate Saenko. 2018. RISE: Randomized Input Sampling for Explanation of Black-box Models. arXiv:1806.07421 [cs.CV] https://arxiv.org/ abs/1806.07421

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Why Should I Trust You?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(San Francisco, California, USA)(KDD ’16). Association for Computing Machinery, New York, NY, USA, 1135–1144. doi:10.1145/...

work page doi:10.1145/2939672.2939778 2016
[22]

Samad, Rahim Hossain, and Khan M

Manar D. Samad, Rahim Hossain, and Khan M. Iftekharuddin. 2021. Dynamic Perturbation of Weights for Improved Data Reconstruction in Unsupervised Learning. In2021 International Joint Conference on Neural Networks (IJCNN). 1–7. doi:10.1109/IJCNN52387.2021.9533539

work page doi:10.1109/ijcnn52387.2021.9533539 2021
[23]

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning im- portant features through propagating activation differences. InProceedings of the 34th International Conference on Machine Learning - Volume 70(Sydney, NSW, Australia)(ICML’17). JMLR.org, 3145–3153

work page 2017
[24]

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps.CoRRabs/1312.6034 (2013). https://api.semanticscholar.org/CorpusID: 1450294

work page internal anchor Pith review Pith/arXiv arXiv 2013
[25]

Pascal Sturmfels, Scott Lundberg, and Su-In Lee. 2020. Visualizing the Im- pact of Feature Attribution Baselines.Distill(2020). doi:10.23915/distill.00022 https://distill.pub/2020/attribution-baselines

work page doi:10.23915/distill.00022 2020
[26]

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. InProceedings of the 34th International Conference on Machine Learning - Volume 70(Sydney, NSW, Australia)(ICML’17). JMLR.org, 3319–3328

work page 2017
[27]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010

work page 2017
[28]

Saurabh Vyawahare. 2024. TMNIST (Typeface MNIST). Kaggle. https://www. kaggle.com/datasets/saurabhvyawahare/tmnist-typeface-mnist

work page 2024
[29]

Dongxian Wu, Shu-Tao Xia, and Yisen Wang. 2020. Adversarial Weight Perturba- tion Helps Robust Generalization. InAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 2958–2969. https://proceedings.neurips.cc/ paper_files/paper/2020/file/1ef91c212e30e14bf...

work page 2020
[30]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017.Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv:cs.LG/1708.07747 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

Zeiler and Rob Fergus

Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and Understanding Convo- lutional Networks. InComputer Vision – ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 818–833. 7 Manuscript under review, Manuscript under review, Lymperopoulos et al. Sample OCCL SHAP RISE IG LRP XWP XWP...

work page 2014

[1] [1]

Shreyash Arya, Sukrut Rao, Moritz Böhle, and Bernt Schiele. 2024. B-cosification: Transforming Deep Neural Networks to be Inherently Interpretable. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 62756–62786. doi:10.52202/079017-2007

work page doi:10.52202/079017-2007 2024

[2] [2]

Beyza Nur Aydoğan and Tevfik Aytekin. 2025. An in-depth analysis of KernelSHAP and SamplingSHAP: assessing robustness, error, and efficiency. Knowledge and Information Systems67 (2025), 10545 – 10579. https://api. semanticscholar.org/CorpusID:282832460

work page 2025

[3] [3]

Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Ben- netot, Siham Tabik, Alberto Barbado, Salvador Garcia, Sergio Gil-Lopez, Daniel Molina, Richard Benjamins, Raja Chatila, and Francisco Herrera. 2020. Ex- plainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.Information...

work page doi:10.1016/j.inffus.2019.12.012 2020

[4] [4]

Leonard Bereska and Efstratios Gavves. 2024. Mechanistic Interpretability for AI Safety – A Review. arXiv:2404.14082 [cs.AI] https://arxiv.org/abs/2404.14082

work page arXiv 2024

[5] [5]

Alexander Binder, Sebastian Bach, Gregoire Montavon, Klaus-Robert Muller, and Wojciech Samek. 2016. Layer-Wise Relevance Propagation for Deep Neural Network Architectures. InInformation Science and Applications (ICISA) 2016, Kuinam J. Kim and Nikolai Joukov (Eds.). Springer Singapore, Singapore, 913– 922

work page 2016

[6] [6]

Alexander Binder, Grégoire Montavon, Sebastian Bach, Klaus-Robert Müller, and Wojciech Samek. 2016. Layer-wise Relevance Propagation for Neural Networks with Local Renormalization Layers. arXiv:1604.00825 [cs.CV] https://arxiv.org/ abs/1604.00825

work page internal anchor Pith review Pith/arXiv arXiv 2016

[7] [7]

Ho Chan and Eduardo Veas. 2024. Importance Estimate of Features via analysis of their Weight and Gradient profile. (04 2024). doi:10.21203/rs.3.rs-4217886/v1

work page doi:10.21203/rs.3.rs-4217886/v1 2024

[8] [8]

Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N Balasub- ramanian. 2018. Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks. In2018 IEEE Winter Conference on Applications of Computer Vision (W ACV). 839–847. doi:10.1109/WACV.2018.00097

work page doi:10.1109/wacv.2018.00097 2018

[9] [9]

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Das- Sarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah....

work page 2021

[10] [10]

Thomas Fel, Melanie Ducoffe, David Vigouroux, Remi Cadene, Mikael Capelle, Claire Nicodeme, and Thomas Serre. 2023. Don’t Lie to Me! Robust and Efficient Explainability with Verified Perturbation Analysis. arXiv:2202.07728 [cs.CV] https://arxiv.org/abs/2202.07728

work page arXiv 2023

[11] [11]

Fong and Andrea Vedaldi

Ruth C. Fong and Andrea Vedaldi. 2017. Interpretable Explanations of Black Boxes by Meaningful Perturbation. In2017 IEEE International Conference on Computer Vision (ICCV). Association for Computing Machinery, 3449–3457. doi:10.1109/ ICCV.2017.371

work page 2017

[12] [12]

Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, and Thomas Icard. 2025. Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability. arXiv:2301.04709 [cs.AI] https://arxiv.org/abs/2301. 04709

work page arXiv 2025

[13] [13]

Tristan Gomez, Thomas Fréour, and Harold Mouchère. 2022. Metrics for saliency map evaluation of deep learning explanation methods. arXiv:2201.13291 [cs.CV] https://arxiv.org/abs/2201.13291

work page arXiv 2022

[14] [14]

Cheng-Yu Hsieh, Chih-Kuan Yeh, Xuanqing Liu, Pradeep Ravikumar, Seungyeon Kim, Sanjiv Kumar, and Cho-Jui Hsieh. 2021. Evaluations and Methods for Explanation through Robustness Analysis. arXiv:2006.00442 [cs.LG] https:// arxiv.org/abs/2006.00442

work page arXiv 2021

[15] [15]

Kusner, and Ricardo Silva

Jean Kaddour, Aengus Lynch, Qi Liu, Matt J. Kusner, and Ricardo Silva. 2022. Causal Machine Learning: A Survey and Open Problems. arXiv:2206.15475 [cs.LG] https://arxiv.org/abs/2206.15475

work page arXiv 2022

[16] [16]

Kotsiantis

Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris B. Kotsiantis. 2020. Explainable AI: A Review of Machine Learning Interpretability Methods.Entropy 23 (2020), 45. https://api.semanticscholar.org/CorpusID:229722844

work page 2020

[17] [17]

Lundberg and Su-In Lee

Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 4768–4777

work page 2017

[18] [18]

Mariusz Karol Nowak and Kamil Lelowicz. 2021. Weight Perturbation as a Method for Improving Performance of Deep Neural Networks. In2021 25th International Conference on Methods and Models in Automation and Robotics (MMAR). 127–132. doi:10.1109/MMAR49549.2021.9528460

work page doi:10.1109/mmar49549.2021.9528460 2021

[19] [19]

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020. Zoom In: An Introduction to Circuits.Distill(2020). doi:10.23915/distill.00024.001 https://distill.pub/2020/circuits/zoom-in

work page doi:10.23915/distill.00024.001 2020

[20] [20]

Vitali Petsiuk, Abir Das, and Kate Saenko. 2018. RISE: Randomized Input Sampling for Explanation of Black-box Models. arXiv:1806.07421 [cs.CV] https://arxiv.org/ abs/1806.07421

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [21]

Why Should I Trust You?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(San Francisco, California, USA)(KDD ’16). Association for Computing Machinery, New York, NY, USA, 1135–1144. doi:10.1145/...

work page doi:10.1145/2939672.2939778 2016

[22] [22]

Samad, Rahim Hossain, and Khan M

Manar D. Samad, Rahim Hossain, and Khan M. Iftekharuddin. 2021. Dynamic Perturbation of Weights for Improved Data Reconstruction in Unsupervised Learning. In2021 International Joint Conference on Neural Networks (IJCNN). 1–7. doi:10.1109/IJCNN52387.2021.9533539

work page doi:10.1109/ijcnn52387.2021.9533539 2021

[23] [23]

Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning im- portant features through propagating activation differences. InProceedings of the 34th International Conference on Machine Learning - Volume 70(Sydney, NSW, Australia)(ICML’17). JMLR.org, 3145–3153

work page 2017

[24] [24]

Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps.CoRRabs/1312.6034 (2013). https://api.semanticscholar.org/CorpusID: 1450294

work page internal anchor Pith review Pith/arXiv arXiv 2013

[25] [25]

Pascal Sturmfels, Scott Lundberg, and Su-In Lee. 2020. Visualizing the Im- pact of Feature Attribution Baselines.Distill(2020). doi:10.23915/distill.00022 https://distill.pub/2020/attribution-baselines

work page doi:10.23915/distill.00022 2020

[26] [26]

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic attribution for deep networks. InProceedings of the 34th International Conference on Machine Learning - Volume 70(Sydney, NSW, Australia)(ICML’17). JMLR.org, 3319–3328

work page 2017

[27] [27]

Gomez, Łukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InProceedings of the 31st International Conference on Neural Information Processing Systems(Long Beach, California, USA)(NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010

work page 2017

[28] [28]

Saurabh Vyawahare. 2024. TMNIST (Typeface MNIST). Kaggle. https://www. kaggle.com/datasets/saurabhvyawahare/tmnist-typeface-mnist

work page 2024

[29] [29]

Dongxian Wu, Shu-Tao Xia, and Yisen Wang. 2020. Adversarial Weight Perturba- tion Helps Robust Generalization. InAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 2958–2969. https://proceedings.neurips.cc/ paper_files/paper/2020/file/1ef91c212e30e14bf...

work page 2020

[30] [30]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017.Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv:cs.LG/1708.07747 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2017

[31] [31]

Zeiler and Rob Fergus

Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and Understanding Convo- lutional Networks. InComputer Vision – ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 818–833. 7 Manuscript under review, Manuscript under review, Lymperopoulos et al. Sample OCCL SHAP RISE IG LRP XWP XWP...

work page 2014