Fidelity of interpretability methods and perturba- tion artifacts in neural networks.arXiv preprint arXiv:2203.02928, 2022

Lennart Brocki, Neo Christopher Chung · 2022 · arXiv 2203.02928

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

AIM: Adversarial Information Masking for Faithfulness Evaluation of Saliency Maps

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

AIM is a new saliency-guided adversarial feature replacement method to evaluate faithfulness of saliency maps and reliability of masking operators on image, audio, and EEG tasks.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

citing papers explorer

Showing 2 of 2 citing papers.

AIM: Adversarial Information Masking for Faithfulness Evaluation of Saliency Maps cs.LG · 2026-05-16 · unverdicted · none · ref 26
AIM is a new saliency-guided adversarial feature replacement method to evaluate faithfulness of saliency maps and reliability of masking operators on image, audio, and EEG tasks.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 293
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Fidelity of interpretability methods and perturba- tion artifacts in neural networks.arXiv preprint arXiv:2203.02928, 2022

fields

years

verdicts

representative citing papers

citing papers explorer