Recognition: unknown
From Attribution to Action: A Human-Centered Application of Activation Steering
Pith reviewed 2026-05-10 14:55 UTC · model grok-4.3
The pith
Activation steering paired with attribution lets practitioners test hypotheses about model behavior through direct interventions instead of passive inspection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that activation steering renders interpretability actionable by enabling intervention-based hypothesis testing, as demonstrated when eight experts used the workflow on CLIP to debug concept usage, grounded their trust primarily in model output changes, adopted suppression-dominated strategies, and identified practical limits including ripple effects and poor generalization of instance-level fixes.
What carries the argument
The interactive web-based workflow that combines SAE-based attribution with activation steering to support instance-level concept analysis and targeted interventions in vision models.
If this is right
- Users can verify attributions by steering components and directly observing prediction changes.
- Debugging workflows become dominated by targeted suppression of identified components.
- Trust in the method rests more on empirical model responses than on the initial attribution quality.
- Instance-level steering corrections may not transfer reliably to other inputs or models.
Where Pith is reading between the lines
- The workflow could support iterative model refinement loops where users steer, observe, and repeat until behavior stabilizes.
- Risks like ripple effects suggest the need for batch-level steering checks before deploying instance fixes.
- Extending the approach beyond vision to language or multimodal models would require new attribution-steering interfaces.
Load-bearing premise
Insights drawn from eight experts performing specific debugging tasks on CLIP reflect how practitioners would generally reason about and apply activation steering in other settings.
What would settle it
A larger study with more diverse practitioners and models where most users continue to rely on explanation plausibility for trust rather than shifting to intervention-based testing.
Figures
read the original abstract
Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces an interactive web-based workflow that integrates Sparse Autoencoder (SAE) attribution with activation steering for instance-level analysis of concept usage in vision models such as CLIP. The workflow is evaluated via semi-structured expert interviews (N=8) on debugging tasks, yielding the claims that steering shifts all participants (8/8) from inspection to intervention-based hypothesis testing, that most (6/8) ground trust in observed model outputs rather than explanation plausibility, that component suppression dominates strategies (7/8), and that users identify risks including ripple effects and limited generalization of instance-level corrections.
Significance. If the qualitative patterns are robust, the work provides timely human-centered evidence that activation steering can convert passive XAI attributions into actionable interventions. The explicit reporting of user strategies and risks (e.g., ripple effects) adds practical value often missing from purely technical steering papers. The implementation of a concrete tool and the focus on real debugging tasks strengthen the bridge from attribution to action, offering design implications for future interpretability systems.
major comments (2)
- [Section 4] Section 4 (User Study / Methodology): The description of the semi-structured interview protocol, participant selection criteria, exact debugging tasks, interview guide, qualitative coding process, and any bias-mitigation steps (e.g., inter-rater reliability) is insufficiently detailed. Because the central claims consist of specific fractions (8/8, 6/8, 7/8) derived from these interviews, the absence of this information prevents verification that the reported patterns are not artifacts of task framing or analysis choices.
- [Findings and Discussion] Findings and Discussion sections: The extrapolation from N=8 experts performing CLIP-specific debugging to general practitioner reasoning about activation steering is load-bearing for the paper's broader utility claim, yet the manuscript provides no additional validation (e.g., larger sample, different models, or quantitative measures) and only limited caveats regarding selection bias and task specificity. This weakens the link between the observed behaviors and the asserted shift toward actionable interpretability.
minor comments (2)
- [Abstract and Section 3] Abstract and Section 3: SAE is introduced without an initial expansion or reference to its standard definition (Sparse Autoencoder), which may hinder readability for readers outside the immediate subfield.
- The paper would benefit from a table summarizing participant demographics, task completion times, or strategy frequencies to make the N=8 results more transparent at a glance.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. The feedback identifies two key areas for improvement: greater methodological transparency in the user study and stronger caveats around generalizability. We agree on both points and will revise the manuscript accordingly. Below we respond point-by-point to the major comments, indicating the changes we will make.
read point-by-point responses
-
Referee: [Section 4] Section 4 (User Study / Methodology): The description of the semi-structured interview protocol, participant selection criteria, exact debugging tasks, interview guide, qualitative coding process, and any bias-mitigation steps (e.g., inter-rater reliability) is insufficiently detailed. Because the central claims consist of specific fractions (8/8, 6/8, 7/8) derived from these interviews, the absence of this information prevents verification that the reported patterns are not artifacts of task framing or analysis choices.
Authors: We agree that the current description of the methodology is insufficient for independent verification. In the revised manuscript we will substantially expand Section 4 to include: (1) the complete semi-structured interview protocol and the full interview guide with example questions; (2) explicit participant selection criteria, including recruitment channels, years of experience in ML interpretability, and prior exposure to vision-language models; (3) precise descriptions of the three debugging tasks, including the images, target concepts, and success criteria given to participants; (4) the qualitative analysis procedure, specifying the thematic coding approach, how the 8/8, 6/8, and 7/8 counts were derived, and any inter-rater reliability assessment (we will report Cohen’s kappa or describe the consensus process used); and (5) bias-mitigation steps such as pilot interviews, question ordering to avoid leading participants, and steps taken to reduce confirmation bias during coding. These additions will allow readers to evaluate whether the observed patterns could be artifacts of task framing or analysis choices. revision: yes
-
Referee: [Findings and Discussion] Findings and Discussion sections: The extrapolation from N=8 experts performing CLIP-specific debugging to general practitioner reasoning about activation steering is load-bearing for the paper's broader utility claim, yet the manuscript provides no additional validation (e.g., larger sample, different models, or quantitative measures) and only limited caveats regarding selection bias and task specificity. This weakens the link between the observed behaviors and the asserted shift toward actionable interpretability.
Authors: We acknowledge that the manuscript’s language occasionally implies broader applicability than the N=8, CLIP-specific evidence strictly supports. While qualitative studies of this size are common in human-centered XAI for generating initial insights, we agree that stronger caveats and clearer scoping are needed. In the revision we will: (1) expand the Limitations subsection to explicitly address selection bias (participants were drawn from a convenience sample of interpretability researchers), task specificity (debugging tasks were limited to CLIP on natural-image classification), and the absence of quantitative or cross-model validation; (2) revise the Findings and Discussion sections to use more qualified language (e.g., “in this study, all participants…” rather than generalizing to “practitioners”); and (3) add a forward-looking paragraph outlining concrete next steps for larger-scale or quantitative follow-up studies. We maintain that the observed shift from inspection to intervention-based reasoning is a substantive finding for the studied setting and provides actionable design implications, but we will no longer present it as a general claim without further evidence. revision: partial
Circularity Check
No circularity: empirical interview study with no derivations or self-referential predictions
full rationale
The paper reports qualitative findings from semi-structured interviews (N=8) on a specific SAE+steering workflow for CLIP debugging tasks. It contains no mathematical equations, fitted parameters, predictions, or derivation chains. All central claims (e.g., 8/8 participants shifting to intervention-based testing, 7/8 using component suppression) are directly grounded in new interview data rather than reducing to prior author work, self-definitions, or fitted inputs. Self-citations, if present, are not load-bearing for the reported observations. The study is self-contained against external benchmarks as a human-centered empirical investigation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
From attribution maps to human-understandable explanations through concept rele- vance propagation.Nature Machine Intelligence, 5(9):1006– 1019, 2023
Reduan Achtibat, Maximilian Dreyer, Ilona Eisenbraun, Sebastian Bosse, Thomas Wiegand, Wojciech Samek, and Sebastian Lapuschkin. From attribution maps to human-understandable explanations through concept rele- vance propagation.Nature Machine Intelligence, 5(9):1006– 1019, 2023. 1
2023
-
[2]
Backpropagation and stochastic gradient descent method.Neurocomputing, 5(4):185–196, 1993
Shun-ichi Amari. Backpropagation and stochastic gradient descent method.Neurocomputing, 5(4):185–196, 1993. 1
1993
-
[3]
On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PloS one, 10 (7):e0130140, 2015
Sebastian Bach, Alexander Binder, Gr ´egoire Montavon, Frederick Klauschen, Klaus-Robert M ¨uller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.PloS one, 10 (7):e0130140, 2015. 1
2015
-
[4]
Does the whole exceed its parts? the effect of ai explanations on complementary team performance
Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel Weld. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. InPro- ceedings of the 2021 CHI conference on human factors in computing systems, pages 1–16, 2021. 3
2021
-
[5]
Network dissection: Quantifying inter- pretability of deep visual representations
David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying inter- pretability of deep visual representations. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 6541–6549, 2017. 1
2017
-
[6]
Principles and prac- tice of explainable machine learning.Frontiers in big Data, 4:688969, 2021
Vaishak Belle and Ioannis Papantonis. Principles and prac- tice of explainable machine learning.Frontiers in big Data, 4:688969, 2021. 1
2021
-
[7]
Eliciting Latent Predictions from Transformers with the Tuned Lens
Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Ja- cob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112, 2023. 2
work page internal anchor Pith review arXiv 2023
-
[8]
Usha Bhalla, Suraj Srinivas, Asma Ghandeharioun, and Himabindu Lakkaraju. Towards unifying interpretability and control: Evaluation via intervention.arXiv preprint arXiv:2411.04430, 2024. 2
-
[9]
Human-centered tools for coping with imperfect algorithms during medical decision-making
Carrie J Cai, Emily Reif, Narayan Hegde, Jason Hipp, Been Kim, Daniel Smilkov, Martin Wattenberg, Fernanda Viegas, Greg S Corrado, Martin C Stumpe, et al. Human-centered tools for coping with imperfect algorithms during medical decision-making. InProceedings of the 2019 CHI Confer- ence on Human Factors in Computing Systems, pages 1–14,
2019
-
[10]
Causal scrubbing: A method for rigorously testing inter- pretability hypotheses
Lawrence Chan, Adria Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas. Causal scrubbing: A method for rigorously testing inter- pretability hypotheses. InAI Alignment Forum, page 19,
-
[11]
Thematic analysis.The journal of positive psychology, 12(3):297–298, 2017
Victoria Clarke and Virginia Braun. Thematic analysis.The journal of positive psychology, 12(3):297–298, 2017. 6
2017
-
[12]
(2020).Opportunities and Challenges in Explainable Artificial Intelligence (XAI): A Survey
Arun Das and Paul Rad. Opportunities and challenges in explainable artificial intelligence (xai): A survey.arXiv preprint arXiv:2006.11371, 2020. 1
-
[13]
The effectiveness of style vectors for steering large language models: A human evaluation.IEEE Access, 13:191443–191457, 2025
Diaoul ´e Diallo, Katharina Dworatzyk, Sophie Jentzsch, Peer Sch¨utt, Sabine Theis, and Tobias Hecking. The effectiveness of style vectors for steering large language models: A human evaluation.IEEE Access, 13:191443–191457, 2025. 2
2025
-
[14]
Towards A Rigorous Science of Interpretable Machine Learning
Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning.arXiv preprint arXiv:1702.08608, 2017. 3
work page internal anchor Pith review arXiv 2017
-
[15]
Mechanistic understanding and validation of large ai models with semanticlens.Nature Machine Intel- ligence, 7(9):1572–1585, 2025
Maximilian Dreyer, Jim Berend, Tobias Labarta, Johanna Vielhaben, Thomas Wiegand, Sebastian Lapuschkin, and Wojciech Samek. Mechanistic understanding and validation of large ai models with semanticlens.Nature Machine Intel- ligence, 7(9):1572–1585, 2025. 2
2025
-
[16]
Maximilian Dreyer, Lorenz Hufe, Jim Berend, Thomas Wiegand, Sebastian Lapuschkin, and Wojciech Samek. From what to how: Attributing clip’s latent components reveals unexpected semantic reliance.arXiv preprint arXiv:2505.20229, 2025. 2, 3
-
[17]
The think aloud method: what is it and how do i use it?Qualitative Research in Sport, Exercise and Health, 9(4):514–531, 2017
David W Eccles and G ¨uler Arsal. The think aloud method: what is it and how do i use it?Qualitative Research in Sport, Exercise and Health, 9(4):514–531, 2017. 4
2017
-
[18]
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield- Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652,
work page internal anchor Pith review arXiv
-
[19]
Beatrice Fazi
M. Beatrice Fazi. Beyond human: Deep learning, explain- ability and representation.Theory, Culture & Society, 38(7): 55–77, 2021-12. 1
2021
-
[20]
Causal abstractions of neural networks.Advances 9 in Neural Information Processing Systems, 34:9574–9586,
Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks.Advances 9 in Neural Information Processing Systems, 34:9574–9586,
-
[21]
Causal abstraction: A theoretical foundation for mechanis- tic interpretability.Journal of Machine Learning Research, 26(83):1–64, 2025
Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaud- hary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, et al. Causal abstraction: A theoretical foundation for mechanis- tic interpretability.Journal of Machine Learning Research, 26(83):1–64, 2025. 2
2025
-
[22]
Who’s asking? user personas and the mechanics of latent misalignment
Asma Ghandeharioun, Ann Yuan, Marius Guerard, Emily Reif, Michael Lepori, and Lucas Dixon. Who’s asking? user personas and the mechanics of latent misalignment. Advances in Neural Information Processing Systems, 37: 125967–126003, 2024. 2
2024
-
[23]
Qualitatively characterizing neural network optimization problems
Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qual- itatively characterizing neural network optimization prob- lems.arXiv preprint arXiv:1412.6544, 2014. 1
work page Pith review arXiv 2014
-
[24]
Counterfactual explanations and how to find them: literature review and benchmarking.Data Mining and Knowledge Discovery, 38(5):2770–2824, 2024
Riccardo Guidotti. Counterfactual explanations and how to find them: literature review and benchmarking.Data Mining and Knowledge Discovery, 38(5):2770–2824, 2024. 1
2024
-
[25]
A Sur- vey of Methods for Explaining Black Box Models.ACM Comput
Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi. A Sur- vey of Methods for Explaining Black Box Models.ACM Comput. Surv., 51(5):93:1–93:42, 2018. 1
2018
-
[26]
Causability and explainability of artificial intelligence in medicine.Wiley interdisciplinary reviews: data mining and knowledge discovery, 9(4):e1312,
Andreas Holzinger, Georg Langs, Helmut Denk, Kurt Zat- loukal, and Heimo M ¨uller. Causability and explainability of artificial intelligence in medicine.Wiley interdisciplinary reviews: data mining and knowledge discovery, 9(4):e1312,
-
[27]
arXiv preprint arXiv:2508.21258 (2025) 3
Farnoush Rezaei Jafari, Oliver Eberle, Ashkan Khakzar, and Neel Nanda. Relp: Faithful and efficient circuit discovery in language models via relevance patching.arXiv preprint arXiv:2508.21258, 2025. 2
-
[28]
Steering clip’s vi- sion transformer with sparse autoencoders.arXiv preprint arXiv:2504.08729, 2025
Sonia Joseph, Praneet Suresh, Ethan Goldfarb, Lorenz Hufe, Yossi Gandelsman, Robert Graham, Danilo Bzdok, Woj- ciech Samek, and Blake Aaron Richards. Steering clip’s vi- sion transformer with sparse autoencoders.arXiv preprint arXiv:2504.08729, 2025. 2, 3
-
[29]
Bernard Keenan and Kacper Sokol. Mind the gap! bridging explainable artificial intelligence and human understanding with luhmann’s functional theory of communication.arXiv preprint arXiv:2302.03460, 2023. 1, 2
-
[30]
Interpretability be- yond feature attribution: Quantitative testing with concept activation vectors (tcav)
Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, et al. Interpretability be- yond feature attribution: Quantitative testing with concept activation vectors (tcav). InInternational conference on ma- chine learning, pages 2668–2677. PMLR, 2018. 1
2018
-
[31]
Human- centered evaluation of explainable ai applications: a system- atic review.Frontiers in Artificial Intelligence, 7:1456486,
Jenia Kim, Henry Maathuis, and Danielle Sent. Human- centered evaluation of explainable ai applications: a system- atic review.Frontiers in Artificial Intelligence, 7:1456486,
-
[32]
Hu- man evaluation of models built for interpretability
Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Samuel J Gershman, and Finale Doshi-Velez. Hu- man evaluation of models built for interpretability. InPro- ceedings of the AAAI conference on human computation and crowdsourcing, pages 59–67, 2019. 3
2019
-
[33]
Conceptviz: A visual analytics approach for ex- ploring concepts in large language models.IEEE Transac- tions on Visualization and Computer Graphics, 2025
Haoxuan Li, Zhen Wen, Qiqi Jiang, Chenxiao Li, Yuwei Wu, Yuchen Yang, Yiyao Wang, Xiuqi Huang, Minfeng Zhu, and Wei Chen. Conceptviz: A visual analytics approach for ex- ploring concepts in large language models.IEEE Transac- tions on Visualization and Computer Graphics, 2025. 2
2025
-
[34]
A unified approach to interpreting model predictions.Advances in Neural Infor- mation Processing Systems, 30, 2017
Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions.Advances in Neural Infor- mation Processing Systems, 30, 2017. 1
2017
-
[35]
Evalu- ating actionability in explainable ai.arXiv preprint arXiv:2601.20086, 2026
Gennie Mansi, Julia Kim, and Mark Riedl. Evalu- ating actionability in explainable ai.arXiv preprint arXiv:2601.20086, 2026. 1, 2
-
[36]
Locating and editing factual associations in gpt
Kevin Meng, David Bau, Alex Andonian, and Yonatan Be- linkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35: 17359–17372, 2022. 2
2022
-
[37]
arXiv preprint arXiv:1901.08644 , year=
Richard Meyes, Melanie Lu, Constantin Waubert De Puiseau, and Tobias Meisen. Ablation studies in artificial neural networks.arXiv preprint arXiv:1901.08644,
-
[38]
Interventionist methods for interpreting deep neural networks
Rapha ¨el Milli `ere and Cameron Buckner. Interventionist methods for interpreting deep neural networks. InNeurocog- nitive Foundations of Mind. Routledge, 2025. 2, 3
2025
-
[39]
The quest for the right mediator: A history, survey, and the- oretical grounding of causal interpretability.arXiv e-prints, pages arXiv–2408, 2024
Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, et al. The quest for the right mediator: A history, survey, and the- oretical grounding of causal interpretability.arXiv e-prints, pages arXiv–2408, 2024. 2
2024
-
[40]
An overview of the empirical evaluation of explainable ai (xai): A comprehensive guideline for user-centered evaluation in xai.Applied Sciences, 14(23):11288, 2024
Sidra Naveed, Gunnar Stevens, and Dean Robin-Kern. An overview of the empirical evaluation of explainable ai (xai): A comprehensive guideline for user-centered evaluation in xai.Applied Sciences, 14(23):11288, 2024. 3
2024
-
[41]
Konstantinos Nikiforidis, Alkiviadis Kyrtsoglou, Thanasis Vafeiadis, Thanasis Kotsiopoulos, Alexandros Nizamis, Di- mosthenis Ioannidis, Konstantinos V otis, Dimitrios Tzo- varas, and Panagiotis Sarigiannidis. Enhancing transparency and trust in AI-powered manufacturing: A survey of explain- able AI (XAI) applications in smart manufacturing in the era of ...
2025
-
[42]
Zoom in: An in- troduction to circuits.Distill, 5(3):e00024–001, 2020
Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An in- troduction to circuits.Distill, 5(3):e00024–001, 2020. 2
2020
-
[43]
Steering Llama 2 via Contrastive Activation Addition
Nina Panickssery, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2023. 2
work page internal anchor Pith review arXiv 2023
-
[44]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2
2021
-
[45]
” why should i trust you?” explaining the predictions of any classifier
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ” why should i trust you?” explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD In- ternational Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016. 1
2016
-
[46]
Steering llama 2 via con- 10 trastive activation addition
Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via con- 10 trastive activation addition. InProceedings of the 62nd An- nual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 15504–15522, 2024. 2
2024
-
[47]
Ripplebench: Capturing ripple effects by leveraging exist- ing knowledge repositories
Roy Rinberg, Usha Bhalla, Igor Shilov, and Rohit Gandikota. Ripplebench: Capturing ripple effects by leveraging exist- ing knowledge repositories. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025. 8
2025
-
[48]
Parallel net- works that learn to pronounce english text.Complex Systems, 1(1):145–168, 1987
Terrence J Sejnowski and Charles R Rosenberg. Parallel net- works that learn to pronounce english text.Complex Systems, 1(1):145–168, 1987. 2
1987
-
[49]
Learning important features through propagating activation differences
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. InInternational Conference on Machine Learn- ing, pages 3145–3153. PMlR, 2017. 3
2017
-
[50]
Neural and conceptual interpretation of pdp models.Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 2:390–431, 1986
Paul Smolensky. Neural and conceptual interpretation of pdp models.Parallel Distributed Processing: Explorations in the Microstructure of Cognition, 2:390–431, 1986. 2
1986
-
[51]
Suny: A visual interpretation framework for convolutional neural networks from a necessary and suf- ficient perspective
Xiwei Xuan, Ziquan Deng, Hsuan-Tien Lin, Zhaodan Kong, and Kwan-Liu Ma. Suny: A visual interpretation framework for convolutional neural networks from a necessary and suf- ficient perspective. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 8371–8376, 2024. 2
2024
-
[52]
Slim: Spuriousness mitigation with minimal human annotations
Xiwei Xuan, Ziquan Deng, Hsuan-Tien Lin, and Kwan-Liu Ma. Slim: Spuriousness mitigation with minimal human annotations. InEuropean Conference on Computer Vision, pages 215–231. Springer, 2024. 2
2024
-
[53]
Attributionscanner: A visual analytics system for model validation with metadata-free slice finding
Xiwei Xuan, Jorge Piazentin Ono, Liang Gou, Kwan-Liu Ma, and Liu Ren. Attributionscanner: A visual analytics system for model validation with metadata-free slice finding. IEEE transactions on visualization and computer graphics,
-
[54]
Vislix: An xai framework for val- idating vision models with slice discovery and analysis
Xinyuan Yan, Xiwei Xuan, Jorge Piazentin Ono, Jiajing Guo, Vikram Mohanty, Shekar Arvind Kumar, Liang Gou, Bei Wang, and Liu Ren. Vislix: An xai framework for val- idating vision models with slice discovery and analysis. In Computer Graphics Forum, page e70125. Wiley Online Li- brary, 2025. 2
2025
-
[55]
A textbook remedy for domain shifts: Knowledge priors for medical image analysis
Yue Yang, Mona Gandhi, Yufei Wang, Yifan Wu, Michael S Yao, Chris Callison-Burch, James Gee, and Mark Yatskar. A textbook remedy for domain shifts: Knowledge priors for medical image analysis. InAdvancements In Medical Foun- dation Models: Explainability, Robustness, Security, and Be- yond, 2024. 4
2024
-
[56]
Fred Zhang and Neel Nanda. Towards best practices of ac- tivation patching in language models: Metrics and methods. arXiv preprint arXiv:2309.16042, 2023. 2
-
[57]
Hengyuan Zhang, Zhihao Zhang, Mingyang Wang, Zunhai Su, Yiwei Wang, Qianli Wang, Shuzhou Yuan, Ercong Nie, Xufeng Duan, Qibo Xue, et al. Locate, steer, and improve: A practical survey of actionable mechanistic interpretability in large language models.arXiv preprint arXiv:2601.14004,
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
Ef- fect of confidence and explanation on accuracy and trust cal- ibration in ai-assisted decision making
Yunfeng Zhang, Q Vera Liao, and Rachel KE Bellamy. Ef- fect of confidence and explanation on accuracy and trust cal- ibration in ai-assisted decision making. InProceedings of the 2020 Conference on Fairness, Accountability, and Trans- parency, pages 295–305, 2020. 3
2020
-
[59]
Explainability for large language models: A survey
Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Meng- nan Du. Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology, 15(2):1–38, 2024. 2 11
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.