Mechanistic Interpretability for AI Safety -- A Review
Pith reviewed 2026-05-22 14:17 UTC · model grok-4.3
The pith
Reverse engineering neural networks into human-understandable algorithms can provide the causal understanding needed to make advanced AI systems safe.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mechanistic interpretability involves reverse engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts, providing a granular, causal understanding of model behaviors that is critical for ensuring value alignment and safety in AI systems.
What carries the argument
Mechanistic interpretability, defined as reverse-engineering neural network computations and representations into human-understandable algorithms and concepts.
If this is right
- Granular causal understanding enables targeted interventions to align model behaviors with human values.
- Dissection of computations supports proactive control to reduce risks of unintended or harmful actions.
- Identification of safety-relevant mechanisms helps manage dual-use concerns in interpretability tools.
- Expansion to new domains like vision and reinforcement learning extends safety benefits beyond language models.
Where Pith is reading between the lines
- If successful at scale, these methods could enable ongoing audits of deployed systems to catch emergent unsafe behaviors before they cause harm.
- Success in one architecture might generalize to others, allowing shared safety insights across different model families.
- Combining mechanistic insights with other safety techniques could create layered defenses against catastrophic outcomes.
Load-bearing premise
Reverse-engineering neural network computations into human-understandable algorithms and concepts is feasible at scale for complex models and behaviors.
What would settle it
A demonstration that a large language model exhibits a critical safety failure, such as generating harmful outputs from an unknown internal circuit, that cannot be explained or mitigated through any identified mechanistic features or interventions.
read the original abstract
Understanding AI systems' inner workings is critical for ensuring value alignment and safety. This review explores mechanistic interpretability: reverse engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to provide a granular, causal understanding. We establish foundational concepts such as features encoding knowledge within neural activations and hypotheses about their representation and computation. We survey methodologies for causally dissecting model behaviors and assess the relevance of mechanistic interpretability to AI safety. We examine benefits in understanding, control, alignment, and risks such as capability gains and dual-use concerns. We investigate challenges surrounding scalability, automation, and comprehensive interpretation. We advocate for clarifying concepts, setting standards, and scaling techniques to handle complex models and behaviors and expand to domains such as vision and reinforcement learning. Mechanistic interpretability could help prevent catastrophic outcomes as AI systems become more powerful and inscrutable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This review synthesizes mechanistic interpretability research, defining it as reverse-engineering neural network computations and representations into human-understandable algorithms and concepts to enable granular causal understanding. It covers foundational ideas such as features in activations and hypotheses on representation/computation, surveys causal dissection methods, evaluates benefits for understanding/control/alignment alongside risks like capability enhancement and dual-use, examines challenges in scalability/automation/comprehensive coverage, and advocates clarifying concepts, setting standards, and scaling techniques to complex models and domains including vision and reinforcement learning. The paper concludes that mechanistic interpretability could help prevent catastrophic outcomes as AI systems grow more powerful and inscrutable.
Significance. If the literature synthesis is accurate, the review offers a structured overview that connects mechanistic interpretability techniques to AI safety goals while acknowledging practical limits. It gives credit to progress on toy models and circuit-level analyses, which provides a foundation for discussing safety applications, though the safety relevance remains framed as a forward-looking possibility rather than a demonstrated outcome.
major comments (2)
- [Abstract and Relevance to AI Safety section] Abstract and section assessing relevance to AI safety: the claim that mechanistic interpretability 'could help prevent catastrophic outcomes' rests on the feasibility of scaling reverse-engineering to frontier models, yet the review notes intractability of circuit enumeration for large systems without synthesizing specific evidence or extensions from the cited literature showing how current methods (e.g., activation patching or causal interventions) generalize beyond toy scales.
- [Challenges section] Section investigating challenges surrounding scalability, automation, and comprehensive interpretation: the advocacy for 'scaling techniques' and 'standards' is presented as a solution path, but lacks concrete discussion of how automation or partial interpretation approaches address the acknowledged gap between current capabilities on small models and the requirements for 100B+ parameter systems.
minor comments (2)
- [Abstract] The abstract packs multiple topics into a single paragraph; splitting the summary of methods from the safety assessment and advocacy would improve readability.
- [Methodologies survey] Some methodological descriptions could include brief pointers to key example papers or figures to help readers trace the surveyed techniques.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our review. We have carefully considered the major comments regarding the strength of claims about AI safety benefits and the concreteness of proposed solutions in the challenges section. Below we respond point by point and indicate revisions that will be incorporated into the next version of the manuscript.
read point-by-point responses
-
Referee: [Abstract and Relevance to AI Safety section] Abstract and section assessing relevance to AI safety: the claim that mechanistic interpretability 'could help prevent catastrophic outcomes' rests on the feasibility of scaling reverse-engineering to frontier models, yet the review notes intractability of circuit enumeration for large systems without synthesizing specific evidence or extensions from the cited literature showing how current methods (e.g., activation patching or causal interventions) generalize beyond toy scales.
Authors: We agree that the safety relevance is prospective and that the manuscript must more explicitly connect the acknowledged scalability limits to cited evidence of progress. The review already notes the intractability of exhaustive circuit enumeration at frontier scales and frames benefits as forward-looking. To address this comment directly, we will revise the abstract and the relevance section to synthesize specific extensions from the literature, including recent applications of activation patching and causal interventions to models beyond toy scales (e.g., work on larger transformers and multimodal systems). This will clarify the basis for cautious optimism without overstating current capabilities. revision: yes
-
Referee: [Challenges section] Section investigating challenges surrounding scalability, automation, and comprehensive interpretation: the advocacy for 'scaling techniques' and 'standards' is presented as a solution path, but lacks concrete discussion of how automation or partial interpretation approaches address the acknowledged gap between current capabilities on small models and the requirements for 100B+ parameter systems.
Authors: The referee correctly identifies an opportunity to make the discussion of solutions more concrete. The manuscript discusses automation and partial interpretation as necessary directions but does not provide sufficient detail on specific techniques or their demonstrated reach. In the revised version we will expand the challenges section with concrete examples drawn from the surveyed literature, including automated circuit discovery pipelines, sparse autoencoder-based feature extraction, and partial interpretation methods that have been applied to models in the 1B–10B parameter range. We will also explicitly discuss the remaining gap to 100B+ systems and the role of standards in guiding future work. revision: yes
Circularity Check
Review paper presents no derivations or predictions that reduce to inputs
full rationale
This is a literature review summarizing external concepts, methods, challenges, and safety relevance of mechanistic interpretability without any original equations, fitted parameters, predictions, or first-principles derivations. The abstract and body survey prior work, note scalability issues, and advocate for future standards and scaling, but make no claims that reduce by construction to self-referential definitions or self-citations. The safety relevance is framed as a potential outcome based on reviewed literature rather than a derived result internal to the paper. No load-bearing steps exhibit the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Mechanistic interpretability: reverse engineering the computational mechanisms and representations learned by neural networks into human-understandable algorithms and concepts to provide a granular, causal understanding.
-
IndisputableMonolith.Foundation.PhiForcingphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Superposition hypothesis: neural networks represent more features than they have neurons by encoding features in overlapping combinations of neurons.
-
IndisputableMonolith.Foundation.DimensionForcingdimension_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We advocate for clarifying concepts, setting standards, and scaling techniques to handle complex models and behaviors.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 26 Pith papers
-
Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift
GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.
-
Lost or Hidden? A Concept-Level Forgetting in Supervised Continual Learning
A framework using sparse autoencoders decomposes concept-level forgetting in supervised continual learning into apparent deletion, recoverability, and decodability, showing substantial recoverability under linearity a...
-
Data-driven Circuit Discovery for Interpretability of Language Models
Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.
-
SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders
SoftSAE introduces a dynamic top-k selection mechanism in sparse autoencoders that learns an input-dependent sparsity level via a differentiable soft top-k operator.
-
Unifying Dynamical Systems and Graph Theory to Mechanistically Understand Computation in Neural Networks
Multi-hop graph analysis of RNNs reveals temporal information routing and motivates resolvent regularization that outperforms L1 by enforcing pathway-level sparsity aligned with task structure.
-
Unifying Dynamical Systems and Graph Theory to Mechanistically Understand Computation in Neural Networks
RNN computation is recovered from multi-hop graph pathways, and constraining these pathways via resolvent regularization yields improved temporal sparsity and task performance over standard L1.
-
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety
ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisone...
-
Task complexity shapes internal representations and robustness in neural networks
Harder classification tasks produce neural representations whose accuracy collapses under binarization and shuffling while easier tasks remain robust, defining task complexity via the performance gap between full-prec...
-
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.
-
From Weight Perturbation to Feature Attribution for Explaining Fully Connected Neural Networks
XWP and XWP_c are novel attribution methods for FCNNs that estimate feature importance by perturbing attached weights to avoid added bias and out-of-distribution issues in occlusion approaches.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
-
SoftSAE: Dynamic Top-K Selection for Adaptive Sparse Autoencoders
SoftSAE replaces fixed-K sparsity in autoencoders with a learned, input-dependent number of active features via a soft top-k operator.
-
Confidence Estimation in Automatic Short Answer Grading with LLMs
A hybrid confidence framework for LLM-based short answer grading combines model signals with aleatoric uncertainty from semantic clustering of responses and improves selective grading reliability over single-source methods.
-
Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM
Gaussian probing infers harmful model specialization from parameter perturbations and internal representation responses to Gaussian latent ensembles rather than from generated outputs.
-
When AI reviews science: Can we trust the referee?
AI peer review systems are vulnerable to prompt injections, prestige biases, assertion strength effects, and contextual poisoning, as demonstrated by a new attack taxonomy and causal experiments on real conference sub...
-
What Physics do Data-Driven MoCap-to-Radar Models Learn?
Data-driven MoCap-to-radar models often fail to learn underlying physics despite low reconstruction error, with temporal attention proving critical for transformers to achieve physical consistency.
-
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.
-
Quantifying Trust: Financial Risk Management for Trustworthy AI Agents
The paper introduces the Agentic Risk Standard (ARS) as a payment settlement framework that delivers predefined compensation for AI agent execution failures, misalignment, or unintended outcomes.
-
On the definition and importance of interpretability in scientific machine learning
Interpretability in SciML requires mechanistic understanding rather than sparsity, and prior knowledge is often essential for interpretable scientific discovery.
-
Superposition Yields Robust Neural Scaling
Strong superposition causes neural loss to scale as the inverse of model dimension due to geometric feature overlaps, explaining scaling laws for broad frequency distributions.
-
Confidence Estimation in Automatic Short Answer Grading with LLMs
A hybrid confidence framework for LLM-based automatic short answer grading integrates model-based signals with aleatoric uncertainty from semantic clustering of responses and yields more reliable estimates than single...
-
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
-
Do Activation Verbalization Methods Convey Privileged Information?
Activation verbalization methods for LLMs largely reflect the verbalizer model's parametric knowledge rather than privileged information from the target model's activations.
-
Mechanistic Interpretability Needs Philosophy
The paper claims that mechanistic interpretability needs philosophy as a partner to clarify concepts, refine methods, and navigate epistemic and ethical complexities in AI systems.
-
LLM-Safety Evaluations Lack Robustness
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
-
Enhancing Adversarial Robustness in Network Intrusion Detection: A Layer-wise Adaptive Regularization Approach
LARAR enhances adversarial robustness in network intrusion detection by using layer-wise adaptive regularization and auxiliary classifiers, achieving 95.01% clean accuracy and improved defense against FGSM, PGD, and t...
Reference graph
Works this paper leans on
-
[1]
Understanding intermediate layers using linear classifier probes.ICLR,
30 Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.ICLR,
-
[2]
An introduction to systems biology: design principles of biological circuits
9, 14 Uri Alon. An introduction to systems biology: design principles of biological circuits. Chapman and Hal- l/CRC, 2019. 21 Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.CoRR, 2024. 9 Aryaman Arora, Dan Jurafsky, and Christopher Potts. Cau...
work page 2019
-
[3]
On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation
31 Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLOS ONE, July 2015. 2 Nicholas Bai, Rahul Ajay Iyer, Tuomas Oikarinen, and Tsui-Wei Weng. Describe-and-dissect: Interpreting neurons in vi...
work page 2015
-
[4]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
13, 15 Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big?ACM FAccT, March 2021. 11 Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Rosemary Ke, Sébastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta-transfer objective for learning t...
work page 2021
-
[5]
21 Nick Cammarata, Gabriel Goh, Shan Carter, Ludwig Schubert, Michael Petrov, and Chris Olah. Curve detectors. Distill, June 2020. 10, 15, 21, 30, 32 Nick Cammarata, Gabriel Goh, Shan Carter, Chelsea Voss, Ludwig Schubert, and Chris Olah. Curve circuits. Distill, 2021. 10, 21, 30, 32 StevenCao, VictorSanh, andAlexanderM.Rush. Low-complexityprobingviafindi...
work page 2020
-
[6]
25 38 Under review as submission to TMLR Yiting Chen, Zhanpeng Zhou, and Junchi Yan. Going beyond neural network feature similarity: The network feature complexity and its interpretation using category theory.CoRR, November 2023a. 11 Zhongtian Chen, Edmund Lau, Jake Mendel, Susan Wei, and Daniel Murfet. Dynamical versus bayesian phase transitions in a toy...
work page 2017
-
[7]
Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso
30 Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability.NeurIPS, 2023. 18, 19, 22, 25, 30 Ian C. Covert, Scott Lundberg, and Su-In Lee. Explaining by removing: a unified framework for model explanation. J. Mach. Learn. Res., January 2021. 2...
work page 2023
-
[8]
Challenges with unsupervised llm knowledge discovery.CoRR, 2023
34 Sebastian Farquhar, Vikrant Varma, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, and Rohin Shah. Challenges with unsupervised llm knowledge discovery.CoRR, 2023. 14 Amir Feder, Nadav Oved, Uri Shalit, and Roi Reichart. Causalm: Causal model explanation through counterfactual language models.Computational Linguistics, May 2021. 18 Jiahai Feng an...
work page 2023
-
[9]
Let’s agree to agree: Neural networks share classification order on real datasets.ICML, 2020
11 Guy Hacohen, Leshem Choshen, and Daphna Weinshall. Let’s agree to agree: Neural networks share classification order on real datasets.ICML, 2020. 11 Michael Hanna, Ollie Liu, and Alexandre Variengien. How does gpt-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model.NeurIPS, 2023. 10, 17, 22, 30 Michael Hanna, San...
work page 2020
-
[10]
15 Stefan Heimersheim and Jett. A circuit for python docstrings in a 4-layer attention-only transformer.AI Alignment Forum, February 2023. 22 Roee Hendel, Mor Geva, and Amir Globerson. In-context learning creates task vectors.EMNLP, October
work page 2023
-
[11]
9, 17, 22 Dan Hendrycks.Introduction to AI Safety, Ethics, and Society. Self-published, 2023a. 26 Dan Hendrycks. Natural selection favors ais over humans.CoRR, July 2023b. 24 42 Under review as submission to TMLR Dan Hendrycks and Mantas Mazeika. X-risk analysis for ai research.CoRR, June 2022. 1, 24, 25 Dan Hendrycks, Nicholas Carlini, John Schulman, and...
work page 2022
-
[12]
9 John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations. NAACL HLT, June 2019. 14 Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and Alexander Lerchner. Towards a definition of disentangled representations.CoRR, December 2018. 27 Jacob Hilton, Nick Cammarata, Shan C...
work page 2019
-
[13]
Distributed representations.Carnegie Mellon University, 1984
30 Geoffrey E Hinton. Distributed representations.Carnegie Mellon University, 1984. 27 Marius Hobbhahn. Marius’ alignment agenda, 2022. 30 Marius Hobbhahn and Lawrence Chan. Should we publish mechanistic interpretability research?AI Align- ment Forum, April 2023. 25 Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza R...
work page 1984
-
[14]
30 janus. Simulators. LessWrong, September 2022. 11, 12, 24, 34 Erik Jenner, Adrià Garriga-alonso, and Egor Zverev. A comparison of causal scrubbing, causal abstractions, and related methods.AI Alignment Forum, June 2023. 13, 18, 19 Erik Jenner, Shreyas Kapur, Vasil Georgiev, Cameron Allen, Scott Emmons, and Stuart Russell. Evidence of learned look-ahead ...
work page 2022
-
[15]
An interpretability illusion for activation patching of arbitrary subspaces
28, 32 Georg Lange, Alex Makelov, and Neel Nanda. An interpretability illusion for activation patching of arbitrary subspaces. AI Alignment Forum, August 2023. 18 Anna Langedijk, Hosein Mohebbi, Gabriele Sarti, Willem Zuidema, and Jaap Jumelet. Decoderlens: Layer- wise interpretation of encoder-decoder transformers.CoRR, 2023. 15 Edmund Lau, Daniel Murfet...
work page 2023
-
[16]
32 Narek Maloyan, Ekansh Verma, Bulat Nutfullin, and Bislan Ashinov. Trojan detection in large language models: Insights from the trojan detection challenge.CoRR, April 2024. 28 Giovanni Luca Marchetti, Christopher Hillar, Danica Kragic, and Sophia Sanborn. Harmonics of learning: Universal fourier features emerge in invariant networks.CoRR, December 2023....
work page 2024
-
[17]
19 Simon C. Marshall and Jan H. Kirchner. Understanding polysemanticity in neural networks through coding theory. CoRR, January 2024. 6, 27, 28 Mantas Mazeika, Andy Zou, Akul Arora, Pavel Pleskov, Dawn Song, Dan Hendrycks, Bo Li, and David Forsyth. How hard is trojan detection in dnns? fooling detectors with evasive trojans.CoRR, September
work page 2024
-
[18]
Copy suppression: Comprehensively understanding an attention head.CoRR, October 2023
29 Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, and Neel Nanda. Copy suppression: Comprehensively understanding an attention head.CoRR, October 2023. 10 ThomasMcGrath, AndreiKapishnikov, NenadTomašev, AdamPearce, MartinWattenberg, DemisHassabis, Been Kim, Ulrich Paquet, and Vladimir Kramnik. Acquisition of chess knowledge in alphazero.PNA...
work page 2023
-
[19]
16 Tilman Räuker, Anson Ho, Stephen Casper, and Dylan Hadfield-Menell. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks.TMLR, August 2023. 1, 13, 25, 26, 27, 28, 29 50 Under review as submission to TMLR Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. Null it out: Guarding protected ...
work page 2023
-
[20]
Interpretability creationism.The Gradient, 2023
11 Naomi Saphra. Interpretability creationism.The Gradient, 2023. 21 Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? CoRR, May 2023. 20 Adam Scherlis, Kshitij Sachan, Adam S. Jermyn, Joe Benton, and Buck Shlegeris. Polysemanticity and capacity in neural networks.CoRR, July 2023. 6, 26, 27 51 Und...
work page 2023
-
[21]
Attribution patching outperforms automated circuit discovery
2 Aaquib Syed, Can Rager, and Arthur Conmy. Attribution patching outperforms automated circuit discovery. CoRR, October 2023. 18, 19, 22 technicalities and Stag. Shallow review of live agendas in alignment & safety.LessWrong, 2023. 24 Max Tegmark and Steve Omohundro. Provably safe systems: the only path to controllable agi.CoRR, September 2023. 24, 29, 35...
work page 2023
-
[22]
Bert rediscovers the classical nlp pipeline.ACL, August 2019
15 Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline.ACL, August 2019. 14 Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon.CoRR, 2022. 21 Hannes Thurnherr and Jérémy Scheurer. Tracrb...
work page 2019
-
[23]
The clock and the pizza: Two stories in mechanistic explanation of neural networks.CoRR, 2023
27 Ziqian Zhong, Ziming Liu, Max Tegmark, and Jacob Andreas. The clock and the pizza: Two stories in mechanistic explanation of neural networks.CoRR, 2023. 22 Roland S. Zimmermann, Judy Borowski, Robert Geirhos, Matthias Bethge, Thomas S. A. Wallis, and Wieland Brendel. How well do feature visualizations support causal understanding of cnn activations? Ne...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.