Radical AI Interpretability
Pith reviewed 2026-06-26 05:24 UTC · model grok-4.3
The pith
Interpreting AI systems requires solving holistically for beliefs, desires, and meanings from computational facts rather than attributing them piecemeal.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Given computational facts about an AI system, radical interpretation solves for its beliefs, desires, and meanings by treating the system as a rational agent. Criteria are proposed for representationalist and interpretationist approaches and tied to tests current mechanistic interpretability tools can carry out. The key claim is that these attributions cannot be made piecemeal: beliefs, desires, and propositional structure are jointly constrained, so a method that fixes one while measuring the others inherits whatever distortions that introduces; the holism is pressing when the system does not share the interpreter's concepts, but it also provides leverage because the system's attitudes cons
What carries the argument
Radical interpretation applied to AI, which solves for an agent's beliefs and desires from computational and behavioral facts under rationality assumptions while enforcing joint constraints across attitudes and propositional structure.
If this is right
- Interpretability methods must be evaluated against criteria that check the entire system of attitudes together rather than isolating individual beliefs or desires.
- Any approach that measures one attitude while holding others fixed will carry over distortions from the fixed components into the measured ones.
- A system's attitudes and its propositional structure mutually constrain each other, allowing mechanistic measurements of either to narrow the possibilities for the other.
- When an AI may not share the interpreter's concepts, the joint constraints still limit which overall assignments remain consistent with the observed computational facts.
Where Pith is reading between the lines
- The proposed criteria could be used to design new tests that check whether existing tools produce attributions consistent across multiple attitudes at once.
- The framework suggests prioritizing development of interpretability techniques that measure consistency between attributed beliefs, desires, and structure rather than accuracy on isolated tasks.
- Safety evaluations might incorporate checks for whether a system's behavior remains stable under small changes to one attributed attitude while holding the holistic constraints fixed.
Load-bearing premise
The philosophical tradition of radical interpretation can be applied directly to AI systems to determine their beliefs and desires from internal computational states, with mechanistic interpretability tools supplying the necessary measurements.
What would settle it
An experiment in which a piecemeal attribution produced by current interpretability tools accurately predicts an AI system's behavior in a novel scenario that a holistic joint-constraint check would predict differently.
read the original abstract
We develop a framework for interpreting AI systems as agents, drawing on the philosophical tradition of radical interpretation and the tools of mechanistic interpretability. The core question is: given the computational facts about a system, how do we solve for its beliefs, desires, and meanings? This matters increasingly for safety. We want to be able to trust the systems we deploy, whether by understanding their goals or, more modestly, by reliably detecting deception. Interpretability researchers are building tools to read beliefs and desires off a model's internals, but there is no settled account of when such a tool has succeeded. This book supplies one. We propose criteria on both representationalist and interpretationist approaches, and tie each to tests current interpretability methods can carry out. A central lesson is that these attributions cannot be made piecemeal. Beliefs, desires, and the propositional structure they presuppose are jointly constrained, and a method that fixes one while measuring the others inherits whatever distortions that introduces. This holism becomes pressing for AI systems, which may not share the interpreter's concepts. However, it also provides leverage: a system's attitudes constrain its propositional structure, that structure constrains which attitudes can be attributed, and mechanistic interpretability can help us measure both.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a conceptual framework for attributing beliefs, desires, and meanings to AI systems by integrating the philosophical tradition of radical interpretation with tools from mechanistic interpretability. It proposes criteria applicable to both representationalist and interpretationist approaches, links these criteria to tests performable by existing interpretability methods, and argues that such attributions must be holistic: beliefs, desires, and the propositional structures they presuppose are jointly constrained rather than attributable piecemeal. The framework is motivated by AI safety concerns, particularly the need to detect deception or understand goals from computational facts alone.
Significance. If the proposed criteria prove operationalizable, the work could bridge philosophy and AI interpretability by supplying a principled, non-piecemeal account of when an interpretability tool has succeeded in reading off internal states. The holism insight directly challenges current practices that fix one attitude while measuring others independently. No machine-checked proofs, reproducible code, or quantitative predictions are present, but the explicit tie between philosophical criteria and existing mechanistic tools is a strength that could guide future empirical work in AI safety.
major comments (1)
- [Abstract] Abstract: The central claim that attributions 'cannot be made piecemeal' because beliefs, desires, and propositional structure are 'jointly constrained' risks circularity. The criteria are said to be tied to tests that current methods can carry out, yet the holism argument presupposes the very joint constraint it seeks to enforce; without a concrete demonstration of how a specific test (e.g., a circuit-level measurement) would be revised under the framework, it is unclear whether the proposal adds non-circular constraints or merely restates the philosophical stance.
minor comments (1)
- [Abstract] The abstract refers to 'this book' but the manuscript appears to be a paper; clarify the intended format and scope.
Simulated Author's Rebuttal
We thank the referee for their careful reading and for highlighting the risk of circularity in the holism argument. We address this point directly below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that attributions 'cannot be made piecemeal' because beliefs, desires, and propositional structure are 'jointly constrained' risks circularity. The criteria are said to be tied to tests that current methods can carry out, yet the holism argument presupposes the very joint constraint it seeks to enforce; without a concrete demonstration of how a specific test (e.g., a circuit-level measurement) would be revised under the framework, it is unclear whether the proposal adds non-circular constraints or merely restates the philosophical stance.
Authors: The holism follows from the radical-interpretation setup itself: given only computational facts, the interpreter must solve simultaneously for beliefs, desires, and the propositional structure that makes those attitudes intelligible, rather than assuming any one in advance. The proposed success criteria are not derived from this holism but are instead anchored in the independent requirements of existing interpretability methods (e.g., consistency under activation patching or circuit interventions). We agree that the abstract would benefit from an explicit illustration of how a concrete test would change under the framework; we will therefore revise the abstract for clarity and insert a short worked example in the main text showing how a circuit-level measurement is re-evaluated when all three elements are constrained jointly. revision: partial
Circularity Check
No significant circularity
full rationale
The manuscript is a purely conceptual philosophical proposal that adopts the radical interpretation tradition to supply criteria for AI belief/desire attribution and argues for holism in attributions. No equations, parameters, fitted values, or formal derivations appear in the provided text. The central lesson about non-piecemeal attribution follows directly from the adopted philosophical stance rather than from any technical derivation that reduces to its own inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked in a load-bearing way within the excerpt. The work is therefore self-contained against external benchmarks and receives the default non-finding.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption AI systems can be fruitfully interpreted as having beliefs, desires, and propositional attitudes similar to humans.
- domain assumption Mechanistic interpretability tools can measure the relevant internal states for attribution.
Reference graph
Works this paper leans on
-
[1]
Transportation Research Record: Journal of the Transportation Research Board , number=
Theoretical maximum capacity as benchmark for empty vehicle redistribution in personal rapid transit , author=. Transportation Research Record: Journal of the Transportation Research Board , number=. 2010 , publisher=
2010
-
[2]
2026 , eprint=
Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers , author=. 2026 , eprint=
2026
-
[3]
2026 , eprint=
Mechanistic Indicators of Understanding in Large Language Models , author=. 2026 , eprint=
2026
-
[4]
Journal of AI, Robotics & Workplace Automation , volume=
AI agents: A new Solow paradox? , author=. Journal of AI, Robotics & Workplace Automation , volume=. 2026 , publisher=
2026
-
[5]
Science , volume=
Toward universal steering and monitoring of AI models , author=. Science , volume=. 2026 , publisher=
2026
-
[6]
Cognition , volume=
Connectionism and the problem of systematicity: Why Smolensky's solution doesn't work , author=. Cognition , volume=. 1990 , publisher=
1990
-
[7]
Intention, Plans, and Practical Reason , year =
Michael Bratman , editor =. Intention, Plans, and Practical Reason , year =
-
[8]
Transportation Research Record: Journal of the Transportation Research Board , number=
A Different Title to Test Repeated Authors , author=. Transportation Research Record: Journal of the Transportation Research Board , number=. 2011 , publisher=
2011
-
[9]
Journal of Field Robotics , volume=
Autonomous driving in urban environments: Boss and the urban challenge , author=. Journal of Field Robotics , volume=. 2008 , publisher=
2008
-
[10]
CVPR , pages=
Are we ready for autonomous driving? The KITTI vision benchmark suite , author=. CVPR , pages=. 2012 , organization=
2012
-
[11]
Synthese , pages=
Radical interpretation , author=. Synthese , pages=
-
[12]
Minds and Machines , volume=
Statistical learning theory and Occam’s razor: the core argument , author=. Minds and Machines , volume=. 2024 , publisher=
2024
-
[13]
2012 , publisher=
Reliable reasoning: Induction and statistical learning theory , author=. 2012 , publisher=
2012
-
[14]
2014 , publisher=
Understanding machine learning: From theory to algorithms , author=. 2014 , publisher=
2014
-
[15]
Stress testing deliberative alignment for anti-scheming training , author=. arXiv preprint arXiv:2509.15541 , year=
-
[16]
1989 , publisher=
The intentional stance , author=. 1989 , publisher=
1989
-
[17]
1998 , publisher=
Statistical learning theory , author=. 1998 , publisher=
1998
-
[18]
Philosophy Compass , volume=
Reliability in machine learning , author=. Philosophy Compass , volume=
-
[19]
Nature Machine Intelligence , volume=
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , author=. Nature Machine Intelligence , volume=
-
[20]
arXiv preprint arXiv:2410.09087 , year=
Mechanistic? , author=. arXiv preprint arXiv:2410.09087 , year=
-
[21]
Philosophical Studies , year=
Still no lie detector for language models: probing empirical and conceptual roadblocks , author=. Philosophical Studies , year=
-
[22]
Minds and Machines , volume=
Standards for belief representations in LLMs , author=. Minds and Machines , volume=
-
[23]
arXiv preprint arXiv:2501.15740 , year=
Propositional interpretability in artificial intelligence , author=. arXiv preprint arXiv:2501.15740 , year=
-
[24]
The Internal State of an LLM Knows When It's Lying
The internal state of an LLM knows when it's lying , author=. arXiv preprint arXiv:2304.13734 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
The geometry of truth: Emergent linear structure in large language model representations of true/false datasets , author=. arXiv preprint arXiv:2310.06824 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
2021 , publisher=
Making AI intelligible: Philosophical foundations , author=. 2021 , publisher=
2021
-
[27]
Computational Linguistics , volume=
Do language models’ words refer? , author=. Computational Linguistics , volume=
-
[28]
Understanding intermediate layers using linear classifier probes
Understanding intermediate layers using linear classifier probes , author=. arXiv preprint arXiv:1610.01644 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Discovering Latent Knowledge in Language Models Without Supervision
Discovering latent knowledge in language models without supervision , author=. arXiv preprint arXiv:2212.03827 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
2019 , publisher=
The conceptual foundations of quantum mechanics , author=. 2019 , publisher=
2019
-
[31]
2018 , publisher=
Mathematical foundations of quantum mechanics , author=. 2018 , publisher=
2018
-
[32]
Toy models of superposition , author=. arXiv preprint arXiv:2209.10652 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Transformer Circuits Thread , year=
Towards monosemanticity: Decomposing language models with dictionary learning , author=. Transformer Circuits Thread , year=
-
[34]
The Journal of Philosophy , volume=
Imaging and conditionalization , author=. The Journal of Philosophy , volume=
-
[35]
Dialectica , volume=
Radical interpretation , author=. Dialectica , volume=
-
[36]
, title =
Stalnaker, Robert C. , title =. 1984 , publisher =
1984
-
[37]
NeurIPS , year=
Attention is all you need , author=. NeurIPS , year=
-
[38]
1980 , publisher=
Metaphors we live by , author=. 1980 , publisher=
1980
-
[39]
NeurIPS , year=
Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting , author=. NeurIPS , year=
-
[40]
What does ChatGPT want? An interpretationist guide , author=
-
[41]
2026 , url=
Marks, Sam and Lindsey, Jack and Olah, Christopher , title=. 2026 , url=
2026
-
[42]
Emergent world representations: Exploring a sequence model trained on a synthetic task , author=. arXiv preprint arXiv:2210.13382 , year=
-
[43]
Transformer Circuits Thread , year=
In-context learning and induction heads , author=. Transformer Circuits Thread , year=
-
[44]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sparks of artificial general intelligence: Early experiments with GPT-4 , author=. arXiv preprint arXiv:2303.12712 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
arXiv preprint arXiv:2309.15840 , year=
How to catch an AI liar , author=. arXiv preprint arXiv:2309.15840 , year=
-
[46]
Advances in neural information processing systems , volume=
Locating and editing factual associations in gpt , author=. Advances in neural information processing systems , volume=
-
[47]
Theory and Decision , volume=
A new basis for decision theory , author=. Theory and Decision , volume=. 1985 , publisher=
1985
-
[48]
Histoy of Economic Thought Chapters , pages=
Truth and Probability , author=. Histoy of Economic Thought Chapters , pages=. 1926 , publisher=
1926
-
[49]
1972 , publisher=
The foundations of statistics , author=. 1972 , publisher=
1972
-
[50]
Philosophy of Science , volume=
A nonpragmatic vindication of probabilism , author=. Philosophy of Science , volume=. 1998 , publisher=
1998
-
[51]
De Finetti, Bruno , booktitle=. La pr
-
[52]
2023 , eprint=
Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching , author=. 2023 , eprint=
2023
-
[53]
2023 , eprint=
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. 2023 , eprint=
2023
-
[54]
2016 , publisher=
Accuracy and the Laws of Credence , author=. 2016 , publisher=
2016
-
[55]
2015 , publisher=
Reasons without persons: Rationality, identity, and time , author=. 2015 , publisher=
2015
-
[56]
The Philosophical Review , volume=
Clever bookies and coherent beliefs , author=. The Philosophical Review , volume=. 1991 , publisher=
1991
-
[57]
Truth Evaluability in Radical Interpretation Theory , author=
-
[58]
Inquiries into Truth and Interpretation , publisher =
Donald Davidson , title =. Inquiries into Truth and Interpretation , publisher =. 1984 , pages =
1984
-
[59]
1990 , publisher=
The logic of decision , author=. 1990 , publisher=
1990
-
[60]
Philosophy of Science , volume=
A simultaneous axiomatization of utility and subjective probability , author=. Philosophy of Science , volume=. 1967 , publisher=
1967
-
[61]
1944 , publisher =
John von Neumann and Oskar Morgenstern , title =. 1944 , publisher =
1944
-
[62]
2005 , publisher=
Donald Davidson: Meaning, truth, language, and reality , author=. 2005 , publisher=
2005
-
[63]
Measuring Faithfulness in Chain-of-Thought Reasoning
Measuring faithfulness in chain-of-thought reasoning , author=. arXiv preprint arXiv:2307.13702 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[64]
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Chain of thought monitorability: A new and fragile opportunity for ai safety , author=. arXiv preprint arXiv:2507.11473 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[65]
Proceedings of the 41st International Conference on Machine Learning , pages=
The linear representation hypothesis and the geometry of large language models , author=. Proceedings of the 41st International Conference on Machine Learning , pages=
-
[66]
Efficient Estimation of Word Representations in Vector Space
Efficient estimation of word representations in vector space , author=. arXiv preprint arXiv:1301.3781 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[67]
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP , pages=
Emergent linear representations in world models of self-supervised sequence models , author=. Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP , pages=
-
[68]
ICLR , year=
Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , author=. ICLR , year=
-
[69]
Transactions of the Association for Computational Linguistics , volume=
Provable limitations of acquiring meaning from ungrounded form: What will future language models understand? , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=
2021
-
[70]
Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=
On the dangers of stochastic parrots: Can language models be too big? , author=. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=
2021
-
[71]
Proceedings of the 25th conference on computational natural language learning , pages=
Can language models encode perceptual structure without grounding? a case study in color , author=. Proceedings of the 25th conference on computational natural language learning , pages=
-
[72]
International conference on learning representations , year=
Mapping language models to grounded conceptual spaces , author=. International conference on learning representations , year=
-
[73]
2019 , publisher =
Stuart Russell , title =. 2019 , publisher =
2019
-
[74]
Artificial intelligence safety and security , pages=
The basic AI drives , author=. Artificial intelligence safety and security , pages=. 2018 , publisher=
2018
-
[75]
2014 , publisher =
Nick Bostrom , title =. 2014 , publisher =
2014
-
[76]
arXiv preprint arXiv:2407.11015 , year=
Does chatgpt have a mind? , author=. arXiv preprint arXiv:2407.11015 , year=
-
[77]
Contemporary Materialism , pages=
Mental events , author=. Contemporary Materialism , pages=. 2002 , publisher=
2002
-
[78]
Cognition , volume=
Connectionism and cognitive architecture: A critical analysis , author=. Cognition , volume=. 1988 , publisher=
1988
-
[79]
Computability: Turing, G
10 Why Philosophers Should Care about Computational Complexity , author=. Computability: Turing, G. 2013 , publisher=
2013
-
[80]
1955 , publisher =
Nelson Goodman , title =. 1955 , publisher =
1955
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.