Radical AI Interpretability

Benjamin A. Levinstein; Daniel A. Herrmann

arxiv: 2606.26523 · v1 · pith:NPP3XSHInew · submitted 2026-06-25 · 💻 cs.AI · cs.LG

Radical AI Interpretability

Daniel A. Herrmann , Benjamin A. Levinstein This is my paper

Pith reviewed 2026-06-26 05:24 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords radical interpretationmechanistic interpretabilityAI agentsbelief attributionholistic constraintsAI safetydeception detection

0 comments

The pith

Interpreting AI systems requires solving holistically for beliefs, desires, and meanings from computational facts rather than attributing them piecemeal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies radical interpretation to AI systems, seeking to determine beliefs, desires, and meanings directly from computational facts about the model. It supplies criteria for both representationalist and interpretationist approaches, each linked to tests that existing mechanistic interpretability methods can perform. The central lesson is that attributions cannot be made piecemeal because beliefs, desires, and the propositional structure they presuppose are jointly constrained. This interdependence becomes especially relevant for AI that may employ different concepts from the interpreter, yet it also supplies mutual leverage for narrowing down possible attributions. The framework matters for safety because reliable detection of deception or understanding of goals depends on avoiding distortions that arise when one attitude is fixed while others are measured.

Core claim

Given computational facts about an AI system, radical interpretation solves for its beliefs, desires, and meanings by treating the system as a rational agent. Criteria are proposed for representationalist and interpretationist approaches and tied to tests current mechanistic interpretability tools can carry out. The key claim is that these attributions cannot be made piecemeal: beliefs, desires, and propositional structure are jointly constrained, so a method that fixes one while measuring the others inherits whatever distortions that introduces; the holism is pressing when the system does not share the interpreter's concepts, but it also provides leverage because the system's attitudes cons

What carries the argument

Radical interpretation applied to AI, which solves for an agent's beliefs and desires from computational and behavioral facts under rationality assumptions while enforcing joint constraints across attitudes and propositional structure.

If this is right

Interpretability methods must be evaluated against criteria that check the entire system of attitudes together rather than isolating individual beliefs or desires.
Any approach that measures one attitude while holding others fixed will carry over distortions from the fixed components into the measured ones.
A system's attitudes and its propositional structure mutually constrain each other, allowing mechanistic measurements of either to narrow the possibilities for the other.
When an AI may not share the interpreter's concepts, the joint constraints still limit which overall assignments remain consistent with the observed computational facts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The proposed criteria could be used to design new tests that check whether existing tools produce attributions consistent across multiple attitudes at once.
The framework suggests prioritizing development of interpretability techniques that measure consistency between attributed beliefs, desires, and structure rather than accuracy on isolated tasks.
Safety evaluations might incorporate checks for whether a system's behavior remains stable under small changes to one attributed attitude while holding the holistic constraints fixed.

Load-bearing premise

The philosophical tradition of radical interpretation can be applied directly to AI systems to determine their beliefs and desires from internal computational states, with mechanistic interpretability tools supplying the necessary measurements.

What would settle it

An experiment in which a piecemeal attribution produced by current interpretability tools accurately predicts an AI system's behavior in a novel scenario that a holistic joint-constraint check would predict differently.

read the original abstract

We develop a framework for interpreting AI systems as agents, drawing on the philosophical tradition of radical interpretation and the tools of mechanistic interpretability. The core question is: given the computational facts about a system, how do we solve for its beliefs, desires, and meanings? This matters increasingly for safety. We want to be able to trust the systems we deploy, whether by understanding their goals or, more modestly, by reliably detecting deception. Interpretability researchers are building tools to read beliefs and desires off a model's internals, but there is no settled account of when such a tool has succeeded. This book supplies one. We propose criteria on both representationalist and interpretationist approaches, and tie each to tests current interpretability methods can carry out. A central lesson is that these attributions cannot be made piecemeal. Beliefs, desires, and the propositional structure they presuppose are jointly constrained, and a method that fixes one while measuring the others inherits whatever distortions that introduces. This holism becomes pressing for AI systems, which may not share the interpreter's concepts. However, it also provides leverage: a system's attitudes constrain its propositional structure, that structure constrains which attitudes can be attributed, and mechanistic interpretability can help us measure both.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A conceptual proposal applying radical interpretation to AI belief attribution that flags holism but stays at the level of framework without tests or examples.

read the letter

The paper's main contribution is importing the radical interpretation approach from philosophy to supply criteria for attributing beliefs and desires to AI systems, then linking those criteria to existing mechanistic interpretability methods. It stresses that these attributions are holistic: beliefs, desires, and propositional structure have to be solved for together rather than piecemeal.

It does a clear job of spelling out why this holism matters for safety applications like deception detection, especially when the model's concepts may not line up with the interpreter's. The reminder that fixing one element while leaving the others free can import distortions is a useful point to keep in view.

The soft spot is that the work remains almost entirely at the proposal stage. No specific criteria are stated, no tests are described in any detail, and there are no examples drawn from actual models or current interpretability papers. The holism claim follows directly from the adopted philosophical stance rather than receiving independent support from the AI side. Without those steps, it is hard to judge whether the framework would change what practitioners do or simply restate an existing philosophical position in new language.

This is for readers already working at the intersection of AI safety and philosophy of mind. Someone looking for concrete guidance on running or evaluating interpretability experiments will not find it here. It is coherent on its own terms and raises a timely question, so it deserves referee time even if the authors will need to add substantial development to make the criteria operational.

Referee Report

1 major / 1 minor

Summary. The manuscript develops a conceptual framework for attributing beliefs, desires, and meanings to AI systems by integrating the philosophical tradition of radical interpretation with tools from mechanistic interpretability. It proposes criteria applicable to both representationalist and interpretationist approaches, links these criteria to tests performable by existing interpretability methods, and argues that such attributions must be holistic: beliefs, desires, and the propositional structures they presuppose are jointly constrained rather than attributable piecemeal. The framework is motivated by AI safety concerns, particularly the need to detect deception or understand goals from computational facts alone.

Significance. If the proposed criteria prove operationalizable, the work could bridge philosophy and AI interpretability by supplying a principled, non-piecemeal account of when an interpretability tool has succeeded in reading off internal states. The holism insight directly challenges current practices that fix one attitude while measuring others independently. No machine-checked proofs, reproducible code, or quantitative predictions are present, but the explicit tie between philosophical criteria and existing mechanistic tools is a strength that could guide future empirical work in AI safety.

major comments (1)

[Abstract] Abstract: The central claim that attributions 'cannot be made piecemeal' because beliefs, desires, and propositional structure are 'jointly constrained' risks circularity. The criteria are said to be tied to tests that current methods can carry out, yet the holism argument presupposes the very joint constraint it seeks to enforce; without a concrete demonstration of how a specific test (e.g., a circuit-level measurement) would be revised under the framework, it is unclear whether the proposal adds non-circular constraints or merely restates the philosophical stance.

minor comments (1)

[Abstract] The abstract refers to 'this book' but the manuscript appears to be a paper; clarify the intended format and scope.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for highlighting the risk of circularity in the holism argument. We address this point directly below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that attributions 'cannot be made piecemeal' because beliefs, desires, and propositional structure are 'jointly constrained' risks circularity. The criteria are said to be tied to tests that current methods can carry out, yet the holism argument presupposes the very joint constraint it seeks to enforce; without a concrete demonstration of how a specific test (e.g., a circuit-level measurement) would be revised under the framework, it is unclear whether the proposal adds non-circular constraints or merely restates the philosophical stance.

Authors: The holism follows from the radical-interpretation setup itself: given only computational facts, the interpreter must solve simultaneously for beliefs, desires, and the propositional structure that makes those attitudes intelligible, rather than assuming any one in advance. The proposed success criteria are not derived from this holism but are instead anchored in the independent requirements of existing interpretability methods (e.g., consistency under activation patching or circuit interventions). We agree that the abstract would benefit from an explicit illustration of how a concrete test would change under the framework; we will therefore revise the abstract for clarity and insert a short worked example in the main text showing how a circuit-level measurement is re-evaluated when all three elements are constrained jointly. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a purely conceptual philosophical proposal that adopts the radical interpretation tradition to supply criteria for AI belief/desire attribution and argues for holism in attributions. No equations, parameters, fitted values, or formal derivations appear in the provided text. The central lesson about non-piecemeal attribution follows directly from the adopted philosophical stance rather than from any technical derivation that reduces to its own inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked in a load-bearing way within the excerpt. The work is therefore self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on philosophical assumptions about agency and interpretation without new empirical or formal support in the abstract.

axioms (2)

domain assumption AI systems can be fruitfully interpreted as having beliefs, desires, and propositional attitudes similar to humans.
The core question assumes this is possible from computational facts.
domain assumption Mechanistic interpretability tools can measure the relevant internal states for attribution.
The framework relies on this to tie criteria to tests.

pith-pipeline@v0.9.1-grok · 5737 in / 1079 out tokens · 28998 ms · 2026-06-26T05:24:48.426507+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

118 extracted references · 24 canonical work pages · 9 internal anchors

[1]

Transportation Research Record: Journal of the Transportation Research Board , number=

Theoretical maximum capacity as benchmark for empty vehicle redistribution in personal rapid transit , author=. Transportation Research Record: Journal of the Transportation Research Board , number=. 2010 , publisher=

2010
[2]

2026 , eprint=

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers , author=. 2026 , eprint=

2026
[3]

2026 , eprint=

Mechanistic Indicators of Understanding in Large Language Models , author=. 2026 , eprint=

2026
[4]

Journal of AI, Robotics & Workplace Automation , volume=

AI agents: A new Solow paradox? , author=. Journal of AI, Robotics & Workplace Automation , volume=. 2026 , publisher=

2026
[5]

Science , volume=

Toward universal steering and monitoring of AI models , author=. Science , volume=. 2026 , publisher=

2026
[6]

Cognition , volume=

Connectionism and the problem of systematicity: Why Smolensky's solution doesn't work , author=. Cognition , volume=. 1990 , publisher=

1990
[7]

Intention, Plans, and Practical Reason , year =

Michael Bratman , editor =. Intention, Plans, and Practical Reason , year =
[8]

Transportation Research Record: Journal of the Transportation Research Board , number=

A Different Title to Test Repeated Authors , author=. Transportation Research Record: Journal of the Transportation Research Board , number=. 2011 , publisher=

2011
[9]

Journal of Field Robotics , volume=

Autonomous driving in urban environments: Boss and the urban challenge , author=. Journal of Field Robotics , volume=. 2008 , publisher=

2008
[10]

CVPR , pages=

Are we ready for autonomous driving? The KITTI vision benchmark suite , author=. CVPR , pages=. 2012 , organization=

2012
[11]

Synthese , pages=

Radical interpretation , author=. Synthese , pages=
[12]

Minds and Machines , volume=

Statistical learning theory and Occam’s razor: the core argument , author=. Minds and Machines , volume=. 2024 , publisher=

2024
[13]

2012 , publisher=

Reliable reasoning: Induction and statistical learning theory , author=. 2012 , publisher=

2012
[14]

2014 , publisher=

Understanding machine learning: From theory to algorithms , author=. 2014 , publisher=

2014
[15]

What Does It Take to Catch a Chinchilla? Verifying Rules on Large-Scale Neural Network Training via Compute Monitoring

Stress testing deliberative alignment for anti-scheming training , author=. arXiv preprint arXiv:2509.15541 , year=

work page arXiv
[16]

1989 , publisher=

The intentional stance , author=. 1989 , publisher=

1989
[17]

1998 , publisher=

Statistical learning theory , author=. 1998 , publisher=

1998
[18]

Philosophy Compass , volume=

Reliability in machine learning , author=. Philosophy Compass , volume=
[19]

Nature Machine Intelligence , volume=

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , author=. Nature Machine Intelligence , volume=
[20]

arXiv preprint arXiv:2410.09087 , year=

Mechanistic? , author=. arXiv preprint arXiv:2410.09087 , year=

work page arXiv
[21]

Philosophical Studies , year=

Still no lie detector for language models: probing empirical and conceptual roadblocks , author=. Philosophical Studies , year=
[22]

Minds and Machines , volume=

Standards for belief representations in LLMs , author=. Minds and Machines , volume=
[23]

arXiv preprint arXiv:2501.15740 , year=

Propositional interpretability in artificial intelligence , author=. arXiv preprint arXiv:2501.15740 , year=

work page arXiv
[24]

The Internal State of an LLM Knows When It's Lying

The internal state of an LLM knows when it's lying , author=. arXiv preprint arXiv:2304.13734 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets , author=. arXiv preprint arXiv:2310.06824 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[26]

2021 , publisher=

Making AI intelligible: Philosophical foundations , author=. 2021 , publisher=

2021
[27]

Computational Linguistics , volume=

Do language models’ words refer? , author=. Computational Linguistics , volume=
[28]

Understanding intermediate layers using linear classifier probes

Understanding intermediate layers using linear classifier probes , author=. arXiv preprint arXiv:1610.01644 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Discovering Latent Knowledge in Language Models Without Supervision

Discovering latent knowledge in language models without supervision , author=. arXiv preprint arXiv:2212.03827 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

2019 , publisher=

The conceptual foundations of quantum mechanics , author=. 2019 , publisher=

2019
[31]

2018 , publisher=

Mathematical foundations of quantum mechanics , author=. 2018 , publisher=

2018
[32]

Toy Models of Superposition

Toy models of superposition , author=. arXiv preprint arXiv:2209.10652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Transformer Circuits Thread , year=

Towards monosemanticity: Decomposing language models with dictionary learning , author=. Transformer Circuits Thread , year=
[34]

The Journal of Philosophy , volume=

Imaging and conditionalization , author=. The Journal of Philosophy , volume=
[35]

Dialectica , volume=

Radical interpretation , author=. Dialectica , volume=
[36]

, title =

Stalnaker, Robert C. , title =. 1984 , publisher =

1984
[37]

NeurIPS , year=

Attention is all you need , author=. NeurIPS , year=
[38]

1980 , publisher=

Metaphors we live by , author=. 1980 , publisher=

1980
[39]

NeurIPS , year=

Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting , author=. NeurIPS , year=
[40]

What does ChatGPT want? An interpretationist guide , author=
[41]

2026 , url=

Marks, Sam and Lindsey, Jack and Olah, Christopher , title=. 2026 , url=

2026
[42]

7 LaGO: Latent Action Guidance for Online Reinforcement Learning Lin, J., Du, Y ., Watkins, O., Hafner, D., Abbeel, P., Klein, D., and Dragan, A

Emergent world representations: Exploring a sequence model trained on a synthetic task , author=. arXiv preprint arXiv:2210.13382 , year=

work page arXiv
[43]

Transformer Circuits Thread , year=

In-context learning and induction heads , author=. Transformer Circuits Thread , year=
[44]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sparks of artificial general intelligence: Early experiments with GPT-4 , author=. arXiv preprint arXiv:2303.12712 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

arXiv preprint arXiv:2309.15840 , year=

How to catch an AI liar , author=. arXiv preprint arXiv:2309.15840 , year=

work page arXiv
[46]

Advances in neural information processing systems , volume=

Locating and editing factual associations in gpt , author=. Advances in neural information processing systems , volume=
[47]

Theory and Decision , volume=

A new basis for decision theory , author=. Theory and Decision , volume=. 1985 , publisher=

1985
[48]

Histoy of Economic Thought Chapters , pages=

Truth and Probability , author=. Histoy of Economic Thought Chapters , pages=. 1926 , publisher=

1926
[49]

1972 , publisher=

The foundations of statistics , author=. 1972 , publisher=

1972
[50]

Philosophy of Science , volume=

A nonpragmatic vindication of probabilism , author=. Philosophy of Science , volume=. 1998 , publisher=

1998
[51]

De Finetti, Bruno , booktitle=. La pr
[52]

2023 , eprint=

Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching , author=. 2023 , eprint=

2023
[53]

2023 , eprint=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. 2023 , eprint=

2023
[54]

2016 , publisher=

Accuracy and the Laws of Credence , author=. 2016 , publisher=

2016
[55]

2015 , publisher=

Reasons without persons: Rationality, identity, and time , author=. 2015 , publisher=

2015
[56]

The Philosophical Review , volume=

Clever bookies and coherent beliefs , author=. The Philosophical Review , volume=. 1991 , publisher=

1991
[57]

Truth Evaluability in Radical Interpretation Theory , author=
[58]

Inquiries into Truth and Interpretation , publisher =

Donald Davidson , title =. Inquiries into Truth and Interpretation , publisher =. 1984 , pages =

1984
[59]

1990 , publisher=

The logic of decision , author=. 1990 , publisher=

1990
[60]

Philosophy of Science , volume=

A simultaneous axiomatization of utility and subjective probability , author=. Philosophy of Science , volume=. 1967 , publisher=

1967
[61]

1944 , publisher =

John von Neumann and Oskar Morgenstern , title =. 1944 , publisher =

1944
[62]

2005 , publisher=

Donald Davidson: Meaning, truth, language, and reality , author=. 2005 , publisher=

2005
[63]

Measuring Faithfulness in Chain-of-Thought Reasoning

Measuring faithfulness in chain-of-thought reasoning , author=. arXiv preprint arXiv:2307.13702 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[64]

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Chain of thought monitorability: A new and fragile opportunity for ai safety , author=. arXiv preprint arXiv:2507.11473 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[65]

Proceedings of the 41st International Conference on Machine Learning , pages=

The linear representation hypothesis and the geometry of large language models , author=. Proceedings of the 41st International Conference on Machine Learning , pages=
[66]

Efficient Estimation of Word Representations in Vector Space

Efficient estimation of word representations in vector space , author=. arXiv preprint arXiv:1301.3781 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[67]

Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP , pages=

Emergent linear representations in world models of self-supervised sequence models , author=. Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP , pages=
[68]

ICLR , year=

Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , author=. ICLR , year=
[69]

Transactions of the Association for Computational Linguistics , volume=

Provable limitations of acquiring meaning from ungrounded form: What will future language models understand? , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=

2021
[70]

Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

On the dangers of stochastic parrots: Can language models be too big? , author=. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

2021
[71]

Proceedings of the 25th conference on computational natural language learning , pages=

Can language models encode perceptual structure without grounding? a case study in color , author=. Proceedings of the 25th conference on computational natural language learning , pages=
[72]

International conference on learning representations , year=

Mapping language models to grounded conceptual spaces , author=. International conference on learning representations , year=
[73]

2019 , publisher =

Stuart Russell , title =. 2019 , publisher =

2019
[74]

Artificial intelligence safety and security , pages=

The basic AI drives , author=. Artificial intelligence safety and security , pages=. 2018 , publisher=

2018
[75]

2014 , publisher =

Nick Bostrom , title =. 2014 , publisher =

2014
[76]

arXiv preprint arXiv:2407.11015 , year=

Does chatgpt have a mind? , author=. arXiv preprint arXiv:2407.11015 , year=

work page arXiv
[77]

Contemporary Materialism , pages=

Mental events , author=. Contemporary Materialism , pages=. 2002 , publisher=

2002
[78]

Cognition , volume=

Connectionism and cognitive architecture: A critical analysis , author=. Cognition , volume=. 1988 , publisher=

1988
[79]

Computability: Turing, G

10 Why Philosophers Should Care about Computational Complexity , author=. Computability: Turing, G. 2013 , publisher=

2013
[80]

1955 , publisher =

Nelson Goodman , title =. 1955 , publisher =

1955

Showing first 80 references.

[1] [1]

Transportation Research Record: Journal of the Transportation Research Board , number=

Theoretical maximum capacity as benchmark for empty vehicle redistribution in personal rapid transit , author=. Transportation Research Record: Journal of the Transportation Research Board , number=. 2010 , publisher=

2010

[2] [2]

2026 , eprint=

Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers , author=. 2026 , eprint=

2026

[3] [3]

2026 , eprint=

Mechanistic Indicators of Understanding in Large Language Models , author=. 2026 , eprint=

2026

[4] [4]

Journal of AI, Robotics & Workplace Automation , volume=

AI agents: A new Solow paradox? , author=. Journal of AI, Robotics & Workplace Automation , volume=. 2026 , publisher=

2026

[5] [5]

Science , volume=

Toward universal steering and monitoring of AI models , author=. Science , volume=. 2026 , publisher=

2026

[6] [6]

Cognition , volume=

Connectionism and the problem of systematicity: Why Smolensky's solution doesn't work , author=. Cognition , volume=. 1990 , publisher=

1990

[7] [7]

Intention, Plans, and Practical Reason , year =

Michael Bratman , editor =. Intention, Plans, and Practical Reason , year =

[8] [8]

Transportation Research Record: Journal of the Transportation Research Board , number=

A Different Title to Test Repeated Authors , author=. Transportation Research Record: Journal of the Transportation Research Board , number=. 2011 , publisher=

2011

[9] [9]

Journal of Field Robotics , volume=

Autonomous driving in urban environments: Boss and the urban challenge , author=. Journal of Field Robotics , volume=. 2008 , publisher=

2008

[10] [10]

CVPR , pages=

Are we ready for autonomous driving? The KITTI vision benchmark suite , author=. CVPR , pages=. 2012 , organization=

2012

[11] [11]

Synthese , pages=

Radical interpretation , author=. Synthese , pages=

[12] [12]

Minds and Machines , volume=

Statistical learning theory and Occam’s razor: the core argument , author=. Minds and Machines , volume=. 2024 , publisher=

2024

[13] [13]

2012 , publisher=

Reliable reasoning: Induction and statistical learning theory , author=. 2012 , publisher=

2012

[14] [14]

2014 , publisher=

Understanding machine learning: From theory to algorithms , author=. 2014 , publisher=

2014

[15] [15]

What Does It Take to Catch a Chinchilla? Verifying Rules on Large-Scale Neural Network Training via Compute Monitoring

Stress testing deliberative alignment for anti-scheming training , author=. arXiv preprint arXiv:2509.15541 , year=

work page arXiv

[16] [16]

1989 , publisher=

The intentional stance , author=. 1989 , publisher=

1989

[17] [17]

1998 , publisher=

Statistical learning theory , author=. 1998 , publisher=

1998

[18] [18]

Philosophy Compass , volume=

Reliability in machine learning , author=. Philosophy Compass , volume=

[19] [19]

Nature Machine Intelligence , volume=

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead , author=. Nature Machine Intelligence , volume=

[20] [20]

arXiv preprint arXiv:2410.09087 , year=

Mechanistic? , author=. arXiv preprint arXiv:2410.09087 , year=

work page arXiv

[21] [21]

Philosophical Studies , year=

Still no lie detector for language models: probing empirical and conceptual roadblocks , author=. Philosophical Studies , year=

[22] [22]

Minds and Machines , volume=

Standards for belief representations in LLMs , author=. Minds and Machines , volume=

[23] [23]

arXiv preprint arXiv:2501.15740 , year=

Propositional interpretability in artificial intelligence , author=. arXiv preprint arXiv:2501.15740 , year=

work page arXiv

[24] [24]

The Internal State of an LLM Knows When It's Lying

The internal state of an LLM knows when it's lying , author=. arXiv preprint arXiv:2304.13734 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets , author=. arXiv preprint arXiv:2310.06824 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

2021 , publisher=

Making AI intelligible: Philosophical foundations , author=. 2021 , publisher=

2021

[27] [27]

Computational Linguistics , volume=

Do language models’ words refer? , author=. Computational Linguistics , volume=

[28] [28]

Understanding intermediate layers using linear classifier probes

Understanding intermediate layers using linear classifier probes , author=. arXiv preprint arXiv:1610.01644 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Discovering Latent Knowledge in Language Models Without Supervision

Discovering latent knowledge in language models without supervision , author=. arXiv preprint arXiv:2212.03827 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

2019 , publisher=

The conceptual foundations of quantum mechanics , author=. 2019 , publisher=

2019

[31] [31]

2018 , publisher=

Mathematical foundations of quantum mechanics , author=. 2018 , publisher=

2018

[32] [32]

Toy Models of Superposition

Toy models of superposition , author=. arXiv preprint arXiv:2209.10652 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Transformer Circuits Thread , year=

Towards monosemanticity: Decomposing language models with dictionary learning , author=. Transformer Circuits Thread , year=

[34] [34]

The Journal of Philosophy , volume=

Imaging and conditionalization , author=. The Journal of Philosophy , volume=

[35] [35]

Dialectica , volume=

Radical interpretation , author=. Dialectica , volume=

[36] [36]

, title =

Stalnaker, Robert C. , title =. 1984 , publisher =

1984

[37] [37]

NeurIPS , year=

Attention is all you need , author=. NeurIPS , year=

[38] [38]

1980 , publisher=

Metaphors we live by , author=. 1980 , publisher=

1980

[39] [39]

NeurIPS , year=

Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting , author=. NeurIPS , year=

[40] [40]

What does ChatGPT want? An interpretationist guide , author=

[41] [41]

2026 , url=

Marks, Sam and Lindsey, Jack and Olah, Christopher , title=. 2026 , url=

2026

[42] [42]

7 LaGO: Latent Action Guidance for Online Reinforcement Learning Lin, J., Du, Y ., Watkins, O., Hafner, D., Abbeel, P., Klein, D., and Dragan, A

Emergent world representations: Exploring a sequence model trained on a synthetic task , author=. arXiv preprint arXiv:2210.13382 , year=

work page arXiv

[43] [43]

Transformer Circuits Thread , year=

In-context learning and induction heads , author=. Transformer Circuits Thread , year=

[44] [44]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sparks of artificial general intelligence: Early experiments with GPT-4 , author=. arXiv preprint arXiv:2303.12712 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

arXiv preprint arXiv:2309.15840 , year=

How to catch an AI liar , author=. arXiv preprint arXiv:2309.15840 , year=

work page arXiv

[46] [46]

Advances in neural information processing systems , volume=

Locating and editing factual associations in gpt , author=. Advances in neural information processing systems , volume=

[47] [47]

Theory and Decision , volume=

A new basis for decision theory , author=. Theory and Decision , volume=. 1985 , publisher=

1985

[48] [48]

Histoy of Economic Thought Chapters , pages=

Truth and Probability , author=. Histoy of Economic Thought Chapters , pages=. 1926 , publisher=

1926

[49] [49]

1972 , publisher=

The foundations of statistics , author=. 1972 , publisher=

1972

[50] [50]

Philosophy of Science , volume=

A nonpragmatic vindication of probabilism , author=. Philosophy of Science , volume=. 1998 , publisher=

1998

[51] [51]

De Finetti, Bruno , booktitle=. La pr

[52] [52]

2023 , eprint=

Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching , author=. 2023 , eprint=

2023

[53] [53]

2023 , eprint=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. 2023 , eprint=

2023

[54] [54]

2016 , publisher=

Accuracy and the Laws of Credence , author=. 2016 , publisher=

2016

[55] [55]

2015 , publisher=

Reasons without persons: Rationality, identity, and time , author=. 2015 , publisher=

2015

[56] [56]

The Philosophical Review , volume=

Clever bookies and coherent beliefs , author=. The Philosophical Review , volume=. 1991 , publisher=

1991

[57] [57]

Truth Evaluability in Radical Interpretation Theory , author=

[58] [58]

Inquiries into Truth and Interpretation , publisher =

Donald Davidson , title =. Inquiries into Truth and Interpretation , publisher =. 1984 , pages =

1984

[59] [59]

1990 , publisher=

The logic of decision , author=. 1990 , publisher=

1990

[60] [60]

Philosophy of Science , volume=

A simultaneous axiomatization of utility and subjective probability , author=. Philosophy of Science , volume=. 1967 , publisher=

1967

[61] [61]

1944 , publisher =

John von Neumann and Oskar Morgenstern , title =. 1944 , publisher =

1944

[62] [62]

2005 , publisher=

Donald Davidson: Meaning, truth, language, and reality , author=. 2005 , publisher=

2005

[63] [63]

Measuring Faithfulness in Chain-of-Thought Reasoning

Measuring faithfulness in chain-of-thought reasoning , author=. arXiv preprint arXiv:2307.13702 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[64] [64]

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Chain of thought monitorability: A new and fragile opportunity for ai safety , author=. arXiv preprint arXiv:2507.11473 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[65] [65]

Proceedings of the 41st International Conference on Machine Learning , pages=

The linear representation hypothesis and the geometry of large language models , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

[66] [66]

Efficient Estimation of Word Representations in Vector Space

Efficient estimation of word representations in vector space , author=. arXiv preprint arXiv:1301.3781 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[67] [67]

Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP , pages=

Emergent linear representations in world models of self-supervised sequence models , author=. Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP , pages=

[68] [68]

ICLR , year=

Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , author=. ICLR , year=

[69] [69]

Transactions of the Association for Computational Linguistics , volume=

Provable limitations of acquiring meaning from ungrounded form: What will future language models understand? , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=

2021

[70] [70]

Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

On the dangers of stochastic parrots: Can language models be too big? , author=. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

2021

[71] [71]

Proceedings of the 25th conference on computational natural language learning , pages=

Can language models encode perceptual structure without grounding? a case study in color , author=. Proceedings of the 25th conference on computational natural language learning , pages=

[72] [72]

International conference on learning representations , year=

Mapping language models to grounded conceptual spaces , author=. International conference on learning representations , year=

[73] [73]

2019 , publisher =

Stuart Russell , title =. 2019 , publisher =

2019

[74] [74]

Artificial intelligence safety and security , pages=

The basic AI drives , author=. Artificial intelligence safety and security , pages=. 2018 , publisher=

2018

[75] [75]

2014 , publisher =

Nick Bostrom , title =. 2014 , publisher =

2014

[76] [76]

arXiv preprint arXiv:2407.11015 , year=

Does chatgpt have a mind? , author=. arXiv preprint arXiv:2407.11015 , year=

work page arXiv

[77] [77]

Contemporary Materialism , pages=

Mental events , author=. Contemporary Materialism , pages=. 2002 , publisher=

2002

[78] [78]

Cognition , volume=

Connectionism and cognitive architecture: A critical analysis , author=. Cognition , volume=. 1988 , publisher=

1988

[79] [79]

Computability: Turing, G

10 Why Philosophers Should Care about Computational Complexity , author=. Computability: Turing, G. 2013 , publisher=

2013

[80] [80]

1955 , publisher =

Nelson Goodman , title =. 1955 , publisher =

1955