Evaluation of two latent reasoning models against controls shows observable latent patterns appear without the proposed mechanisms, have graded causal effects on behavior, and concentrate in structured low-rank directions, arguing that patterns are insufficient evidence for reasoning.
Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance?
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5representative citing papers
Chart information is encoded but not routed to predictions in VLMs for claim verification, unlike tables, revealed by layer-wise probing and attention analysis on three models.
Different scoring mechanisms cause encoder-based authorship attribution models to consolidate authorship signals at different layers, as shown by causal interventions and gradient analysis.
Authors create a benchmark across discrete/continuous and static/dynamical systems and introduce the Causal Abstraction Error (CAE) metric that reliably distinguishes valid from invalid causal abstractions when it includes faithfulness testing.
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
citing papers explorer
-
Observable Patterns Are Not Explanations: A Causal-Geometric Analysis of Latent Reasoning Models
Evaluation of two latent reasoning models against controls shows observable latent patterns appear without the proposed mechanisms, have graded causal effects on behavior, and concentrate in structured low-rank directions, arguing that patterns are insufficient evidence for reasoning.
-
Encoded but Not Routed: Explaining the Table-Chart Gap in Scientific Claim Verification
Chart information is encoded but not routed to predictions in VLMs for claim verification, unlike tables, revealed by layer-wise probing and attention analysis on three models.
-
Validating Causal Abstraction Metrics on Simulated Complex Systems
Authors create a benchmark across discrete/continuous and static/dynamical systems and introduce the Causal Abstraction Error (CAE) metric that reliably distinguishes valid from invalid causal abstractions when it includes faithfulness testing.
-
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.