arXiv preprint arXiv:2511.13653 , year =

Leo Gao, Achyuta Rajaram, Jacob Coxon, Soham V Govande, Bowen Baker, Dan Mossing · 2025 · arXiv 2511.13653

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

representative citing papers

Towards Verifiable Transformers: Solver-Checkable Circuit Explanations

cs.LG · 2026-05-21 · unverdicted · novelty 8.0

Presents a solver-verifiable framework for Transformer circuits, with exhaustive checks on small symbolic tasks and surrogate methods for larger models.

Crafting Reversible SFT Behaviors in Large Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

From Mechanistic to Compositional Interpretability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

The paper introduces compositional interpretability as a category-theoretic framework that casts mechanistic explanations as commuting syntactic-semantic mappings optimized under faithfulness and complexity constraints derived from minimum description length.

Interpretability Can Be Actionable

cs.LG · 2026-05-11 · conditional · novelty 6.0

Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.

Bilinear autoencoders find interpretable manifolds

cs.LG · 2026-05-09 · unverdicted · novelty 6.0

Bilinear autoencoders decompose neural activations into low-rank quadratic forms to discover interpretable multi-dimensional manifolds, improving reconstruction in language models and challenging linear representation assumptions.

The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

cs.CL · 2026-04-02 · unverdicted · novelty 6.0

MoE expert neurons show lower polysemanticity than dense FFN neurons, widening with sparser routing, and experts specialize in fine-grained tasks like specific linguistic operations, supporting expert-level interpretability.

NeuroCogMap Reveals Cognitive Organization of Large Language Models

q-bio.NC · 2026-07-01 · unverdicted · novelty 5.0

NeuroCogMap maps LLM internal representations into stable functional parcels tied to cognitive functions, failure modes, and human cortical activity during language tasks.

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

cs.CL · 2026-01-20 · unverdicted · novelty 5.0

The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.

Internal Deployment in the AI Act

cs.CY · 2025-12-05 · unverdicted · novelty 4.0

Interpretations of Articles 2(1), 2(6), and 2(8) of the AI Act support applying the regulation to internal AI deployment while allowing for R&D exceptions, with the provisions viewed as complementary.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Internal Deployment in the AI Act cs.CY · 2025-12-05 · unverdicted · none · ref 13
Interpretations of Articles 2(1), 2(6), and 2(8) of the AI Act support applying the regulation to internal AI deployment while allowing for R&D exceptions, with the provisions viewed as complementary.

arXiv preprint arXiv:2511.13653 , year =

fields

years

verdicts

representative citing papers

citing papers explorer