LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.
Title resolution pending
7 Pith papers cite this work. Polarity classification is still indexing.
years
2026 7representative citing papers
MINE uses mechanistic interpretability on language-aligned image representations to generate per-voxel feature descriptions, validated via image generation and counterfactual edits that causally shift brain activation.
Recurrent-depth transformers achieve systematic generalization and depth extrapolation on implicit reasoning tasks through iterative layer reuse, a three-stage grokking process, and inference-time scaling, while vanilla transformers fail.
Per-head attention contributions to the residual stream serve as strong linear features for classifying relational knowledge in LLMs, with probe accuracy correlating to relation specificity and signal distribution.
Label noise hurts fine-tuning performance most while grammatical and typographical noise sometimes act as mild regularizers, with changes concentrated in task-specific layers.
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
citing papers explorer
-
How Language Models Process Negation
LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.
-
Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex
MINE uses mechanistic interpretability on language-aligned image representations to generate per-voxel feature descriptions, validated via image generation and counterfactual edits that causally shift brain activation.
-
Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
Recurrent-depth transformers achieve systematic generalization and depth extrapolation on implicit reasoning tasks through iterative layer reuse, a three-stage grokking process, and inference-time scaling, while vanilla transformers fail.
-
Tracing Relational Knowledge Recall in Large Language Models
Per-head attention contributions to the residual stream serve as strong linear features for classifying relational knowledge in LLMs, with probe accuracy correlating to relation specificity and signal distribution.
-
Analyzing the Effect of Noise in LLM Fine-tuning
Label noise hurts fine-tuning performance most while grammatical and typographical noise sometimes act as mild regularizers, with changes concentrated in task-specific layers.
-
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
- Language-Switching Triggers Take a Latent Detour Through Language Models