AI weather models may simulate the atmosphere via particle positions in latent space whose updates follow gradient flow on a learned free energy functional rather than conventional physical equations.
hub
A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646
19 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 4polarities
background 4representative citing papers
Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.
Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.
A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
Harder classification tasks produce neural representations whose accuracy collapses under binarization and shuffling while easier tasks remain robust, defining task complexity via the performance gap between full-precision and perturbed networks.
Linear mappings in feature space can reconstruct a wide range of image manipulations including semantic edits, suggesting that feature representations are approximately linearly organized.
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
Transformers can be built to act as nonlinear featurizers via attention, supporting in-context regression with proven generalization bounds on synthetic tasks.
Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.
AP-MAE reconstructs masked attention patterns in LLMs with high accuracy, generalizes across models, predicts generation correctness at 55-70%, and enables 13.6% accuracy gains via targeted interventions.
Mechanistic experiments on Gemma 3 27B, Qwen 2.5 7B and Magistral Small 24B show verbal confidence is cached at post-answer positions from answer tokens and captures richer answer-quality information beyond token log-probabilities.
Fine-tuned transformers with multi-task learning recover substantial wording-derived signal for item difficulty at small sample sizes typical in applied testing.
Graph Memory Transformer (GMT) swaps dense FFN sublayers for a graph of 128 centroids and a learned 128x128 transition matrix per block, yielding a 82M-parameter decoder-only LM that trains stably but trails a 103M dense baseline in perplexity.
MCircKE maps causal circuits for specific reasoning tasks in LLMs and surgically updates parameters within those circuits to enable effective multi-hop knowledge editing, as shown on the MQuAKE-3K benchmark.
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
Different types of syntactic agreement recruit overlapping units within LLMs, indicating that agreement forms a meaningful functional category across English, Russian, Chinese, and structurally similar languages.
Activation verbalization methods for LLMs largely reflect the verbalizer model's parametric knowledge rather than privileged information from the target model's activations.
Whisper-style speech encoders show semantic cross-lingual alignment beyond phonetics in final layers, and early-exiting boosts ASR performance on low-resource languages.
Human visual interestingness is linearly decodable from final-layer embeddings in Qwen3-VL-8B and becomes progressively more structured across vision and language layers without explicit supervision.
citing papers explorer
-
The physics of AI weather models
AI weather models may simulate the atmosphere via particle positions in latent space whose updates follow gradient flow on a learned free energy functional rather than conventional physical equations.
-
Data-driven Circuit Discovery for Interpretability of Language Models
Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.
-
Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training
Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.
-
Improving Sparse Autoencoder with Dynamic Attention
A cross-attention SAE with sparsemax attention achieves lower reconstruction loss and higher-quality concepts than fixed-sparsity baselines by making activation counts data-dependent.
-
Task complexity shapes internal representations and robustness in neural networks
Harder classification tasks produce neural representations whose accuracy collapses under binarization and shuffling while easier tasks remain robust, defining task complexity via the performance gap between full-precision and perturbed networks.
-
FeatMap: Understanding image manipulation in the feature space and its implications for feature space geometry
Linear mappings in feature space can reconstruct a wide range of image manipulations including semantic edits, suggesting that feature representations are approximately linearly organized.
-
Enabling Performant and Flexible Model-Internal Observability for LLM Inference
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
-
Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer
Transformers can be built to act as nonlinear featurizers via attention, supporting in-context regression with proven generalization bounds on synthetic tasks.
-
Inside-Out: Measuring Generalization in Vision Transformers Through Inner Workings
Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.
-
Automated Attention Pattern Discovery at Scale in Large Language Models
AP-MAE reconstructs masked attention patterns in LLMs with high accuracy, generalizes across models, predicts generation correctness at 55-70%, and enables 13.6% accuracy gains via targeted interventions.
-
How do LLMs Compute Verbal Confidence
Mechanistic experiments on Gemma 3 27B, Qwen 2.5 7B and Magistral Small 24B show verbal confidence is cached at post-answer positions from answer tokens and captures richer answer-quality information beyond token log-probabilities.
-
Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning
Fine-tuned transformers with multi-task learning recover substantial wording-derived signal for item difficulty at small sample sizes typical in applied testing.
-
Graph Memory Transformer (GMT)
Graph Memory Transformer (GMT) swaps dense FFN sublayers for a graph of 128 centroids and a learned 128x128 transition matrix per block, yielding a 82M-parameter decoder-only LM that trains stably but trails a 103M dense baseline in perplexity.
-
Mechanistic Circuit-Based Knowledge Editing in Large Language Models
MCircKE maps causal circuits for specific reasoning tasks in LLMs and surgically updates parameters within those circuits to enable effective multi-hop knowledge editing, as shown on the MQuAKE-3K benchmark.
-
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
-
Different types of syntactic agreement recruit the same units within large language models
Different types of syntactic agreement recruit overlapping units within LLMs, indicating that agreement forms a meaningful functional category across English, Russian, Chinese, and structurally similar languages.
-
Do Activation Verbalization Methods Convey Privileged Information?
Activation verbalization methods for LLMs largely reflect the verbalizer model's parametric knowledge rather than privileged information from the target model's activations.
-
Languages in Whisper-Style Speech Encoders Align Both Phonetically and Semantically
Whisper-style speech encoders show semantic cross-lingual alignment beyond phonetics in final layers, and early-exiting boosts ASR performance on low-resource languages.
-
Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers
Human visual interestingness is linearly decodable from final-layer embeddings in Qwen3-VL-8B and becomes progressively more structured across vision and language layers without explicit supervision.