super hub Mixed citations

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Aidan Ewart, Hoagy Cunningham, Logan Riggs, Robert Huben · 2023 · cs.LG · arXiv 2309.08600

Mixed citation behavior. Most common role is background (54%).

105 Pith papers citing it

Background 54% of classified citations

open full Pith review browse 105 citing papers more from Aidan Ewart arXiv PDF

abstract

One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Moreover, we show that with our learned set of features, we can pinpoint the features that are causally responsible for counterfactual behaviour on the indirect object identification task \citep{wang2022interpretability} to a finer degree than previous decompositions. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 18 method 6 baseline 1 dataset 1

citation-polarity summary

background 14 use method 6 unclear 3 baseline 1 support 1 use dataset 1

claims ledger

abstract One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identif

authors

Aidan Ewart and Lee Sharkey Hoagy Cunningham Logan Riggs Robert Huben

co-cited works

representative citing papers

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 2 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.

SLAM: Structural Linguistic Activation Marking for Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 8.0

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

Slot Machines: How LLMs Keep Track of Multiple Entities

cs.CL · 2026-04-22 · unverdicted · novelty 8.0

LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.

KAN: Kolmogorov-Arnold Networks

cs.LG · 2024-04-30 · conditional · novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.

Markovian Circuit Tracing for Transformer State Dynamic

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

This paper presents Markovian Circuit Tracing (MCT) as a benchmark and pipeline to extract and test state-transition structures in transformer activations using synthetic HMM tasks, demonstrating that state patching improves counterfactual predictions.

Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models

cs.CV · 2026-05-19 · conditional · novelty 7.0

Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.

Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

In a combinatorial toy setting, winning lottery tickets preserve families of compatible feature locations in early feature space that balance proximity to final codes with low interference, rather than specific weight subnetworks.

Beyond Linear Superposition: Discovering Climate Features in AI Weather Models with KAN-SAE

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

KAN-SAE applies nonlinear per-feature B-spline activations in sparse autoencoders to discover 72% more alive climate features and interpretable patterns such as European heatwaves and Pacific typhoons in deep learning weather models.

Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies

cs.RO · 2026-05-17 · conditional · novelty 7.0

Event-grounded SAE analysis in VLA policies produces stronger causal effects on robot behavior than standard methods by anchoring features to clustered end-effector keyframes across simulations and real-robot tests.

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.

Lost and Found in Translation: Variational Diagnostics for Neural Codebook Channels

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Defines the neural codebook channel K_{e→d}(j|i) and proves a Bernoulli-KL bound on encoder-decoder mismatch in VAEs that cannot be recovered from marginal histograms or mutual information.

Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.

SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.

Interpreting Reinforcement Learning Agents with Susceptibilities

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.

What Cohort INRs Encode and Where to Freeze Them

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.

PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

PLOT localizes causal variables in neural networks by fitting optimal transport couplings between abstract and neural intervention effect geometries, enabling fast handles or guided search.

SMolLM: Small Language Models Learn Small Molecular Grammar

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.

Transformers with Selective Access to Early Representations

cs.LG · 2026-05-05 · unverdicted · novelty 7.0 · 2 refs

SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.

Linear-Readout Floors and Threshold Recovery in Computation in Superposition

cs.LG · 2026-05-02 · unverdicted · novelty 7.0

Linear readouts incur an Omega(d^{-1/2}) crosstalk floor that caps the Hanni template at d^{3/2} capacity, while threshold recovery succeeds at quadratic loads for s = O(d/log d) sparsity, resolving the apparent contradiction via distinct readout invariants.

Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection

cs.CV · 2026-04-29 · unverdicted · novelty 7.0

Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.

A Unifying Framework for Unsupervised Concept Extraction

cs.LG · 2026-04-27 · unverdicted · novelty 7.0

A meta-theorem reduces establishing identifiability guarantees for concept extraction methods to the problem of characterizing the intersection of two sets.

citing papers explorer

Showing 50 of 105 citing papers.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations cs.CL · 2026-05-12 · unverdicted · none · ref 24 · internal anchor
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reasoning models.
WriteSAE: Sparse Autoencoders for Recurrent State cs.LG · 2026-05-12 · unverdicted · none · ref 2 · 2 links · internal anchor
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 6 · internal anchor
Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% strength while preserving generation quality.
SLAM: Structural Linguistic Activation Marking for Language Models cs.CL · 2026-05-06 · unverdicted · none · ref 3 · internal anchor
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
Slot Machines: How LLMs Keep Track of Multiple Entities cs.CL · 2026-04-22 · unverdicted · none · ref 3 · internal anchor
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
KAN: Kolmogorov-Arnold Networks cs.LG · 2024-04-30 · conditional · none · ref 5 · internal anchor
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift cs.LG · 2026-05-21 · unverdicted · none · ref 21 · internal anchor
GAE reduces the faithfulness gap in dictionary-based explainers under distribution shift by geometrically realigning the ID dictionary to the OOD-active subspace, with a quadratic excess-loss bound.
Markovian Circuit Tracing for Transformer State Dynamic cs.LG · 2026-05-20 · unverdicted · none · ref 34 · internal anchor
This paper presents Markovian Circuit Tracing (MCT) as a benchmark and pipeline to extract and test state-transition structures in transformer activations using synthetic HMM tasks, demonstrating that state patching improves counterfactual predictions.
Capability $\neq$ Interpretability: Human Interpretability of Vision Foundation Models cs.CV · 2026-05-19 · conditional · none · ref 17 · internal anchor
Foundation models yield less human-interpretable features than supervised vision transformers, with interpretability tied to activation locality and coarse semantic alignment rather than task performance.
Toy Combinatorial Interpretability Models Reveal Lottery Tickets in Early Feature Space cs.LG · 2026-05-18 · unverdicted · none · ref 13 · internal anchor
In a combinatorial toy setting, winning lottery tickets preserve families of compatible feature locations in early feature space that balance proximity to final codes with low interference, rather than specific weight subnetworks.
Beyond Linear Superposition: Discovering Climate Features in AI Weather Models with KAN-SAE cs.LG · 2026-05-17 · unverdicted · none · ref 4 · internal anchor
KAN-SAE applies nonlinear per-feature B-spline activations in sparse autoencoders to discover 72% more alive climate features and interpretable patterns such as European heatwaves and Pacific typhoons in deep learning weather models.
Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies cs.RO · 2026-05-17 · conditional · none · ref 11 · internal anchor
Event-grounded SAE analysis in VLA policies produces stronger causal effects on robot behavior than standard methods by anchoring features to clustered end-effector keyframes across simulations and real-robot tests.
When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability cs.LG · 2026-05-14 · unverdicted · none · ref 7 · internal anchor
Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.
Lost and Found in Translation: Variational Diagnostics for Neural Codebook Channels cs.LG · 2026-05-13 · unverdicted · none · ref 34 · internal anchor
Defines the neural codebook channel K_{e→d}(j|i) and proves a Bernoulli-KL bound on encoder-decoder mismatch in VAEs that cannot be recovered from marginal histograms or mutual information.
Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning cs.LG · 2026-05-12 · unverdicted · none · ref 8 · internal anchor
SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing cs.LG · 2026-05-11 · unverdicted · none · ref 1 · internal anchor
SLIM decomposes LLM hidden states via sparse autoencoders with learnable gates to enable precise, interpretable steering of molecular properties, yielding up to 42.4-point gains on the MolEditRL benchmark.
Interpreting Reinforcement Learning Agents with Susceptibilities cs.LG · 2026-05-08 · unverdicted · none · ref 116 · internal anchor
Susceptibilities applied to regret in deep RL agents reveal stagewise internal development in parameter space of a gridworld model that policy inspection alone cannot detect, validated via activation steering.
What Cohort INRs Encode and Where to Freeze Them cs.LG · 2026-05-08 · unverdicted · none · ref 10 · internal anchor
Optimal INR freeze depth matches highest weight stable rank layer; SAEs reveal SIREN atoms are localized while FFMLP atoms trace cohort contours with causal impact on PSNR.
PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction cs.LG · 2026-05-07 · unverdicted · none · ref 5 · 2 links · internal anchor
PLOT localizes causal variables in neural networks by fitting optimal transport couplings between abstract and neural intervention effect geometries, enabling fast handles or guided search.
SMolLM: Small Language Models Learn Small Molecular Grammar cs.LG · 2026-05-07 · unverdicted · none · ref 61 · internal anchor
A 53K-parameter model generates 95% valid SMILES on ZINC-250K, outperforming larger models, by resolving chemical constraints in fixed order: brackets first, rings second, valence last.
Transformers with Selective Access to Early Representations cs.LG · 2026-05-05 · unverdicted · none · ref 10 · 2 links · internal anchor
SATFormer uses a context-dependent gate for selective reuse of early Transformer representations, improving validation loss and zero-shot accuracy especially on retrieval benchmarks.
Linear-Readout Floors and Threshold Recovery in Computation in Superposition cs.LG · 2026-05-02 · unverdicted · none · ref 4 · internal anchor
Linear readouts incur an Omega(d^{-1/2}) crosstalk floor that caps the Hanni template at d^{3/2} capacity, while threshold recovery succeeds at quadratic loads for s = O(d/log d) sparsity, resolving the apparent contradiction via distinct readout invariants.
Sparsity as a Key: Unlocking New Insights from Latent Structures for Out-of-Distribution Detection cs.CV · 2026-04-29 · unverdicted · none · ref 8 · internal anchor
Sparse autoencoders on ViT class tokens reveal stable Class Activation Profiles for in-distribution data, enabling OOD detection via divergence from core energy profiles.
A Unifying Framework for Unsupervised Concept Extraction cs.LG · 2026-04-27 · unverdicted · none · ref 2 · internal anchor
A meta-theorem reduces establishing identifiability guarantees for concept extraction methods to the problem of characterizing the intersection of two sets.
Are LLM Uncertainty and Correctness Encoded by the Same Features? A Functional Dissociation via Sparse Autoencoders cs.LG · 2026-04-21 · unverdicted · none · ref 35 · internal anchor
Uncertainty and correctness in LLMs are encoded by distinct feature populations, with suppression of confounded features improving accuracy and reducing entropy.
Grokking of Diffusion Models: Case Study on Modular Addition cs.LG · 2026-04-20 · unverdicted · none · ref 5 · internal anchor
Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.
Diverse Dictionary Learning cs.LG · 2026-04-19 · unverdicted · none · ref 3 · internal anchor
Diverse dictionary learning identifies intersections, complements, and dependency structures of latent variables from data X = g(Z) up to indeterminacies, and full identifiability when structural diversity is sufficient.
Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision cs.CV · 2026-04-14 · unverdicted · none · ref 8 · internal anchor
Cross-Layer Transcoders decompose ViT activations into sparse, depth-aware layer contributions that maintain zero-shot accuracy and enable faithful attribution of the final representation.
Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP cs.CV · 2026-04-07 · unverdicted · none · ref 7 · internal anchor
The paper proposes information scope as a new interpretability axis for SAE features in CLIP and introduces the Contextual Dependency Score to separate local from global scope features, showing they influence model predictions differently.
What Makes Good Multilingual Reasoning? Disentangling Reasoning Traces with Measurable Features cs.CL · 2026-04-06 · unverdicted · none · ref 2 · internal anchor
Effective multilingual reasoning in large models relies on language-specific patterns in reasoning features rather than uniform English-like traces.
MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents cs.LG · 2026-04-03 · conditional · none · ref 5 · internal anchor
Joint training of a primary SAE with a meta SAE that applies a decomposability penalty on decoder directions produces more atomic latents, shown by 7.5% lower mean absolute phi and 7.6% higher fuzzing scores on GPT-2.
How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability cs.CL · 2026-01-27 · unverdicted · none · ref 3 · internal anchor
Transformer weights at early training stages are closed-form compositions of bigram, token-interchangeability, and context mappings that directly reflect text-corpus statistics and explain the emergence of semantic associations.
V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models cs.CL · 2025-09-18 · conditional · none · ref 11 · internal anchor
V-SEAM combines concept-level visual semantic editing with attention head modulation to identify positive and negative contributors across object, attribute, and relationship levels, then uses this to improve VLM performance on VQA benchmarks.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 129 · internal anchor
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Scaling and evaluating sparse autoencoders cs.LG · 2024-06-06 · unverdicted · none · ref 12 · internal anchor
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
Concept-Based Abductive and Contrastive Explanations for Behaviors of Vision Models cs.LG · 2026-05-07 · unverdicted · none · ref 69
Concept-based abductive and contrastive explanations find minimal high-level concepts that causally determine vision model outcomes on individual images or groups sharing a specified behavior.
How Language Models Process Negation cs.CL · 2026-05-04 · unverdicted · none · ref 53
LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.
Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction cs.AI · 2026-05-04 · unverdicted · none · ref 4
A four-step recipe partitions the input space using interchange intervention behavior to diagnose where causal abstractions hold and to guide improvements, demonstrated by recovering a full hypothesis from scratch in a toy logic task.
Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance cs.CL · 2026-05-21 · unverdicted · none · ref 84 · internal anchor
Embedding model performance on MTEB tasks correlates strongly with nearest-neighbor overlap and ICA magnitude differences in their embedding spaces.
Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models cs.CV · 2026-05-15 · unverdicted · none · ref 4 · internal anchor
SAE-FT uses a sparse autoencoder on pre-trained CLIP visual representations to regularize fine-tuning by penalizing changes to semantically meaningful features, aiming for robust performance on ImageNet and distribution shifts.
Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute cs.AI · 2026-05-14 · unverdicted · none · ref 13 · internal anchor
Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.
Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology cs.LG · 2026-05-14 · unverdicted · none · ref 29 · internal anchor
Training installs a depth-dependent spectral gradient and low-rank bottleneck in LLM residual streams whose amplification or suppression of graph communities is predicted by local operator type.
Why Retrieval-Augmented Generation Fails: A Graph Perspective cs.CL · 2026-05-13 · unverdicted · none · ref 11 · internal anchor
Attribution graphs reveal that RAG failures arise from shallow fragmented evidence flow in LLMs, enabling topology-based detection and targeted interventions that reinforce question-guided routing.
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders cs.LG · 2026-05-13 · unverdicted · none · ref 20 · 2 links · internal anchor
Sparse autoencoders on EEG transformers extract clinical features, identify three steering regimes, expose age-pathology entanglements and wrecking-ball failures, and map interventions to frequency spectra.
Proof-Carrying Certificates for LLM Pipelines: A Trust-Boundary Architecture cs.LO · 2026-05-13 · unverdicted · partial · ref 14 · internal anchor
Introduces a trust-boundary architecture in Lean 4 with three certificate families and two operators that deliver sorry-free, axiom-audited assurances for LLM pipeline components.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy cs.LG · 2026-05-13 · unverdicted · none · ref 9 · 2 links · internal anchor
Base LLMs show multi-agent yield to peer pressure at rates equal to or higher than aligned models, localized by activation patching to mid-layers where attention dominates, with one dissenter cutting yield by 54-73 points while prompt defenses fail on variants.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 25 · 2 links · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Domain Restriction via Multi SAE Layer Transitions cs.AI · 2026-05-12 · unverdicted · none · ref 6 · internal anchor
Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models cs.CL · 2026-05-10 · unverdicted · none · ref 12 · internal anchor
Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.
Tool Calling is Linearly Readable and Steerable in Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 46 · internal anchor
Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.

Sparse Autoencoders Find Highly Interpretable Features in Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer