SwordBench benchmarks steering methods for concept removal in vision models and shows that linear SVMs achieve strong separability and orthogonality but incur collateral damage, while sparse autoencoders often perform better and no method reaches perfect steering even in simple cases.
hub
arXiv preprint arXiv:2307.15771 , year=
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.
In-context learning binds model outputs to the demonstrated label tokens as an exhaustive vocabulary, overriding semantic plausibility and causing fixation even with homogeneous or nonsense labels.
Tabular foundation models show substantial depthwise redundancy, so a looped single-layer version achieves comparable results with 20% of the original parameters.
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.
Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.
Varying evaluation metrics and corruption methods in activation patching produces different localization and circuit discovery outcomes in language models, leading to recommendations for preferred practices.
Chronos encodes frequency content in decoder representations with quality that varies across the spectrum, as revealed by minimum description length probes on sinusoid inputs.
citing papers explorer
-
SwordBench: Evaluating Orthogonality of Steering Image Representations
SwordBench benchmarks steering methods for concept removal in vision models and shows that linear SVMs achieve strong separability and orthogonality but incur collateral damage, while sparse autoencoders often perform better and no method reaches perfect steering even in simple cases.
-
How LLMs Are Persuaded: A Few Attention Heads, Rerouted
Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.
-
In-Context Fixation: When Demonstrated Labels Override Semantics in Few-Shot Classification
In-context learning binds model outputs to the demonstrated label tokens as an exhaustive vocabulary, overriding semantic plausibility and causing fixation even with homogeneous or nonsense labels.
-
Is One Layer Enough? Understanding Inference Dynamics in Tabular Foundation Models
Tabular foundation models show substantial depthwise redundancy, so a looped single-layer version achieves comparable results with 20% of the original parameters.
-
Cell-Based Representation of Relational Binding in Language Models
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
-
Instructions Shape Production of Language, not Processing
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
-
Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal
Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.
-
How to use and interpret activation patching
Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.
-
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Varying evaluation metrics and corruption methods in activation patching produces different localization and circuit discovery outcomes in language models, leading to recommendations for preferred practices.
-
Preliminary Insights in Chronos Frequency Data Understanding and Reconstruction
Chronos encodes frequency content in decoder representations with quality that varies across the spectrum, as revealed by minimum description length probes on sinusoid inputs.