SwordBench: Evaluating Orthogonality of Steering Image Representations

Dawid Pludowski; Hubert Baniecki; Przemyslaw Biecek; Vladimir Zaigrajew

arxiv: 2605.16372 · v1 · pith:R7Y7X4NRnew · submitted 2026-05-10 · 💻 cs.CV · cs.AI· cs.LG

SwordBench: Evaluating Orthogonality of Steering Image Representations

Vladimir Zaigrajew , Dawid Pludowski , Hubert Baniecki , Przemyslaw Biecek This is my paper

Pith reviewed 2026-05-20 22:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords steeringconcept activation vectorsorthogonalityvision modelsbenchmarkcollateral damagecross-concept robustnessinterpretability

0 comments

The pith

SwordBench shows linear SVMs steer image concepts more orthogonally than alternatives but still produce collateral damage on unrelated tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SwordBench as a benchmark for testing how well steering methods can remove or alter specific concepts in vision model representations without side effects. It defines cross-concept robustness to check if detection of one concept remains stable after inputs are made orthogonal to other concepts, and collateral damage to measure unintended drops in model performance on tasks that do not involve the steered concept. These metrics matter because steering is used for interpretability and safety corrections at inference time, yet prior work focused only on language models with ambiguous tasks. Experiments across multiple backbones and removal tasks reveal that linear support vector machines achieve stronger separability and orthogonality than sparse autoencoders or optimization baselines, yet they still incur collateral damage and no approach reaches perfect steering even in simple settings.

Core claim

SwordBench supplies a unified suite for evaluating steering of image representations across vision backbones and concept removal tasks. It introduces cross-concept robustness, which tracks the stability of concept detection after orthogonalization against alternative concepts, and collateral damage, which checks whether steering harms downstream task accuracy on inputs that lack the target bias. Results indicate linear support vector machines deliver superior separability and orthogonality but fail to reach zero collateral damage and often underperform sparse autoencoders on that dimension, while both standard baselines and optimization-based methods fall short of perfect steering in simpler

What carries the argument

Cross-concept robustness and collateral damage metrics that quantify second-order effects of orthogonalization among concept activation vectors during pragmatic steering.

If this is right

Linear support vector machines provide stronger separability and orthogonality than sparse autoencoders or optimization baselines across tested vision models.
Even high-performing orthogonal methods still produce measurable collateral damage on downstream tasks for inputs without the steered concept.
No evaluated method reaches perfect steering performance in simpler concept-removal regimes.
Evaluation must include stability across alternative orthogonalized concepts rather than isolated separability alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers choosing steering techniques for safety applications should weight collateral damage more heavily than pure orthogonality scores.
The benchmark could be extended to test whether non-linear projections or hybrid methods reduce the observed collateral effects.
Similar second-order metrics might apply to steering in other modalities where concept vectors interact during inference.

Load-bearing premise

The proposed notions of cross-concept robustness and collateral damage correctly quantify the second-order effects of orthogonalization among concept activation vectors in realistic steering scenarios for vision models.

What would settle it

A steering method that achieves perfect orthogonality, zero collateral damage, and unchanged downstream accuracy on all SwordBench tasks would falsify the reported finding that even top methods leave residual damage.

Figures

Figures reproduced from arXiv: 2605.16372 by Dawid Pludowski, Hubert Baniecki, Przemyslaw Biecek, Vladimir Zaigrajew.

**Figure 1.** Figure 1: SWORDBENCH multi-dataset concept activation vectors evaluation. On the left, we report vector metrics for SigLIP image representations CAVs on CelebA (averaged over 39 concepts). Improved disentanglement (lower MS) correlates with higher robustness (CCR), with methods forming three clusters (•, ■, ▲) along the orthogonality and detection axes. On the right, we compute a PCA biplot on ISIC using all evaluat… view at source ↗

**Figure 2.** Figure 2: Sample efficiency analysis on CelebA (SigLIP). [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of images from the MetaShift dataset. [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗

**Figure 4.** Figure 4: Example of Counteranimal dataset. An animal is presented in its common environment (left) or with an atypical background (right). Sampling strategy. We use the common split of Counteranimal dataset as a train and validation set in our experiments, with a random split between those in a ratio of 9:1. As a test set, we use all images from uncommon split. This approach leverages the strengthening of the backg… view at source ↗

**Figure 5.** Figure 5: Example of watermarking in ImageNet-W. Left: A clean image of a Tractor. Right: The same image infused with the watermark text church. The watermark overlays transparent text, which is clearly visible but does not obscure the underlying structural details of the object [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗

**Figure 6.** Figure 6: Example of corruptions in ImageNet-C. We visualize three sample corruptions: Fog, Frost, and Gaussian Noise. Each row displays the corruption across increasing severity levels. 1 2 3 4 5 Severity Level Brightness Contrast Defocus Blur Elastic Transform Fog Frost Gaussian Noise Glass Blur Impulse Noise Jpeg Compression Motion Blur Pixelate Shot Noise Snow Zoom Blur Corruption Type 0.0 0.0 0.0 0.2 0.4 0.0 0.… view at source ↗

**Figure 7.** Figure 7: Heatmap of corruptions × severity impact on accuracy drop for CLIP. Gaussian Noise visibly degrades performance from severity 3, whereas Shot Noise degrades it from severity 4. The largest drop in accuracy is observed at severity level 5, but, as shown in [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗

**Figure 8.** Figure 8: Critical difference analysis of collateral damage. [PITH_FULL_IMAGE:figures/full_fig_p040_8.png] view at source ↗

**Figure 9.** Figure 9: Sample efficiency on ImageNet-W (husky/cat Watermark). [PITH_FULL_IMAGE:figures/full_fig_p041_9.png] view at source ↗

**Figure 10.** Figure 10: Extended sample efficiency ablations. Complementing [PITH_FULL_IMAGE:figures/full_fig_p042_10.png] view at source ↗

**Figure 11.** Figure 11: Sample efficiency ablations for ISIC. We report metrics as a function of training sample size for: ISIC with CLIP (top); ISIC with DINOv2 (middle), and ISIC with SigLIP (bottom). 43 [PITH_FULL_IMAGE:figures/full_fig_p043_11.png] view at source ↗

**Figure 12.** Figure 12: Win-rate comparison on CelebA: linear CAVs vs. nonlinear probes. [PITH_FULL_IMAGE:figures/full_fig_p044_12.png] view at source ↗

**Figure 13.** Figure 13: Win-rate comparison on ISIC: linear CAVs vs. nonlinear probes. [PITH_FULL_IMAGE:figures/full_fig_p044_13.png] view at source ↗

**Figure 14.** Figure 14: PCA biplots of metric correlations. We project all evaluation metrics onto the first two principal components across all backbones for Waterbirds (left), CelebA (middle), and ImageNet-W (right). The arrows indicate the direction of each metric. Predictive metrics (AUC, F1, MAD) consistently overlap, while MS and CCR form distinct correlated directions. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_14.png] view at source ↗

**Figure 15.** Figure 15: Correlation between AUC and F1 on CelebA. [PITH_FULL_IMAGE:figures/full_fig_p047_15.png] view at source ↗

**Figure 16.** Figure 16: Correlation between AUC and F1 on ISIC. Across all backbones, we observe a strong correlation (p ≈ 0, R2 > 0.6) between the F1 and AUC metrics. MAD shows a clear and strong correlation with both of AUC and F1, which is why we exclude it from our results. 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 Pairwise Cosine Similarity 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 Cross-Concept Robustness (unaggreg… view at source ↗

**Figure 17.** Figure 17: Heatmap comparison of MS vs CCR on CelebA (using linear SVM). [PITH_FULL_IMAGE:figures/full_fig_p047_17.png] view at source ↗

**Figure 18.** Figure 18: Critical difference analysis of backbone performance on ISIC [PITH_FULL_IMAGE:figures/full_fig_p048_18.png] view at source ↗

**Figure 19.** Figure 19: Critical difference analysis of backbone performance on CelebA [PITH_FULL_IMAGE:figures/full_fig_p048_19.png] view at source ↗

read the original abstract

Steering or intervening on model representations at inference time to correct predictions is essential for AI interpretability and safety, yet existing evaluation protocols are limited to ambiguous language modeling tasks. To address this gap, we introduce SwordBench, a benchmark for steering image representations of vision models across multiple backbones and concept removal tasks. Beyond a unified benchmarking suite, we propose new evaluation notions that uncover the second-order effects of orthogonalization among concept activation vectors for pragmatic steering. Specifically, cross-concept robustness measures the stability of concept detection performance across inputs orthogonalized against alternative concepts, and collateral damage quantifies whether steering inadvertently affects model performance on a downstream task for inputs lacking the bias. We find that although a linear support vector machine exhibits superior separability and orthogonality, it fails to achieve zero collateral damage, often trailing sparse autoencoders. In simpler regimes, both standard baselines and optimization-based methods fail to achieve perfect steering. The source code will be made available soon on GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SwordBench introduces a vision-specific benchmark and two new metrics for steering orthogonality, but the key empirical claims need more checks on metric robustness.

read the letter

The one or two things your colleague should know about this paper are that it introduces SwordBench, a benchmark for steering image representations in vision models, and it proposes new evaluation notions called cross-concept robustness and collateral damage to assess second-order effects of orthogonalization. What is actually new is the shift to image-based tasks and these tailored metrics that build on but go past the language modeling setups cited in the abstract. The paper does well in providing a unified suite across backbones and concept removal tasks. The specific finding that a linear support vector machine shows better separability and orthogonality but trails sparse autoencoders on collateral damage is presented clearly, along with the observation that standard methods fail to achieve perfect steering in simpler regimes. This work helps fill the gap in evaluation protocols for vision models, which is relevant for making steering interventions more reliable in practice. The promise of releasing source code on GitHub is a positive step for allowing others to reproduce and extend the results. On the soft spots, the support for the claims is limited by the lack of information on datasets, statistical tests, error bars, and exact protocols. The collateral damage metric in particular may not robustly quantify the effects without additional task ablations or sensitivity analysis, as the ranking between methods could change depending on how inputs lacking the bias are sampled or which downstream task is used. This is a moderate concern rather than a fatal one, but it does mean the central empirical comparison needs more backing. This paper is for researchers in AI interpretability and safety who are interested in standardized benchmarks for vision model interventions. Readers who want to evaluate or improve steering methods will get value from the new framework and the reported comparisons. The paper deserves a serious referee because the benchmark and metrics address a genuine need in the field, even if revisions are needed for stronger validation. I would recommend engaging with this work through peer review, with a focus on clarifying the experimental setup and testing the robustness of the new metrics.

Referee Report

2 major / 1 minor

Summary. The paper introduces SwordBench, a benchmark for evaluating the orthogonality of steering image representations in vision models across multiple backbones and concept removal tasks. It proposes two new evaluation notions—cross-concept robustness, which measures stability of concept detection after orthogonalization against alternative concepts, and collateral damage, which quantifies unintended effects on downstream task performance for inputs lacking the target bias—to capture second-order effects of orthogonalization among concept activation vectors. Empirical comparisons show that linear SVMs achieve superior separability and orthogonality relative to sparse autoencoders yet incur higher collateral damage and fail to reach zero, while both standard baselines and optimization-based methods fail to achieve perfect steering in simpler regimes.

Significance. If the proposed metrics are shown to be robust, SwordBench would address a clear gap in standardized evaluation for representation steering in computer vision, where protocols have lagged behind language modeling. The empirical demonstration of trade-offs between orthogonality and collateral damage provides actionable guidance for interpretability and safety work. Releasing source code would further strengthen the contribution by enabling direct reproduction of the reported rankings.

major comments (2)

The central empirical claim that linear SVMs trail SAEs on collateral damage while outperforming on separability rests on the collateral damage metric correctly quantifying unintended downstream effects. However, the manuscript provides no sensitivity analysis, cross-task ablations, or details on how 'inputs lacking the bias' are sampled, leaving open the possibility that the observed ranking is an artifact of the specific downstream task or sampling procedure chosen.
No datasets, statistical tests, error bars, or exact experimental protocols are described for the reported comparisons (e.g., SVM vs. SAE collateral damage). This absence makes it impossible to assess the reliability or statistical significance of the finding that SVMs 'often trail' SAEs, which is load-bearing for the paper's main conclusion.

minor comments (1)

The abstract states that source code 'will be made available soon on GitHub.' A concrete repository link or commit hash should be provided to support review and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving the clarity and robustness of our empirical results. We address each major comment below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses

Referee: The central empirical claim that linear SVMs trail SAEs on collateral damage while outperforming on separability rests on the collateral damage metric correctly quantifying unintended downstream effects. However, the manuscript provides no sensitivity analysis, cross-task ablations, or details on how 'inputs lacking the bias' are sampled, leaving open the possibility that the observed ranking is an artifact of the specific downstream task or sampling procedure chosen.

Authors: We agree that the current manuscript lacks sufficient detail on the sampling of inputs lacking the target bias and does not include sensitivity or cross-task analyses. In the revision we will add an explicit description of the sampling procedure (including selection criteria and dataset splits), perform sensitivity analyses by varying the proportion and selection method of unbiased inputs, and include cross-task ablations on at least two additional downstream tasks to verify that the SVM–SAE ranking on collateral damage is stable. revision: yes
Referee: No datasets, statistical tests, error bars, or exact experimental protocols are described for the reported comparisons (e.g., SVM vs. SAE collateral damage). This absence makes it impossible to assess the reliability or statistical significance of the finding that SVMs 'often trail' SAEs, which is load-bearing for the paper's main conclusion.

Authors: We acknowledge that the experimental section is currently underspecified. The revised manuscript will contain a dedicated experimental protocol subsection that lists the exact datasets and splits for every task, the precise hyper-parameters and training procedures for SVM and SAE methods, error bars computed over at least five random seeds, and statistical significance tests (paired t-tests with Bonferroni correction) comparing collateral-damage scores. We will also make the full source code and evaluation scripts publicly available upon acceptance to enable direct reproduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark with author-defined metrics

full rationale

The paper introduces SwordBench as a new evaluation suite and defines cross-concept robustness and collateral damage as novel notions for measuring second-order effects of orthogonalization. These are presented as proposals rather than derived quantities, and the central findings (SVM superiority on separability/orthogonality but not on collateral damage) are direct empirical measurements on the benchmark. No equations, fitted parameters, or self-citations are shown to reduce the reported results to inputs by construction. The work is self-contained as a benchmark paper; results follow from applying the stated definitions to the chosen models and tasks without tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the work rests on standard machine-learning assumptions about vector representations and linear separability.

pith-pipeline@v0.9.0 · 5710 in / 1193 out tokens · 34254 ms · 2026-05-20T22:34:52.635899+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose new evaluation notions that uncover the second-order effects of orthogonalization among concept activation vectors... cross-concept robustness... collateral damage
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

steering image representations of vision models across multiple backbones and concept removal tasks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · 2 internal anchors

[1]

Steering

Joseph, Sonia and Suresh, Praneet and Goldfarb, Ethan and Hufe, Lorenz and Gandelsman, Yossi and Graham, Robert and Bzdok, Danilo and Samek, Wojciech and Richards, Blake Aaron , booktitle=. Steering

work page
[2]

ICLR Workshop on Building Trust in Language Models and Applications , year=

Interpretable Steering of Large Language Models with Feature Guided Activation Additions , author=. ICLR Workshop on Building Trust in Language Models and Applications , year=

work page
[3]

Discover and cure:

Wu, Shirley and Yuksekgonul, Mert and Zhang, Linjun and Zou, James , booktitle=. Discover and cure:

work page
[4]

Liang, Weixin and Zou, James , booktitle=

work page
[5]

Wah, Catherine and Branson, Steve and Welinder, Peter and Perona, Pietro and Belongie, Serge , year=. The

work page
[6]

Skin lesion analysis toward melanoma detection 2018:

Codella, Noel and Rotemberg, Veronica and Tschandl, Philipp and Celebi, M Emre and Dusza, Stephen and Gutman, David and Helba, Brian and Kalloo, Aadi and Liopyris, Konstantinos and Marchetti, Michael and others , journal=. Skin lesion analysis toward melanoma detection 2018:

work page 2018
[7]

Debiasing skin lesion datasets and models?

Bissoto, Alceu and Valle, Eduardo and Avila, Sandra , booktitle=. Debiasing skin lesion datasets and models?

work page
[8]

NeurIPS , year=

Learning debiased representation via disentangled feature augmentation , author=. NeurIPS , year=

work page
[9]

ECAI , year=

Model Science: Getting Serious About Verification, Explanation and Control of AI Systems , author=. ECAI , year=

work page
[10]

Irvin, Jeremy and Rajpurkar, Pranav and Ko, Michael and Yu, Yifan and Ciurea-Ilcus, Silviana and Chute, Chris and Marklund, Henrik and Haghgoo, Behzad and Ball, Robyn and Shpanskaya, Katie and others , booktitle=

work page
[11]

You, Kihyun and Gu, Jawook and Ham, Jiyeon and Park, Beomhee and Kim, Jiho and Hong, Eun K and Baek, Woonhyuk and Roh, Byungseok , booktitle=

work page
[12]

Defense-prefix for preventing typographic attacks on

Azuma, Hiroki and Matsui, Yusuke , booktitle=. Defense-prefix for preventing typographic attacks on

work page
[13]

ICML , year=

Axiomatic attribution for deep networks , author=. ICML , year=

work page
[14]

Finding and removing

Anders, Christopher J and Weber, Leander and Neumann, David and Samek, Wojciech and M. Finding and removing. Information Fusion , volume=

work page
[15]

Visual Sparse Steering (VS2): Unsupervised Adaptation for Image Classification using Sparsity-Guided Steering Vectors

Visual Sparse Steering: Improving Zero-shot Image Classification with Sparsity Guided Steering Vectors , author=. arXiv preprint arXiv:2506.01247 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

NeurIPS , year=

Kopf, Laura and Bommer, Philine L and Hedstr. NeurIPS , year=

work page
[17]

To Steer or Not to Steer?

Anna Hedstr. To Steer or Not to Steer?. ICML , year=

work page
[18]

Layer by Layer:

Oscar Skean and Md Rifat Arefin and Dan Zhao and Niket Nikul Patel and Jalal Naghiyev and Yann LeCun and Ravid Shwartz-Ziv , booktitle=. Layer by Layer:

work page
[19]

NeurIPS Workshop on Mechanistic Interpretability , year=

Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone , author=. NeurIPS Workshop on Mechanistic Interpretability , year=

work page
[20]

Concept-Based Explanations in Computer Vision:

Lee, Jae Hee and Mikriukov, Georgii and Schwalbe, Gesina and Wermter, Stefan and Wolter, Diedrich , booktitle=. Concept-Based Explanations in Computer Vision:

work page
[21]

Interpretability beyond feature attribution:

Kim, Been and Wattenberg, Martin and Gilmer, Justin and Cai, Carrie and Wexler, James and Viegas, Fernanda and others , booktitle=. Interpretability beyond feature attribution:

work page
[22]

ICML , year=

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. ICML , year=

work page
[23]

Emergent Linear Representations in World Models of Self-Supervised Sequence Models

Nanda, Neel and Lee, Andrew and Wattenberg, Martin. Emergent Linear Representations in World Models of Self-Supervised Sequence Models. 2023

work page 2023
[24]

ICLR , year=

Understanding intermediate layers using linear classifier probes , author=. ICLR , year=

work page
[25]

ICLR , year=

Navigating Neural Space: Revisiting Concept Activation Vectors to Overcome Directional Divergence , author=. ICLR , year=

work page
[26]

Laines Schmalwasser and Niklas Penzel and Joachim Denzler and Julia Niebling , booktitle=. Fast

work page
[27]

NeurIPS , year=

Towards automatic concept-based explanations , author=. NeurIPS , year=

work page
[28]

Invertible concept-based explanations for

Zhang, Ruihan and Madumal, Prashan and Miller, Tim and Ehinger, Krista A and Rubinstein, Benjamin IP , booktitle=. Invertible concept-based explanations for

work page
[29]

MICCAI , year=

Using causal analysis for conceptual deep learning explanation , author=. MICCAI , year=

work page
[30]

CVPR , year=

Contrastive pretraining for visual concept explanations of socioeconomic outcomes , author=. CVPR , year=

work page
[31]

IJCNN , year=

Conceptual explanations of neural network prediction for time series , author=. IJCNN , year=

work page
[32]

Under the hood:

Giulianelli, Mario and Harding, Jack and Mohnert, Florian and Hupkes, Dieuwke and Zuidema, Willem , booktitle=. Under the hood:

work page
[33]

What does

Clark, Kevin and Khandelwal, Urvashi and Levy, Omer and Manning, Christopher D , booktitle=. What does

work page
[34]

Concept activation regions:

Crabb. Concept activation regions:. NeurIPS , year=

work page
[35]

ICLR , year=

Not All Language Model Features Are One-Dimensionally Linear , author=. ICLR , year=

work page
[36]

Oikarinen, Tuomas and Weng, Tsui-Wei , booktitle=

work page
[37]

NeurIPS , year=

Labeling neural representations with inverse recognition , author=. NeurIPS , year=

work page
[38]

2022 , journal=

Toy Models of Superposition , author=. 2022 , journal=

work page 2022
[39]

Bricken, Trenton and Templeton, Adly and Batson, Joshua and Chen, Brian and Jermyn, Adam and Conerly, Tom and Turner, Nicholas L. and Anil, Cem and Denison, Carson and Askell, Amanda and Lasenby, Robert and Wu, Yifan and Kravec, Shauna and Schiefer, Nicholas and Maxwell, Tim and Joseph, Nicholas and Tamkin, Alex and Nguyen, Karina and McLean, Brayden and ...

work page
[40]

Daniel Freeman and Theodore R

Adly Templeton and Tom Conerly and Jonathan Marcus and Jack Lindsey and Trenton Bricken and Brian Chen and Adam Pearce and Craig Citro and Emmanuel Ameisen and Andy Jones and Hoagy Cunningham and Nicholas L Turner and Callum McDougall and Monte MacDiarmid and Alex Tamkin and Esin Durmus and Tristan Hume and Francesco Mosconi and C. Daniel Freeman and Theo...

work page
[41]

ICLR , year=

Scaling and evaluating sparse autoencoders , author=. ICLR , year=

work page
[42]

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Rajamanoharan, Senthooran and Lieberum, Tom and Sonnerat, Nicolas and Conmy, Arthur and Varma, Vikrant and Kram. Jumping ahead:. arXiv preprint arXiv:2407.14435 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Interpreting

Vladimir Zaigrajew and Hubert Baniecki and Przemyslaw Biecek , booktitle=. Interpreting

work page
[44]

Are Sparse Autoencoders Useful?

Subhash Kantamneni and Joshua Engels and Senthooran Rajamanoharan and Max Tegmark and Neel Nanda , booktitle=. Are Sparse Autoencoders Useful?

work page
[45]

ACL , year=

Extracting Latent Steering Vectors from Pretrained Language Models , author=. ACL , year=

work page
[46]

Representation engineering: A top-down approach to

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others , journal=. Representation engineering: A top-down approach to

work page
[47]

ICML , year=

Bartosz Cywi. ICML , year=

work page
[48]

Steering

Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander , booktitle=. Steering

work page
[49]

ICLR Workshop on Reasoning and Planning for Large Language Models , year=

Understanding Reasoning in Thinking Language Models via Steering Vectors , author=. ICLR Workshop on Reasoning and Planning for Large Language Models , year=

work page
[50]

From hope to safety:

Dreyer, Maximilian and Pahde, Frederik and Anders, Christopher J and Samek, Wojciech and Lapuschkin, Sebastian , booktitle=. From hope to safety:

work page
[51]

CVPR Workshops , year=

Reactive Model Correction: Mitigating Harm to Task-Relevant Features via Conditional Bias Suppression , author=. CVPR Workshops , year=

work page
[52]

Post-hoc Concept Disentanglement:

Erogullari, Eren and Lapuschkin, Sebastian and Samek, Wojciech and Pahde, Frederik , booktitle=. Post-hoc Concept Disentanglement:

work page
[53]

Zhengxuan Wu and Aryaman Arora and Atticus Geiger and Zheng Wang and Jing Huang and Dan Jurafsky and Christopher D Manning and Christopher Potts , booktitle=

work page
[54]

Steering Language Models in Multi-Token Generation:

Klerings, Alina and Brinkmann, Jannik and Ruffinelli, Daniel and Ponzetto, Simone Paolo , booktitle=. Steering Language Models in Multi-Token Generation:

work page
[55]

NeurIPS , year=

Analysing the generalisation and reliability of steering vectors , author=. NeurIPS , year=

work page
[56]

ICCV , year=

Deep learning face attributes in the wild , author=. ICCV , year=

work page
[57]

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=

work page
[58]

Tschandl, Philipp and Rosendahl, Cliff and Kittler, Harald , journal=. The

work page
[59]

COLM , year=

Steering large language model activations in sparse spaces , author=. COLM , year=

work page
[60]

Whispering Experts:

Xavier Suau and Pieter Delobelle and Katherine Metcalf and Armand Joulin and Nicholas Apostoloff and Luca Zappella and Pau Rodriguez , booktitle=. Whispering Experts:

work page
[61]

A Whac-A-Mole Dilemma:

Li, Zhiheng and Evtimov, Ivan and Gordo, Albert and Hazirbas, Caner and Hassner, Tal and Ferrer, Cristian Canton and Xu, Chenliang and Ibrahim, Mark , booktitle=. A Whac-A-Mole Dilemma:

work page
[62]

arXiv preprint arXiv:2411.04430 , year=

Towards unifying interpretability and control: Evaluation via intervention , author=. arXiv preprint arXiv:2411.04430 , year=

work page arXiv
[63]

Interpretable Machine Learning , author=. M. Phil. diss., Dept. of Engineering, University of Cambridge , year=

work page
[64]

Jailbroken:

Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , booktitle=. Jailbroken:

work page
[65]

Emergent Misalignment:

Jan Betley and Daniel Chee Hian Tan and Niels Warncke and Anna Sztyber-Betley and Xuchan Bao and Mart. Emergent Misalignment:. ICML , year=

work page
[66]

ACL , year=

Large language models are not fair evaluators , author=. ACL , year=

work page
[67]

From Generation to Judgment: O pportunities and Challenges of LLM -as-a-judge

Li, Dawei and Jiang, Bohan and Huang, Liangjie and Beigi, Alimohammad and Zhao, Chengshuai and Tan, Zhen and Bhattacharjee, Amrita and Jiang, Yuxuan and Chen, Canyu and Wu, Tianhao and Shu, Kai and Cheng, Lu and Liu, Huan. From Generation to Judgment: O pportunities and Challenges of LLM -as-a-judge. 2025

work page 2025
[68]

Persona vectors:

Chen, Runjin and Arditi, Andy and Sleight, Henry and Evans, Owain and Lindsey, Jack , journal=. Persona vectors:

work page
[69]

ICLR , year=

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , author=. ICLR , year=

work page
[70]

ICLR , year=

Distributionally Robust Neural Networks , author=. ICLR , year=

work page
[71]

Explaining Similarity in Vision-Language Encoders with Weighted

Hubert Baniecki and Maximilian Muschalik and Fabian Fumagalli and Barbara Hammer and Eyke H. Explaining Similarity in Vision-Language Encoders with Weighted. NeurIPS , year =

work page
[72]

Reveal to revise:

Pahde, Frederik and Dreyer, Maximilian and Samek, Wojciech and Lapuschkin, Sebastian , booktitle=. Reveal to revise:

work page
[73]

Neuroimage , volume=

On the interpretation of weight vectors of linear models in multivariate neuroimaging , author=. Neuroimage , volume=

work page
[74]

WACV , year=

Robust Novelty Detection Through Style-Conscious Feature Ranking , author=. WACV , year=

work page
[75]

ICLR , year=

Hollmann, Noah and M. ICLR , year=

work page
[76]

Nature , volume=

Accurate predictions on small data with a tabular foundation model , author=. Nature , volume=

work page
[77]

ICML , year=

Learning transferable visual models from natural language supervision , author=. ICML , year=

work page
[78]

ICML Workshop on Reliable and Responsible Foundation Models , year=

Steering language model refusal with sparse autoencoders , author=. ICML Workshop on Reliable and Responsible Foundation Models , year=

work page
[79]

Transactions on Machine Learning Research , year=

Maxime Oquab and Timoth. Transactions on Machine Learning Research , year=

work page
[80]

ICCV , year=

Sigmoid loss for language image pre-training , author=. ICCV , year=

work page

Showing first 80 references.

[1] [1]

Steering

Joseph, Sonia and Suresh, Praneet and Goldfarb, Ethan and Hufe, Lorenz and Gandelsman, Yossi and Graham, Robert and Bzdok, Danilo and Samek, Wojciech and Richards, Blake Aaron , booktitle=. Steering

work page

[2] [2]

ICLR Workshop on Building Trust in Language Models and Applications , year=

Interpretable Steering of Large Language Models with Feature Guided Activation Additions , author=. ICLR Workshop on Building Trust in Language Models and Applications , year=

work page

[3] [3]

Discover and cure:

Wu, Shirley and Yuksekgonul, Mert and Zhang, Linjun and Zou, James , booktitle=. Discover and cure:

work page

[4] [4]

Liang, Weixin and Zou, James , booktitle=

work page

[5] [5]

Wah, Catherine and Branson, Steve and Welinder, Peter and Perona, Pietro and Belongie, Serge , year=. The

work page

[6] [6]

Skin lesion analysis toward melanoma detection 2018:

Codella, Noel and Rotemberg, Veronica and Tschandl, Philipp and Celebi, M Emre and Dusza, Stephen and Gutman, David and Helba, Brian and Kalloo, Aadi and Liopyris, Konstantinos and Marchetti, Michael and others , journal=. Skin lesion analysis toward melanoma detection 2018:

work page 2018

[7] [7]

Debiasing skin lesion datasets and models?

Bissoto, Alceu and Valle, Eduardo and Avila, Sandra , booktitle=. Debiasing skin lesion datasets and models?

work page

[8] [8]

NeurIPS , year=

Learning debiased representation via disentangled feature augmentation , author=. NeurIPS , year=

work page

[9] [9]

ECAI , year=

Model Science: Getting Serious About Verification, Explanation and Control of AI Systems , author=. ECAI , year=

work page

[10] [10]

Irvin, Jeremy and Rajpurkar, Pranav and Ko, Michael and Yu, Yifan and Ciurea-Ilcus, Silviana and Chute, Chris and Marklund, Henrik and Haghgoo, Behzad and Ball, Robyn and Shpanskaya, Katie and others , booktitle=

work page

[11] [11]

You, Kihyun and Gu, Jawook and Ham, Jiyeon and Park, Beomhee and Kim, Jiho and Hong, Eun K and Baek, Woonhyuk and Roh, Byungseok , booktitle=

work page

[12] [12]

Defense-prefix for preventing typographic attacks on

Azuma, Hiroki and Matsui, Yusuke , booktitle=. Defense-prefix for preventing typographic attacks on

work page

[13] [13]

ICML , year=

Axiomatic attribution for deep networks , author=. ICML , year=

work page

[14] [14]

Finding and removing

Anders, Christopher J and Weber, Leander and Neumann, David and Samek, Wojciech and M. Finding and removing. Information Fusion , volume=

work page

[15] [15]

Visual Sparse Steering (VS2): Unsupervised Adaptation for Image Classification using Sparsity-Guided Steering Vectors

Visual Sparse Steering: Improving Zero-shot Image Classification with Sparsity Guided Steering Vectors , author=. arXiv preprint arXiv:2506.01247 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

NeurIPS , year=

Kopf, Laura and Bommer, Philine L and Hedstr. NeurIPS , year=

work page

[17] [17]

To Steer or Not to Steer?

Anna Hedstr. To Steer or Not to Steer?. ICML , year=

work page

[18] [18]

Layer by Layer:

Oscar Skean and Md Rifat Arefin and Dan Zhao and Niket Nikul Patel and Jalal Naghiyev and Yann LeCun and Ravid Shwartz-Ziv , booktitle=. Layer by Layer:

work page

[19] [19]

NeurIPS Workshop on Mechanistic Interpretability , year=

Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone , author=. NeurIPS Workshop on Mechanistic Interpretability , year=

work page

[20] [20]

Concept-Based Explanations in Computer Vision:

Lee, Jae Hee and Mikriukov, Georgii and Schwalbe, Gesina and Wermter, Stefan and Wolter, Diedrich , booktitle=. Concept-Based Explanations in Computer Vision:

work page

[21] [21]

Interpretability beyond feature attribution:

Kim, Been and Wattenberg, Martin and Gilmer, Justin and Cai, Carrie and Wexler, James and Viegas, Fernanda and others , booktitle=. Interpretability beyond feature attribution:

work page

[22] [22]

ICML , year=

The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. ICML , year=

work page

[23] [23]

Emergent Linear Representations in World Models of Self-Supervised Sequence Models

Nanda, Neel and Lee, Andrew and Wattenberg, Martin. Emergent Linear Representations in World Models of Self-Supervised Sequence Models. 2023

work page 2023

[24] [24]

ICLR , year=

Understanding intermediate layers using linear classifier probes , author=. ICLR , year=

work page

[25] [25]

ICLR , year=

Navigating Neural Space: Revisiting Concept Activation Vectors to Overcome Directional Divergence , author=. ICLR , year=

work page

[26] [26]

Laines Schmalwasser and Niklas Penzel and Joachim Denzler and Julia Niebling , booktitle=. Fast

work page

[27] [27]

NeurIPS , year=

Towards automatic concept-based explanations , author=. NeurIPS , year=

work page

[28] [28]

Invertible concept-based explanations for

Zhang, Ruihan and Madumal, Prashan and Miller, Tim and Ehinger, Krista A and Rubinstein, Benjamin IP , booktitle=. Invertible concept-based explanations for

work page

[29] [29]

MICCAI , year=

Using causal analysis for conceptual deep learning explanation , author=. MICCAI , year=

work page

[30] [30]

CVPR , year=

Contrastive pretraining for visual concept explanations of socioeconomic outcomes , author=. CVPR , year=

work page

[31] [31]

IJCNN , year=

Conceptual explanations of neural network prediction for time series , author=. IJCNN , year=

work page

[32] [32]

Under the hood:

Giulianelli, Mario and Harding, Jack and Mohnert, Florian and Hupkes, Dieuwke and Zuidema, Willem , booktitle=. Under the hood:

work page

[33] [33]

What does

Clark, Kevin and Khandelwal, Urvashi and Levy, Omer and Manning, Christopher D , booktitle=. What does

work page

[34] [34]

Concept activation regions:

Crabb. Concept activation regions:. NeurIPS , year=

work page

[35] [35]

ICLR , year=

Not All Language Model Features Are One-Dimensionally Linear , author=. ICLR , year=

work page

[36] [36]

Oikarinen, Tuomas and Weng, Tsui-Wei , booktitle=

work page

[37] [37]

NeurIPS , year=

Labeling neural representations with inverse recognition , author=. NeurIPS , year=

work page

[38] [38]

2022 , journal=

Toy Models of Superposition , author=. 2022 , journal=

work page 2022

[39] [39]

Bricken, Trenton and Templeton, Adly and Batson, Joshua and Chen, Brian and Jermyn, Adam and Conerly, Tom and Turner, Nicholas L. and Anil, Cem and Denison, Carson and Askell, Amanda and Lasenby, Robert and Wu, Yifan and Kravec, Shauna and Schiefer, Nicholas and Maxwell, Tim and Joseph, Nicholas and Tamkin, Alex and Nguyen, Karina and McLean, Brayden and ...

work page

[40] [40]

Daniel Freeman and Theodore R

Adly Templeton and Tom Conerly and Jonathan Marcus and Jack Lindsey and Trenton Bricken and Brian Chen and Adam Pearce and Craig Citro and Emmanuel Ameisen and Andy Jones and Hoagy Cunningham and Nicholas L Turner and Callum McDougall and Monte MacDiarmid and Alex Tamkin and Esin Durmus and Tristan Hume and Francesco Mosconi and C. Daniel Freeman and Theo...

work page

[41] [41]

ICLR , year=

Scaling and evaluating sparse autoencoders , author=. ICLR , year=

work page

[42] [42]

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Rajamanoharan, Senthooran and Lieberum, Tom and Sonnerat, Nicolas and Conmy, Arthur and Varma, Vikrant and Kram. Jumping ahead:. arXiv preprint arXiv:2407.14435 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Interpreting

Vladimir Zaigrajew and Hubert Baniecki and Przemyslaw Biecek , booktitle=. Interpreting

work page

[44] [44]

Are Sparse Autoencoders Useful?

Subhash Kantamneni and Joshua Engels and Senthooran Rajamanoharan and Max Tegmark and Neel Nanda , booktitle=. Are Sparse Autoencoders Useful?

work page

[45] [45]

ACL , year=

Extracting Latent Steering Vectors from Pretrained Language Models , author=. ACL , year=

work page

[46] [46]

Representation engineering: A top-down approach to

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others , journal=. Representation engineering: A top-down approach to

work page

[47] [47]

ICML , year=

Bartosz Cywi. ICML , year=

work page

[48] [48]

Steering

Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander , booktitle=. Steering

work page

[49] [49]

ICLR Workshop on Reasoning and Planning for Large Language Models , year=

Understanding Reasoning in Thinking Language Models via Steering Vectors , author=. ICLR Workshop on Reasoning and Planning for Large Language Models , year=

work page

[50] [50]

From hope to safety:

Dreyer, Maximilian and Pahde, Frederik and Anders, Christopher J and Samek, Wojciech and Lapuschkin, Sebastian , booktitle=. From hope to safety:

work page

[51] [51]

CVPR Workshops , year=

Reactive Model Correction: Mitigating Harm to Task-Relevant Features via Conditional Bias Suppression , author=. CVPR Workshops , year=

work page

[52] [52]

Post-hoc Concept Disentanglement:

Erogullari, Eren and Lapuschkin, Sebastian and Samek, Wojciech and Pahde, Frederik , booktitle=. Post-hoc Concept Disentanglement:

work page

[53] [53]

Zhengxuan Wu and Aryaman Arora and Atticus Geiger and Zheng Wang and Jing Huang and Dan Jurafsky and Christopher D Manning and Christopher Potts , booktitle=

work page

[54] [54]

Steering Language Models in Multi-Token Generation:

Klerings, Alina and Brinkmann, Jannik and Ruffinelli, Daniel and Ponzetto, Simone Paolo , booktitle=. Steering Language Models in Multi-Token Generation:

work page

[55] [55]

NeurIPS , year=

Analysing the generalisation and reliability of steering vectors , author=. NeurIPS , year=

work page

[56] [56]

ICCV , year=

Deep learning face attributes in the wild , author=. ICCV , year=

work page

[57] [57]

Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=

work page

[58] [58]

Tschandl, Philipp and Rosendahl, Cliff and Kittler, Harald , journal=. The

work page

[59] [59]

COLM , year=

Steering large language model activations in sparse spaces , author=. COLM , year=

work page

[60] [60]

Whispering Experts:

Xavier Suau and Pieter Delobelle and Katherine Metcalf and Armand Joulin and Nicholas Apostoloff and Luca Zappella and Pau Rodriguez , booktitle=. Whispering Experts:

work page

[61] [61]

A Whac-A-Mole Dilemma:

Li, Zhiheng and Evtimov, Ivan and Gordo, Albert and Hazirbas, Caner and Hassner, Tal and Ferrer, Cristian Canton and Xu, Chenliang and Ibrahim, Mark , booktitle=. A Whac-A-Mole Dilemma:

work page

[62] [62]

arXiv preprint arXiv:2411.04430 , year=

Towards unifying interpretability and control: Evaluation via intervention , author=. arXiv preprint arXiv:2411.04430 , year=

work page arXiv

[63] [63]

Interpretable Machine Learning , author=. M. Phil. diss., Dept. of Engineering, University of Cambridge , year=

work page

[64] [64]

Jailbroken:

Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , booktitle=. Jailbroken:

work page

[65] [65]

Emergent Misalignment:

Jan Betley and Daniel Chee Hian Tan and Niels Warncke and Anna Sztyber-Betley and Xuchan Bao and Mart. Emergent Misalignment:. ICML , year=

work page

[66] [66]

ACL , year=

Large language models are not fair evaluators , author=. ACL , year=

work page

[67] [67]

From Generation to Judgment: O pportunities and Challenges of LLM -as-a-judge

Li, Dawei and Jiang, Bohan and Huang, Liangjie and Beigi, Alimohammad and Zhao, Chengshuai and Tan, Zhen and Bhattacharjee, Amrita and Jiang, Yuxuan and Chen, Canyu and Wu, Tianhao and Shu, Kai and Cheng, Lu and Liu, Huan. From Generation to Judgment: O pportunities and Challenges of LLM -as-a-judge. 2025

work page 2025

[68] [68]

Persona vectors:

Chen, Runjin and Arditi, Andy and Sleight, Henry and Evans, Owain and Lindsey, Jack , journal=. Persona vectors:

work page

[69] [69]

ICLR , year=

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , author=. ICLR , year=

work page

[70] [70]

ICLR , year=

Distributionally Robust Neural Networks , author=. ICLR , year=

work page

[71] [71]

Explaining Similarity in Vision-Language Encoders with Weighted

Hubert Baniecki and Maximilian Muschalik and Fabian Fumagalli and Barbara Hammer and Eyke H. Explaining Similarity in Vision-Language Encoders with Weighted. NeurIPS , year =

work page

[72] [72]

Reveal to revise:

Pahde, Frederik and Dreyer, Maximilian and Samek, Wojciech and Lapuschkin, Sebastian , booktitle=. Reveal to revise:

work page

[73] [73]

Neuroimage , volume=

On the interpretation of weight vectors of linear models in multivariate neuroimaging , author=. Neuroimage , volume=

work page

[74] [74]

WACV , year=

Robust Novelty Detection Through Style-Conscious Feature Ranking , author=. WACV , year=

work page

[75] [75]

ICLR , year=

Hollmann, Noah and M. ICLR , year=

work page

[76] [76]

Nature , volume=

Accurate predictions on small data with a tabular foundation model , author=. Nature , volume=

work page

[77] [77]

ICML , year=

Learning transferable visual models from natural language supervision , author=. ICML , year=

work page

[78] [78]

ICML Workshop on Reliable and Responsible Foundation Models , year=

Steering language model refusal with sparse autoencoders , author=. ICML Workshop on Reliable and Responsible Foundation Models , year=

work page

[79] [79]

Transactions on Machine Learning Research , year=

Maxime Oquab and Timoth. Transactions on Machine Learning Research , year=

work page

[80] [80]

ICCV , year=

Sigmoid loss for language image pre-training , author=. ICCV , year=

work page