pith. sign in

arxiv: 2605.16372 · v1 · pith:R7Y7X4NRnew · submitted 2026-05-10 · 💻 cs.CV · cs.AI· cs.LG

SwordBench: Evaluating Orthogonality of Steering Image Representations

Pith reviewed 2026-05-20 22:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords steeringconcept activation vectorsorthogonalityvision modelsbenchmarkcollateral damagecross-concept robustnessinterpretability
0
0 comments X

The pith

SwordBench shows linear SVMs steer image concepts more orthogonally than alternatives but still produce collateral damage on unrelated tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SwordBench as a benchmark for testing how well steering methods can remove or alter specific concepts in vision model representations without side effects. It defines cross-concept robustness to check if detection of one concept remains stable after inputs are made orthogonal to other concepts, and collateral damage to measure unintended drops in model performance on tasks that do not involve the steered concept. These metrics matter because steering is used for interpretability and safety corrections at inference time, yet prior work focused only on language models with ambiguous tasks. Experiments across multiple backbones and removal tasks reveal that linear support vector machines achieve stronger separability and orthogonality than sparse autoencoders or optimization baselines, yet they still incur collateral damage and no approach reaches perfect steering even in simple settings.

Core claim

SwordBench supplies a unified suite for evaluating steering of image representations across vision backbones and concept removal tasks. It introduces cross-concept robustness, which tracks the stability of concept detection after orthogonalization against alternative concepts, and collateral damage, which checks whether steering harms downstream task accuracy on inputs that lack the target bias. Results indicate linear support vector machines deliver superior separability and orthogonality but fail to reach zero collateral damage and often underperform sparse autoencoders on that dimension, while both standard baselines and optimization-based methods fall short of perfect steering in simpler

What carries the argument

Cross-concept robustness and collateral damage metrics that quantify second-order effects of orthogonalization among concept activation vectors during pragmatic steering.

If this is right

  • Linear support vector machines provide stronger separability and orthogonality than sparse autoencoders or optimization baselines across tested vision models.
  • Even high-performing orthogonal methods still produce measurable collateral damage on downstream tasks for inputs without the steered concept.
  • No evaluated method reaches perfect steering performance in simpler concept-removal regimes.
  • Evaluation must include stability across alternative orthogonalized concepts rather than isolated separability alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers choosing steering techniques for safety applications should weight collateral damage more heavily than pure orthogonality scores.
  • The benchmark could be extended to test whether non-linear projections or hybrid methods reduce the observed collateral effects.
  • Similar second-order metrics might apply to steering in other modalities where concept vectors interact during inference.

Load-bearing premise

The proposed notions of cross-concept robustness and collateral damage correctly quantify the second-order effects of orthogonalization among concept activation vectors in realistic steering scenarios for vision models.

What would settle it

A steering method that achieves perfect orthogonality, zero collateral damage, and unchanged downstream accuracy on all SwordBench tasks would falsify the reported finding that even top methods leave residual damage.

Figures

Figures reproduced from arXiv: 2605.16372 by Dawid Pludowski, Hubert Baniecki, Przemyslaw Biecek, Vladimir Zaigrajew.

Figure 1
Figure 1. Figure 1: SWORDBENCH multi-dataset concept activation vectors evaluation. On the left, we report vector metrics for SigLIP image representations CAVs on CelebA (averaged over 39 concepts). Improved disentanglement (lower MS) correlates with higher robustness (CCR), with methods forming three clusters (•, ■, ▲) along the orthogonality and detection axes. On the right, we compute a PCA biplot on ISIC using all evaluat… view at source ↗
Figure 2
Figure 2. Figure 2: Sample efficiency analysis on CelebA (SigLIP). [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples of images from the MetaShift dataset. [PITH_FULL_IMAGE:figures/full_fig_p022_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example of Counteranimal dataset. An animal is presented in its common environment (left) or with an atypical background (right). Sampling strategy. We use the common split of Counteranimal dataset as a train and validation set in our experiments, with a random split between those in a ratio of 9:1. As a test set, we use all images from uncommon split. This approach leverages the strengthening of the backg… view at source ↗
Figure 5
Figure 5. Figure 5: Example of watermarking in ImageNet-W. Left: A clean image of a Tractor. Right: The same image infused with the watermark text church. The watermark overlays transparent text, which is clearly visible but does not obscure the underlying structural details of the object [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example of corruptions in ImageNet-C. We visualize three sample corruptions: Fog, Frost, and Gaussian Noise. Each row displays the corruption across increasing severity levels. 1 2 3 4 5 Severity Level Brightness Contrast Defocus Blur Elastic Transform Fog Frost Gaussian Noise Glass Blur Impulse Noise Jpeg Compression Motion Blur Pixelate Shot Noise Snow Zoom Blur Corruption Type 0.0 0.0 0.0 0.2 0.4 0.0 0.… view at source ↗
Figure 7
Figure 7. Figure 7: Heatmap of corruptions × severity impact on accuracy drop for CLIP. Gaussian Noise visibly degrades performance from severity 3, whereas Shot Noise degrades it from severity 4. The largest drop in accuracy is observed at severity level 5, but, as shown in [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Critical difference analysis of collateral damage. [PITH_FULL_IMAGE:figures/full_fig_p040_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Sample efficiency on ImageNet-W (husky/cat Watermark). [PITH_FULL_IMAGE:figures/full_fig_p041_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Extended sample efficiency ablations. Complementing [PITH_FULL_IMAGE:figures/full_fig_p042_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Sample efficiency ablations for ISIC. We report metrics as a function of training sample size for: ISIC with CLIP (top); ISIC with DINOv2 (middle), and ISIC with SigLIP (bottom). 43 [PITH_FULL_IMAGE:figures/full_fig_p043_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Win-rate comparison on CelebA: linear CAVs vs. nonlinear probes. [PITH_FULL_IMAGE:figures/full_fig_p044_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Win-rate comparison on ISIC: linear CAVs vs. nonlinear probes. [PITH_FULL_IMAGE:figures/full_fig_p044_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: PCA biplots of metric correlations. We project all evaluation metrics onto the first two principal components across all backbones for Waterbirds (left), CelebA (middle), and ImageNet-W (right). The arrows indicate the direction of each metric. Predictive metrics (AUC, F1, MAD) consistently overlap, while MS and CCR form distinct correlated directions. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Correlation between AUC and F1 on CelebA. [PITH_FULL_IMAGE:figures/full_fig_p047_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Correlation between AUC and F1 on ISIC. Across all backbones, we observe a strong correlation (p ≈ 0, R2 > 0.6) between the F1 and AUC metrics. MAD shows a clear and strong correlation with both of AUC and F1, which is why we exclude it from our results. 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 Pairwise Cosine Similarity 5 10 15 20 25 30 35 40 5 10 15 20 25 30 35 40 Cross-Concept Robustness (unaggreg… view at source ↗
Figure 17
Figure 17. Figure 17: Heatmap comparison of MS vs CCR on CelebA (using linear SVM). [PITH_FULL_IMAGE:figures/full_fig_p047_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Critical difference analysis of backbone performance on ISIC [PITH_FULL_IMAGE:figures/full_fig_p048_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Critical difference analysis of backbone performance on CelebA [PITH_FULL_IMAGE:figures/full_fig_p048_19.png] view at source ↗
read the original abstract

Steering or intervening on model representations at inference time to correct predictions is essential for AI interpretability and safety, yet existing evaluation protocols are limited to ambiguous language modeling tasks. To address this gap, we introduce SwordBench, a benchmark for steering image representations of vision models across multiple backbones and concept removal tasks. Beyond a unified benchmarking suite, we propose new evaluation notions that uncover the second-order effects of orthogonalization among concept activation vectors for pragmatic steering. Specifically, cross-concept robustness measures the stability of concept detection performance across inputs orthogonalized against alternative concepts, and collateral damage quantifies whether steering inadvertently affects model performance on a downstream task for inputs lacking the bias. We find that although a linear support vector machine exhibits superior separability and orthogonality, it fails to achieve zero collateral damage, often trailing sparse autoencoders. In simpler regimes, both standard baselines and optimization-based methods fail to achieve perfect steering. The source code will be made available soon on GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SwordBench, a benchmark for evaluating the orthogonality of steering image representations in vision models across multiple backbones and concept removal tasks. It proposes two new evaluation notions—cross-concept robustness, which measures stability of concept detection after orthogonalization against alternative concepts, and collateral damage, which quantifies unintended effects on downstream task performance for inputs lacking the target bias—to capture second-order effects of orthogonalization among concept activation vectors. Empirical comparisons show that linear SVMs achieve superior separability and orthogonality relative to sparse autoencoders yet incur higher collateral damage and fail to reach zero, while both standard baselines and optimization-based methods fail to achieve perfect steering in simpler regimes.

Significance. If the proposed metrics are shown to be robust, SwordBench would address a clear gap in standardized evaluation for representation steering in computer vision, where protocols have lagged behind language modeling. The empirical demonstration of trade-offs between orthogonality and collateral damage provides actionable guidance for interpretability and safety work. Releasing source code would further strengthen the contribution by enabling direct reproduction of the reported rankings.

major comments (2)
  1. The central empirical claim that linear SVMs trail SAEs on collateral damage while outperforming on separability rests on the collateral damage metric correctly quantifying unintended downstream effects. However, the manuscript provides no sensitivity analysis, cross-task ablations, or details on how 'inputs lacking the bias' are sampled, leaving open the possibility that the observed ranking is an artifact of the specific downstream task or sampling procedure chosen.
  2. No datasets, statistical tests, error bars, or exact experimental protocols are described for the reported comparisons (e.g., SVM vs. SAE collateral damage). This absence makes it impossible to assess the reliability or statistical significance of the finding that SVMs 'often trail' SAEs, which is load-bearing for the paper's main conclusion.
minor comments (1)
  1. The abstract states that source code 'will be made available soon on GitHub.' A concrete repository link or commit hash should be provided to support review and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for improving the clarity and robustness of our empirical results. We address each major comment below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses
  1. Referee: The central empirical claim that linear SVMs trail SAEs on collateral damage while outperforming on separability rests on the collateral damage metric correctly quantifying unintended downstream effects. However, the manuscript provides no sensitivity analysis, cross-task ablations, or details on how 'inputs lacking the bias' are sampled, leaving open the possibility that the observed ranking is an artifact of the specific downstream task or sampling procedure chosen.

    Authors: We agree that the current manuscript lacks sufficient detail on the sampling of inputs lacking the target bias and does not include sensitivity or cross-task analyses. In the revision we will add an explicit description of the sampling procedure (including selection criteria and dataset splits), perform sensitivity analyses by varying the proportion and selection method of unbiased inputs, and include cross-task ablations on at least two additional downstream tasks to verify that the SVM–SAE ranking on collateral damage is stable. revision: yes

  2. Referee: No datasets, statistical tests, error bars, or exact experimental protocols are described for the reported comparisons (e.g., SVM vs. SAE collateral damage). This absence makes it impossible to assess the reliability or statistical significance of the finding that SVMs 'often trail' SAEs, which is load-bearing for the paper's main conclusion.

    Authors: We acknowledge that the experimental section is currently underspecified. The revised manuscript will contain a dedicated experimental protocol subsection that lists the exact datasets and splits for every task, the precise hyper-parameters and training procedures for SVM and SAE methods, error bars computed over at least five random seeds, and statistical significance tests (paired t-tests with Bonferroni correction) comparing collateral-damage scores. We will also make the full source code and evaluation scripts publicly available upon acceptance to enable direct reproduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark with author-defined metrics

full rationale

The paper introduces SwordBench as a new evaluation suite and defines cross-concept robustness and collateral damage as novel notions for measuring second-order effects of orthogonalization. These are presented as proposals rather than derived quantities, and the central findings (SVM superiority on separability/orthogonality but not on collateral damage) are direct empirical measurements on the benchmark. No equations, fitted parameters, or self-citations are shown to reduce the reported results to inputs by construction. The work is self-contained as a benchmark paper; results follow from applying the stated definitions to the chosen models and tasks without tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the work rests on standard machine-learning assumptions about vector representations and linear separability.

pith-pipeline@v0.9.0 · 5710 in / 1193 out tokens · 34254 ms · 2026-05-20T22:34:52.635899+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

97 extracted references · 97 canonical work pages · 2 internal anchors

  1. [1]

    Steering

    Joseph, Sonia and Suresh, Praneet and Goldfarb, Ethan and Hufe, Lorenz and Gandelsman, Yossi and Graham, Robert and Bzdok, Danilo and Samek, Wojciech and Richards, Blake Aaron , booktitle=. Steering

  2. [2]

    ICLR Workshop on Building Trust in Language Models and Applications , year=

    Interpretable Steering of Large Language Models with Feature Guided Activation Additions , author=. ICLR Workshop on Building Trust in Language Models and Applications , year=

  3. [3]

    Discover and cure:

    Wu, Shirley and Yuksekgonul, Mert and Zhang, Linjun and Zou, James , booktitle=. Discover and cure:

  4. [4]

    Liang, Weixin and Zou, James , booktitle=

  5. [5]

    Wah, Catherine and Branson, Steve and Welinder, Peter and Perona, Pietro and Belongie, Serge , year=. The

  6. [6]

    Skin lesion analysis toward melanoma detection 2018:

    Codella, Noel and Rotemberg, Veronica and Tschandl, Philipp and Celebi, M Emre and Dusza, Stephen and Gutman, David and Helba, Brian and Kalloo, Aadi and Liopyris, Konstantinos and Marchetti, Michael and others , journal=. Skin lesion analysis toward melanoma detection 2018:

  7. [7]

    Debiasing skin lesion datasets and models?

    Bissoto, Alceu and Valle, Eduardo and Avila, Sandra , booktitle=. Debiasing skin lesion datasets and models?

  8. [8]

    NeurIPS , year=

    Learning debiased representation via disentangled feature augmentation , author=. NeurIPS , year=

  9. [9]

    ECAI , year=

    Model Science: Getting Serious About Verification, Explanation and Control of AI Systems , author=. ECAI , year=

  10. [10]

    Irvin, Jeremy and Rajpurkar, Pranav and Ko, Michael and Yu, Yifan and Ciurea-Ilcus, Silviana and Chute, Chris and Marklund, Henrik and Haghgoo, Behzad and Ball, Robyn and Shpanskaya, Katie and others , booktitle=

  11. [11]

    You, Kihyun and Gu, Jawook and Ham, Jiyeon and Park, Beomhee and Kim, Jiho and Hong, Eun K and Baek, Woonhyuk and Roh, Byungseok , booktitle=

  12. [12]

    Defense-prefix for preventing typographic attacks on

    Azuma, Hiroki and Matsui, Yusuke , booktitle=. Defense-prefix for preventing typographic attacks on

  13. [13]

    ICML , year=

    Axiomatic attribution for deep networks , author=. ICML , year=

  14. [14]

    Finding and removing

    Anders, Christopher J and Weber, Leander and Neumann, David and Samek, Wojciech and M. Finding and removing. Information Fusion , volume=

  15. [15]

    Visual Sparse Steering (VS2): Unsupervised Adaptation for Image Classification using Sparsity-Guided Steering Vectors

    Visual Sparse Steering: Improving Zero-shot Image Classification with Sparsity Guided Steering Vectors , author=. arXiv preprint arXiv:2506.01247 , year=

  16. [16]

    NeurIPS , year=

    Kopf, Laura and Bommer, Philine L and Hedstr. NeurIPS , year=

  17. [17]

    To Steer or Not to Steer?

    Anna Hedstr. To Steer or Not to Steer?. ICML , year=

  18. [18]

    Layer by Layer:

    Oscar Skean and Md Rifat Arefin and Dan Zhao and Niket Nikul Patel and Jalal Naghiyev and Yann LeCun and Ravid Shwartz-Ziv , booktitle=. Layer by Layer:

  19. [19]

    NeurIPS Workshop on Mechanistic Interpretability , year=

    Rethinking Sparse Autoencoders: Select-and-Project for Fairness and Control from Encoder Features Alone , author=. NeurIPS Workshop on Mechanistic Interpretability , year=

  20. [20]

    Concept-Based Explanations in Computer Vision:

    Lee, Jae Hee and Mikriukov, Georgii and Schwalbe, Gesina and Wermter, Stefan and Wolter, Diedrich , booktitle=. Concept-Based Explanations in Computer Vision:

  21. [21]

    Interpretability beyond feature attribution:

    Kim, Been and Wattenberg, Martin and Gilmer, Justin and Cai, Carrie and Wexler, James and Viegas, Fernanda and others , booktitle=. Interpretability beyond feature attribution:

  22. [22]

    ICML , year=

    The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. ICML , year=

  23. [23]

    Emergent Linear Representations in World Models of Self-Supervised Sequence Models

    Nanda, Neel and Lee, Andrew and Wattenberg, Martin. Emergent Linear Representations in World Models of Self-Supervised Sequence Models. 2023

  24. [24]

    ICLR , year=

    Understanding intermediate layers using linear classifier probes , author=. ICLR , year=

  25. [25]

    ICLR , year=

    Navigating Neural Space: Revisiting Concept Activation Vectors to Overcome Directional Divergence , author=. ICLR , year=

  26. [26]

    Laines Schmalwasser and Niklas Penzel and Joachim Denzler and Julia Niebling , booktitle=. Fast

  27. [27]

    NeurIPS , year=

    Towards automatic concept-based explanations , author=. NeurIPS , year=

  28. [28]

    Invertible concept-based explanations for

    Zhang, Ruihan and Madumal, Prashan and Miller, Tim and Ehinger, Krista A and Rubinstein, Benjamin IP , booktitle=. Invertible concept-based explanations for

  29. [29]

    MICCAI , year=

    Using causal analysis for conceptual deep learning explanation , author=. MICCAI , year=

  30. [30]

    CVPR , year=

    Contrastive pretraining for visual concept explanations of socioeconomic outcomes , author=. CVPR , year=

  31. [31]

    IJCNN , year=

    Conceptual explanations of neural network prediction for time series , author=. IJCNN , year=

  32. [32]

    Under the hood:

    Giulianelli, Mario and Harding, Jack and Mohnert, Florian and Hupkes, Dieuwke and Zuidema, Willem , booktitle=. Under the hood:

  33. [33]

    What does

    Clark, Kevin and Khandelwal, Urvashi and Levy, Omer and Manning, Christopher D , booktitle=. What does

  34. [34]

    Concept activation regions:

    Crabb. Concept activation regions:. NeurIPS , year=

  35. [35]

    ICLR , year=

    Not All Language Model Features Are One-Dimensionally Linear , author=. ICLR , year=

  36. [36]

    Oikarinen, Tuomas and Weng, Tsui-Wei , booktitle=

  37. [37]

    NeurIPS , year=

    Labeling neural representations with inverse recognition , author=. NeurIPS , year=

  38. [38]

    2022 , journal=

    Toy Models of Superposition , author=. 2022 , journal=

  39. [39]

    Bricken, Trenton and Templeton, Adly and Batson, Joshua and Chen, Brian and Jermyn, Adam and Conerly, Tom and Turner, Nicholas L. and Anil, Cem and Denison, Carson and Askell, Amanda and Lasenby, Robert and Wu, Yifan and Kravec, Shauna and Schiefer, Nicholas and Maxwell, Tim and Joseph, Nicholas and Tamkin, Alex and Nguyen, Karina and McLean, Brayden and ...

  40. [40]

    Daniel Freeman and Theodore R

    Adly Templeton and Tom Conerly and Jonathan Marcus and Jack Lindsey and Trenton Bricken and Brian Chen and Adam Pearce and Craig Citro and Emmanuel Ameisen and Andy Jones and Hoagy Cunningham and Nicholas L Turner and Callum McDougall and Monte MacDiarmid and Alex Tamkin and Esin Durmus and Tristan Hume and Francesco Mosconi and C. Daniel Freeman and Theo...

  41. [41]

    ICLR , year=

    Scaling and evaluating sparse autoencoders , author=. ICLR , year=

  42. [42]

    Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

    Rajamanoharan, Senthooran and Lieberum, Tom and Sonnerat, Nicolas and Conmy, Arthur and Varma, Vikrant and Kram. Jumping ahead:. arXiv preprint arXiv:2407.14435 , year=

  43. [43]

    Interpreting

    Vladimir Zaigrajew and Hubert Baniecki and Przemyslaw Biecek , booktitle=. Interpreting

  44. [44]

    Are Sparse Autoencoders Useful?

    Subhash Kantamneni and Joshua Engels and Senthooran Rajamanoharan and Max Tegmark and Neel Nanda , booktitle=. Are Sparse Autoencoders Useful?

  45. [45]

    ACL , year=

    Extracting Latent Steering Vectors from Pretrained Language Models , author=. ACL , year=

  46. [46]

    Representation engineering: A top-down approach to

    Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others , journal=. Representation engineering: A top-down approach to

  47. [47]

    ICML , year=

    Bartosz Cywi. ICML , year=

  48. [48]

    Steering

    Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander , booktitle=. Steering

  49. [49]

    ICLR Workshop on Reasoning and Planning for Large Language Models , year=

    Understanding Reasoning in Thinking Language Models via Steering Vectors , author=. ICLR Workshop on Reasoning and Planning for Large Language Models , year=

  50. [50]

    From hope to safety:

    Dreyer, Maximilian and Pahde, Frederik and Anders, Christopher J and Samek, Wojciech and Lapuschkin, Sebastian , booktitle=. From hope to safety:

  51. [51]

    CVPR Workshops , year=

    Reactive Model Correction: Mitigating Harm to Task-Relevant Features via Conditional Bias Suppression , author=. CVPR Workshops , year=

  52. [52]

    Post-hoc Concept Disentanglement:

    Erogullari, Eren and Lapuschkin, Sebastian and Samek, Wojciech and Pahde, Frederik , booktitle=. Post-hoc Concept Disentanglement:

  53. [53]

    Zhengxuan Wu and Aryaman Arora and Atticus Geiger and Zheng Wang and Jing Huang and Dan Jurafsky and Christopher D Manning and Christopher Potts , booktitle=

  54. [54]

    Steering Language Models in Multi-Token Generation:

    Klerings, Alina and Brinkmann, Jannik and Ruffinelli, Daniel and Ponzetto, Simone Paolo , booktitle=. Steering Language Models in Multi-Token Generation:

  55. [55]

    NeurIPS , year=

    Analysing the generalisation and reliability of steering vectors , author=. NeurIPS , year=

  56. [56]

    ICCV , year=

    Deep learning face attributes in the wild , author=. ICCV , year=

  57. [57]

    Deng, Jia and Dong, Wei and Socher, Richard and Li, Li-Jia and Li, Kai and Fei-Fei, Li , booktitle=

  58. [58]

    Tschandl, Philipp and Rosendahl, Cliff and Kittler, Harald , journal=. The

  59. [59]

    COLM , year=

    Steering large language model activations in sparse spaces , author=. COLM , year=

  60. [60]

    Whispering Experts:

    Xavier Suau and Pieter Delobelle and Katherine Metcalf and Armand Joulin and Nicholas Apostoloff and Luca Zappella and Pau Rodriguez , booktitle=. Whispering Experts:

  61. [61]

    A Whac-A-Mole Dilemma:

    Li, Zhiheng and Evtimov, Ivan and Gordo, Albert and Hazirbas, Caner and Hassner, Tal and Ferrer, Cristian Canton and Xu, Chenliang and Ibrahim, Mark , booktitle=. A Whac-A-Mole Dilemma:

  62. [62]

    arXiv preprint arXiv:2411.04430 , year=

    Towards unifying interpretability and control: Evaluation via intervention , author=. arXiv preprint arXiv:2411.04430 , year=

  63. [63]

    Interpretable Machine Learning , author=. M. Phil. diss., Dept. of Engineering, University of Cambridge , year=

  64. [64]

    Jailbroken:

    Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , booktitle=. Jailbroken:

  65. [65]

    Emergent Misalignment:

    Jan Betley and Daniel Chee Hian Tan and Niels Warncke and Anna Sztyber-Betley and Xuchan Bao and Mart. Emergent Misalignment:. ICML , year=

  66. [66]

    ACL , year=

    Large language models are not fair evaluators , author=. ACL , year=

  67. [67]

    From Generation to Judgment: O pportunities and Challenges of LLM -as-a-judge

    Li, Dawei and Jiang, Bohan and Huang, Liangjie and Beigi, Alimohammad and Zhao, Chengshuai and Tan, Zhen and Bhattacharjee, Amrita and Jiang, Yuxuan and Chen, Canyu and Wu, Tianhao and Shu, Kai and Cheng, Lu and Liu, Huan. From Generation to Judgment: O pportunities and Challenges of LLM -as-a-judge. 2025

  68. [68]

    Persona vectors:

    Chen, Runjin and Arditi, Andy and Sleight, Henry and Evans, Owain and Lindsey, Jack , journal=. Persona vectors:

  69. [69]

    ICLR , year=

    Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , author=. ICLR , year=

  70. [70]

    ICLR , year=

    Distributionally Robust Neural Networks , author=. ICLR , year=

  71. [71]

    Explaining Similarity in Vision-Language Encoders with Weighted

    Hubert Baniecki and Maximilian Muschalik and Fabian Fumagalli and Barbara Hammer and Eyke H. Explaining Similarity in Vision-Language Encoders with Weighted. NeurIPS , year =

  72. [72]

    Reveal to revise:

    Pahde, Frederik and Dreyer, Maximilian and Samek, Wojciech and Lapuschkin, Sebastian , booktitle=. Reveal to revise:

  73. [73]

    Neuroimage , volume=

    On the interpretation of weight vectors of linear models in multivariate neuroimaging , author=. Neuroimage , volume=

  74. [74]

    WACV , year=

    Robust Novelty Detection Through Style-Conscious Feature Ranking , author=. WACV , year=

  75. [75]

    ICLR , year=

    Hollmann, Noah and M. ICLR , year=

  76. [76]

    Nature , volume=

    Accurate predictions on small data with a tabular foundation model , author=. Nature , volume=

  77. [77]

    ICML , year=

    Learning transferable visual models from natural language supervision , author=. ICML , year=

  78. [78]

    ICML Workshop on Reliable and Responsible Foundation Models , year=

    Steering language model refusal with sparse autoencoders , author=. ICML Workshop on Reliable and Responsible Foundation Models , year=

  79. [79]

    Transactions on Machine Learning Research , year=

    Maxime Oquab and Timoth. Transactions on Machine Learning Research , year=

  80. [80]

    ICCV , year=

    Sigmoid loss for language image pre-training , author=. ICCV , year=

Showing first 80 references.