From Attribution to Action: A Human-Centered Application of Activation Steering

Katharina Weitz; Maximilian Dreyer; Sebastian Lapuschkin; Tobias Labarta; Wojciech Samek

arxiv: 2604.11467 · v2 · pith:YJ4PD72Pnew · submitted 2026-04-13 · 💻 cs.AI · cs.HC· cs.LG

From Attribution to Action: A Human-Centered Application of Activation Steering

Tobias Labarta , Maximilian Dreyer , Katharina Weitz , Wojciech Samek , Sebastian Lapuschkin This is my paper

Pith reviewed 2026-05-10 14:55 UTC · model grok-4.3

classification 💻 cs.AI cs.HCcs.LG

keywords activation steeringexplainable AImodel debugginghuman-AI interactionCLIPvision modelssparse autoencodersattribution methods

0 comments

The pith

Activation steering paired with attribution lets practitioners test hypotheses about model behavior through direct interventions instead of passive inspection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an interactive workflow that combines sparse autoencoder attribution with activation steering for instance-level analysis in vision models such as CLIP. Expert interviews with eight practitioners performing debugging tasks show that steering shifts users from inspecting explanations to actively intervening and observing resulting model changes. Most participants built trust from those observed responses rather than from the initial attributions alone, while favoring systematic suppression of components and noting risks like unintended ripple effects across predictions.

Core claim

The paper claims that activation steering renders interpretability actionable by enabling intervention-based hypothesis testing, as demonstrated when eight experts used the workflow on CLIP to debug concept usage, grounded their trust primarily in model output changes, adopted suppression-dominated strategies, and identified practical limits including ripple effects and poor generalization of instance-level fixes.

What carries the argument

The interactive web-based workflow that combines SAE-based attribution with activation steering to support instance-level concept analysis and targeted interventions in vision models.

If this is right

Users can verify attributions by steering components and directly observing prediction changes.
Debugging workflows become dominated by targeted suppression of identified components.
Trust in the method rests more on empirical model responses than on the initial attribution quality.
Instance-level steering corrections may not transfer reliably to other inputs or models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The workflow could support iterative model refinement loops where users steer, observe, and repeat until behavior stabilizes.
Risks like ripple effects suggest the need for batch-level steering checks before deploying instance fixes.
Extending the approach beyond vision to language or multimodal models would require new attribution-steering interfaces.

Load-bearing premise

Insights drawn from eight experts performing specific debugging tasks on CLIP reflect how practitioners would generally reason about and apply activation steering in other settings.

What would settle it

A larger study with more diverse practitioners and models where most users continue to rely on explanation plausibility for trust rather than shifting to intervention-based testing.

Figures

Figures reproduced from arXiv: 2604.11467 by Katharina Weitz, Maximilian Dreyer, Sebastian Lapuschkin, Tobias Labarta, Wojciech Samek.

**Figure 1.** Figure 1: The four-step workflow from attribution to action: practitioners review component attributions, form causal hypotheses about [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: SemanticLens workflow implementation. Users select inspection samples (1), review components ranked by attribution (2), [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: A process diagram of the interview structure: Pre-questionnaire, two debugging tasks each with attribution-only phase (Phase 1) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Two debugging tasks: Task 1 (typographic attack on CLIP ViT-B-32, where overlaid text causes misclassification) and Task 2 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pairs SAE attribution with activation steering in a web tool and reports consistent patterns from interviews with eight experts using it for CLIP debugging tasks.

read the letter

The paper takes attribution via sparse autoencoders and adds activation steering so users can intervene on specific components in vision models. They built a web tool for the full loop and ran semi-structured interviews with eight experts on debugging tasks with CLIP. All eight shifted to testing hypotheses by steering rather than just inspecting attributions, six grounded their trust in the actual model outputs after changes, and seven relied mainly on suppressing components. They also noted risks like ripple effects across the model and that instance-level fixes often fail to generalize.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces an interactive web-based workflow that integrates Sparse Autoencoder (SAE) attribution with activation steering for instance-level analysis of concept usage in vision models such as CLIP. The workflow is evaluated via semi-structured expert interviews (N=8) on debugging tasks, yielding the claims that steering shifts all participants (8/8) from inspection to intervention-based hypothesis testing, that most (6/8) ground trust in observed model outputs rather than explanation plausibility, that component suppression dominates strategies (7/8), and that users identify risks including ripple effects and limited generalization of instance-level corrections.

Significance. If the qualitative patterns are robust, the work provides timely human-centered evidence that activation steering can convert passive XAI attributions into actionable interventions. The explicit reporting of user strategies and risks (e.g., ripple effects) adds practical value often missing from purely technical steering papers. The implementation of a concrete tool and the focus on real debugging tasks strengthen the bridge from attribution to action, offering design implications for future interpretability systems.

major comments (2)

[Section 4] Section 4 (User Study / Methodology): The description of the semi-structured interview protocol, participant selection criteria, exact debugging tasks, interview guide, qualitative coding process, and any bias-mitigation steps (e.g., inter-rater reliability) is insufficiently detailed. Because the central claims consist of specific fractions (8/8, 6/8, 7/8) derived from these interviews, the absence of this information prevents verification that the reported patterns are not artifacts of task framing or analysis choices.
[Findings and Discussion] Findings and Discussion sections: The extrapolation from N=8 experts performing CLIP-specific debugging to general practitioner reasoning about activation steering is load-bearing for the paper's broader utility claim, yet the manuscript provides no additional validation (e.g., larger sample, different models, or quantitative measures) and only limited caveats regarding selection bias and task specificity. This weakens the link between the observed behaviors and the asserted shift toward actionable interpretability.

minor comments (2)

[Abstract and Section 3] Abstract and Section 3: SAE is introduced without an initial expansion or reference to its standard definition (Sparse Autoencoder), which may hinder readability for readers outside the immediate subfield.
The paper would benefit from a table summarizing participant demographics, task completion times, or strategy frequencies to make the N=8 results more transparent at a glance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The feedback identifies two key areas for improvement: greater methodological transparency in the user study and stronger caveats around generalizability. We agree on both points and will revise the manuscript accordingly. Below we respond point-by-point to the major comments, indicating the changes we will make.

read point-by-point responses

Referee: [Section 4] Section 4 (User Study / Methodology): The description of the semi-structured interview protocol, participant selection criteria, exact debugging tasks, interview guide, qualitative coding process, and any bias-mitigation steps (e.g., inter-rater reliability) is insufficiently detailed. Because the central claims consist of specific fractions (8/8, 6/8, 7/8) derived from these interviews, the absence of this information prevents verification that the reported patterns are not artifacts of task framing or analysis choices.

Authors: We agree that the current description of the methodology is insufficient for independent verification. In the revised manuscript we will substantially expand Section 4 to include: (1) the complete semi-structured interview protocol and the full interview guide with example questions; (2) explicit participant selection criteria, including recruitment channels, years of experience in ML interpretability, and prior exposure to vision-language models; (3) precise descriptions of the three debugging tasks, including the images, target concepts, and success criteria given to participants; (4) the qualitative analysis procedure, specifying the thematic coding approach, how the 8/8, 6/8, and 7/8 counts were derived, and any inter-rater reliability assessment (we will report Cohen’s kappa or describe the consensus process used); and (5) bias-mitigation steps such as pilot interviews, question ordering to avoid leading participants, and steps taken to reduce confirmation bias during coding. These additions will allow readers to evaluate whether the observed patterns could be artifacts of task framing or analysis choices. revision: yes
Referee: [Findings and Discussion] Findings and Discussion sections: The extrapolation from N=8 experts performing CLIP-specific debugging to general practitioner reasoning about activation steering is load-bearing for the paper's broader utility claim, yet the manuscript provides no additional validation (e.g., larger sample, different models, or quantitative measures) and only limited caveats regarding selection bias and task specificity. This weakens the link between the observed behaviors and the asserted shift toward actionable interpretability.

Authors: We acknowledge that the manuscript’s language occasionally implies broader applicability than the N=8, CLIP-specific evidence strictly supports. While qualitative studies of this size are common in human-centered XAI for generating initial insights, we agree that stronger caveats and clearer scoping are needed. In the revision we will: (1) expand the Limitations subsection to explicitly address selection bias (participants were drawn from a convenience sample of interpretability researchers), task specificity (debugging tasks were limited to CLIP on natural-image classification), and the absence of quantitative or cross-model validation; (2) revise the Findings and Discussion sections to use more qualified language (e.g., “in this study, all participants…” rather than generalizing to “practitioners”); and (3) add a forward-looking paragraph outlining concrete next steps for larger-scale or quantitative follow-up studies. We maintain that the observed shift from inspection to intervention-based reasoning is a substantive finding for the studied setting and provides actionable design implications, but we will no longer present it as a general claim without further evidence. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical interview study with no derivations or self-referential predictions

full rationale

The paper reports qualitative findings from semi-structured interviews (N=8) on a specific SAE+steering workflow for CLIP debugging tasks. It contains no mathematical equations, fitted parameters, predictions, or derivation chains. All central claims (e.g., 8/8 participants shifting to intervention-based testing, 7/8 using component suppression) are directly grounded in new interview data rather than reducing to prior author work, self-definitions, or fitted inputs. Self-citations, if present, are not load-bearing for the reported observations. The study is self-contained against external benchmarks as a human-centered empirical investigation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical human-computer interaction study with no mathematical derivations. No free parameters, axioms, or invented entities are introduced; the workflow builds on existing SAE and activation steering techniques evaluated through new qualitative data.

pith-pipeline@v0.9.0 · 5491 in / 1288 out tokens · 77246 ms · 2026-05-10T14:55:24.712764+00:00 · methodology

From Attribution to Action: A Human-Centered Application of Activation Steering

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)