White-Box Sensitivity Auditing with Steering Vectors

David Evans; Hannah Cyberey; Yangfeng Ji

arxiv: 2601.16398 · v2 · pith:LU4GPZTLnew · submitted 2026-01-23 · 💻 cs.CY · cs.CL· cs.LG

White-Box Sensitivity Auditing with Steering Vectors

Hannah Cyberey , Yangfeng Ji , David Evans This is my paper

Pith reviewed 2026-05-21 16:09 UTC · model grok-4.3

classification 💻 cs.CY cs.CLcs.LG

keywords LLM auditingactivation steeringbias detectionwhite-box evaluationsensitivity analysisprotected attributesalgorithmic fairnessinternal representations

0 comments

The pith

Steering vectors applied inside LLMs reveal substantial dependence on protected attributes such as gender in decision tasks where black-box input tests detect little or no bias.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a white-box auditing approach that manipulates model activations directly to test sensitivity to abstract concepts. Instead of only varying input text, the method constructs steering vectors that add or remove representations of attributes like gender and measures resulting shifts in model outputs on simulated high-stakes tasks. This internal testing finds consistent evidence of reliance on protected attributes across four decision scenarios. The approach matters because regulators and operators need ways to audit properties that are hard to isolate through surface-level prompts alone. The authors release code to support replication on other models and tasks.

Core claim

By constructing and applying steering vectors to isolate and modify internal representations of protected attributes, the method measures causal sensitivity of LLM predictions to those attributes in ways that black-box input-output tests cannot, and the resulting audits show large effects from these attributes even in cases where standard evaluations report minimal bias.

What carries the argument

Steering vectors, defined as directions in activation space that are added or subtracted to amplify or suppress a target concept such as gender while the model processes a query.

If this is right

Audits of LLM decision systems can now test internal causal dependence rather than relying solely on curated input prompts.
Developers gain a tool to quantify how much a protected attribute influences outputs in hiring, lending, or medical recommendation scenarios.
Regulators could incorporate internal sensitivity checks when models are deployed in high-stakes settings.
The framework extends to auditing other abstract properties such as factual consistency or safety constraints by steering the corresponding concepts.
Open release of the method allows direct comparison against existing black-box bias benchmarks on the same tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the steering method proves stable across model scales, it could become a standard complement to black-box fairness testing for any organization with internal model access.
The technique might help identify when biases arise from training data patterns versus from inference-time reasoning.
Combining steering audits with existing interpretability methods could produce more targeted debiasing interventions.
Similar internal manipulation approaches may apply to non-language models where activations can be steered toward task-relevant concepts.

Load-bearing premise

Steering vectors can be constructed and inserted to change one specific abstract concept without also altering unrelated internal representations or harming performance on the original task.

What would settle it

A controlled test in which steering a gender vector also changes model accuracy on a neutral task unrelated to gender, or fails to produce the expected output shift when the attribute is truly irrelevant.

read the original abstract

Algorithmic audits are essential tools for examining systems for properties required by regulators or desired by operators. Current audits of large language models (LLMs) primarily rely on black-box evaluations that assess model behavior only through input-output testing. These methods are limited to tests constructed in the input space, often generated by heuristics. In addition, many socially relevant model properties (e.g., gender bias) are abstract and difficult to measure through text-based inputs alone. To address these limitations, we propose a white-box sensitivity auditing framework for LLMs that leverages activation steering to conduct more rigorous assessments through model internals. Our auditing method conducts internal sensitivity tests by manipulating key concepts relevant to the model's intended function for the task. We demonstrate its application to bias audits in four simulated high-stakes LLM decision tasks. Our method consistently indicates substantial dependence on protected attributes in model predictions, even in settings where standard black-box evaluations suggest little or no bias. Our code is openly available at https://github.com/hannahxchen/llm-steering-audit

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a white-box audit using steering vectors to find bias in LLMs where black-box tests do not, but provides little evidence that the steering isolates the target concept cleanly.

read the letter

The main point is that this work proposes steering model activations to test sensitivity to protected attributes like gender in four simulated decision tasks, reporting substantial dependence even when standard black-box checks find little or none. They make the code available, which supports anyone who wants to inspect or reuse the implementation. The shift from input tweaks to internal manipulations is a logical response to the limits of black-box methods for abstract properties. That part is straightforward and addresses a real gap in current auditing practice. The open repository is a concrete plus for reproducibility. The soft spot is the lack of reported checks on whether steering preserves task accuracy or leaves unrelated representations stable. Without those numbers, the sensitivity results could reflect broader activation shifts rather than specific causal dependence on the protected attribute. The stress-test concern about confounding therefore lands on the current description. This paper is for researchers in interpretability and AI auditing who want to experiment with internal interventions. A reader already working on white-box techniques could pick up the framework and adapt the validation steps. It deserves peer review because the idea targets a practical limitation and includes reproducible artifacts, even though the empirical controls around steering will need strengthening.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a white-box sensitivity auditing framework for LLMs that uses activation steering to manipulate internal representations of protected attributes (e.g., gender) during four simulated high-stakes decision tasks. It claims this internal approach detects substantial dependence on protected attributes even in cases where standard black-box input-output evaluations indicate little or no bias. Open code is provided at the cited GitHub repository.

Significance. If the steering interventions can be shown to isolate targeted concepts without confounding task performance or unrelated representations, the method would strengthen auditing practices for abstract properties required by regulators. The open code is a clear strength supporting reproducibility and further testing.

major comments (2)

[§3] §3 (Steering Vector Construction): No quantitative validation is reported that steering preserves task accuracy or leaves unrelated internal representations unchanged (e.g., no pre/post accuracy deltas or similarity metrics to non-targeted concept directions). This is load-bearing for the central claim that the method isolates causal sensitivity to protected attributes rather than producing a broad activation shift.
[§4.1–4.2] §4.1–4.2 (Experimental Results): The comparison to black-box baselines lacks reported effect sizes, confidence intervals, or controls for steering magnitude; without these, it is unclear whether the reported difference in detected bias is robust or sensitive to post-hoc parameter choices.

minor comments (2)

[Abstract, §2] The abstract and §2 could more explicitly define the four simulated tasks and their input distributions to allow readers to assess generalizability.
[§3.1] Notation for the steering vector addition (e.g., the scaling factor) would benefit from an explicit equation in §3.1.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major point below and will revise the paper accordingly to improve rigor and clarity.

read point-by-point responses

Referee: [§3] §3 (Steering Vector Construction): No quantitative validation is reported that steering preserves task accuracy or leaves unrelated internal representations unchanged (e.g., no pre/post accuracy deltas or similarity metrics to non-targeted concept directions). This is load-bearing for the central claim that the method isolates causal sensitivity to protected attributes rather than producing a broad activation shift.

Authors: We agree that explicit quantitative validation of steering specificity is important for supporting the causal interpretation. While the manuscript reports task performance under steering and provides qualitative evidence of targeted effects, it does not include pre/post accuracy deltas or similarity metrics to non-targeted concept directions. In the revised manuscript, we will add these analyses, including accuracy comparisons before and after steering as well as cosine similarity measures between steered activations and unrelated concept vectors, to demonstrate that the interventions isolate the targeted protected attributes without broad activation shifts. revision: yes
Referee: [§4.1–4.2] §4.1–4.2 (Experimental Results): The comparison to black-box baselines lacks reported effect sizes, confidence intervals, or controls for steering magnitude; without these, it is unclear whether the reported difference in detected bias is robust or sensitive to post-hoc parameter choices.

Authors: We acknowledge that additional statistical reporting and controls would strengthen the comparison to black-box baselines. The current results indicate consistent differences in detected bias, but to address this, the revised version will include effect sizes, confidence intervals for the key metrics, and an ablation over steering magnitudes (varying the scaling factor) to confirm that the observed differences are robust rather than sensitive to specific parameter selections. revision: yes

Circularity Check

0 steps flagged

No circularity: method applies external steering interventions to measure sensitivity

full rationale

The paper introduces a white-box auditing framework that constructs steering vectors to manipulate abstract concepts like gender and observes resulting changes in model outputs for bias detection. This is an empirical intervention-based measurement rather than a derivation that reduces to fitted parameters or self-referential definitions. No equations or steps in the provided abstract or description equate the reported sensitivity findings to quantities defined by the inputs themselves. The comparison to black-box evaluations serves as an external benchmark, keeping the central claim independent. Minor self-citation risk is absent from the given text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5708 in / 1087 out tokens · 39271 ms · 2026-05-21T16:09:49.301018+00:00 · methodology

White-Box Sensitivity Auditing with Steering Vectors

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)