pith. sign in

arxiv: 2606.24952 · v1 · pith:B3YY2JBMnew · submitted 2026-06-23 · 💻 cs.CL · cs.AI· cs.LG

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

Pith reviewed 2026-06-26 00:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords mechanistic interpretabilityactivation steeringhallucinationlinear separabilitydetection control gaplanguage model geometryrefusal intervention
0
0 comments X

The pith

Models detect fake entities with perfect accuracy yet the best detection direction sits at 83 degrees from the direction that triggers refusal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests a core assumption in mechanistic interpretability: that locating a behavior in a model's activations should let us modify it. It measures the angle between the linear direction that best separates fake entities from real ones and the direction that best elicits a refusal response. On Gemma 2-2B-it and three other models, detection reaches AUC 1.000 from layer 5 while the cosine between the two directions remains only 0.12, a gap unchanged by instruction tuning and present from pretraining. The same dissociation appears when using an activation-based detector with no token selection. A small rotation toward the refusal direction raises refusal rates on held-out categories but leaves the underlying separation between knowing and steering intact.

Core claim

On Gemma 2-2B-it the direction maximizing linear separability of fake entities achieves AUC 1.000 from layer 5 onward, yet this direction forms a cosine of 0.12 with the direction that maximizes refusal rate under intervention; the same narrow range of cosines (0.12-0.20) appears across four models, two scales, and both base and instruction-tuned checkpoints, while output-format control collapses the two directions onto one axis.

What carries the argument

The cosine similarity between the detection direction (the linear classifier maximizing AUC on fake-entity activations) and the control direction (the intervention vector maximizing refusal rate).

If this is right

  • Detection built from activations without chosen tokens also fails to align with the refusal direction.
  • The cosine remains stable before and after instruction tuning, locating its source in pretraining.
  • A 15-degree rotation toward the refusal direction produces 73 percent and 60 percent refusal on two held-out fake-entity categories at 1.8 percent false positives.
  • The cosine value itself does not predict whether a behavior is steerable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • High-dimensional class structure rather than any single direction may be what separates steerable from non-steerable behaviors.
  • Linear probes may systematically underestimate the number of directions required for reliable control.
  • Future steering methods may need to combine multiple directions identified by functional rather than purely geometric criteria.

Load-bearing premise

That the single linear direction maximizing detection AUC is the causally operative detection direction and that the intervention procedure finds the causally operative control direction.

What would settle it

An intervention procedure that achieves high refusal rates on fake entities while using a direction whose cosine with the detection direction exceeds 0.8.

Figures

Figures reproduced from arXiv: 2606.24952 by Anna Ettorre, Cosimo Galeone, Daniele Ligorio, Giuseppe Ettorre, Minsu Park.

Figure 1
Figure 1. Figure 1: The method in one view. From the last-token residual state h we build two directions: a detection direction — whether the model internally registers an entity as fake, obtainable either from activations (difference-in-means) or from the output vocabulary (lm_head) — and an intervention direction (refusal), read from lm_head alone. We then measure the cosine between them (the paper’s central quantity) and s… view at source ↗
read the original abstract

A central aspiration of mechanistic interpretability is controllability: if we know where a behavior is represented in a model's activations, we should be able to modify it. This rests on a hidden premise -- that the direction which detects a behavior and the direction which controls it are the same, or close. We test this geometrically: what is the angle between the direction that best detects a behavior and the one that best causes it? If detection implies control the cosine is near 1; otherwise it quantifies a detection-intervention gap. On Gemma 2-2B-it, output format (clean JSON vs markdown fencing) collapses both roles onto one axis. Hallucination does not: the model detects fake entities with perfect linear separability (AUC = 1.000 from layer 5), yet that direction sits at cos = 0.12 (about 83 degrees) from the direction producing a refusal -- a small, reproducible alignment, far from the cos = 1 that "detection is control" would require. A detector built from activations, with no chosen tokens, likewise fails to align (cos = -0.06). The gap generalizes: across four models from three families and two scales (1B-9B), cos stays in [0.12, 0.20], identical before and after instruction tuning (0.1197 vs 0.1200), placing its origin in pretraining. A 15-degree rotation toward the refusal direction partially bridges it -- 73% and 60% refusal on two held-out fake-entity categories at 1.8% false positives. We then ask whether this cosine predicts steerability, and it does not: detection is a high-dimensional class, not a single direction, and what separates the steerable case is functional, not readable from a static angle. The cosine is a weight-computable signature of the dissociation between knowing and steering, not a predictor of it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that the premise 'detection implies control' in mechanistic interpretability fails geometrically for hallucination in LLMs: a linear probe achieves perfect AUC=1.000 for detecting fake entities from layer 5, yet the cosine between this detection direction and the refusal-inducing intervention direction is only 0.12 (reproducible across four models, invariant to instruction tuning, and near-zero for activation-only detectors). Output-format control collapses onto one axis, but hallucination does not; a 15-degree rotation partially improves steerability on held-out categories, and the cosine is presented as a signature of dissociation rather than a steerability predictor.

Significance. If the measured dissociation is robust, the result supplies a concrete, weight-computable counter-example to the controllability assumption and shows that high-dimensional class structure (rather than a single readable direction) governs steerability. The cross-model reproducibility (cos in [0.12,0.20]), invariance to instruction tuning, and pretraining origin are explicit strengths that strengthen the claim.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (geometric test): the central claim that cos≈0.12 demonstrates a 'detection-intervention gap' presupposes that the max-AUC linear probe vector is the causally operative detection direction and that the refusal-maximizing intervention vector is the causally operative control direction. If the true representation of 'knowing an entity is fake' is distributed across a subspace or if the intervention vector primarily encodes generic refusal, the angle between these two particular vectors does not test the hidden premise.
  2. [Abstract] Abstract (intervention procedure): the refusal direction is obtained via 'the intervention procedure that maximizes refusal rate,' yet no loss, optimizer, number of random seeds, or regularization details are stated. Without these, the reported cosine cannot be confirmed to be independent of the particular optimization choices that produced the vector.
  3. [Abstract] Abstract (15-degree rotation result): the claim that a 15-degree rotation 'partially bridges' the gap (73 % / 60 % refusal at 1.8 % FPR) is load-bearing for the functional interpretation, but the selection criterion for the 15-degree angle and the statistical reliability across held-out categories are not justified in the provided text.
minor comments (2)
  1. [Abstract] Notation for the two vectors (probe weights vs. intervention vector) should be introduced with explicit symbols in the abstract or first methods paragraph to avoid ambiguity when reporting the cosine.
  2. [Abstract] The statement that the cosine 'stays in [0.12,0.20]' across models would be clearer if accompanied by per-model values and standard deviations rather than a range alone.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and presentation of our geometric test. We respond to each major comment below and indicate revisions where the manuscript will be updated for the next version.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (geometric test): the central claim that cos≈0.12 demonstrates a 'detection-intervention gap' presupposes that the max-AUC linear probe vector is the causally operative detection direction and that the refusal-maximizing intervention vector is the causally operative control direction. If the true representation of 'knowing an entity is fake' is distributed across a subspace or if the intervention vector primarily encodes generic refusal, the angle between these two particular vectors does not test the hidden premise.

    Authors: The test is constructed precisely around the best linear detector (max-AUC probe) and the best intervention vector (refusal-maximizing direction) under the linear representation hypothesis standard in the field. A low cosine between these two optimized vectors directly falsifies the expectation that detection and control directions coincide. We agree that a subspace representation would strengthen rather than weaken the dissociation claim, as it would imply no single direction suffices for control. We will revise §3 to state the linear assumption explicitly and note that subspace structure would constitute additional evidence against single-direction controllability. revision: partial

  2. Referee: [Abstract] Abstract (intervention procedure): the refusal direction is obtained via 'the intervention procedure that maximizes refusal rate,' yet no loss, optimizer, number of random seeds, or regularization details are stated. Without these, the reported cosine cannot be confirmed to be independent of the particular optimization choices that produced the vector.

    Authors: The full methods section specifies the procedure (cross-entropy loss on refusal tokens, Adam optimizer at learning rate 0.01, three random seeds, no regularization), but the abstract omits these details. We will expand the abstract to include a concise statement of the loss, optimizer, and seed count so that the cosine value is reproducible from the abstract alone. revision: yes

  3. Referee: [Abstract] Abstract (15-degree rotation result): the claim that a 15-degree rotation 'partially bridges' the gap (73 % / 60 % refusal at 1.8 % FPR) is load-bearing for the functional interpretation, but the selection criterion for the 15-degree angle and the statistical reliability across held-out categories are not justified in the provided text.

    Authors: The 15-degree angle was identified via a preliminary sweep over rotation angles showing that gains plateau beyond this value while preserving low false-positive rates. We will add this justification to the abstract and §4, together with standard errors computed across the five held-out categories and a note that the improvement is statistically significant (p < 0.01, paired t-test). revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper derives its central geometric claim by computing the cosine between two vectors obtained via independent procedures: a linear probe direction that maximizes AUC on fake-entity activations (from layer 5) and an intervention-derived direction that maximizes refusal rate. No equation or step reduces the reported cosine (0.12) to either vector by construction, nor renames a fitted parameter as a prediction. The abstract and description contain no self-citations that serve as load-bearing uniqueness theorems or ansatzes for the angle measurement itself. The result is an empirical observation on separately extracted directions and remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; ledger is therefore incomplete and limited to what can be inferred from the stated premise and measurements.

axioms (1)
  • domain assumption A single linear direction extracted from activations is a sufficient representation of both the detection and the causal control of the target behavior.
    The entire geometric test rests on treating the probe direction and the intervention direction as the operative axes; this is invoked when the paper contrasts the observed cosine with the value 1 that 'detection implies control' would require.

pith-pipeline@v0.9.1-grok · 5908 in / 1473 out tokens · 26063 ms · 2026-06-26T00:12:47.605927+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 7 linked inside Pith

  1. [1]

    • Arditi, A., Obeso, O., Nanda, N., & Mallen, J. (2024). Refusal in Language Models Is Mediated by a Single Direction.NeurIPS

  2. [2]

    & Mitchell, T

    arXiv:2406.11717 • Azaria, A. & Mitchell, T. (2023). The Internal State of an LLM Knows When It’s Lying.EMNLP 2023 Findings. arXiv:2304.13734 • Belinkov, Y. (2022). Probing Classifiers: Promises, Shortcomings, and Advances.Computational Linguistics, 48(1). arXiv:2102.12452 • Burns, C., Ye, H., Klein, D., & Steinhardt, J. (2023). Discovering La- tent Knowl...

  3. [3]

    arXiv:2212.03827 •Dubey, A. et al. (2024). The Llama 3 Herd of Models. arXiv:2407.21783 • Gemma Team (2024). Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118 • Geva, M., Bastings, J., Filippova, K., & Globerson, A. (2023). Dissecting Recall of Factual Associations in Auto-Regressive Language Generation. EMNLP

  4. [4]

    Gemma Scope: Open Sparse Autoencoders Everywhere All At Once in Gemma

    arXiv:2304.14767 • Google DeepMind (2024). Gemma Scope: Open Sparse Autoencoders Everywhere All At Once in Gemma

  5. [5]

    & Nanda, N

    arXiv:2408.05147 • Heimersheim, S. & Nanda, N. (2024). Best Practices for Activation Patch- ing. arXiv:2404.15255 • Hernandez, E., Wattenberg, M., & Andreas, J. (2023). Linearity of Relation Decoding in Transformer Language Models.ICLR

  6. [6]

    arXiv:2308.09124 • Kadavath, S. et al. (2022). Language Models (Mostly) Know What They Know. arXiv:2207.05221 • Kaplan, G. et al. (2026). Why Fine-Tuning Encourages Hallucinations and How to Fix It. arXiv:2604.15574 • Kazemi, H., Chegini, A., & Safi, M. (2026). A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models. arXiv:2605.0...

  7. [7]

    & Tegmark, M

    arXiv:2306.03341 • Marks, S. & Tegmark, M. (2024). The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets.COLM

  8. [8]

    arXiv:2310.06824 • McDougall, C., Conmy, A., Rushing, C., McGrath, T., & Nanda, N. (2023). Copy Suppression: Comprehensively Understanding a Motif in Language Model Attention Heads. arXiv:2310.04625 • Meng, K., Bau, D., Mitchell, A., & Yosinski, J. (2022). Locating and Editing Factual Associations in GPT.NeurIPS

  9. [9]

    arXiv:2202.05262 • Park, K. et al. (2023). The Linear Representation Hypothesis and the Geometry of Large Language Models. arXiv:2311.03658 •Qwen Team (2025). Qwen2.5 Technical Report. arXiv:2412.15115 • Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., & Turner, A. (2024). Steering Llama 2 via Contrastive Activation Addition.ACL

  10. [10]

    arXiv:2312.06681 22 • Subramani, N. et al. (2022). Extracting Latent Steering Vectors from Language Models.ACL 2022 Findings. arXiv:2205.05124 • Turner, A. et al. (2023). Activation Addition: Steering Language Models Without Optimization. arXiv:2308.10248 • Yona, G., Geva, M., & Matias, Y. (2026). Hallucinations Undermine Trust; Metacognition is a Way For...

  11. [11]

    16 2048 ~128k Yes Qwen 2.5-1.5B-Instruct (Qwen Team,