pith. machine review for the scientific record. sign in

arxiv: 2604.18901 · v2 · submitted 2026-04-20 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 3 theorem links

· Lean Theorem

Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords harmful intentresidual streamlinear separabilityLLM probingsafety detectionactivation geometryrefusal mechanisms
0
0 comments X

The pith

Harmful intent is linearly separable from residual-stream activations across many language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that harmful intent can be recovered as a linear direction in the residual stream activations of language models. Using only 100 labeled examples per class and a soft AUC optimization, a direction is fitted that achieves strong detection performance with high AUROC and good true positive rate at low false positive rate. This performance holds across 12 models in four families, different alignment methods including abliterated ones without refusal, and generalizes to held-out benchmarks. The work matters because characterizing these internal representations could help understand and improve how models handle harmful requests beyond just observing their output behavior. Geometric analysis shows that the direction depends on the activation extraction protocol but remains effective within each protocol.

Core claim

Harmful intent is linearly separable from residual-stream activations across 12 models spanning four architectural families and three alignment variants with scales from 0.5B to 9B. A direction fitted from 100 labelled examples per class via Soft-AUC optimisation reaches mean effective AUROC 0.982 and TPR at 1% FPR of 0.797. It generalises to three held-out harm benchmarks and a hard-benign control. The direction matches its instruction-tuned counterpart in abliterated variants where the refusal mechanism has been removed. Different pooling protocols on the same activations recover directions 73 degrees apart yet each supports effective detection under its own protocol.

What carries the argument

A supervised linear direction in residual-stream activations recovered from labeled harmful and benign examples using Soft-AUC optimisation to separate harmful intent.

If this is right

  • The recovered direction detects harmful intent effectively even in models from which the refusal mechanism has been removed.
  • Performance generalizes to held-out harm benchmarks and a hard-benign control set.
  • Different activation extraction protocols yield distinct directions approximately 73 degrees apart but each performs well for its protocol.
  • Supervised probing exceeds AUROC 0.96 with varying low-FPR performance, highlighting the need for low false positive rate metrics in safety evaluations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the linear direction truly captures intent rather than surface features, it could enable targeted interventions to modulate harm sensitivity without full retraining.
  • Protocol dependence suggests that the feature might be entangled with how the model processes the input format or position.
  • Extending this to larger models or other languages could test whether the separability scales or is language-specific.

Load-bearing premise

The linear direction recovered from labeled examples corresponds to a stable computational feature of harmful intent rather than a proxy for superficial statistics like prompt length or token distribution that correlate with the labels.

What would settle it

If controlling for prompt length, token distribution, and instruction format between harmful and benign examples causes the AUROC of the fitted direction to drop substantially below 0.9, this would indicate the direction is primarily a surface proxy rather than a feature of harmful intent.

Figures

Figures reproduced from arXiv: 2604.18901 by Isaac Llorente-Saguer.

Figure 1
Figure 1. Figure 1: Effective AUROC (harmful vs all benign) for all direction strategies across the 12 models, at the [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: TPR at 1% FPR (harmful vs all benign). 7 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effective AUROC vs TPR@1%FPR for wLDA (left) and wopt (right), with stratified bootstrap 95% CI whiskers (1,000 resamples). Each point is one model; colour indicates architecture family, shape indicates alignment variant. Zero-shot and surface baselines. PC1 (normative) achieves AUROC 0.665 ± 0.073 and TPR below 0.05 on all 12 models. The perplexity baseline achieves AUROC 0.587 ± 0.038. The random baselin… view at source ↗
Figure 4
Figure 4. Figure 4: Score distributions at the validation-selected layer for two instruction-tuned models. Dashed vertical [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-layer effective AUROC for all strategies and all 12 models. Each point is an independent fit [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Pairwise unsigned angles between direction vectors at the validation-selected layer, for all 12 models. [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effective AUROC (left) and TPR@1%FPR (right) as a function of fit-set size [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
read the original abstract

Aligned language models refuse harmful instructions, but the representations through which they recognise such instructions are less well characterised than the behaviours they produce. Harmful intent is linearly separable from residual-stream activations across 12 models spanning four architectural families (Qwen2.5, Qwen3.5, Llama-3.2, Gemma-3) and three alignment variants (base, instruction-tuned, abliterated), with parameter scales from 0.5B to 1.3B and a within-family scale extension to 9B on Qwen3.5. A direction fitted from 100 labelled examples per class via Soft-AUC optimisation reaches mean effective AUROC 0.982 and TPR@1\%FPR 0.797, generalises to three held-out harm benchmarks and a hard-benign control, and matches its instruction-tuned counterpart within $\pm 0.003$ AUROC in abliterated variants from which the refusal mechanism has been removed. The supervised strategies all exceed AUROC 0.96, but their TPR@1\%FPR varies by more than ten times the AUROC gap; a deployed 9B safety classifier shows the same pattern at AUROC 0.94 and TPR 0.30, motivating low-FPR reporting as a default in safety-adjacent detection evaluation. Geometric measurements refine the picture. The recovered direction is concentrated within each extraction protocol but protocol-dependent across them: two pooling choices applied to the same chat-templated activations at the same residual-stream layer (max-pool over content tokens versus last-token at the post-instruction position) recover harm directions $73^\circ$ apart, and projecting one out leaves detection under either max-pool extraction essentially intact. Probing identifies a protocol-specific direction rather than a unique computational feature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript claims that harmful intent is linearly separable in residual-stream activations of LLMs. A direction fitted via Soft-AUC optimization on 100 labeled examples per class from 12 models (spanning Qwen, Llama, Gemma families and alignment variants) achieves mean AUROC 0.982 and TPR@1%FPR 0.797, generalizes to held-out harm benchmarks and a hard-benign control, remains effective in abliterated models, and is compared to a deployed safety classifier. Geometric analysis shows the direction is concentrated within extraction protocols but 73° apart across max-pool vs. last-token pooling at the same layer, leading to the conclusion that probing recovers a protocol-specific direction rather than a unique computational feature.

Significance. If the recovered direction isolates harmful intent rather than surface statistics, the result would strengthen mechanistic understanding of refusal and safety behaviors across model families and scales. Strengths include the breadth of models tested, explicit low-FPR metric reporting, and direct comparison to an existing deployed classifier. The geometric ablation of pooling protocols is a useful refinement. However, the supervised nature of the fit and lack of explicit controls for confounds limit the strength of the 'stable feature' interpretation.

major comments (3)
  1. Abstract and generalization section: The reported generalization to three held-out harm benchmarks and the hard-benign control does not state whether these sets were matched or ablated for prompt length, lexical distribution, token statistics, or instruction format. Without such controls, the high AUROC and TPR may reflect exploitation of surface correlations by the Soft-AUC fit on only 100 examples per class rather than recovery of intent as a stable feature.
  2. Geometric measurements paragraph: The reported 73° angle between max-pool and last-token harm directions at the same residual-stream layer, together with the finding that projecting one out leaves the other essentially intact, indicates strong extraction-protocol dependence. This directly weakens the central claim that the direction corresponds to a model-intrinsic computational feature of harmful intent rather than a protocol-specific proxy.
  3. Soft-AUC fitting description: The manuscript does not detail whether the Soft-AUC optimization on the 100-example sets was cross-validated or whether regularization was applied; given the small sample size and the downstream performance numbers, this leaves open the possibility of overfitting to label-correlated surface statistics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and limitations of our claims. We respond to each major point below and indicate revisions where they strengthen the manuscript without altering its core findings.

read point-by-point responses
  1. Referee: Abstract and generalization section: The reported generalization to three held-out harm benchmarks and the hard-benign control does not state whether these sets were matched or ablated for prompt length, lexical distribution, token statistics, or instruction format. Without such controls, the high AUROC and TPR may reflect exploitation of surface correlations by the Soft-AUC fit on only 100 examples per class rather than recovery of intent as a stable feature.

    Authors: The manuscript does not explicitly report matching or ablation of the held-out sets for prompt length, lexical distribution, token statistics, or instruction format. The hard-benign control is intended to include surface-similar benign examples, and the strong generalization across three independent harm benchmarks plus consistent results over 12 models provide indirect support for robustness. In revision we will add a dedicated paragraph in the generalization section describing the benchmark construction process and any post-hoc checks performed on length and lexical overlap. revision: yes

  2. Referee: Geometric measurements paragraph: The reported 73° angle between max-pool and last-token harm directions at the same residual-stream layer, together with the finding that projecting one out leaves the other essentially intact, indicates strong extraction-protocol dependence. This directly weakens the central claim that the direction corresponds to a model-intrinsic computational feature of harmful intent rather than a protocol-specific proxy.

    Authors: The manuscript already reports the 73° angle and the projection result, and concludes explicitly that 'Probing identifies a protocol-specific direction rather than a unique computational feature.' This finding refines rather than undermines the central claim: harmful intent remains geometrically recoverable within each extraction protocol, with high within-protocol stability and cross-model generalization. The title and abstract frame the result as recoverability of a feature, not invariance to every preprocessing choice. No revision is required on this point. revision: no

  3. Referee: Soft-AUC fitting description: The manuscript does not detail whether the Soft-AUC optimization on the 100-example sets was cross-validated or whether regularization was applied; given the small sample size and the downstream performance numbers, this leaves open the possibility of overfitting to label-correlated surface statistics.

    Authors: We agree that the methods section lacks explicit statements on cross-validation folds or regularization strength for the Soft-AUC procedure. In the revised manuscript we will expand the fitting description to include these details (number of folds, regularization parameter if used, and any early-stopping criteria). The observed generalization to held-out benchmarks and the replication of high AUROC across twelve architecturally distinct models already argue against severe overfitting, but the added methodological transparency will address the concern directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; supervised probe evaluated on held-out data

full rationale

The paper describes fitting a linear direction on 100 labeled examples per class via Soft-AUC optimization and reports its AUROC and TPR on held-out harm benchmarks plus a hard-benign control. This constitutes standard supervised evaluation with independent test sets rather than any self-definitional loop, fitted input renamed as prediction, or load-bearing self-citation. No uniqueness theorems, ansatzes, or renamings of known results are invoked to support the separability claim. The geometric measurements (e.g., 73° angle between pooling protocols) are post-hoc observations on the fitted vectors and do not reduce the central result to its inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on one fitted direction per extraction protocol plus standard linear-algebra assumptions; no new physical entities are introduced.

free parameters (1)
  • harm direction vector
    Optimized via Soft-AUC on 100 labeled examples per class for each model and pooling protocol.
axioms (1)
  • domain assumption Harmful intent is linearly separable in residual-stream activations at the chosen layer
    Invoked when the fitted direction is treated as recovering a meaningful feature rather than a correlational artifact.

pith-pipeline@v0.9.0 · 5636 in / 1317 out tokens · 42116 ms · 2026-05-12T00:45:27.740789+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 7 internal anchors

  1. [1]

    INSIDE: LLMs’ internal states retain the power of hallucination detection.arXiv preprint arXiv:2402.03744,

    Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. INSIDE: LLMs’ internal states retain the power of hallucination detection.arXiv preprint arXiv:2402.03744,

  2. [2]

    Toy Models of Superposition

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652,

  3. [3]

    Transformer feed-forward layers are key-value memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495,

  4. [4]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: LLM-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674,

  5. [5]

    The Geometry of Harmful Intent: Training-free anomaly detection via angular deviation in LLM residual streams.arXiv preprint arXiv:2603.27412,

    Isaac Llorente-Saguer. The Geometry of Harmful Intent: Training-free anomaly detection via angular deviation in LLM residual streams.arXiv preprint arXiv:2603.27412,

  6. [6]

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

    Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824,

  7. [7]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249,

  8. [8]

    Progress measures for grokking via mechanistic interpretability

    Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability.arXiv preprint arXiv:2301.05217,

  9. [9]

    The Linear Representation Hypothesis and the Geometry of Large Language Models

    Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658,

  10. [10]

    Xstest: A test suite for identifying exaggerated safety behaviours in large language models

    Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),...

  11. [11]

    Ian Tenney, Dipanjan Das, and Ellie Pavlick

    URLhttps://crfm.stanford.edu/2023/03/13/alpaca.html. Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4593–4601,

  12. [12]

    arXiv preprint arXiv:2407.21772 , year=

    Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, et al. Shieldgemma: Generative ai content moderation based on gemma.arXiv preprint arXiv:2407.21772,

  13. [13]

    LLMs encode harmfulness and refusal separately, 2025

    Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi. LLMs encode harmfulness and refusal separately.arXiv preprint arXiv:2507.11878,

  14. [14]

    AutoDAN: Interpretable Gradient- Based Adversarial Attacks on Large Language Mod- els

    Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. Autodan: interpretable gradient-based adversarial attacks on large language models.arXiv preprint arXiv:2310.15140,

  15. [15]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023a. 18 Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal ...

  16. [16]

    is optimised via Riemannian gradient ascent on the unit sphereS D−1. At each step, the Euclidean gradient∇ ˆU is projected onto the tangent space of the sphere atwto obtain the Riemannian gradient: ∇R ˆU=∇ ˆU−(∇ ˆU·w)w.(3) The objective is a sigmoid-smoothed approximation to the Mann–WhitneyU statistic, where each pairwise margin( xi −x j) · wis scaled by...