Recognition: 3 theorem links
· Lean TheoremHarmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
Pith reviewed 2026-05-12 00:45 UTC · model grok-4.3
The pith
Harmful intent is linearly separable from residual-stream activations across many language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Harmful intent is linearly separable from residual-stream activations across 12 models spanning four architectural families and three alignment variants with scales from 0.5B to 9B. A direction fitted from 100 labelled examples per class via Soft-AUC optimisation reaches mean effective AUROC 0.982 and TPR at 1% FPR of 0.797. It generalises to three held-out harm benchmarks and a hard-benign control. The direction matches its instruction-tuned counterpart in abliterated variants where the refusal mechanism has been removed. Different pooling protocols on the same activations recover directions 73 degrees apart yet each supports effective detection under its own protocol.
What carries the argument
A supervised linear direction in residual-stream activations recovered from labeled harmful and benign examples using Soft-AUC optimisation to separate harmful intent.
If this is right
- The recovered direction detects harmful intent effectively even in models from which the refusal mechanism has been removed.
- Performance generalizes to held-out harm benchmarks and a hard-benign control set.
- Different activation extraction protocols yield distinct directions approximately 73 degrees apart but each performs well for its protocol.
- Supervised probing exceeds AUROC 0.96 with varying low-FPR performance, highlighting the need for low false positive rate metrics in safety evaluations.
Where Pith is reading between the lines
- If the linear direction truly captures intent rather than surface features, it could enable targeted interventions to modulate harm sensitivity without full retraining.
- Protocol dependence suggests that the feature might be entangled with how the model processes the input format or position.
- Extending this to larger models or other languages could test whether the separability scales or is language-specific.
Load-bearing premise
The linear direction recovered from labeled examples corresponds to a stable computational feature of harmful intent rather than a proxy for superficial statistics like prompt length or token distribution that correlate with the labels.
What would settle it
If controlling for prompt length, token distribution, and instruction format between harmful and benign examples causes the AUROC of the fitted direction to drop substantially below 0.9, this would indicate the direction is primarily a surface proxy rather than a feature of harmful intent.
Figures
read the original abstract
Aligned language models refuse harmful instructions, but the representations through which they recognise such instructions are less well characterised than the behaviours they produce. Harmful intent is linearly separable from residual-stream activations across 12 models spanning four architectural families (Qwen2.5, Qwen3.5, Llama-3.2, Gemma-3) and three alignment variants (base, instruction-tuned, abliterated), with parameter scales from 0.5B to 1.3B and a within-family scale extension to 9B on Qwen3.5. A direction fitted from 100 labelled examples per class via Soft-AUC optimisation reaches mean effective AUROC 0.982 and TPR@1\%FPR 0.797, generalises to three held-out harm benchmarks and a hard-benign control, and matches its instruction-tuned counterpart within $\pm 0.003$ AUROC in abliterated variants from which the refusal mechanism has been removed. The supervised strategies all exceed AUROC 0.96, but their TPR@1\%FPR varies by more than ten times the AUROC gap; a deployed 9B safety classifier shows the same pattern at AUROC 0.94 and TPR 0.30, motivating low-FPR reporting as a default in safety-adjacent detection evaluation. Geometric measurements refine the picture. The recovered direction is concentrated within each extraction protocol but protocol-dependent across them: two pooling choices applied to the same chat-templated activations at the same residual-stream layer (max-pool over content tokens versus last-token at the post-instruction position) recover harm directions $73^\circ$ apart, and projecting one out leaves detection under either max-pool extraction essentially intact. Probing identifies a protocol-specific direction rather than a unique computational feature.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that harmful intent is linearly separable in residual-stream activations of LLMs. A direction fitted via Soft-AUC optimization on 100 labeled examples per class from 12 models (spanning Qwen, Llama, Gemma families and alignment variants) achieves mean AUROC 0.982 and TPR@1%FPR 0.797, generalizes to held-out harm benchmarks and a hard-benign control, remains effective in abliterated models, and is compared to a deployed safety classifier. Geometric analysis shows the direction is concentrated within extraction protocols but 73° apart across max-pool vs. last-token pooling at the same layer, leading to the conclusion that probing recovers a protocol-specific direction rather than a unique computational feature.
Significance. If the recovered direction isolates harmful intent rather than surface statistics, the result would strengthen mechanistic understanding of refusal and safety behaviors across model families and scales. Strengths include the breadth of models tested, explicit low-FPR metric reporting, and direct comparison to an existing deployed classifier. The geometric ablation of pooling protocols is a useful refinement. However, the supervised nature of the fit and lack of explicit controls for confounds limit the strength of the 'stable feature' interpretation.
major comments (3)
- Abstract and generalization section: The reported generalization to three held-out harm benchmarks and the hard-benign control does not state whether these sets were matched or ablated for prompt length, lexical distribution, token statistics, or instruction format. Without such controls, the high AUROC and TPR may reflect exploitation of surface correlations by the Soft-AUC fit on only 100 examples per class rather than recovery of intent as a stable feature.
- Geometric measurements paragraph: The reported 73° angle between max-pool and last-token harm directions at the same residual-stream layer, together with the finding that projecting one out leaves the other essentially intact, indicates strong extraction-protocol dependence. This directly weakens the central claim that the direction corresponds to a model-intrinsic computational feature of harmful intent rather than a protocol-specific proxy.
- Soft-AUC fitting description: The manuscript does not detail whether the Soft-AUC optimization on the 100-example sets was cross-validated or whether regularization was applied; given the small sample size and the downstream performance numbers, this leaves open the possibility of overfitting to label-correlated surface statistics.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the scope and limitations of our claims. We respond to each major point below and indicate revisions where they strengthen the manuscript without altering its core findings.
read point-by-point responses
-
Referee: Abstract and generalization section: The reported generalization to three held-out harm benchmarks and the hard-benign control does not state whether these sets were matched or ablated for prompt length, lexical distribution, token statistics, or instruction format. Without such controls, the high AUROC and TPR may reflect exploitation of surface correlations by the Soft-AUC fit on only 100 examples per class rather than recovery of intent as a stable feature.
Authors: The manuscript does not explicitly report matching or ablation of the held-out sets for prompt length, lexical distribution, token statistics, or instruction format. The hard-benign control is intended to include surface-similar benign examples, and the strong generalization across three independent harm benchmarks plus consistent results over 12 models provide indirect support for robustness. In revision we will add a dedicated paragraph in the generalization section describing the benchmark construction process and any post-hoc checks performed on length and lexical overlap. revision: yes
-
Referee: Geometric measurements paragraph: The reported 73° angle between max-pool and last-token harm directions at the same residual-stream layer, together with the finding that projecting one out leaves the other essentially intact, indicates strong extraction-protocol dependence. This directly weakens the central claim that the direction corresponds to a model-intrinsic computational feature of harmful intent rather than a protocol-specific proxy.
Authors: The manuscript already reports the 73° angle and the projection result, and concludes explicitly that 'Probing identifies a protocol-specific direction rather than a unique computational feature.' This finding refines rather than undermines the central claim: harmful intent remains geometrically recoverable within each extraction protocol, with high within-protocol stability and cross-model generalization. The title and abstract frame the result as recoverability of a feature, not invariance to every preprocessing choice. No revision is required on this point. revision: no
-
Referee: Soft-AUC fitting description: The manuscript does not detail whether the Soft-AUC optimization on the 100-example sets was cross-validated or whether regularization was applied; given the small sample size and the downstream performance numbers, this leaves open the possibility of overfitting to label-correlated surface statistics.
Authors: We agree that the methods section lacks explicit statements on cross-validation folds or regularization strength for the Soft-AUC procedure. In the revised manuscript we will expand the fitting description to include these details (number of folds, regularization parameter if used, and any early-stopping criteria). The observed generalization to held-out benchmarks and the replication of high AUROC across twelve architecturally distinct models already argue against severe overfitting, but the added methodological transparency will address the concern directly. revision: yes
Circularity Check
No significant circularity; supervised probe evaluated on held-out data
full rationale
The paper describes fitting a linear direction on 100 labeled examples per class via Soft-AUC optimization and reports its AUROC and TPR on held-out harm benchmarks plus a hard-benign control. This constitutes standard supervised evaluation with independent test sets rather than any self-definitional loop, fitted input renamed as prediction, or load-bearing self-citation. No uniqueness theorems, ansatzes, or renamings of known results are invoked to support the separability claim. The geometric measurements (e.g., 73° angle between pooling protocols) are post-hoc observations on the fitted vectors and do not reduce the central result to its inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- harm direction vector
axioms (1)
- domain assumption Harmful intent is linearly separable in residual-stream activations at the chosen layer
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A direction fitted from 100 labelled examples per class via Soft-AUC optimisation reaches mean effective AUROC 0.982... The recovered direction is concentrated within each extraction protocol but protocol-dependent across them: two pooling choices... recover harm directions 73° apart
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
harmful intent is linearly separable from residual-stream activations... wLDA = μ̂H − μ̂N / ∥μ̂H − μ̂N∥... Soft-AUC surrogate... Riemannian gradient ascent on the unit sphere
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Detection remains stable across alignment variants, including abliterated models... harmful intent and refusal are functionally dissociated
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. INSIDE: LLMs’ internal states retain the power of hallucination detection.arXiv preprint arXiv:2402.03744,
-
[2]
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Transformer feed-forward layers are key-value memories
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495,
work page 2021
-
[4]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: LLM-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Isaac Llorente-Saguer. The Geometry of Harmful Intent: Training-free anomaly detection via angular deviation in LLM residual streams.arXiv preprint arXiv:2603.27412,
-
[6]
Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824,
work page internal anchor Pith review arXiv
-
[7]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Progress measures for grokking via mechanistic interpretability
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability.arXiv preprint arXiv:2301.05217,
work page internal anchor Pith review arXiv
-
[9]
The Linear Representation Hypothesis and the Geometry of Large Language Models
Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models.arXiv preprint arXiv:2311.03658,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Xstest: A test suite for identifying exaggerated safety behaviours in large language models
Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),...
work page 2024
-
[11]
Ian Tenney, Dipanjan Das, and Ellie Pavlick
URLhttps://crfm.stanford.edu/2023/03/13/alpaca.html. Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4593–4601,
work page 2023
-
[12]
arXiv preprint arXiv:2407.21772 , year=
Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, et al. Shieldgemma: Generative ai content moderation based on gemma.arXiv preprint arXiv:2407.21772,
-
[13]
LLMs encode harmfulness and refusal separately, 2025
Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi. LLMs encode harmfulness and refusal separately.arXiv preprint arXiv:2507.11878,
-
[14]
AutoDAN: Interpretable Gradient- Based Adversarial Attacks on Large Language Mod- els
Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. Autodan: interpretable gradient-based adversarial attacks on large language models.arXiv preprint arXiv:2310.15140,
-
[15]
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023a. 18 Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
is optimised via Riemannian gradient ascent on the unit sphereS D−1. At each step, the Euclidean gradient∇ ˆU is projected onto the tangent space of the sphere atwto obtain the Riemannian gradient: ∇R ˆU=∇ ˆU−(∇ ˆU·w)w.(3) The objective is a sigmoid-smoothed approximation to the Mann–WhitneyU statistic, where each pairwise margin( xi −x j) · wis scaled by...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.