arxiv: 2604.09839 · v2 · submitted 2026-04-10 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Steered LLM Activations are Non-Surjective

Aayush Mishra , Daniel Khashabi , Anqi Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:55 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords activation steeringLLM interpretabilitysurjectivityresidual streamwhite-box controlblack-box prompting

0 comments

The pith

Activation steering in LLMs moves residual stream states off the manifold reachable by any prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that under practical assumptions, adding a steering vector to an LLM's activations produces an internal state with no preimage under the model's normal forward pass from token sequences. This means the behavioral effect of steering cannot be matched by any textual prompt, even in principle. A reader would care because much current work in interpretability and safety treats steering success as a proxy for what prompts could achieve, yet the result shows these are formally distinct operations. The authors support the claim with both a mathematical argument and checks on three common models.

Core claim

Under practical assumptions, activation steering is non-surjective: the steered residual-stream activation lies outside the image of the forward pass from discrete prompts, so almost surely no prompt can reproduce the same internal state.

What carries the argument

The manifold of residual-stream activations reachable from sequences of discrete tokens; additive steering displaces the activation vector off this manifold.

Load-bearing premise

The practical assumptions about residual-stream geometry and additive steering that make the set of prompt-reachable states a lower-dimensional submanifold of the full activation space.

What would settle it

An explicit prompt whose residual-stream activation vector exactly equals a given steered vector, or a demonstration that the reachable set is the entire space under the model's actual forward pass.

Figures

Figures reproduced from arXiv: 2604.09839 by Aayush Mishra, Anqi Liu, Daniel Khashabi.

**Figure 1.** Figure 1: LLMs admit a countable and practically finite number of prompts V ≤K. This property implies the existence of holes in their real activation space Rd : regions that do not map back to any prompt. We show that activation steering, a popular white-box intervention method to change model behavior, almost surely steers activations into such holes resulting in almost-sure non-surjectivity, i.e., steered model b… view at source ↗

**Figure 2.** Figure 2: Due to their almost sure injectivity, natural LLM activations can uniquely recover prompts using the SipIt algorithm (§5.1). Injectivity at initialization; preserved under training. Nikolaou et al. (2025) use the real-analyticity of transformers to show that with random draws of initial parameters (from practical distributions like Gaussian, Xavier, etc.), internal representations of these models almost… view at source ↗

**Figure 3.** Figure 3: We test the surjectivity of steered activations using two methods. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: We sort the L2 distances between activations [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: We plot the L2 distance between steered activations and model’s natural activations [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: SIPIT experiments on the Gemma model (first two plots) and Llama 3.2-1B-Instruct INT4 quanitized model (right most plot) shows similar trends. Gemma activations have large absolute values which scales the numbers. We did not perform a coefficient sweep for this model due to resource constraints. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: SIPIT experiments on the Qwen model show similar results. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: ICL experiments on the Qwen, Gemma and even the quantized Llama (INT4) [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

read the original abstract

Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in its behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations) and safety research (e.g., jailbreakability). However, it is unclear whether steered behavior is realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a preimage under the model's natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proves that under practical assumptions steered activations lie off the prompt-reachable manifold, but the assumptions stay vague and the experiments only illustrate rather than confirm the almost-sure claim.

read the letter

The main thing to know is that this paper argues activation steering moves the residual stream to states that almost no discrete prompt can produce. Under their practical assumptions the forward map from token sequences is not surjective, so steering achieves internal configurations prompting cannot reach. This is the formal separation they draw between white-box steering and black-box prompting. The new piece is the surjectivity framing itself plus the claim that additive steering takes you off the reachable manifold. The three-LLM checks show statistical differences consistent with that picture. The work is useful because it gives a mathematical reason to stop treating steering results as direct evidence about what prompts can do in safety or interpretability settings. The soft spots are the unspecified assumptions. Without seeing exactly what they assume about the residual-stream manifold, the continuity or analyticity of the forward pass, or the general position of embeddings, it is hard to know whether the measure-zero argument survives contact with real transformers. The empirical illustrations are only finite samples, so they cannot establish the almost-sure non-existence result; they just show differences. This paper is for people already working with activation steering who want to understand its theoretical limits relative to prompting. A reader who cares about formal constraints on model internals will find it worth reading. It deserves a serious referee because the claim is strong enough to matter if the assumptions hold. I would send it out for review and ask the referees to focus on whether the practical assumptions are realistic and whether the proof generalizes beyond the idealized setting.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that activation steering in LLMs is non-surjective: under practical assumptions, additive steering displaces the residual stream off the manifold of activations reachable from any discrete token sequence. It proves that almost surely no prompt can reproduce the internal state induced by steering, using a mathematical analysis of manifolds and the forward pass, and illustrates the finding with empirical checks on three widely used LLMs. The authors conclude that this creates a formal separation between white-box steering and black-box prompting, cautioning against equating steering success with prompt-based interpretability or vulnerability.

Significance. If the central non-surjectivity result holds, it is significant for interpretability and safety research because it supplies a theoretical reason why steered behaviors may not be realizable by any natural-language prompt. The combination of a manifold-theoretic argument with multi-model empirical illustrations provides a concrete basis for decoupling white-box and black-box interventions, which could influence evaluation protocols in probing, safety, and mechanistic interpretability.

major comments (2)

[Abstract / proof] Abstract and proof section: the central claim rests on 'practical assumptions' about the residual-stream manifold and forward-pass properties, yet these assumptions are never stated explicitly (e.g., whether the reachable set has positive codimension, whether the forward map is analytic or merely continuous, or whether token embeddings are in general position). Without this list the measure-zero argument cannot be verified for real transformers.
[Empirical illustration] Empirical section: the experiments on three LLMs show statistical differences between steered and prompt-induced activations, but finite sampling cannot establish the 'almost surely' non-existence result; the paper acknowledges this limitation yet still presents the empirical work as supporting the almost-sure claim.

minor comments (1)

[Throughout] The phrase 'manifold of states reachable from discrete prompts' is used repeatedly without a formal definition or pointer to prior literature on activation geometry in transformers; a short definitional paragraph would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our assumptions and the scope of the empirical results. We address each point below and will incorporate revisions to improve verifiability and framing.

read point-by-point responses

Referee: [Abstract / proof] Abstract and proof section: the central claim rests on 'practical assumptions' about the residual-stream manifold and forward-pass properties, yet these assumptions are never stated explicitly (e.g., whether the reachable set has positive codimension, whether the forward map is analytic or merely continuous, or whether token embeddings are in general position). Without this list the measure-zero argument cannot be verified for real transformers.

Authors: We agree that the assumptions underlying the measure-zero argument should be stated explicitly to enable verification. In the revised manuscript we will add a dedicated paragraph in the proof section that enumerates them: (1) the reachable activation set is a lower-dimensional submanifold of positive codimension in the residual stream; (2) the forward-pass map is continuous (and analytic on the interior of its domain); and (3) token embeddings are in general position so that their linear combinations do not fill the ambient space. These clarifications will make the application of the measure-zero result transparent for real transformers. revision: yes
Referee: [Empirical illustration] Empirical section: the experiments on three LLMs show statistical differences between steered and prompt-induced activations, but finite sampling cannot establish the 'almost surely' non-existence result; the paper acknowledges this limitation yet still presents the empirical work as supporting the almost-sure claim.

Authors: We concur that finite sampling supplies only illustrative evidence and cannot prove the almost-sure claim, which rests on the theoretical argument. The empirical checks are intended to demonstrate that the predicted statistical separation is observable on standard models. We will revise the text to frame the experiments explicitly as supportive illustrations, strengthen the limitations discussion, and avoid any phrasing that could be read as treating the empirical results as confirmatory of the measure-zero statement. revision: yes

Circularity Check

0 steps flagged

No circularity: non-surjectivity follows from manifold analysis under stated assumptions

full rationale

The central result is a mathematical proof that additive steering maps the residual stream outside the image of the discrete-prompt forward pass, under practical assumptions on the residual-stream manifold and forward-pass properties. This is not obtained by fitting parameters to data, renaming an empirical pattern, or reducing to a self-citation chain; the proof is self-contained once the assumptions are granted. The three-LLM empirical illustrations are presented separately as corroboration and do not enter the derivation. No load-bearing step equates the claimed non-surjectivity to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on unspecified practical assumptions about the model and steering operation that enable the manifold argument.

axioms (1)

domain assumption Practical assumptions on LLM residual stream and steering operation
The proof is stated to hold under these assumptions, but they are not detailed in the abstract.

pith-pipeline@v0.9.0 · 5487 in / 1142 out tokens · 64385 ms · 2026-05-11T00:55:38.802831+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
cs.AI 2026-05 unverdicted novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

Reference graph

Works this paper leans on

4 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[4]

Chernoff

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page doi:10.1214/aoms/1177729586.full 1933