pith. sign in

arxiv: 2605.17231 · v1 · pith:PCNGA3X3new · submitted 2026-05-17 · 💻 cs.LG · cs.CL

FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers

Pith reviewed 2026-05-20 14:49 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords activation steeringFisher information metricpullback metrictransformerslanguage modelsoptimal directiongeometry of activations
0
0 comments X

The pith

The pullback Fisher metric provides a closed-form optimal direction for steering activations in transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Activation steering in language models has always assumed Euclidean space, but this paper shows that the true local geometry is given by pulling back the Fisher information metric from the softmax output through the Jacobian of later layers. This geometry differs from Euclidean by more than 97 percent in relative spectral norm and has much lower effective dimension. The authors derive a closed-form equation for the minimum-distortion steering direction under this metric. A sympathetic reader would care because the new method explains the varying success of prior techniques and delivers better control with less off-target effects on GPT-2.

Core claim

The paper claims that starting from the pullback Fisher metric, a closed-form steering equation can be derived that identifies the minimum-distortion direction for any target concept at each point, which can be applied iteratively without requiring manifold fitting or data-driven geometry estimation. This framework, called FishBack, also reveals that existing methods implicitly use different approximate metrics whose relative performance is predicted by a spectral diagnostic comparing their cost to the Fisher-optimal cost.

What carries the argument

The pullback Fisher metric obtained by pulling the softmax layer's Fisher information back through the Jacobians of subsequent layers, which defines the geometry used to find minimum-distortion steering directions.

If this is right

  • Existing methods such as CAA, ActAdd, and ITI each implicitly adopt a particular approximate metric.
  • Their performance gaps are quantitatively predicted by the ratio of their implicit metric's cost to the Fisher-optimal cost.
  • Iterative pullback steering outperforms all Euclidean baselines across three verb-morphology concepts and four layers on GPT-2.
  • Off-target KL reductions reach 1.3x to 2.5x relative to Euclidean gradient ascent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The layer-wise recursive decomposition suggests the metric can be computed efficiently even in deeper transformer stacks without full Jacobian materialization.
  • The low effective dimensionality implies that steering success depends on aligning directions with a small number of dominant modes in the pulled-back geometry.

Load-bearing premise

The local geometry relevant for activation steering is accurately captured by the Fisher information metric of the softmax layer pulled back through the Jacobian of subsequent layers.

What would settle it

Applying FishBack steering to a new concept and checking whether the resulting change in output distribution matches the predicted minimum-distortion path better than Euclidean methods on GPT-2 layers.

Figures

Figures reproduced from arXiv: 2605.17231 by Jiayi Zhao, Sihan Wang.

Figure 2
Figure 2. Figure 2: Pullback Fisher geometry of GPT-2 intermedi [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Median off-target KL ratio (baseline / Ours) at [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Steering paths at Layer 9 across three verb-morphology concepts. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Steering paths at Layer 6. The pattern is consis [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Steering paths at Layer 11. Our method achieves [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 5
Figure 5. Figure 5: Steering paths at Layer 3. Our method (dark blue) [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 9
Figure 9. Figure 9: Condition number κ(G) vs. off-target KL ratio at P W (1) = 0.9. The Spearman correlation is not significant (r = −0.09, p = 0.38), likely due to insufficient dynamic range: all condition numbers fall within 106.5–108 . F Deferred Proofs from Section 3 F.1 Characterization of Concept Decomposability (Lemma 3.10) Proof. Decomposability requires Pλ(y 1 i )/Pλ(y 0 i ) to be in￾dependent of i. By the softmax fo… view at source ↗
read the original abstract

Activation steering methods modify intermediate representations of language models to control output behavior, but universally assume the activation space is Euclidean. We show this assumption fails drastically: the local geometry induced by the model's own output behavior -- the Fisher information metric of the softmax layer, pulled back through the Jacobian of subsequent layers -- deviates from the Euclidean metric by over 97% in relative spectral norm on GPT-2, with an effective dimensionality of only 2--17% of the ambient space. From this pullback Fisher metric, we derive a closed-form steering equation that identifies the minimum-distortion direction for any target concept, yielding a closed-form optimal direction at each point that can be applied iteratively without manifold fitting or data-driven geometry estimation. We call the resulting framework FishBack. The metric admits a layer-wise recursive decomposition, which reveals that existing methods -- CAA, ActAdd, ITI, and others -- each implicitly adopt a particular approximate metric, and that their performance gaps are quantitatively predicted by a single spectral diagnostic: the ratio of their implicit metric's cost to the Fisher-optimal cost. On GPT-2, iterative pullback steering consistently outperforms all Euclidean baselines across three verb-morphology concepts and four layers, with off-target KL reductions of $1.3\times$--$2.5\times$ relative to Euclidean gradient ascent and $1.5\times$ relative to CAA at matched concept probability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FishBack, a framework for activation steering in transformers that replaces the Euclidean assumption on activation spaces with the pullback of the softmax-layer Fisher information metric through the Jacobian of subsequent layers. It reports that this geometry deviates from Euclidean by over 97% in relative spectral norm on GPT-2 with effective dimensionality 2-17% of ambient space. From the pullback metric the authors derive a closed-form minimum-distortion steering direction for any target concept that can be applied iteratively without manifold fitting. A layer-wise recursive decomposition is used to interpret existing methods (CAA, ActAdd, ITI) as implicit approximations to this metric, with their performance gaps predicted by a single spectral ratio of implicit to Fisher-optimal cost. Experiments on GPT-2 for three verb-morphology concepts across four layers show 1.3-2.5x off-target KL reduction relative to Euclidean gradient ascent and 1.5x relative to CAA at matched concept probability.

Significance. If the central derivations hold, the work supplies a principled information-geometric foundation for activation steering that directly incorporates the model's output distribution, potentially unifying and improving heuristic methods while reducing off-target effects. The closed-form character and recursive decomposition are notable strengths that could guide more reliable concept editing in large language models.

major comments (2)
  1. [Abstract] Abstract: the claim of a 'closed-form steering equation' yielding a unique 'optimal direction at each point' is undermined by the singularity of the pullback metric. The softmax Fisher matrix F = diag(p) - p p^T has a one-dimensional kernel spanned by the all-ones vector, so G(a) = J(a)^T F(p(a)) J(a) is positive semi-definite with a non-trivial null space. The minimum-distortion problem arg min_v v^T G v subject to a linear concept constraint therefore requires either the Moore-Penrose pseudo-inverse G^+ or explicit regularization; neither is mentioned in the abstract's description of the steering equation.
  2. [Section on spectral diagnostic] Section describing the spectral diagnostic: the ratio of an implicit metric's cost to the Fisher-optimal cost is defined directly from the same quantity used to declare optimality and to evaluate empirical superiority. This construction risks making the 'quantitative prediction' of performance gaps partly tautological rather than an independent diagnostic, weakening the cross-method comparison claim.
minor comments (2)
  1. The abstract states '97% spectral deviation' and 'effective dimensionality of only 2--17%' without defining the precise norm, baseline Euclidean metric, or the method used to compute effective dimension; a short clarifying sentence or reference to the relevant equation would improve readability.
  2. All acronyms (CAA, ActAdd, ITI) should be expanded on first use in the main text even if defined in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our geometric framework. We address each major point below and indicate revisions where appropriate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of a 'closed-form steering equation' yielding a unique 'optimal direction at each point' is undermined by the singularity of the pullback metric. The softmax Fisher matrix F = diag(p) - p p^T has a one-dimensional kernel spanned by the all-ones vector, so G(a) = J(a)^T F(p(a)) J(a) is positive semi-definite with a non-trivial null space. The minimum-distortion problem arg min_v v^T G v subject to a linear concept constraint therefore requires either the Moore-Penrose pseudo-inverse G^+ or explicit regularization; neither is mentioned in the abstract's description of the steering equation.

    Authors: We agree that the abstract should explicitly note the handling of the pullback metric's semi-definiteness. The full derivation in Section 3 uses the Moore-Penrose pseudo-inverse G^+ to obtain the minimum-distortion direction under the linear concept constraint, which is well-defined on the range of G and yields a unique solution in the quotient space orthogonal to the kernel. We will revise the abstract to state that the closed-form steering equation employs the pseudo-inverse of the pullback Fisher metric. revision: yes

  2. Referee: [Section on spectral diagnostic] Section describing the spectral diagnostic: the ratio of an implicit metric's cost to the Fisher-optimal cost is defined directly from the same quantity used to declare optimality and to evaluate empirical superiority. This construction risks making the 'quantitative prediction' of performance gaps partly tautological rather than an independent diagnostic, weakening the cross-method comparison claim.

    Authors: The spectral ratio is computed a priori from the eigenvalues of the implicit versus Fisher metrics alone, without reference to task-specific performance data. It quantifies the relative distortion cost of each method's implicit geometry and is used to predict the ordering of empirical gaps before any steering experiments are run. We will add explicit language in the relevant section clarifying this separation and noting that the subsequent correlation with observed KL reductions serves as empirical validation rather than part of the definition. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from standard pullback construction

full rationale

The paper begins with the standard Fisher information matrix at the softmax layer and pulls it back via the Jacobian of subsequent layers to obtain the activation-space metric G(a) = J^T F J. This is a direct, first-principles definition from information geometry and is not defined in terms of any steering outcome or performance gap. The closed-form optimal direction is obtained by solving the quadratic minimization problem induced by this metric under a linear concept constraint, which is a standard Lagrange-multiplier or pseudo-inverse step and does not presuppose the final steering vector. The layer-wise decomposition and the spectral diagnostic (ratio of implicit-metric cost to Fisher-optimal cost) are post-hoc explanatory devices that rank existing Euclidean methods; the reported performance advantages are measured by independent quantities (concept probability and off-target KL divergence) rather than by the diagnostic itself. No load-bearing step reduces to a self-citation, fitted input renamed as prediction, or ansatz smuggled from prior work. The derivation chain therefore remains independent of its claimed outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on one domain assumption about the appropriateness of the Fisher metric; no explicit free parameters or new physical entities are introduced in the abstract.

axioms (1)
  • domain assumption The local geometry induced by the model's output behavior is given by the Fisher information metric of the softmax layer pulled back through the Jacobian of subsequent layers.
    This premise is invoked in the first sentence of the abstract as the correct replacement for the Euclidean assumption.
invented entities (1)
  • FishBack framework no independent evidence
    purpose: Name for the closed-form steering procedure based on the pullback metric.
    New label introduced for the derived method; no independent evidence supplied.

pith-pipeline@v0.9.0 · 5777 in / 1513 out tokens · 81771 ms · 2026-05-20T14:49:00.266269+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 5 internal anchors

  1. [2]

    Steering

    Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander , booktitle=. Steering. 2024 , publisher=

  2. [3]

    Representation Engineering: A Top-Down Approach to

    Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others , journal=. Representation Engineering: A Top-Down Approach to

  3. [4]

    Advances in Neural Information Processing Systems , volume=

    Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Advances in Neural Information Processing Systems , volume=

  4. [6]

    Proceedings of the 41st International Conference on Machine Learning , year=

    Representation Surgery: Theory and Practice of Affine Steering , author=. Proceedings of the 41st International Conference on Machine Learning , year=

  5. [7]

    and Potts, Christopher , booktitle=

    Wu, Zhengxuan and Arora, Aryaman and Geiger, Atticus and Huang, Jing and Wang, Zheng and Manning, Christopher D. and Potts, Christopher , booktitle=

  6. [8]

    2023 , url=

    Nora Belrose and David Schneider-Joseph and Shauli Ravfogel and Ryan Cotterell and Edward Raff and Stella Biderman , booktitle=. 2023 , url=

  7. [10]

    2nd Workshop on Models of Human Feedback for AI Alignment , year=

    Angular Steering: Behavior Control via Rotation in Activation Space , author=. 2nd Workshop on Models of Human Feedback for AI Alignment , year=

  8. [14]

    Bulletin of the Calcutta Mathematical Society , volume=

    Information and the Accuracy Attainable in the Estimation of Statistical Parameters , author=. Bulletin of the Calcutta Mathematical Society , volume=

  9. [15]

    Neural Computation , volume=

    Natural Gradient Works Efficiently in Learning , author=. Neural Computation , volume=

  10. [16]

    Information Geometry and Its Applications , author=

  11. [17]

    and Ghosh, Joydeep , journal=

    Banerjee, Arindam and Merugu, Srujana and Dhillon, Inderjit S. and Ghosh, Joydeep , journal=. Clustering with

  12. [18]

    International Conference on Learning Representations , year=

    Latent Space Oddity: On the Curvature of Deep Generative Models , author=. International Conference on Learning Representations , year=

  13. [19]

    International Conference on Artificial Intelligence and Statistics , year=

    Pulling Back Information Geometry , author=. International Conference on Artificial Intelligence and Statistics , year=

  14. [20]

    Journal of Machine Learning Research , volume=

    New Insights and Perspectives on the Natural Gradient Method , author=. Journal of Machine Learning Research , volume=

  15. [22]

    OpenAI Blog , year=

    Language Models are Unsupervised Multitask Learners , author=. OpenAI Blog , year=

  16. [23]

    Journal of Machine Learning Research , volume=

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. Journal of Machine Learning Research , volume=

  17. [26]

    Amari, S.-i. 1998. Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2): 251--276

  18. [27]

    Amari, S.-i. 2016. Information Geometry and Its Applications. Springer

  19. [28]

    Arditi, A.; Obeso, O.; Syed, A.; Paleka, D.; Panickssery, N.; Gurnee, W.; and Nanda, N. 2024. Refusal in Language Models Is Mediated by a Single Direction. arXiv preprint arXiv:2406.11717

  20. [29]

    Arvanitidis, G.; Gonz \'a lez-Duque, M.; Pouplin, A.; Kalatzis, D.; and Hauberg, S. 2022. Pulling Back Information Geometry. In International Conference on Artificial Intelligence and Statistics

  21. [30]

    K.; and Hauberg, S

    Arvanitidis, G.; Hansen, L. K.; and Hauberg, S. 2018. Latent Space Oddity: On the Curvature of Deep Generative Models. In International Conference on Learning Representations

  22. [31]

    Layer Normalization

    Ba, J. L.; Kiros, J. R.; and Hinton, G. E. 2016. Layer Normalization. arXiv preprint arXiv:1607.06450

  23. [32]

    S.; and Ghosh, J

    Banerjee, A.; Merugu, S.; Dhillon, I. S.; and Ghosh, J. 2005. Clustering with Bregman Divergences. Journal of Machine Learning Research, 6: 1705--1749

  24. [33]

    Belrose, N.; Schneider-Joseph, D.; Ravfogel, S.; Cotterell, R.; Raff, E.; and Biderman, S. 2023. LEACE : Perfect linear concept erasure in closed form. In Thirty-seventh Conference on Neural Information Processing Systems

  25. [34]

    Dathathri, S.; Madotto, A.; Lan, J.; Hung, J.; Frank, E.; Molino, P.; Yosinski, J.; and Liu, R. 2020. Plug and Play Language Models: A Simple Approach to Controlled Text Generation. arXiv preprint arXiv:1912.02164

  26. [35]

    Im, J.; and Li, J. 2025. A Unified Framework for Understanding Activation Steering Directions. arXiv preprint arXiv:2502.02716

  27. [36]

    Li, K.; Patel, O.; Vi \'e gas, F.; Pfister, H.; and Wattenberg, M. 2023. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. In Advances in Neural Information Processing Systems, volume 36

  28. [37]

    Martens, J. 2020. New Insights and Perspectives on the Natural Gradient Method. Journal of Machine Learning Research, 21(146): 1--76

  29. [38]

    J.; and Veitch, V

    Park, K.; Nief, T.; Choe, Y. J.; and Veitch, V. 2026. The Information Geometry of Softmax: Probing and Steering. arXiv preprint arXiv:2602.15293

  30. [39]

    Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language Models are Unsupervised Multitask Learners. OpenAI Blog

  31. [40]

    Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140): 1--67

  32. [41]

    Rao, C. R. 1945. Information and the Accuracy Attainable in the Estimation of Statistical Parameters. Bulletin of the Calcutta Mathematical Society, 37: 81--91

  33. [42]

    J.; Wu, L.; Harrasse, A.; Phillips, J

    Raval, S.; Song, H. J.; Wu, L.; Harrasse, A.; Phillips, J. M.; Barez, F.; and Abdullah, A. 2026. Curveball Steering: The Right Direction To Steer Isn't Always Linear. arXiv preprint arXiv:2603.09313

  34. [43]

    Rimsky, N.; Gabrieli, N.; Schulz, J.; Tong, M.; Hubinger, E.; and Turner, A. 2024. Steering Llama 2 via Contrastive Activation Addition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15504--15522. Association for Computational Linguistics

  35. [44]

    Singh, S.; Ravfogel, S.; Herzig, J.; Aharoni, R.; Cotterell, R.; and Kumaraguru, P. 2024. Representation Surgery: Theory and Practice of Affine Steering. In Proceedings of the 41st International Conference on Machine Learning

  36. [45]

    Steering Language Models With Activation Engineering

    Turner, A. M.; Thiergart, L.; Leech, G.; Udell, D.; Vazquez, J. J.; Mini, U.; and MacDiarmid, M. 2024. Steering Language Models With Activation Engineering. arXiv preprint arXiv:2308.10248

  37. [46]

    van der Weij, T.; Poesio, M.; and Schoots, N. 2024. Extending Activation Steering to Broad Skills and Multiple Behaviours. arXiv preprint arXiv:2403.05767

  38. [47]

    M.; and Nguyen, T

    Vu, H. M.; and Nguyen, T. M. 2025. Angular Steering: Behavior Control via Rotation in Activation Space. In 2nd Workshop on Models of Human Feedback for AI Alignment

  39. [48]

    D.; and Potts, C

    Wu, Z.; Arora, A.; Geiger, A.; Huang, J.; Wang, Z.; Manning, C. D.; and Potts, C. 2025. AxBench : Steering LLMs ? Benchmark and Decompose the Steering Ability of Representation Intervention Methods. In Proceedings of the 42nd International Conference on Machine Learning

  40. [49]

    Wurgaft, D.; et al. 2026. Manifold Steering: Turning, Not Transplanting, Activations for Controllable Generation. arXiv preprint arXiv:2605.05115

  41. [50]

    Zou, A.; Phan, L.; Chen, S.; Campbell, J.; Guo, P.; Ren, R.; Pan, A.; Yin, X.; Mazeika, M.; Dombrowski, A.-K.; et al. 2023. Representation Engineering: A Top-Down Approach to AI Transparency. arXiv preprint arXiv:2310.01405