FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers

Jiayi Zhao; Sihan Wang

arxiv: 2605.17231 · v1 · pith:PCNGA3X3new · submitted 2026-05-17 · 💻 cs.LG · cs.CL

FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers

Sihan Wang , Jiayi Zhao This is my paper

Pith reviewed 2026-05-20 14:49 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords activation steeringFisher information metricpullback metrictransformerslanguage modelsoptimal directiongeometry of activations

0 comments

The pith

The pullback Fisher metric provides a closed-form optimal direction for steering activations in transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Activation steering in language models has always assumed Euclidean space, but this paper shows that the true local geometry is given by pulling back the Fisher information metric from the softmax output through the Jacobian of later layers. This geometry differs from Euclidean by more than 97 percent in relative spectral norm and has much lower effective dimension. The authors derive a closed-form equation for the minimum-distortion steering direction under this metric. A sympathetic reader would care because the new method explains the varying success of prior techniques and delivers better control with less off-target effects on GPT-2.

Core claim

The paper claims that starting from the pullback Fisher metric, a closed-form steering equation can be derived that identifies the minimum-distortion direction for any target concept at each point, which can be applied iteratively without requiring manifold fitting or data-driven geometry estimation. This framework, called FishBack, also reveals that existing methods implicitly use different approximate metrics whose relative performance is predicted by a spectral diagnostic comparing their cost to the Fisher-optimal cost.

What carries the argument

The pullback Fisher metric obtained by pulling the softmax layer's Fisher information back through the Jacobians of subsequent layers, which defines the geometry used to find minimum-distortion steering directions.

If this is right

Existing methods such as CAA, ActAdd, and ITI each implicitly adopt a particular approximate metric.
Their performance gaps are quantitatively predicted by the ratio of their implicit metric's cost to the Fisher-optimal cost.
Iterative pullback steering outperforms all Euclidean baselines across three verb-morphology concepts and four layers on GPT-2.
Off-target KL reductions reach 1.3x to 2.5x relative to Euclidean gradient ascent.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The layer-wise recursive decomposition suggests the metric can be computed efficiently even in deeper transformer stacks without full Jacobian materialization.
The low effective dimensionality implies that steering success depends on aligning directions with a small number of dominant modes in the pulled-back geometry.

Load-bearing premise

The local geometry relevant for activation steering is accurately captured by the Fisher information metric of the softmax layer pulled back through the Jacobian of subsequent layers.

What would settle it

Applying FishBack steering to a new concept and checking whether the resulting change in output distribution matches the predicted minimum-distortion path better than Euclidean methods on GPT-2 layers.

Figures

Figures reproduced from arXiv: 2605.17231 by Jiayi Zhao, Sihan Wang.

**Figure 3.** Figure 3: Median off-target KL ratio (baseline / Ours) at [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Steering paths at Layer 9 across three verb-morphology concepts. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 6.** Figure 6: Steering paths at Layer 6. The pattern is consis [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Steering paths at Layer 11. Our method achieves [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 5.** Figure 5: Steering paths at Layer 3. Our method (dark blue) [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 9.** Figure 9: Condition number κ(G) vs. off-target KL ratio at P W (1) = 0.9. The Spearman correlation is not significant (r = −0.09, p = 0.38), likely due to insufficient dynamic range: all condition numbers fall within 106.5–108 . F Deferred Proofs from Section 3 F.1 Characterization of Concept Decomposability (Lemma 3.10) Proof. Decomposability requires Pλ(y 1 i )/Pλ(y 0 i ) to be independent of i. By the softmax fo… view at source ↗

read the original abstract

Activation steering methods modify intermediate representations of language models to control output behavior, but universally assume the activation space is Euclidean. We show this assumption fails drastically: the local geometry induced by the model's own output behavior -- the Fisher information metric of the softmax layer, pulled back through the Jacobian of subsequent layers -- deviates from the Euclidean metric by over 97% in relative spectral norm on GPT-2, with an effective dimensionality of only 2--17% of the ambient space. From this pullback Fisher metric, we derive a closed-form steering equation that identifies the minimum-distortion direction for any target concept, yielding a closed-form optimal direction at each point that can be applied iteratively without manifold fitting or data-driven geometry estimation. We call the resulting framework FishBack. The metric admits a layer-wise recursive decomposition, which reveals that existing methods -- CAA, ActAdd, ITI, and others -- each implicitly adopt a particular approximate metric, and that their performance gaps are quantitatively predicted by a single spectral diagnostic: the ratio of their implicit metric's cost to the Fisher-optimal cost. On GPT-2, iterative pullback steering consistently outperforms all Euclidean baselines across three verb-morphology concepts and four layers, with off-target KL reductions of $1.3\times$--$2.5\times$ relative to Euclidean gradient ascent and $1.5\times$ relative to CAA at matched concept probability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FishBack derives a pullback Fisher metric for activation steering that unifies prior Euclidean methods but the singular softmax Fisher matrix leaves the closed-form direction under-specified without explicit regularization.

read the letter

The main takeaway is that this paper replaces the usual Euclidean assumption in activation steering with the pullback of the softmax Fisher information metric through the remaining layers. That produces a closed-form steering vector at each point and a layer-wise decomposition that recasts CAA, ActAdd, and ITI as different approximate metrics whose relative costs are measured by one spectral ratio. On GPT-2 the iterative version cuts off-target KL by 1.3-2.5x compared with gradient ascent and 1.5x compared with CAA at matched concept strength, which is the concrete empirical result worth checking first.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FishBack, a framework for activation steering in transformers that replaces the Euclidean assumption on activation spaces with the pullback of the softmax-layer Fisher information metric through the Jacobian of subsequent layers. It reports that this geometry deviates from Euclidean by over 97% in relative spectral norm on GPT-2 with effective dimensionality 2-17% of ambient space. From the pullback metric the authors derive a closed-form minimum-distortion steering direction for any target concept that can be applied iteratively without manifold fitting. A layer-wise recursive decomposition is used to interpret existing methods (CAA, ActAdd, ITI) as implicit approximations to this metric, with their performance gaps predicted by a single spectral ratio of implicit to Fisher-optimal cost. Experiments on GPT-2 for three verb-morphology concepts across four layers show 1.3-2.5x off-target KL reduction relative to Euclidean gradient ascent and 1.5x relative to CAA at matched concept probability.

Significance. If the central derivations hold, the work supplies a principled information-geometric foundation for activation steering that directly incorporates the model's output distribution, potentially unifying and improving heuristic methods while reducing off-target effects. The closed-form character and recursive decomposition are notable strengths that could guide more reliable concept editing in large language models.

major comments (2)

[Abstract] Abstract: the claim of a 'closed-form steering equation' yielding a unique 'optimal direction at each point' is undermined by the singularity of the pullback metric. The softmax Fisher matrix F = diag(p) - p p^T has a one-dimensional kernel spanned by the all-ones vector, so G(a) = J(a)^T F(p(a)) J(a) is positive semi-definite with a non-trivial null space. The minimum-distortion problem arg min_v v^T G v subject to a linear concept constraint therefore requires either the Moore-Penrose pseudo-inverse G^+ or explicit regularization; neither is mentioned in the abstract's description of the steering equation.
[Section on spectral diagnostic] Section describing the spectral diagnostic: the ratio of an implicit metric's cost to the Fisher-optimal cost is defined directly from the same quantity used to declare optimality and to evaluate empirical superiority. This construction risks making the 'quantitative prediction' of performance gaps partly tautological rather than an independent diagnostic, weakening the cross-method comparison claim.

minor comments (2)

The abstract states '97% spectral deviation' and 'effective dimensionality of only 2--17%' without defining the precise norm, baseline Euclidean metric, or the method used to compute effective dimension; a short clarifying sentence or reference to the relevant equation would improve readability.
All acronyms (CAA, ActAdd, ITI) should be expanded on first use in the main text even if defined in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of our geometric framework. We address each major point below and indicate revisions where appropriate to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of a 'closed-form steering equation' yielding a unique 'optimal direction at each point' is undermined by the singularity of the pullback metric. The softmax Fisher matrix F = diag(p) - p p^T has a one-dimensional kernel spanned by the all-ones vector, so G(a) = J(a)^T F(p(a)) J(a) is positive semi-definite with a non-trivial null space. The minimum-distortion problem arg min_v v^T G v subject to a linear concept constraint therefore requires either the Moore-Penrose pseudo-inverse G^+ or explicit regularization; neither is mentioned in the abstract's description of the steering equation.

Authors: We agree that the abstract should explicitly note the handling of the pullback metric's semi-definiteness. The full derivation in Section 3 uses the Moore-Penrose pseudo-inverse G^+ to obtain the minimum-distortion direction under the linear concept constraint, which is well-defined on the range of G and yields a unique solution in the quotient space orthogonal to the kernel. We will revise the abstract to state that the closed-form steering equation employs the pseudo-inverse of the pullback Fisher metric. revision: yes
Referee: [Section on spectral diagnostic] Section describing the spectral diagnostic: the ratio of an implicit metric's cost to the Fisher-optimal cost is defined directly from the same quantity used to declare optimality and to evaluate empirical superiority. This construction risks making the 'quantitative prediction' of performance gaps partly tautological rather than an independent diagnostic, weakening the cross-method comparison claim.

Authors: The spectral ratio is computed a priori from the eigenvalues of the implicit versus Fisher metrics alone, without reference to task-specific performance data. It quantifies the relative distortion cost of each method's implicit geometry and is used to predict the ordering of empirical gaps before any steering experiments are run. We will add explicit language in the relevant section clarifying this separation and noting that the subsequent correlation with observed KL reductions serves as empirical validation rather than part of the definition. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from standard pullback construction

full rationale

The paper begins with the standard Fisher information matrix at the softmax layer and pulls it back via the Jacobian of subsequent layers to obtain the activation-space metric G(a) = J^T F J. This is a direct, first-principles definition from information geometry and is not defined in terms of any steering outcome or performance gap. The closed-form optimal direction is obtained by solving the quadratic minimization problem induced by this metric under a linear concept constraint, which is a standard Lagrange-multiplier or pseudo-inverse step and does not presuppose the final steering vector. The layer-wise decomposition and the spectral diagnostic (ratio of implicit-metric cost to Fisher-optimal cost) are post-hoc explanatory devices that rank existing Euclidean methods; the reported performance advantages are measured by independent quantities (concept probability and off-target KL divergence) rather than by the diagnostic itself. No load-bearing step reduces to a self-citation, fitted input renamed as prediction, or ansatz smuggled from prior work. The derivation chain therefore remains independent of its claimed outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on one domain assumption about the appropriateness of the Fisher metric; no explicit free parameters or new physical entities are introduced in the abstract.

axioms (1)

domain assumption The local geometry induced by the model's output behavior is given by the Fisher information metric of the softmax layer pulled back through the Jacobian of subsequent layers.
This premise is invoked in the first sentence of the abstract as the correct replacement for the Euclidean assumption.

invented entities (1)

FishBack framework no independent evidence
purpose: Name for the closed-form steering procedure based on the pullback metric.
New label introduced for the derived method; no independent evidence supplied.

pith-pipeline@v0.9.0 · 5777 in / 1513 out tokens · 81771 ms · 2026-05-20T14:49:00.266269+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DKL(Pλ0 ∥ Pf(h0+δh)) = ½ δh^T (J^T H J) δh + O(∥δh∥^3); δh* = ρ G^+ q / (q^T G^+ q)
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Fisher-Pythagorean excess cost identity: CG(δ̃) − CG(δ*_G) = ½ (δ̃ − δ*_G)^T G (δ̃ − δ*_G)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 5 internal anchors

[2]

Steering

Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander , booktitle=. Steering. 2024 , publisher=

work page 2024
[3]

Representation Engineering: A Top-Down Approach to

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others , journal=. Representation Engineering: A Top-Down Approach to

work page
[4]

Advances in Neural Information Processing Systems , volume=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Advances in Neural Information Processing Systems , volume=

work page
[6]

Proceedings of the 41st International Conference on Machine Learning , year=

Representation Surgery: Theory and Practice of Affine Steering , author=. Proceedings of the 41st International Conference on Machine Learning , year=

work page
[7]

and Potts, Christopher , booktitle=

Wu, Zhengxuan and Arora, Aryaman and Geiger, Atticus and Huang, Jing and Wang, Zheng and Manning, Christopher D. and Potts, Christopher , booktitle=

work page
[8]

2023 , url=

Nora Belrose and David Schneider-Joseph and Shauli Ravfogel and Ryan Cotterell and Edward Raff and Stella Biderman , booktitle=. 2023 , url=

work page 2023
[10]

2nd Workshop on Models of Human Feedback for AI Alignment , year=

Angular Steering: Behavior Control via Rotation in Activation Space , author=. 2nd Workshop on Models of Human Feedback for AI Alignment , year=

work page
[14]

Bulletin of the Calcutta Mathematical Society , volume=

Information and the Accuracy Attainable in the Estimation of Statistical Parameters , author=. Bulletin of the Calcutta Mathematical Society , volume=

work page
[15]

Neural Computation , volume=

Natural Gradient Works Efficiently in Learning , author=. Neural Computation , volume=

work page
[16]

Information Geometry and Its Applications , author=

work page
[17]

and Ghosh, Joydeep , journal=

Banerjee, Arindam and Merugu, Srujana and Dhillon, Inderjit S. and Ghosh, Joydeep , journal=. Clustering with

work page
[18]

International Conference on Learning Representations , year=

Latent Space Oddity: On the Curvature of Deep Generative Models , author=. International Conference on Learning Representations , year=

work page
[19]

International Conference on Artificial Intelligence and Statistics , year=

Pulling Back Information Geometry , author=. International Conference on Artificial Intelligence and Statistics , year=

work page
[20]

Journal of Machine Learning Research , volume=

New Insights and Perspectives on the Natural Gradient Method , author=. Journal of Machine Learning Research , volume=

work page
[22]

OpenAI Blog , year=

Language Models are Unsupervised Multitask Learners , author=. OpenAI Blog , year=

work page
[23]

Journal of Machine Learning Research , volume=

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. Journal of Machine Learning Research , volume=

work page
[26]

Amari, S.-i. 1998. Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2): 251--276

work page 1998
[27]

Amari, S.-i. 2016. Information Geometry and Its Applications. Springer

work page 2016
[28]

Arditi, A.; Obeso, O.; Syed, A.; Paleka, D.; Panickssery, N.; Gurnee, W.; and Nanda, N. 2024. Refusal in Language Models Is Mediated by a Single Direction. arXiv preprint arXiv:2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Arvanitidis, G.; Gonz \'a lez-Duque, M.; Pouplin, A.; Kalatzis, D.; and Hauberg, S. 2022. Pulling Back Information Geometry. In International Conference on Artificial Intelligence and Statistics

work page 2022
[30]

K.; and Hauberg, S

Arvanitidis, G.; Hansen, L. K.; and Hauberg, S. 2018. Latent Space Oddity: On the Curvature of Deep Generative Models. In International Conference on Learning Representations

work page 2018
[31]

Layer Normalization

Ba, J. L.; Kiros, J. R.; and Hinton, G. E. 2016. Layer Normalization. arXiv preprint arXiv:1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016
[32]

S.; and Ghosh, J

Banerjee, A.; Merugu, S.; Dhillon, I. S.; and Ghosh, J. 2005. Clustering with Bregman Divergences. Journal of Machine Learning Research, 6: 1705--1749

work page 2005
[33]

Belrose, N.; Schneider-Joseph, D.; Ravfogel, S.; Cotterell, R.; Raff, E.; and Biderman, S. 2023. LEACE : Perfect linear concept erasure in closed form. In Thirty-seventh Conference on Neural Information Processing Systems

work page 2023
[34]

Dathathri, S.; Madotto, A.; Lan, J.; Hung, J.; Frank, E.; Molino, P.; Yosinski, J.; and Liu, R. 2020. Plug and Play Language Models: A Simple Approach to Controlled Text Generation. arXiv preprint arXiv:1912.02164

work page arXiv 2020
[35]

Im, J.; and Li, J. 2025. A Unified Framework for Understanding Activation Steering Directions. arXiv preprint arXiv:2502.02716

work page arXiv 2025
[36]

Li, K.; Patel, O.; Vi \'e gas, F.; Pfister, H.; and Wattenberg, M. 2023. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. In Advances in Neural Information Processing Systems, volume 36

work page 2023
[37]

Martens, J. 2020. New Insights and Perspectives on the Natural Gradient Method. Journal of Machine Learning Research, 21(146): 1--76

work page 2020
[38]

J.; and Veitch, V

Park, K.; Nief, T.; Choe, Y. J.; and Veitch, V. 2026. The Information Geometry of Softmax: Probing and Steering. arXiv preprint arXiv:2602.15293

work page arXiv 2026
[39]

Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language Models are Unsupervised Multitask Learners. OpenAI Blog

work page 2019
[40]

Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140): 1--67

work page 2020
[41]

Rao, C. R. 1945. Information and the Accuracy Attainable in the Estimation of Statistical Parameters. Bulletin of the Calcutta Mathematical Society, 37: 81--91

work page 1945
[42]

J.; Wu, L.; Harrasse, A.; Phillips, J

Raval, S.; Song, H. J.; Wu, L.; Harrasse, A.; Phillips, J. M.; Barez, F.; and Abdullah, A. 2026. Curveball Steering: The Right Direction To Steer Isn't Always Linear. arXiv preprint arXiv:2603.09313

work page arXiv 2026
[43]

Rimsky, N.; Gabrieli, N.; Schulz, J.; Tong, M.; Hubinger, E.; and Turner, A. 2024. Steering Llama 2 via Contrastive Activation Addition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15504--15522. Association for Computational Linguistics

work page 2024
[44]

Singh, S.; Ravfogel, S.; Herzig, J.; Aharoni, R.; Cotterell, R.; and Kumaraguru, P. 2024. Representation Surgery: Theory and Practice of Affine Steering. In Proceedings of the 41st International Conference on Machine Learning

work page 2024
[45]

Steering Language Models With Activation Engineering

Turner, A. M.; Thiergart, L.; Leech, G.; Udell, D.; Vazquez, J. J.; Mini, U.; and MacDiarmid, M. 2024. Steering Language Models With Activation Engineering. arXiv preprint arXiv:2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

van der Weij, T.; Poesio, M.; and Schoots, N. 2024. Extending Activation Steering to Broad Skills and Multiple Behaviours. arXiv preprint arXiv:2403.05767

work page arXiv 2024
[47]

M.; and Nguyen, T

Vu, H. M.; and Nguyen, T. M. 2025. Angular Steering: Behavior Control via Rotation in Activation Space. In 2nd Workshop on Models of Human Feedback for AI Alignment

work page 2025
[48]

D.; and Potts, C

Wu, Z.; Arora, A.; Geiger, A.; Huang, J.; Wang, Z.; Manning, C. D.; and Potts, C. 2025. AxBench : Steering LLMs ? Benchmark and Decompose the Steering Ability of Representation Intervention Methods. In Proceedings of the 42nd International Conference on Machine Learning

work page 2025
[49]

Wurgaft, D.; et al. 2026. Manifold Steering: Turning, Not Transplanting, Activations for Controllable Generation. arXiv preprint arXiv:2605.05115

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

Zou, A.; Phan, L.; Chen, S.; Campbell, J.; Guo, P.; Ren, R.; Pan, A.; Yin, X.; Mazeika, M.; Dombrowski, A.-K.; et al. 2023. Representation Engineering: A Top-Down Approach to AI Transparency. arXiv preprint arXiv:2310.01405

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [2]

Steering

Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander , booktitle=. Steering. 2024 , publisher=

work page 2024

[2] [3]

Representation Engineering: A Top-Down Approach to

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and others , journal=. Representation Engineering: A Top-Down Approach to

work page

[3] [4]

Advances in Neural Information Processing Systems , volume=

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , author=. Advances in Neural Information Processing Systems , volume=

work page

[4] [6]

Proceedings of the 41st International Conference on Machine Learning , year=

Representation Surgery: Theory and Practice of Affine Steering , author=. Proceedings of the 41st International Conference on Machine Learning , year=

work page

[5] [7]

and Potts, Christopher , booktitle=

Wu, Zhengxuan and Arora, Aryaman and Geiger, Atticus and Huang, Jing and Wang, Zheng and Manning, Christopher D. and Potts, Christopher , booktitle=

work page

[6] [8]

2023 , url=

Nora Belrose and David Schneider-Joseph and Shauli Ravfogel and Ryan Cotterell and Edward Raff and Stella Biderman , booktitle=. 2023 , url=

work page 2023

[7] [10]

2nd Workshop on Models of Human Feedback for AI Alignment , year=

Angular Steering: Behavior Control via Rotation in Activation Space , author=. 2nd Workshop on Models of Human Feedback for AI Alignment , year=

work page

[8] [14]

Bulletin of the Calcutta Mathematical Society , volume=

Information and the Accuracy Attainable in the Estimation of Statistical Parameters , author=. Bulletin of the Calcutta Mathematical Society , volume=

work page

[9] [15]

Neural Computation , volume=

Natural Gradient Works Efficiently in Learning , author=. Neural Computation , volume=

work page

[10] [16]

Information Geometry and Its Applications , author=

work page

[11] [17]

and Ghosh, Joydeep , journal=

Banerjee, Arindam and Merugu, Srujana and Dhillon, Inderjit S. and Ghosh, Joydeep , journal=. Clustering with

work page

[12] [18]

International Conference on Learning Representations , year=

Latent Space Oddity: On the Curvature of Deep Generative Models , author=. International Conference on Learning Representations , year=

work page

[13] [19]

International Conference on Artificial Intelligence and Statistics , year=

Pulling Back Information Geometry , author=. International Conference on Artificial Intelligence and Statistics , year=

work page

[14] [20]

Journal of Machine Learning Research , volume=

New Insights and Perspectives on the Natural Gradient Method , author=. Journal of Machine Learning Research , volume=

work page

[15] [22]

OpenAI Blog , year=

Language Models are Unsupervised Multitask Learners , author=. OpenAI Blog , year=

work page

[16] [23]

Journal of Machine Learning Research , volume=

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. Journal of Machine Learning Research , volume=

work page

[17] [26]

Amari, S.-i. 1998. Natural Gradient Works Efficiently in Learning. Neural Computation, 10(2): 251--276

work page 1998

[18] [27]

Amari, S.-i. 2016. Information Geometry and Its Applications. Springer

work page 2016

[19] [28]

Arditi, A.; Obeso, O.; Syed, A.; Paleka, D.; Panickssery, N.; Gurnee, W.; and Nanda, N. 2024. Refusal in Language Models Is Mediated by a Single Direction. arXiv preprint arXiv:2406.11717

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [29]

Arvanitidis, G.; Gonz \'a lez-Duque, M.; Pouplin, A.; Kalatzis, D.; and Hauberg, S. 2022. Pulling Back Information Geometry. In International Conference on Artificial Intelligence and Statistics

work page 2022

[21] [30]

K.; and Hauberg, S

Arvanitidis, G.; Hansen, L. K.; and Hauberg, S. 2018. Latent Space Oddity: On the Curvature of Deep Generative Models. In International Conference on Learning Representations

work page 2018

[22] [31]

Layer Normalization

Ba, J. L.; Kiros, J. R.; and Hinton, G. E. 2016. Layer Normalization. arXiv preprint arXiv:1607.06450

work page internal anchor Pith review Pith/arXiv arXiv 2016

[23] [32]

S.; and Ghosh, J

Banerjee, A.; Merugu, S.; Dhillon, I. S.; and Ghosh, J. 2005. Clustering with Bregman Divergences. Journal of Machine Learning Research, 6: 1705--1749

work page 2005

[24] [33]

Belrose, N.; Schneider-Joseph, D.; Ravfogel, S.; Cotterell, R.; Raff, E.; and Biderman, S. 2023. LEACE : Perfect linear concept erasure in closed form. In Thirty-seventh Conference on Neural Information Processing Systems

work page 2023

[25] [34]

Dathathri, S.; Madotto, A.; Lan, J.; Hung, J.; Frank, E.; Molino, P.; Yosinski, J.; and Liu, R. 2020. Plug and Play Language Models: A Simple Approach to Controlled Text Generation. arXiv preprint arXiv:1912.02164

work page arXiv 2020

[26] [35]

Im, J.; and Li, J. 2025. A Unified Framework for Understanding Activation Steering Directions. arXiv preprint arXiv:2502.02716

work page arXiv 2025

[27] [36]

Li, K.; Patel, O.; Vi \'e gas, F.; Pfister, H.; and Wattenberg, M. 2023. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. In Advances in Neural Information Processing Systems, volume 36

work page 2023

[28] [37]

Martens, J. 2020. New Insights and Perspectives on the Natural Gradient Method. Journal of Machine Learning Research, 21(146): 1--76

work page 2020

[29] [38]

J.; and Veitch, V

Park, K.; Nief, T.; Choe, Y. J.; and Veitch, V. 2026. The Information Geometry of Softmax: Probing and Steering. arXiv preprint arXiv:2602.15293

work page arXiv 2026

[30] [39]

Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language Models are Unsupervised Multitask Learners. OpenAI Blog

work page 2019

[31] [40]

Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research, 21(140): 1--67

work page 2020

[32] [41]

Rao, C. R. 1945. Information and the Accuracy Attainable in the Estimation of Statistical Parameters. Bulletin of the Calcutta Mathematical Society, 37: 81--91

work page 1945

[33] [42]

J.; Wu, L.; Harrasse, A.; Phillips, J

Raval, S.; Song, H. J.; Wu, L.; Harrasse, A.; Phillips, J. M.; Barez, F.; and Abdullah, A. 2026. Curveball Steering: The Right Direction To Steer Isn't Always Linear. arXiv preprint arXiv:2603.09313

work page arXiv 2026

[34] [43]

Rimsky, N.; Gabrieli, N.; Schulz, J.; Tong, M.; Hubinger, E.; and Turner, A. 2024. Steering Llama 2 via Contrastive Activation Addition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15504--15522. Association for Computational Linguistics

work page 2024

[35] [44]

Singh, S.; Ravfogel, S.; Herzig, J.; Aharoni, R.; Cotterell, R.; and Kumaraguru, P. 2024. Representation Surgery: Theory and Practice of Affine Steering. In Proceedings of the 41st International Conference on Machine Learning

work page 2024

[36] [45]

Steering Language Models With Activation Engineering

Turner, A. M.; Thiergart, L.; Leech, G.; Udell, D.; Vazquez, J. J.; Mini, U.; and MacDiarmid, M. 2024. Steering Language Models With Activation Engineering. arXiv preprint arXiv:2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [46]

van der Weij, T.; Poesio, M.; and Schoots, N. 2024. Extending Activation Steering to Broad Skills and Multiple Behaviours. arXiv preprint arXiv:2403.05767

work page arXiv 2024

[38] [47]

M.; and Nguyen, T

Vu, H. M.; and Nguyen, T. M. 2025. Angular Steering: Behavior Control via Rotation in Activation Space. In 2nd Workshop on Models of Human Feedback for AI Alignment

work page 2025

[39] [48]

D.; and Potts, C

Wu, Z.; Arora, A.; Geiger, A.; Huang, J.; Wang, Z.; Manning, C. D.; and Potts, C. 2025. AxBench : Steering LLMs ? Benchmark and Decompose the Steering Ability of Representation Intervention Methods. In Proceedings of the 42nd International Conference on Machine Learning

work page 2025

[40] [49]

Wurgaft, D.; et al. 2026. Manifold Steering: Turning, Not Transplanting, Activations for Controllable Generation. arXiv preprint arXiv:2605.05115

work page internal anchor Pith review Pith/arXiv arXiv 2026

[41] [50]

Zou, A.; Phan, L.; Chen, S.; Campbell, J.; Guo, P.; Ren, R.; Pan, A.; Yin, X.; Mazeika, M.; Dombrowski, A.-K.; et al. 2023. Representation Engineering: A Top-Down Approach to AI Transparency. arXiv preprint arXiv:2310.01405

work page internal anchor Pith review Pith/arXiv arXiv 2023