Decomposing how prompting steers behavior

Fan L. Cheng; Nikolaus Kriegeskorte

arxiv: 2606.03093 · v1 · pith:5RK3ZP3Enew · submitted 2026-06-02 · 💻 cs.AI

Decomposing how prompting steers behavior

Fan L. Cheng , Nikolaus Kriegeskorte This is my paper

Pith reviewed 2026-06-28 10:36 UTC · model grok-4.3

classification 💻 cs.AI

keywords promptingrepresentational geometrylarge language modelsvision-language modelsaffine transformationactivation interventiontask structure

0 comments

The pith

Prompts steer LLMs and VLMs by applying affine transformations that mix dimensions to recover instructed task geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a nested geometric decomposition that aligns representations of the same stimuli under different prompts using maps of increasing expressivity, then tests each map by causally swapping activations in a single layer. It finds that translation and rigid transformations capture much of the prompt-induced change and improve behavior, but only affine transformations nearly recover the target task geometry and produce matching behavioral gains. This matters because it isolates the geometric mechanism by which instructions reorganize internal states without changing weights, across multiple models and tasks. A sympathetic reader would see this as evidence that cross-dimensional linear mixing is the operative step in prompt-driven reorganization.

Core claim

Across three LLMs, three VLMs, and six datasets, prompts reshape hidden-state geometry toward the instructed task structure. Cross-validated variance shows most activation change is captured by shape-preserving maps, with tier profiles varying by model and task. Although earlier tiers improve behavioral agreement, affine transformation is the first to nearly recover target-prompt task geometry and yields corresponding behavioral gains, indicating that cross-dimensional linear mixing is a key mechanism by which prompts reorganize representations.

What carries the argument

Nested geometric decomposition framework that applies stimulus-invariant maps (translation, rigid with uniform scaling, sequential axis scaling, affine, nonlinear) to align prompt-A and prompt-B representations, then causally replaces single-layer activations to test recovery of geometry and behavior.

If this is right

Prompts consistently reshape representations toward instructed task structure across text and image domains.
Much prompt-induced activation change is captured by shape-preserving maps such as translation and rigid transformation.
Tier profiles reveal model- and task-specific routing strategies across layers.
Affine transformation produces the largest gains in recovering task geometry and aligning behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition could be applied to other interventions such as fine-tuning or in-context examples to compare their geometric signatures.
If affine mixing is the critical step, prompt design might be improved by explicitly encouraging cross-dimensional alignment in the instruction.
The framework suggests testing whether similar geometric tiers explain steering in non-prompt settings such as activation editing.

Load-bearing premise

Prompt-induced changes can be fully captured by stimulus-invariant geometric maps applied to hidden states, and causally replacing a single layer's activations isolates each transformation tier without confounding effects from other layers or non-geometric factors.

What would settle it

In a new set of models or tasks, replacing activations with the affine-mapped version fails to recover target-prompt representational geometry or behavioral agreement on held-out stimuli.

Figures

Figures reproduced from arXiv: 2606.03093 by Fan L. Cheng, Nikolaus Kriegeskorte.

**Figure 1.** Figure 1: (a) Two prompts A (“Are there people?”) and B (“How many people?”) are presented to the same model together with a stimulus set S. We tap the hidden state at a single transformer layer ℓ (highlighted) to obtain the layer-ℓ manifolds MA = ΦA(S) (green) and MB = ΦB(S) (purple). The same forward pass continues past ℓ and yields a prompt-specific output per stimulus (“Yes/No” vs. counts). We ask whether a syst… view at source ↗

**Figure 2.** Figure 2: Prompting reshapes representational geometry toward the instructed task structure. (a) Multidimensional scaling (MDS) visualizations of prompt-conditioned representations. (Top) Layer-32 representations from Llama3-8B-Instruct for 1,920 text stories under a topic prompt (Prompt A) and an emotion prompt (Prompt B). (Bottom) Layer-27 representations from LLaVA-OneVision for 1,000 COCO images under a person-d… view at source ↗

**Figure 3.** Figure 3: Nested geometric decomposition of prompt-induced representational maps: translation ( [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of prompt-pair groups and datasets. (a) Cross-validated ∆R2 , averaged over datasets, models, and layers for each prompt-pair group. Specific prompt-pair groups show distinct decomposition profiles, regardless of whether the paired prompts differ across attributes or within an attribute. (b) Top: ∆R2 Ou , the additional variance explained by rotation/reflection and uniform scaling beyond transla… view at source ↗

**Figure 6.** Figure 6: Normalized prompt-induced activation change across the six prompt-pair groups, pooled [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Normalized activation change for EmotionalStory across layers for OPT-2.7B (top), Llama [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Normalized activation change for WritingStyle across layers for OPT-2.7B (top), Llama-3- [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Normalized activation change for Number across layers for OPT-2.7B (top), Llama-3-8B [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Normalized activation change for EmoSet across layers for BLIP-2 (top), LLaVA [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Normalised activation change for StyleTransfer across layers for BLIP-2 (top), LLaVA [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Normalized activation change for COCO across layers for BLIP-2 (top), LLaVA [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: EmotionalStory dataset, prompt A (topic) vs. prompt B (emotion); stimuli coloured by ground-truth emotion. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: WritingStyle dataset, prompt A (topic) vs. prompt B (writing style); stimuli coloured by ground-truth style. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

**Figure 15.** Figure 15: Number dataset, prompt A (numbers mentioned) vs. prompt B (cognitive operation); stimuli coloured by ground-truth task framing. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗

**Figure 16.** Figure 16: EmoSet dataset, prompt A (image content) vs. prompt B (emotion evoked); stimuli coloured by ground-truth emotion. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: StyleTransfer dataset, prompt A (scene) vs. prompt B (artistic style); stimuli coloured by ground-truth scene (top legend) and style (bottom legend). 30 [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗

**Figure 18.** Figure 18: StyleTransfer dataset (continued), prompt [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗

**Figure 19.** Figure 19: COCO dataset, prompt A (people detection) vs. prompt B (people count); stimuli coloured by ground-truth detection (top) and count bin (bottom). 32 [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗

**Figure 20.** Figure 20: Layerwise RDM correlation (left) and silhouette score (right) for EmotionalStory (1/2) [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗

**Figure 21.** Figure 21: Layerwise RDM correlation (left) and silhouette score (right) for EmotionalStory (2/2) [PITH_FULL_IMAGE:figures/full_fig_p035_21.png] view at source ↗

**Figure 22.** Figure 22: Layerwise RDM correlation (left) and silhouette score (right) for WritingStyle (1/2) under [PITH_FULL_IMAGE:figures/full_fig_p036_22.png] view at source ↗

**Figure 23.** Figure 23: Layerwise RDM correlation (left) and silhouette score (right) for WritingStyle (2/2) under [PITH_FULL_IMAGE:figures/full_fig_p037_23.png] view at source ↗

**Figure 24.** Figure 24: Layerwise RDM correlation (left) and silhouette score (right) for Number (1/2) under [PITH_FULL_IMAGE:figures/full_fig_p038_24.png] view at source ↗

**Figure 25.** Figure 25: Layerwise RDM correlation (left) and silhouette score (right) for Number (2/2) under [PITH_FULL_IMAGE:figures/full_fig_p039_25.png] view at source ↗

**Figure 26.** Figure 26: Layerwise RDM correlation (left) and silhouette score (right) for EmoSet (1/2) under [PITH_FULL_IMAGE:figures/full_fig_p040_26.png] view at source ↗

**Figure 27.** Figure 27: Layerwise RDM correlation (left) and silhouette score (right) for EmoSet (2/2) under [PITH_FULL_IMAGE:figures/full_fig_p041_27.png] view at source ↗

**Figure 28.** Figure 28: Layerwise RDM correlation (left) and silhouette score (right) for StyleTransfer (1/2) under [PITH_FULL_IMAGE:figures/full_fig_p042_28.png] view at source ↗

**Figure 29.** Figure 29: Layerwise RDM correlation (left) and silhouette score (right) for StyleTransfer (2/2) under [PITH_FULL_IMAGE:figures/full_fig_p043_29.png] view at source ↗

**Figure 30.** Figure 30: Layerwise RDM correlation (left) and silhouette score (right) for COCO (1/2) under [PITH_FULL_IMAGE:figures/full_fig_p044_30.png] view at source ↗

**Figure 31.** Figure 31: Layerwise RDM correlation (left) and silhouette score (right) for COCO (2/2) under [PITH_FULL_IMAGE:figures/full_fig_p045_31.png] view at source ↗

**Figure 32.** Figure 32: Incremental [PITH_FULL_IMAGE:figures/full_fig_p047_32.png] view at source ↗

**Figure 33.** Figure 33: Incremental [PITH_FULL_IMAGE:figures/full_fig_p048_33.png] view at source ↗

**Figure 34.** Figure 34: Incremental [PITH_FULL_IMAGE:figures/full_fig_p049_34.png] view at source ↗

**Figure 35.** Figure 35: Incremental [PITH_FULL_IMAGE:figures/full_fig_p050_35.png] view at source ↗

**Figure 36.** Figure 36: Incremental [PITH_FULL_IMAGE:figures/full_fig_p051_36.png] view at source ↗

**Figure 37.** Figure 37: Incremental [PITH_FULL_IMAGE:figures/full_fig_p052_37.png] view at source ↗

**Figure 38.** Figure 38: Cumulative cross-validated R2 of each transformation under prompt paraphrasing, evaluated on the canonical XB on held-out stimuli for Llama3-8B-Instruct on EmotionalStory (source A = topic, canonical target B = emotion; three paraphrases of emotion). 53 [PITH_FULL_IMAGE:figures/full_fig_p053_38.png] view at source ↗

**Figure 39.** Figure 39: Cumulative cross-validated R2 of each transformation under prompt paraphrasing, evaluated on the canonical XB on held-out stimuli for LLaVA-OneVision-7B on StyleTransfer(source A = scene, canonical target B = style; three paraphrases of style). Evaluation on out-of-distribution (OOD) datasets Setup. We next test whether transformations fitted on one stimulus distribution generalize to a distinct dataset … view at source ↗

**Figure 40.** Figure 40: Out-of-distribution (OOD) generalization of the nested geometric decomposition. Top: [PITH_FULL_IMAGE:figures/full_fig_p055_40.png] view at source ↗

**Figure 41.** Figure 41: Alignment dimensionality. (top) the centered cross-prompt cross-covariance at the stricter [PITH_FULL_IMAGE:figures/full_fig_p056_41.png] view at source ↗

read the original abstract

Prompting steers large language models (LLMs) and vision-language models (VLMs) without weight updates, but it remains unclear how instruction changes reshape internal representations to produce behavior. We introduce a nested geometric decomposition framework that treats prompting as a transformation of the representational geometry of the content following the prompt. For each prompt pair, we align representations of the same stimuli under two prompts using increasingly expressive stimulus-invariant maps: translation, rigid transformation with uniform scaling, sequential axis scaling, affine transformation, and nonlinear transformation. We then causally test each map by replacing a single layer's prompt-A hidden state for held-out stimuli with its mapped counterpart and measuring recovery of prompt-B representational geometry and behavior. Across three LLMs, three VLMs, and six text or image datasets spanning style, emotion, scene content, and number, prompts consistently reshape representations toward the instructed task structure. Cross-validated variance decomposition shows that much prompt-induced activation change is captured by shape-preserving maps, especially translation and rigid transformation with uniform scaling, while tier profiles reveal model- and task-specific routing strategies across layers. Crucially, although translation and rigid tiers already improve behavioral agreement, affine transformation is the first tier to nearly recover target-prompt task geometry and yields corresponding behavioral gains. This suggests that cross-dimensional linear mixing is a key mechanism by which prompts reorganize representations toward instructed task structure. Our framework decomposes prompt-induced representational change into interpretable geometric components and reveals how models route task-relevant structure to produce prompt-driven behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper decomposes prompt effects via nested geometric maps and single-layer causal swaps, but those swaps do not isolate tier contributions cleanly because downstream layers continue under the original prompt.

read the letter

The main point is a framework that aligns prompt-A and prompt-B representations with a sequence of stimulus-invariant maps (translation, rigid, axis scaling, affine, nonlinear) and then tests each by swapping one layer's activations on held-out items. The claim is that affine maps recover target task geometry and behavior where earlier tiers fall short.

What is new is the combination of tiered geometric decomposition with causal replacement testing across LLMs and VLMs on style, emotion, scene, and number tasks. The variance decomposition showing that shape-preserving maps capture much of the activation shift is a concrete step beyond simple probing.

The work applies the method consistently to multiple models and datasets, which gives some breadth. The stimulus-invariant fitting and cross-validated checks are reasonable design choices.

The soft spot is the causal intervention itself. Replacing activations at one layer and letting the rest of the network run under the original prompt means later layers can amplify, distort, or mask the mapped signal. This undercuts the isolation needed to conclude that cross-dimensional linear mixing is the operative mechanism rather than earlier tiers or non-geometric computations. The stress-test concern lands directly on the method described.

The abstract gives no numbers, error bars, or dataset sizes, so the strength of the behavioral gains remains unclear until the full results are checked.

This is for interpretability researchers who want tools to break down prompting. It deserves peer review because the geometric framing and causal angle are distinct enough to warrant scrutiny, even with the isolation issue.

Referee Report

2 major / 2 minor

Summary. The paper introduces a nested geometric decomposition framework that models prompting as stimulus-invariant transformations (translation, rigid+scaling, axis scaling, affine, nonlinear) of hidden-state geometry in LLMs and VLMs. For each prompt pair it fits these maps on training stimuli, then performs causal interventions by replacing a single layer's prompt-A activations for held-out stimuli with the mapped version and measures recovery of prompt-B representational geometry and downstream behavior. Across three LLMs, three VLMs and six datasets it reports that shape-preserving maps capture much of the activation change while affine maps are the first tier to nearly recover target task geometry and produce corresponding behavioral gains, suggesting cross-dimensional linear mixing as a key mechanism.

Significance. If the central claims hold after addressing isolation concerns, the framework supplies a concrete, testable decomposition of prompt effects into interpretable geometric tiers, with the variance-decomposition results and multi-model/multi-task evaluation providing reusable tools for mechanistic interpretability. The explicit causal-intervention design and the finding that affine maps outperform earlier tiers are potentially high-impact contributions.

major comments (2)

[Causal intervention procedure] Causal intervention procedure (abstract and methods): the headline claim that affine transformation is the first tier to nearly recover target-prompt task geometry rests on single-layer replacement. Because subsequent layers continue to process the replaced activations under the original prompt-A weights and context, recovery may be modulated by non-geometric computations downstream; this undermines the isolation needed to attribute behavioral gains specifically to cross-dimensional linear mixing rather than tier interactions or later layers.
[Results on tier profiles] Results on tier profiles and behavioral recovery: the assertion that affine is 'the first tier to nearly recover' and 'yields corresponding behavioral gains' requires explicit quantitative support (effect sizes, cross-validated R² or cosine similarities with confidence intervals, and statistical comparisons across tiers) to be load-bearing for the mechanism claim; without these the routing-strategy conclusions remain under-specified.

minor comments (2)

[Abstract] Abstract: quantitative metrics, dataset sizes, number of stimuli, and exclusion criteria are absent, making it impossible to evaluate the strength of the reported findings from the summary alone.
[Methods] Notation for maps: the precise parameterization of each tier (e.g., how uniform scaling is enforced in the rigid tier, how the affine matrix is constrained to be stimulus-invariant) should be stated explicitly with equations to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address the two major comments below, agreeing where the concerns are valid and outlining targeted revisions.

read point-by-point responses

Referee: Causal intervention procedure (abstract and methods): the headline claim that affine transformation is the first tier to nearly recover target-prompt task geometry rests on single-layer replacement. Because subsequent layers continue to process the replaced activations under the original prompt-A weights and context, recovery may be modulated by non-geometric computations downstream; this undermines the isolation needed to attribute behavioral gains specifically to cross-dimensional linear mixing rather than tier interactions or later layers.

Authors: We agree that single-layer replacement does not fully isolate the geometric map from downstream processing under prompt-A weights, and this limits strong causal attribution solely to cross-dimensional mixing. The design measures geometry recovery directly at the intervention layer (via cosine similarity or similar) while behavioral recovery reflects propagation; tier-wise differences still indicate that affine maps provide a better match to target geometry than earlier tiers. We will revise the methods and discussion to explicitly note this limitation and add a sentence clarifying that full isolation would require multi-layer interventions, which we flag as future work. revision: partial
Referee: Results on tier profiles and behavioral recovery: the assertion that affine is 'the first tier to nearly recover' and 'yields corresponding behavioral gains' requires explicit quantitative support (effect sizes, cross-validated R² or cosine similarities with confidence intervals, and statistical comparisons across tiers) to be load-bearing for the mechanism claim; without these the routing-strategy conclusions remain under-specified.

Authors: We accept that the current presentation of tier profiles would benefit from more explicit quantitative backing. The manuscript already reports cross-validated variance decomposition, but we will expand the results section to include per-tier effect sizes, confidence intervals on cosine similarities and behavioral metrics, and statistical comparisons (e.g., paired tests) across tiers. These additions will be made in the revision to strengthen the mechanism claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's framework fits stimulus-invariant geometric maps (translation through nonlinear) to align prompt-A and prompt-B representations, then measures recovery of geometry and behavior after single-layer replacement on held-out stimuli, with cross-validated variance decomposition. These steps are empirical measurements of explained variance and causal effects rather than any reduction of a claimed result to its own fitted inputs by construction, self-definitional equivalence, or load-bearing self-citation. No equations or claims in the abstract or described methods rename known patterns, import uniqueness from prior author work, or smuggle ansatzes; the central claim about affine tiers follows directly from the measured recovery metrics without circular collapse to the fitting procedure itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on the domain assumption that prompt effects are stimulus-invariant geometric transformations and on fitted parameters for each map tier; no invented entities are introduced.

free parameters (1)

map parameters per tier
Parameters of translation, rigid, scaling, and affine maps are fitted to align prompt-pair representations for each stimulus set.

axioms (1)

domain assumption Prompt effects on representations can be modeled as stimulus-invariant geometric transformations
The nested decomposition framework treats all prompt-induced changes as expressible by the listed sequence of maps applied uniformly across stimuli.

pith-pipeline@v0.9.1-grok · 5792 in / 1303 out tokens · 33447 ms · 2026-06-28T10:36:53.153709+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 13 linked inside Pith

[1]

arXiv preprint arXiv:2308.10248 , year=

Steering language models with activation engineering , author=. arXiv preprint arXiv:2308.10248 , year=

Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2310.01405 , year=

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

Pith/arXiv arXiv
[3]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Singh, Shashwat and Ravfogel, Shauli and Herzig, Jonathan and Aharoni, Roee and Cotterell, Ryan and Kumaraguru, Ponnurangam , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[4]

2nd Workshop on Models of Human Feedback for AI Alignment , year=

Angular Steering: Behavior Control via Rotation in Activation Space , author=. 2nd Workshop on Models of Human Feedback for AI Alignment , year=
[5]

Advances in Neural Information Processing Systems , volume=

Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=
[6]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Householder pseudo-rotation: A novel approach to activation editing in LLMs with direction-magnitude perspective , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[7]

arXiv preprint arXiv:2410.16314 , year=

Steering large language models using conceptors: Improving addition-based activation engineering , author=. arXiv preprint arXiv:2410.16314 , year=

arXiv
[8]

arXiv preprint arXiv:2603.02237 , year=

Concept Heterogeneity-aware Representation Steering , author=. arXiv preprint arXiv:2603.02237 , year=

Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2601.19375 , year=

Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection , author=. arXiv preprint arXiv:2601.19375 , year=

arXiv
[10]

arXiv preprint arXiv:2409.05907 , year=

Programming refusal with conditional activation steering , author=. arXiv preprint arXiv:2409.05907 , year=

arXiv
[11]

arXiv preprint arXiv:2510.04309 , year=

Activation Steering with a Feedback Controller , author=. arXiv preprint arXiv:2510.04309 , year=

Pith/arXiv arXiv
[12]

arXiv preprint arXiv:2602.17560 , year=

ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment , author=. arXiv preprint arXiv:2602.17560 , year=

arXiv
[13]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Sake: Steering activations for knowledge editing , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[14]

arXiv preprint arXiv:2506.07335 , year=

Improving llm reasoning through interpretable role-playing steering , author=. arXiv preprint arXiv:2506.07335 , year=

arXiv
[15]

arXiv preprint arXiv:2511.05408 , year=

Steering Language Models with Weight Arithmetic , author=. arXiv preprint arXiv:2511.05408 , year=

arXiv
[16]

arXiv e-prints , pages=

On the Non-Identifiability of Steering Vectors in Large Lan-guage Models , author=. arXiv e-prints , pages=
[17]

The Fourteenth International Conference on Learning Representations , year=

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint , author=. The Fourteenth International Conference on Learning Representations , year=
[18]

arXiv preprint arXiv:2505.20809 , year=

Improved representation steering for language models , author=. arXiv preprint arXiv:2505.20809 , year=

arXiv
[19]

arXiv preprint arXiv:2603.09313 , year=

Curveball Steering: The Right Direction To Steer Isn't Always Linear , author=. arXiv preprint arXiv:2603.09313 , year=

arXiv
[20]

arXiv preprint arXiv:2502.19649 , year=

Taxonomy, opportunities, and challenges of representation engineering for large language models , author=. arXiv preprint arXiv:2502.19649 , year=

arXiv
[21]

Computational Linguistics , year=

The quest for the right mediator: A history, survey, and theoretical grounding of causal mediation in mechanistic interpretability , author=. Computational Linguistics , year=
[22]

ICML , year=

The linear representation hypothesis and the geometry of large language models , author=. ICML , year=
[23]

arXiv preprint arXiv:2502.08009 , year=

The geometry of prompting: Unveiling distinct mechanisms of task adaptation in language models , author=. arXiv preprint arXiv:2502.08009 , year=

arXiv
[24]

arXiv preprint arXiv:2601.22364 , year=

Context Structure Reshapes the Representational Geometry of Language Models , author=. arXiv preprint arXiv:2601.22364 , year=

arXiv
[25]

2025 , url=

Core Francisco Park and Andrew Lee and Ekdeep Singh Lubana and Yongyi Yang and Maya Okawa and Kento Nishi and Martin Wattenberg and Hidenori Tanaka , booktitle=. 2025 , url=

2025
[26]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Unifying Attention Heads and Task Vectors via Hidden State Geometry in In-Context Learning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[27]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Do different prompting methods yield a common task representation in language models? , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
[28]

arXiv preprint arXiv:2510.19694 , year=

Do Prompts Reshape Representations? An Empirical Study of Prompting Effects on Embeddings , author=. arXiv preprint arXiv:2510.19694 , year=

arXiv
[29]

arXiv preprint arXiv:2602.20338 , year=

Emergent Manifold Separability during Reasoning in Large Language Models , author=. arXiv preprint arXiv:2602.20338 , year=

Pith/arXiv arXiv
[30]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

In-context learning creates task vectors , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

2023
[31]

International conference on learning representations , year=

Function vectors in large language models , author=. International conference on learning representations , year=
[32]

arXiv preprint arXiv:2509.04466 , year=

Just-in-time and distributed task representations in language models , author=. arXiv preprint arXiv:2509.04466 , year=

arXiv
[33]

arXiv preprint arXiv:2509.22518 , year=

REMA: A Unified Reasoning Manifold Framework for Interpreting Large Language Model , author=. arXiv preprint arXiv:2509.22518 , year=

arXiv
[34]

arXiv preprint arXiv:2505.10571 , year=

On the failure of latent state persistence in large language models , author=. arXiv preprint arXiv:2505.10571 , year=

arXiv
[35]

arXiv preprint arXiv:2603.03308 , year=

Old Habits Die Hard: How Conversational History Geometrically Traps LLMs , author=. arXiv preprint arXiv:2603.03308 , year=

Pith/arXiv arXiv
[36]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Steering llama 2 via contrastive activation addition , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[37]

2024 , publisher=

Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet , author=. 2024 , publisher=

2024
[38]

Zhengxuan Wu and Aryaman Arora and Zheng Wang and Atticus Geiger and Dan Jurafsky and Christopher D Manning and Christopher Potts , booktitle=. Re. 2024 , url=

2024
[39]

The Thirteenth International Conference on Learning Representations , year=

Controlling Language and Diffusion Models by Transporting Activations , author=. The Thirteenth International Conference on Learning Representations , year=
[40]

Advances in Neural Information Processing Systems , volume=

Analysing the generalisation and reliability of steering vectors , author=. Advances in Neural Information Processing Systems , volume=
[41]

arXiv preprint arXiv:2502.02716 , year=

A unified understanding and evaluation of steering methods , author=. arXiv preprint arXiv:2502.02716 , year=

arXiv
[42]

arXiv preprint arXiv:2412.09563 , year=

Does representation matter? exploring intermediate layers in large language models , author=. arXiv preprint arXiv:2412.09563 , year=

arXiv
[43]

Forty-second International Conference on Machine Learning , year=

Layer by Layer: Uncovering Hidden Representations in Language Models , author=. Forty-second International Conference on Machine Learning , year=
[44]

Frontiers in systems neuroscience , volume=

Representational similarity analysis-connecting the branches of systems neuroscience , author=. Frontiers in systems neuroscience , volume=. 2008 , publisher=

2008
[45]

International conference on machine learning , pages=

Similarity of neural network representations revisited , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[46]

Advances in Neural Information Processing Systems , editor=

Generalized Shape Metrics on Neural Representations , author=. Advances in Neural Information Processing Systems , editor=. 2021 , url=

2021
[47]

UniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models , year=

Equivalence between representational similarity analysis, centered kernel alignment, and canonical correlations analysis , author=. UniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models , year=
[48]

UniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models , year=

What Representational Similarity Measures Imply about Decodable Information , author=. UniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models , year=
[49]

Proceedings of the National Academy of Sciences , volume=

The topology and geometry of neural representations , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

2024
[50]

NeurIPS 2025 Workshop on CogInterp , year =

Interpreting Style--Content Parsing in Vision--Language Models , author =. NeurIPS 2025 Workshop on CogInterp , year =

2025
[51]

bioRxiv , pages=

Quantifying differences in neural population activity with shape metrics , author=. bioRxiv , pages=. 2025 , publisher=

2025
[52]

Psychometrika , volume=

Generalized procrustes analysis , author=. Psychometrika , volume=. 1975 , publisher=

1975
[53]

arXiv preprint arXiv:2505.17322 , year=

From Compression to Expression: A Layerwise Analysis of In-Context Learning , author=. arXiv preprint arXiv:2505.17322 , year=

arXiv
[54]

arXiv preprint arXiv:2408.03326 , year=

Llava-onevision: Easy visual task transfer , author=. arXiv preprint arXiv:2408.03326 , year=

Pith/arXiv arXiv
[55]

arXiv preprint arXiv:2511.21631 , year=

Qwen3-vl technical report , author=. arXiv preprint arXiv:2511.21631 , year=

Pith/arXiv arXiv
[56]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv
[57]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[58]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[59]

arXiv preprint arXiv:2205.01068 , year=

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

Pith/arXiv arXiv
[60]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Emoset: A large-scale visual emotion dataset with rich attributes , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
[61]

Nature Human Behaviour , pages=

The psychophysics of style , author=. Nature Human Behaviour , pages=. 2025 , publisher=

2025
[62]

European conference on computer vision , pages=

Microsoft coco: Common objects in context , author=. European conference on computer vision , pages=. 2014 , organization=

2014
[63]

arXiv preprint arXiv:2602.06843 , year=

The Representational Geometry of Number , author=. arXiv preprint arXiv:2602.06843 , year=

arXiv
[64]

arXiv preprint arXiv:1609.07843 , year=

Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

Pith/arXiv arXiv
[65]

arXiv preprint arXiv:2604.07729 , year=

Emotion concepts and their function in a large language model , author=. arXiv preprint arXiv:2604.07729 , year=

Pith/arXiv arXiv

[1] [1]

arXiv preprint arXiv:2308.10248 , year=

Steering language models with activation engineering , author=. arXiv preprint arXiv:2308.10248 , year=

Pith/arXiv arXiv

[2] [2]

arXiv preprint arXiv:2310.01405 , year=

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

Pith/arXiv arXiv

[3] [3]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Singh, Shashwat and Ravfogel, Shauli and Herzig, Jonathan and Aharoni, Roee and Cotterell, Ryan and Kumaraguru, Ponnurangam , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024

[4] [4]

2nd Workshop on Models of Human Feedback for AI Alignment , year=

Angular Steering: Behavior Control via Rotation in Activation Space , author=. 2nd Workshop on Models of Human Feedback for AI Alignment , year=

[5] [5]

Advances in Neural Information Processing Systems , volume=

Inference-time intervention: Eliciting truthful answers from a language model , author=. Advances in Neural Information Processing Systems , volume=

[6] [6]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Householder pseudo-rotation: A novel approach to activation editing in LLMs with direction-magnitude perspective , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024

[7] [7]

arXiv preprint arXiv:2410.16314 , year=

Steering large language models using conceptors: Improving addition-based activation engineering , author=. arXiv preprint arXiv:2410.16314 , year=

arXiv

[8] [8]

arXiv preprint arXiv:2603.02237 , year=

Concept Heterogeneity-aware Representation Steering , author=. arXiv preprint arXiv:2603.02237 , year=

Pith/arXiv arXiv

[9] [9]

arXiv preprint arXiv:2601.19375 , year=

Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection , author=. arXiv preprint arXiv:2601.19375 , year=

arXiv

[10] [10]

arXiv preprint arXiv:2409.05907 , year=

Programming refusal with conditional activation steering , author=. arXiv preprint arXiv:2409.05907 , year=

arXiv

[11] [11]

arXiv preprint arXiv:2510.04309 , year=

Activation Steering with a Feedback Controller , author=. arXiv preprint arXiv:2510.04309 , year=

Pith/arXiv arXiv

[12] [12]

arXiv preprint arXiv:2602.17560 , year=

ODESteer: A Unified ODE-Based Steering Framework for LLM Alignment , author=. arXiv preprint arXiv:2602.17560 , year=

arXiv

[13] [13]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Sake: Steering activations for knowledge editing , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[14] [14]

arXiv preprint arXiv:2506.07335 , year=

Improving llm reasoning through interpretable role-playing steering , author=. arXiv preprint arXiv:2506.07335 , year=

arXiv

[15] [15]

arXiv preprint arXiv:2511.05408 , year=

Steering Language Models with Weight Arithmetic , author=. arXiv preprint arXiv:2511.05408 , year=

arXiv

[16] [16]

arXiv e-prints , pages=

On the Non-Identifiability of Steering Vectors in Large Lan-guage Models , author=. arXiv e-prints , pages=

[17] [17]

The Fourteenth International Conference on Learning Representations , year=

AlphaSteer: Learning Refusal Steering with Principled Null-Space Constraint , author=. The Fourteenth International Conference on Learning Representations , year=

[18] [18]

arXiv preprint arXiv:2505.20809 , year=

Improved representation steering for language models , author=. arXiv preprint arXiv:2505.20809 , year=

arXiv

[19] [19]

arXiv preprint arXiv:2603.09313 , year=

Curveball Steering: The Right Direction To Steer Isn't Always Linear , author=. arXiv preprint arXiv:2603.09313 , year=

arXiv

[20] [20]

arXiv preprint arXiv:2502.19649 , year=

Taxonomy, opportunities, and challenges of representation engineering for large language models , author=. arXiv preprint arXiv:2502.19649 , year=

arXiv

[21] [21]

Computational Linguistics , year=

The quest for the right mediator: A history, survey, and theoretical grounding of causal mediation in mechanistic interpretability , author=. Computational Linguistics , year=

[22] [22]

ICML , year=

The linear representation hypothesis and the geometry of large language models , author=. ICML , year=

[23] [23]

arXiv preprint arXiv:2502.08009 , year=

The geometry of prompting: Unveiling distinct mechanisms of task adaptation in language models , author=. arXiv preprint arXiv:2502.08009 , year=

arXiv

[24] [24]

arXiv preprint arXiv:2601.22364 , year=

Context Structure Reshapes the Representational Geometry of Language Models , author=. arXiv preprint arXiv:2601.22364 , year=

arXiv

[25] [25]

2025 , url=

Core Francisco Park and Andrew Lee and Ekdeep Singh Lubana and Yongyi Yang and Maya Okawa and Kento Nishi and Martin Wattenberg and Hidenori Tanaka , booktitle=. 2025 , url=

2025

[26] [26]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Unifying Attention Heads and Task Vectors via Hidden State Geometry in In-Context Learning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[27] [27]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Do different prompting methods yield a common task representation in language models? , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

[28] [28]

arXiv preprint arXiv:2510.19694 , year=

Do Prompts Reshape Representations? An Empirical Study of Prompting Effects on Embeddings , author=. arXiv preprint arXiv:2510.19694 , year=

arXiv

[29] [29]

arXiv preprint arXiv:2602.20338 , year=

Emergent Manifold Separability during Reasoning in Large Language Models , author=. arXiv preprint arXiv:2602.20338 , year=

Pith/arXiv arXiv

[30] [30]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

In-context learning creates task vectors , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

2023

[31] [31]

International conference on learning representations , year=

Function vectors in large language models , author=. International conference on learning representations , year=

[32] [32]

arXiv preprint arXiv:2509.04466 , year=

Just-in-time and distributed task representations in language models , author=. arXiv preprint arXiv:2509.04466 , year=

arXiv

[33] [33]

arXiv preprint arXiv:2509.22518 , year=

REMA: A Unified Reasoning Manifold Framework for Interpreting Large Language Model , author=. arXiv preprint arXiv:2509.22518 , year=

arXiv

[34] [34]

arXiv preprint arXiv:2505.10571 , year=

On the failure of latent state persistence in large language models , author=. arXiv preprint arXiv:2505.10571 , year=

arXiv

[35] [35]

arXiv preprint arXiv:2603.03308 , year=

Old Habits Die Hard: How Conversational History Geometrically Traps LLMs , author=. arXiv preprint arXiv:2603.03308 , year=

Pith/arXiv arXiv

[36] [36]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Steering llama 2 via contrastive activation addition , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[37] [37]

2024 , publisher=

Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet , author=. 2024 , publisher=

2024

[38] [38]

Zhengxuan Wu and Aryaman Arora and Zheng Wang and Atticus Geiger and Dan Jurafsky and Christopher D Manning and Christopher Potts , booktitle=. Re. 2024 , url=

2024

[39] [39]

The Thirteenth International Conference on Learning Representations , year=

Controlling Language and Diffusion Models by Transporting Activations , author=. The Thirteenth International Conference on Learning Representations , year=

[40] [40]

Advances in Neural Information Processing Systems , volume=

Analysing the generalisation and reliability of steering vectors , author=. Advances in Neural Information Processing Systems , volume=

[41] [41]

arXiv preprint arXiv:2502.02716 , year=

A unified understanding and evaluation of steering methods , author=. arXiv preprint arXiv:2502.02716 , year=

arXiv

[42] [42]

arXiv preprint arXiv:2412.09563 , year=

Does representation matter? exploring intermediate layers in large language models , author=. arXiv preprint arXiv:2412.09563 , year=

arXiv

[43] [43]

Forty-second International Conference on Machine Learning , year=

Layer by Layer: Uncovering Hidden Representations in Language Models , author=. Forty-second International Conference on Machine Learning , year=

[44] [44]

Frontiers in systems neuroscience , volume=

Representational similarity analysis-connecting the branches of systems neuroscience , author=. Frontiers in systems neuroscience , volume=. 2008 , publisher=

2008

[45] [45]

International conference on machine learning , pages=

Similarity of neural network representations revisited , author=. International conference on machine learning , pages=. 2019 , organization=

2019

[46] [46]

Advances in Neural Information Processing Systems , editor=

Generalized Shape Metrics on Neural Representations , author=. Advances in Neural Information Processing Systems , editor=. 2021 , url=

2021

[47] [47]

UniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models , year=

Equivalence between representational similarity analysis, centered kernel alignment, and canonical correlations analysis , author=. UniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models , year=

[48] [48]

UniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models , year=

What Representational Similarity Measures Imply about Decodable Information , author=. UniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models , year=

[49] [49]

Proceedings of the National Academy of Sciences , volume=

The topology and geometry of neural representations , author=. Proceedings of the National Academy of Sciences , volume=. 2024 , publisher=

2024

[50] [50]

NeurIPS 2025 Workshop on CogInterp , year =

Interpreting Style--Content Parsing in Vision--Language Models , author =. NeurIPS 2025 Workshop on CogInterp , year =

2025

[51] [51]

bioRxiv , pages=

Quantifying differences in neural population activity with shape metrics , author=. bioRxiv , pages=. 2025 , publisher=

2025

[52] [52]

Psychometrika , volume=

Generalized procrustes analysis , author=. Psychometrika , volume=. 1975 , publisher=

1975

[53] [53]

arXiv preprint arXiv:2505.17322 , year=

From Compression to Expression: A Layerwise Analysis of In-Context Learning , author=. arXiv preprint arXiv:2505.17322 , year=

arXiv

[54] [54]

arXiv preprint arXiv:2408.03326 , year=

Llava-onevision: Easy visual task transfer , author=. arXiv preprint arXiv:2408.03326 , year=

Pith/arXiv arXiv

[55] [55]

arXiv preprint arXiv:2511.21631 , year=

Qwen3-vl technical report , author=. arXiv preprint arXiv:2511.21631 , year=

Pith/arXiv arXiv

[56] [56]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv

[57] [57]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

2023

[58] [58]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[59] [59]

arXiv preprint arXiv:2205.01068 , year=

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

Pith/arXiv arXiv

[60] [60]

Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

Emoset: A large-scale visual emotion dataset with rich attributes , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

[61] [61]

Nature Human Behaviour , pages=

The psychophysics of style , author=. Nature Human Behaviour , pages=. 2025 , publisher=

2025

[62] [62]

European conference on computer vision , pages=

Microsoft coco: Common objects in context , author=. European conference on computer vision , pages=. 2014 , organization=

2014

[63] [63]

arXiv preprint arXiv:2602.06843 , year=

The Representational Geometry of Number , author=. arXiv preprint arXiv:2602.06843 , year=

arXiv

[64] [64]

arXiv preprint arXiv:1609.07843 , year=

Pointer sentinel mixture models , author=. arXiv preprint arXiv:1609.07843 , year=

Pith/arXiv arXiv

[65] [65]

arXiv preprint arXiv:2604.07729 , year=

Emotion concepts and their function in a large language model , author=. arXiv preprint arXiv:2604.07729 , year=

Pith/arXiv arXiv