pith. machine review for the scientific record. sign in

arxiv: 2311.03658 · v2 · submitted 2023-11-07 · 💻 cs.CL · cs.AI· cs.LG· stat.ML

Recognition: no theorem link

The Linear Representation Hypothesis and the Geometry of Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-11 21:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGstat.ML
keywords linear representation hypothesislarge language modelscounterfactualscausal inner productlinear probingmodel steeringrepresentation geometryLLaMA-2
0
0 comments X

The pith

High-level concepts in large language models are linear directions under a causal inner product built from counterfactual pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper gives two formal definitions of linear representation, one for concepts in the model's output word space and one for inputs in sentence space, both using counterfactuals as the key device. It proves that these definitions correspond exactly to the tasks of linear probing for concept detection and model steering for behavior control. To make geometry well-defined, the authors derive a specific non-Euclidean inner product from the same counterfactual structure; this product unifies every prior notion of linear representation into one coherent framework. Experiments on LLaMA-2 confirm that real concepts align linearly once the right inner product is used and that the choice of geometry affects both interpretation and control.

Core claim

Using the language of counterfactuals, the authors formalize linear representation first in the output space as directions that separate counterfactual word pairs, and second in the input space as directions that separate counterfactual sentence pairs. They prove these formalizations recover linear probing and steering, respectively. They then construct a causal inner product under which all geometric operations respect the counterfactual structure of language; this single object unifies probing, steering, and all earlier linear-representation techniques, and it permits direct construction of both probes and steering vectors from counterfactual pairs alone.

What carries the argument

The causal inner product, a non-Euclidean inner product on representation space derived from counterfactual pairs that makes cosine similarity and projection respect the structure of language.

If this is right

  • Linear probes for any concept can be built directly from counterfactual pairs without external labeled data.
  • Steering vectors that change model behavior along a concept can be constructed from the same counterfactual pairs.
  • Geometric notions such as projection and similarity become consistent across probing and steering once the causal inner product is used.
  • All existing techniques for finding linear representations become special cases of the single counterfactual framework.
  • The existence of linear representations can be tested by checking whether counterfactual pairs align along a single direction under the causal inner product.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same counterfactual construction could be applied to detect and edit representations of safety-relevant concepts such as deception or toxicity.
  • If the causal inner product generalizes across model scales, it offers a parameter-free way to transfer probes and steering vectors between models.
  • The framework suggests a route to test whether linearity holds for relational concepts that involve multiple entities rather than single attributes.
  • Extending the construction beyond text to multimodal models would require defining counterfactual pairs across modalities.

Load-bearing premise

That counterfactual pairs for a given concept can be reliably constructed or approximated inside the model so that the resulting causal inner product correctly captures language structure.

What would settle it

A dataset of counterfactual pairs for several concepts where the directions obtained from the causal inner product produce probes whose accuracy does not match the steering effectiveness obtained from the same directions.

read the original abstract

Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely related questions: What does "linear representation" actually mean? And, how do we make sense of geometric notions (e.g., cosine similarity or projection) in the representation space? To answer these, we use the language of counterfactuals to give two formalizations of "linear representation", one in the output (word) representation space, and one in the input (sentence) space. We then prove these connect to linear probing and model steering, respectively. To make sense of geometric notions, we use the formalization to identify a particular (non-Euclidean) inner product that respects language structure in a sense we make precise. Using this causal inner product, we show how to unify all notions of linear representation. In particular, this allows the construction of probes and steering vectors using counterfactual pairs. Experiments with LLaMA-2 demonstrate the existence of linear representations of concepts, the connection to interpretation and control, and the fundamental role of the choice of inner product.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper formalizes the linear representation hypothesis via two counterfactual definitions—one in output (word) space linking to probing, one in input (sentence) space linking to steering—then derives a non-Euclidean 'causal' inner product under which all linear notions unify, allowing probes and steering vectors to be constructed directly from counterfactual pairs. Experiments on LLaMA-2 are cited to show existence of linear representations, their connection to interpretation/control, and the importance of the inner-product choice.

Significance. If the unification holds, the work supplies a principled, counterfactual-based geometry for LLM representations that could replace ad-hoc Euclidean assumptions in interpretability research. The parameter-free derivation from invariance properties and the explicit link between probing and steering are strengths that would make the framework useful for both theory and downstream control methods.

major comments (3)
  1. [§3] §3 (causal inner product derivation): the unification claim requires that projections and similarities under the new inner product correspond exactly to minimal interventions that flip only the target concept. The manuscript does not supply a direct verification (theoretical or empirical) that the constructed pairs satisfy this minimality condition in the model's actual geometry rather than in an idealized causal model.
  2. [Experiments] Experiments section: results on LLaMA-2 are used to support both existence of linear representations and the role of the inner product, yet the text provides no details on control conditions, baseline inner products, or statistical tests. Without these, it is unclear whether the reported effects are attributable to the causal inner product or to other model properties.
  3. [§4] §4 (counterfactual pair construction): the operational definitions of probing and steering vectors from pairs presuppose that such pairs can be reliably approximated without entanglement or non-local effects. The paper does not report diagnostics (e.g., effect-size checks or ablation on pair quality) that would confirm the pairs behave as assumed minimal interventions.
minor comments (2)
  1. Notation for the causal inner product is introduced without an explicit comparison table to the Euclidean case, making it harder to see at a glance where the two metrics diverge.
  2. [Introduction] A few sentences in the introduction repeat the informal statement of the linear representation hypothesis; tightening would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (causal inner product derivation): the unification claim requires that projections and similarities under the new inner product correspond exactly to minimal interventions that flip only the target concept. The manuscript does not supply a direct verification (theoretical or empirical) that the constructed pairs satisfy this minimality condition in the model's actual geometry rather than in an idealized causal model.

    Authors: Our derivation in §3 proceeds from the counterfactual definitions of linear representations and identifies the causal inner product as the one that respects invariance under minimal interventions on the target concept. By construction within this framework, projections and similarities under the inner product align with the effects of such interventions in the idealized causal model. We agree that a direct empirical verification against the LLM's actual geometry would strengthen the presentation; we will revise §3 to make the theoretical correspondence more explicit and to discuss the idealized assumption as a limitation. revision: partial

  2. Referee: [Experiments] Experiments section: results on LLaMA-2 are used to support both existence of linear representations and the role of the inner product, yet the text provides no details on control conditions, baseline inner products, or statistical tests. Without these, it is unclear whether the reported effects are attributable to the causal inner product or to other model properties.

    Authors: We agree that the current experimental description lacks these details. In the revised manuscript we will expand the Experiments section to describe control conditions (including random counterfactual pairs), explicit comparisons to the Euclidean inner product as a baseline, and statistical tests assessing the significance of the reported effects. revision: yes

  3. Referee: [§4] §4 (counterfactual pair construction): the operational definitions of probing and steering vectors from pairs presuppose that such pairs can be reliably approximated without entanglement or non-local effects. The paper does not report diagnostics (e.g., effect-size checks or ablation on pair quality) that would confirm the pairs behave as assumed minimal interventions.

    Authors: The definitions in §4 rely on the assumption that the constructed pairs approximate minimal interventions, consistent with the formalization developed earlier. We will add to the revised manuscript diagnostics such as effect-size measurements on the interventions and ablations on pair quality to provide empirical support for this assumption. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations follow from counterfactual formalizations and vector-space axioms

full rationale

The paper begins with explicit counterfactual-based definitions of linear representation (one in output space, one in input space), then uses standard linear-algebraic arguments to connect them to probing and steering. The causal inner product is constructed precisely to satisfy the invariance properties stated in those definitions, so the subsequent unification and pair-based construction of probes/steering vectors are direct consequences rather than independent predictions. No parameter is fitted to the target result, no uniqueness theorem is imported from the authors' own prior work, and no ansatz is smuggled via citation. The experimental section on LLaMA-2 supplies separate empirical checks that do not enter the formal chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The work rests on standard linear algebra and counterfactual semantics from causal inference. No free parameters are introduced in the core theory. The causal inner product is derived rather than fitted. No new physical entities are postulated.

axioms (2)
  • standard math Representation spaces are vector spaces over the reals
    Invoked throughout the formalizations of linear representations and inner products.
  • domain assumption Counterfactual interventions on inputs and outputs are well-defined for the model
    Central to both formalizations; appears in the definitions connecting to probing and steering.
invented entities (1)
  • causal inner product no independent evidence
    purpose: To define geometry (angles, projections) that respects language structure under counterfactual interventions
    Derived from the requirement that the inner product be invariant under certain language-preserving transformations; no independent empirical evidence is given beyond the unification it enables.

pith-pipeline@v0.9.0 · 5507 in / 1518 out tokens · 43819 ms · 2026-05-11T21:37:51.886037+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

    cs.CL 2026-05 unverdicted novelty 8.0

    REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...

  2. SLAM: Structural Linguistic Activation Marking for Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

  3. SLAM: Structural Linguistic Activation Marking for Language Models

    cs.CL 2026-05 unverdicted novelty 8.0

    SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.

  4. Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

    cs.LG 2026-04 accept novelty 8.0

    Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.

  5. Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

    cs.LG 2026-05 unverdicted novelty 7.0

    Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.

  6. The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It

    cs.LG 2026-05 unverdicted novelty 7.0

    Transformers encode counts correctly internally but fail to read them out due to misalignment with digit output directions, fixable by updating 37k output parameters or small LoRA on attention.

  7. Cell-Based Representation of Relational Binding in Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...

  8. Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

    cs.LG 2026-04 conditional novelty 7.0

    Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.

  9. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  10. Steering Language Models With Activation Engineering

    cs.CL 2023-08 unverdicted novelty 7.0

    Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.

  11. Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

    cs.LG 2026-05 unverdicted novelty 6.0

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

  12. Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...

  13. The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations

    cs.AI 2026-05 unverdicted novelty 6.0

    Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.

  14. A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases

    cs.LG 2026-05 unverdicted novelty 6.0

    LLMs exhibit three geometric phases in next-token prediction—seeding multiplexing, hoisting overriding, and focal convergence—where predictive subspaces rise, stabilize, and converge across layers.

  15. Tool Calling is Linearly Readable and Steerable in Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.

  16. Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark p...

  17. Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Tree SAE learns hierarchical feature pairs in sparse autoencoders by combining activation coverage with a new reconstruction condition, outperforming prior methods on hierarchy detection while remaining competitive on...

  18. Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer

    cs.LG 2026-05 unverdicted novelty 6.0

    Health foundation model embeddings contain an interpretable symbolic organization shared across modalities that supports cross-domain transfer without joint training.

  19. Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.

  20. Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes

    cs.LG 2026-05 unverdicted novelty 6.0

    Pairwise matrices for SAEs demonstrate that single-feature inspection mislabels causal axes, with joint suppression and matched-geometry controls revealing distinct output regimes not captured by single-feature or ran...

  21. Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

    cs.LG 2026-04 unverdicted novelty 6.0

    Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out har...

  22. Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

    cs.LG 2026-04 unverdicted novelty 6.0

    Harmful intent is geometrically recoverable as a linear direction or angular deviation in LLM residual streams, with high AUROC across 12 models, stable under alignment variants including abliterated ones, and transfe...

  23. LLM Safety From Within: Detecting Harmful Content with Internal Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.

  24. Characterizing Model-Native Skills

    cs.AI 2026-04 conditional novelty 6.0

    Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...

  25. Rhetorical Questions in LLM Representations: A Linear Probing Study

    cs.CL 2026-04 unverdicted novelty 6.0

    Linear probes show rhetorical questions are encoded via multiple dataset-specific directions in LLM representations, with low cross-probe agreement on the same data.

  26. Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds

    cs.CL 2026-04 unverdicted novelty 6.0

    Mature small language models share nearly identical 21-emotion geometries across architectures with Spearman correlations 0.74-0.92 despite opposite behavioral profiles, while immature models restructure under RLHF an...

  27. Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

    cs.LG 2026-04 unverdicted novelty 6.0

    DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...

  28. When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.

  29. The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...

  30. Steering Llama 2 via Contrastive Activation Addition

    cs.CL 2023-12 unverdicted novelty 6.0

    Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.

  31. Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

    cs.AI 2026-05 unverdicted novelty 5.0

    Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

  32. Negative Before Positive: Asymmetric Valence Processing in Large Language Models

    cs.CL 2026-05 unverdicted novelty 5.0

    Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.

  33. Semantic Structure of Feature Space in Large Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.

  34. H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    H-probes locate low-dimensional subspaces encoding hierarchy in LLM activations for synthetic tree tasks, show causal importance and generalization, and detect weaker signals in mathematical reasoning traces.

  35. From Weights to Activations: Is Steering the Next Frontier of Adaptation?

    cs.CL 2026-04 unverdicted novelty 4.0

    Steering is positioned as a distinct adaptation paradigm that uses targeted activation interventions for local, reversible behavioral changes without parameter updates.

  36. There Will Be a Scientific Theory of Deep Learning

    stat.ML 2026-04 unverdicted novelty 2.0

    A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universa...

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 33 Pith papers · 8 internal anchors

  1. [1]

    doi: 10.18653/v1/K16-1002

    Association for Computational Linguistics. doi: 10.18653/v1/K16-1002. URL https://aclanthology.org/K16-1002. Chang, T., Tu, Z., and Bergen, B. The geometry of multi- lingual language model representations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 119–136,

  2. [2]

    Word embed- dings, analogies, and machine learning: Beyond king - man + woman = queen

    Drozd, A., Gladkova, A., and Matsuoka, S. Word embed- dings, analogies, and machine learning: Beyond king - man + woman = queen. In Proceedings of COLING 2016, the 26th International Conference on Computational Lin- guistics: Technical papers, pp. 3519–3530,

  3. [3]

    Toy Models of Superposition

    Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., et al. Toy models of superposition. arXiv preprint arXiv:2209.10652,

  4. [4]

    How contextual are contextualized word rep- resentations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings

    Ethayarajh, K. How contextual are contextualized word rep- resentations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 55–65,

  5. [5]

    doi: 10.18653/v1/2020.conll-1.29

    Association for Computational Linguis- tics. doi: 10.18653/v1/2020.conll-1.29. URL https: //aclanthology.org/2020.conll-1.29. Geva, M., Caciularu, A., Wang, K., and Goldberg, Y . Trans- former feed-forward layers build predictions by promot- ing concepts in the vocabulary space. In Proceedings of the Conference on Empirical Methods in Natural Lan- guage P...

  6. [6]

    and Levy, O

    Goldberg, Y . and Levy, O. word2vec explained: deriv- ing Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722,

  7. [7]

    2024 , journal =

    Gurnee, W. and Tegmark, M. Language models represent space and time. arXiv preprint arXiv:2310.02207, art. arXiv:2310.02207, October

  8. [8]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    doi: 10.48550/arXiv. 2310.02207. Hendel, R., Geva, M., and Globerson, A. In-context learning creates task vectors. arXiv preprint arXiv:2310.15916,

  9. [9]

    arXiv preprint arXiv:2308.09124 , year=

    9 The Linear Representation Hypothesis and the Geometry of Large Language Models Hernandez, E., Sharma, A. S., Haklay, T., Meng, K., Watten- berg, M., Andreas, J., Belinkov, Y ., and Bau, D. Linear- ity of relation decoding in transformer language models. arXiv preprint arXiv:2308.09124,

  10. [10]

    and Manning, C

    Hewitt, J. and Manning, C. D. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long and Short Papers), pp. 4129–4138,

  11. [11]

    Towards a Definition of Disentangled Representations

    Higgins, I., Amos, D., Pfau, D., Racaniere, S., Matthey, L., Rezende, D., and Lerchner, A. Towards a defi- nition of disentangled representations. arXiv preprint arXiv:1812.02230,

  12. [12]

    Uncovering meanings of embeddings via partial orthogonality

    Jiang, Y ., Aragam, B., and Veitch, V . Uncovering meanings of embeddings via partial orthogonality. arXiv preprint arXiv:2310.17611,

  13. [13]

    and Richardson, J

    Kudo, T. and Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71,

  14. [14]

    On the sentence embeddings from pre-trained language mod- els

    Li, B., Zhou, H., He, J., Wang, M., Yang, Y ., and Li, L. On the sentence embeddings from pre-trained language mod- els. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9119–9130,

  15. [15]

    Language mod- els implement simple word2vec-style vector arithmetic

    Merullo, J., Eickhoff, C., and Pavlick, E. Language mod- els implement simple word2vec-style vector arithmetic. arXiv preprint arXiv:2305.16130,

  16. [16]

    Gemma: Open Models Based on Gemini Research and Technology

    Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivi `ere, M., Kale, M. S., Love, J., et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295,

  17. [17]

    Exploiting Similarities among Languages for Machine Translation

    Mikolov, T., Le, Q. V ., and Sutskever, I. Exploiting simi- larities among languages for machine translation. arXiv preprint arXiv:1309.4168, 2013a. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. Advances in Neural Informa- tion Processing Systems, 26, 2013b. ...

  18. [18]

    InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Associa- tion for Computational Linguistics. doi: 10.18653/v1/ D17-1308. URL https://aclanthology.org/ D17-1308. Moran, G. E., Sridhar, D., Wang, Y ., and Blei, D. M. Identi- fiable deep generative models via sparse decoding. arXiv preprint arXiv:2110.10804, art. arXiv:2110.10804, Octo- ber

  19. [19]

    E., Sridhar, D., Wang, Y ., and Blei, D

    doi: 10.48550/arXiv.2110.10804. Nanda, N., Lee, A., and Wattenberg, M. Emergent linear rep- resentations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941,

  20. [20]

    GPT-4 Technical Report

    URL https://www.alignmentforum. org/posts/AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens . OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774,

  21. [21]

    URL https://openreview

    ISSN 2835-8856. URL https://openreview. net/forum?id=8HuyXvbvqX. Pennington, J., Socher, R., and Manning, C. D. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543,

  22. [22]

    Prompt algebra for task composition

    Perera, P., Trager, M., Zancato, L., Achille, A., and Soatto, S. Prompt algebra for task composition. arXiv preprint arXiv:2306.00310,

  23. [23]

    2024 , month = feb, number =

    Todd, E., Li, M. L., Sharma, A. S., Mueller, A., Wallace, B. C., and Bau, D. Function vectors in large language models. arXiv preprint arXiv:2310.15213,

  24. [24]

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V ., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V ., Kha...

  25. [25]

    Steering Language Models With Activation Engineering

    Turner, A. M., Thiergart, L., Udell, D., Leech, G., Mini, U., and MacDiarmid, M. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, art. arXiv:2308.10248, August

  26. [26]

    Steering Language Models With Activation Engineering

    doi: 10.48550/arXiv.2308.10248. Ushio, A., Anke, L. E., Schockaert, S., and Camacho- Collados, J. BERT is to NLP what AlexNet is to CV: Can pre-trained language models identify analogies? In Proceedings of the 59th Annual Meeting of the Associa- tion for Computational Linguistics and the 11th Interna- tional Joint Conference on Natural Language Processing...

  27. [27]

    Concept alge- bra for score-based conditional models

    Wang, Z., Gui, L., Negrea, J., and Veitch, V . Concept alge- bra for score-based conditional models. arXiv preprint arXiv:2302.03693,

  28. [28]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., 11 The Linear Representation Hypothesis and the Geometry of Large Language Models Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, Z., and Hendrycks, D. Representation engineering: A t...

  29. [29]

    Concept names, one example of the counterfactual pairs, and the number of the used pairs # Concept Example Count 1 verb ⇒ 3pSg (accept, accepts) 32 2 verb ⇒ Ving (add, adding) 31 3 verb ⇒ Ved (accept, accepted) 47 4 Ving ⇒ 3pSg (adding, adds) 27 5 Ving ⇒ Ved (adding, added) 34 6 3pSg ⇒ Ved (adds, added) 29 7 verb ⇒ V + able (accept, acceptable) 6 8 verb ⇒...

  30. [30]

    princess

    tokens, 90% of which is in English. This model uses 32,000 tokens and 4,096 dimensions for its token embeddings. Counterfactual pairs Tokenization poses a challenge in using certain words. First, a word can be tokenized to more than one token. For example, a word “princess” is tokenized to “prin” + “cess”, and γ(“princess”) does not exist. Thus, we cannot...