Recognition: no theorem link
The Linear Representation Hypothesis and the Geometry of Large Language Models
Pith reviewed 2026-05-11 21:37 UTC · model grok-4.3
The pith
High-level concepts in large language models are linear directions under a causal inner product built from counterfactual pairs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using the language of counterfactuals, the authors formalize linear representation first in the output space as directions that separate counterfactual word pairs, and second in the input space as directions that separate counterfactual sentence pairs. They prove these formalizations recover linear probing and steering, respectively. They then construct a causal inner product under which all geometric operations respect the counterfactual structure of language; this single object unifies probing, steering, and all earlier linear-representation techniques, and it permits direct construction of both probes and steering vectors from counterfactual pairs alone.
What carries the argument
The causal inner product, a non-Euclidean inner product on representation space derived from counterfactual pairs that makes cosine similarity and projection respect the structure of language.
If this is right
- Linear probes for any concept can be built directly from counterfactual pairs without external labeled data.
- Steering vectors that change model behavior along a concept can be constructed from the same counterfactual pairs.
- Geometric notions such as projection and similarity become consistent across probing and steering once the causal inner product is used.
- All existing techniques for finding linear representations become special cases of the single counterfactual framework.
- The existence of linear representations can be tested by checking whether counterfactual pairs align along a single direction under the causal inner product.
Where Pith is reading between the lines
- The same counterfactual construction could be applied to detect and edit representations of safety-relevant concepts such as deception or toxicity.
- If the causal inner product generalizes across model scales, it offers a parameter-free way to transfer probes and steering vectors between models.
- The framework suggests a route to test whether linearity holds for relational concepts that involve multiple entities rather than single attributes.
- Extending the construction beyond text to multimodal models would require defining counterfactual pairs across modalities.
Load-bearing premise
That counterfactual pairs for a given concept can be reliably constructed or approximated inside the model so that the resulting causal inner product correctly captures language structure.
What would settle it
A dataset of counterfactual pairs for several concepts where the directions obtained from the causal inner product produce probes whose accuracy does not match the steering effectiveness obtained from the same directions.
read the original abstract
Informally, the 'linear representation hypothesis' is the idea that high-level concepts are represented linearly as directions in some representation space. In this paper, we address two closely related questions: What does "linear representation" actually mean? And, how do we make sense of geometric notions (e.g., cosine similarity or projection) in the representation space? To answer these, we use the language of counterfactuals to give two formalizations of "linear representation", one in the output (word) representation space, and one in the input (sentence) space. We then prove these connect to linear probing and model steering, respectively. To make sense of geometric notions, we use the formalization to identify a particular (non-Euclidean) inner product that respects language structure in a sense we make precise. Using this causal inner product, we show how to unify all notions of linear representation. In particular, this allows the construction of probes and steering vectors using counterfactual pairs. Experiments with LLaMA-2 demonstrate the existence of linear representations of concepts, the connection to interpretation and control, and the fundamental role of the choice of inner product.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes the linear representation hypothesis via two counterfactual definitions—one in output (word) space linking to probing, one in input (sentence) space linking to steering—then derives a non-Euclidean 'causal' inner product under which all linear notions unify, allowing probes and steering vectors to be constructed directly from counterfactual pairs. Experiments on LLaMA-2 are cited to show existence of linear representations, their connection to interpretation/control, and the importance of the inner-product choice.
Significance. If the unification holds, the work supplies a principled, counterfactual-based geometry for LLM representations that could replace ad-hoc Euclidean assumptions in interpretability research. The parameter-free derivation from invariance properties and the explicit link between probing and steering are strengths that would make the framework useful for both theory and downstream control methods.
major comments (3)
- [§3] §3 (causal inner product derivation): the unification claim requires that projections and similarities under the new inner product correspond exactly to minimal interventions that flip only the target concept. The manuscript does not supply a direct verification (theoretical or empirical) that the constructed pairs satisfy this minimality condition in the model's actual geometry rather than in an idealized causal model.
- [Experiments] Experiments section: results on LLaMA-2 are used to support both existence of linear representations and the role of the inner product, yet the text provides no details on control conditions, baseline inner products, or statistical tests. Without these, it is unclear whether the reported effects are attributable to the causal inner product or to other model properties.
- [§4] §4 (counterfactual pair construction): the operational definitions of probing and steering vectors from pairs presuppose that such pairs can be reliably approximated without entanglement or non-local effects. The paper does not report diagnostics (e.g., effect-size checks or ablation on pair quality) that would confirm the pairs behave as assumed minimal interventions.
minor comments (2)
- Notation for the causal inner product is introduced without an explicit comparison table to the Euclidean case, making it harder to see at a glance where the two metrics diverge.
- [Introduction] A few sentences in the introduction repeat the informal statement of the linear representation hypothesis; tightening would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. We address each major comment point by point below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (causal inner product derivation): the unification claim requires that projections and similarities under the new inner product correspond exactly to minimal interventions that flip only the target concept. The manuscript does not supply a direct verification (theoretical or empirical) that the constructed pairs satisfy this minimality condition in the model's actual geometry rather than in an idealized causal model.
Authors: Our derivation in §3 proceeds from the counterfactual definitions of linear representations and identifies the causal inner product as the one that respects invariance under minimal interventions on the target concept. By construction within this framework, projections and similarities under the inner product align with the effects of such interventions in the idealized causal model. We agree that a direct empirical verification against the LLM's actual geometry would strengthen the presentation; we will revise §3 to make the theoretical correspondence more explicit and to discuss the idealized assumption as a limitation. revision: partial
-
Referee: [Experiments] Experiments section: results on LLaMA-2 are used to support both existence of linear representations and the role of the inner product, yet the text provides no details on control conditions, baseline inner products, or statistical tests. Without these, it is unclear whether the reported effects are attributable to the causal inner product or to other model properties.
Authors: We agree that the current experimental description lacks these details. In the revised manuscript we will expand the Experiments section to describe control conditions (including random counterfactual pairs), explicit comparisons to the Euclidean inner product as a baseline, and statistical tests assessing the significance of the reported effects. revision: yes
-
Referee: [§4] §4 (counterfactual pair construction): the operational definitions of probing and steering vectors from pairs presuppose that such pairs can be reliably approximated without entanglement or non-local effects. The paper does not report diagnostics (e.g., effect-size checks or ablation on pair quality) that would confirm the pairs behave as assumed minimal interventions.
Authors: The definitions in §4 rely on the assumption that the constructed pairs approximate minimal interventions, consistent with the formalization developed earlier. We will add to the revised manuscript diagnostics such as effect-size measurements on the interventions and ablations on pair quality to provide empirical support for this assumption. revision: yes
Circularity Check
No significant circularity; derivations follow from counterfactual formalizations and vector-space axioms
full rationale
The paper begins with explicit counterfactual-based definitions of linear representation (one in output space, one in input space), then uses standard linear-algebraic arguments to connect them to probing and steering. The causal inner product is constructed precisely to satisfy the invariance properties stated in those definitions, so the subsequent unification and pair-based construction of probes/steering vectors are direct consequences rather than independent predictions. No parameter is fitted to the target result, no uniqueness theorem is imported from the authors' own prior work, and no ansatz is smuggled via citation. The experimental section on LLaMA-2 supplies separate empirical checks that do not enter the formal chain.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Representation spaces are vector spaces over the reals
- domain assumption Counterfactual interventions on inputs and outputs are well-defined for the model
invented entities (1)
-
causal inner product
no independent evidence
Forward citations
Cited by 36 Pith papers
-
REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.
-
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
-
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
-
The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It
Transformers encode counts correctly internally but fail to read them out due to misalignment with digit output directions, fixable by updating 37k output parameters or small LoRA on attention.
-
Cell-Based Representation of Relational Binding in Language Models
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...
-
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Steering Language Models With Activation Engineering
Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
-
The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations
Temporal knowledge drift is encoded as a geometrically orthogonal direction in LLM residual streams, independent of correctness and uncertainty.
-
A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases
LLMs exhibit three geometric phases in next-token prediction—seeding multiplexing, hoisting overriding, and focal convergence—where predictive subspaces rise, stabilize, and converge across layers.
-
Tool Calling is Linearly Readable and Steerable in Language Models
Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
-
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
Tree SAE learns hierarchical feature structures by combining activation coverage with a new reconstruction condition, outperforming prior SAEs on hierarchical pair detection while matching state-of-the-art benchmark p...
-
Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders
Tree SAE learns hierarchical feature pairs in sparse autoencoders by combining activation coverage with a new reconstruction condition, outperforming prior methods on hierarchy detection while remaining competitive on...
-
Emergent Symbolic Structure in Health Foundation Models: Extraction, Alignment, and Cross-Modal Transfer
Health foundation model embeddings contain an interpretable symbolic organization shared across modalities that supports cross-domain transfer without joint training.
-
Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models
VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.
-
Pairwise matrices for sparse autoencoders: single-feature inspection mislabels causal axes
Pairwise matrices for SAEs demonstrate that single-feature inspection mislabels causal axes, with joint suppression and matched-geometry controls revealing distinct output regimes not captured by single-feature or ran...
-
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
Harmful intent is linearly separable in LLM residual streams across 12 models and multiple architectures, reaching mean AUROC 0.982 while showing protocol-dependent directions and strong generalization to held-out har...
-
Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
Harmful intent is geometrically recoverable as a linear direction or angular deviation in LLM residual streams, with high AUROC across 12 models, stable under alignment variants including abliterated ones, and transfe...
-
LLM Safety From Within: Detecting Harmful Content with Internal Representations
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
-
Characterizing Model-Native Skills
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
-
Rhetorical Questions in LLM Representations: A Linear Probing Study
Linear probes show rhetorical questions are encoded via multiple dataset-specific directions in LLM representations, with low cross-probe agreement on the same data.
-
Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds
Mature small language models share nearly identical 21-emotion geometries across architectures with Spearman correlations 0.74-0.92 despite opposite behavioral profiles, while immature models restructure under RLHF an...
-
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
-
When Safety Geometry Collapses: Fine-Tuning Vulnerabilities in Agentic Guard Models
Benign fine-tuning collapses safety geometry in guard models like Granite Guardian, dropping refusal to 0%, but Fisher-Weighted Safety Subspace Regularization restores it to 75% while improving robustness.
-
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...
-
Steering Llama 2 via Contrastive Activation Addition
Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
-
Negative Before Positive: Asymmetric Valence Processing in Large Language Models
Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.
-
Semantic Structure of Feature Space in Large Language Models
LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.
-
H-Probes: Extracting Hierarchical Structures From Latent Representations of Language Models
H-probes locate low-dimensional subspaces encoding hierarchy in LLM activations for synthetic tree tasks, show causal importance and generalization, and detect weaker signals in mathematical reasoning traces.
-
From Weights to Activations: Is Steering the Next Frontier of Adaptation?
Steering is positioned as a distinct adaptation paradigm that uses targeted activation interventions for local, reversible behavioral changes without parameter updates.
-
There Will Be a Scientific Theory of Deep Learning
A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universa...
Reference graph
Works this paper leans on
-
[1]
Association for Computational Linguistics. doi: 10.18653/v1/K16-1002. URL https://aclanthology.org/K16-1002. Chang, T., Tu, Z., and Bergen, B. The geometry of multi- lingual language model representations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 119–136,
-
[2]
Word embed- dings, analogies, and machine learning: Beyond king - man + woman = queen
Drozd, A., Gladkova, A., and Matsuoka, S. Word embed- dings, analogies, and machine learning: Beyond king - man + woman = queen. In Proceedings of COLING 2016, the 26th International Conference on Computational Lin- guistics: Technical papers, pp. 3519–3530,
work page 2016
-
[3]
Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., et al. Toy models of superposition. arXiv preprint arXiv:2209.10652,
work page internal anchor Pith review arXiv
-
[4]
Ethayarajh, K. How contextual are contextualized word rep- resentations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 55–65,
work page 2019
-
[5]
doi: 10.18653/v1/2020.conll-1.29
Association for Computational Linguis- tics. doi: 10.18653/v1/2020.conll-1.29. URL https: //aclanthology.org/2020.conll-1.29. Geva, M., Caciularu, A., Wang, K., and Goldberg, Y . Trans- former feed-forward layers build predictions by promot- ing concepts in the vocabulary space. In Proceedings of the Conference on Empirical Methods in Natural Lan- guage P...
-
[6]
Goldberg, Y . and Levy, O. word2vec explained: deriv- ing Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722,
-
[7]
Gurnee, W. and Tegmark, M. Language models represent space and time. arXiv preprint arXiv:2310.02207, art. arXiv:2310.02207, October
-
[8]
doi: 10.48550/arXiv. 2310.02207. Hendel, R., Geva, M., and Globerson, A. In-context learning creates task vectors. arXiv preprint arXiv:2310.15916,
work page internal anchor Pith review doi:10.48550/arxiv
-
[9]
arXiv preprint arXiv:2308.09124 , year=
9 The Linear Representation Hypothesis and the Geometry of Large Language Models Hernandez, E., Sharma, A. S., Haklay, T., Meng, K., Watten- berg, M., Andreas, J., Belinkov, Y ., and Bau, D. Linear- ity of relation decoding in transformer language models. arXiv preprint arXiv:2308.09124,
-
[10]
Hewitt, J. and Manning, C. D. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, Volume 1 (Long and Short Papers), pp. 4129–4138,
work page 2019
-
[11]
Towards a Definition of Disentangled Representations
Higgins, I., Amos, D., Pfau, D., Racaniere, S., Matthey, L., Rezende, D., and Lerchner, A. Towards a defi- nition of disentangled representations. arXiv preprint arXiv:1812.02230,
-
[12]
Uncovering meanings of embeddings via partial orthogonality
Jiang, Y ., Aragam, B., and Veitch, V . Uncovering meanings of embeddings via partial orthogonality. arXiv preprint arXiv:2310.17611,
-
[13]
Kudo, T. and Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71,
work page 2018
-
[14]
On the sentence embeddings from pre-trained language mod- els
Li, B., Zhou, H., He, J., Wang, M., Yang, Y ., and Li, L. On the sentence embeddings from pre-trained language mod- els. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9119–9130,
work page 2020
-
[15]
Language mod- els implement simple word2vec-style vector arithmetic
Merullo, J., Eickhoff, C., and Pavlick, E. Language mod- els implement simple word2vec-style vector arithmetic. arXiv preprint arXiv:2305.16130,
-
[16]
Gemma: Open Models Based on Gemini Research and Technology
Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivi `ere, M., Kale, M. S., Love, J., et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Exploiting Similarities among Languages for Machine Translation
Mikolov, T., Le, Q. V ., and Sutskever, I. Exploiting simi- larities among languages for machine translation. arXiv preprint arXiv:1309.4168, 2013a. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. Advances in Neural Informa- tion Processing Systems, 26, 2013b. ...
work page Pith review arXiv 2013
-
[18]
Appworld: A controllable world of apps and people for benchmarking interactive coding agents
Associa- tion for Computational Linguistics. doi: 10.18653/v1/ D17-1308. URL https://aclanthology.org/ D17-1308. Moran, G. E., Sridhar, D., Wang, Y ., and Blei, D. M. Identi- fiable deep generative models via sparse decoding. arXiv preprint arXiv:2110.10804, art. arXiv:2110.10804, Octo- ber
-
[19]
E., Sridhar, D., Wang, Y ., and Blei, D
doi: 10.48550/arXiv.2110.10804. Nanda, N., Lee, A., and Wattenberg, M. Emergent linear rep- resentations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941,
-
[20]
URL https://www.alignmentforum. org/posts/AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens . OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
ISSN 2835-8856. URL https://openreview. net/forum?id=8HuyXvbvqX. Pennington, J., Socher, R., and Manning, C. D. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543,
work page 2014
-
[22]
Prompt algebra for task composition
Perera, P., Trager, M., Zancato, L., Achille, A., and Soatto, S. Prompt algebra for task composition. arXiv preprint arXiv:2306.00310,
-
[23]
Todd, E., Li, M. L., Sharma, A. S., Mueller, A., Wallace, B. C., and Bau, D. Function vectors in large language models. arXiv preprint arXiv:2310.15213,
-
[24]
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V ., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V ., Kha...
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Steering Language Models With Activation Engineering
Turner, A. M., Thiergart, L., Udell, D., Leech, G., Mini, U., and MacDiarmid, M. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, art. arXiv:2308.10248, August
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Steering Language Models With Activation Engineering
doi: 10.48550/arXiv.2308.10248. Ushio, A., Anke, L. E., Schockaert, S., and Camacho- Collados, J. BERT is to NLP what AlexNet is to CV: Can pre-trained language models identify analogies? In Proceedings of the 59th Annual Meeting of the Associa- tion for Computational Linguistics and the 11th Interna- tional Joint Conference on Natural Language Processing...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.10248
-
[27]
Concept alge- bra for score-based conditional models
Wang, Z., Gui, L., Negrea, J., and Veitch, V . Concept alge- bra for score-based conditional models. arXiv preprint arXiv:2302.03693,
-
[28]
Representation Engineering: A Top-Down Approach to AI Transparency
Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., 11 The Linear Representation Hypothesis and the Geometry of Large Language Models Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, Z., and Hendrycks, D. Representation engineering: A t...
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Concept names, one example of the counterfactual pairs, and the number of the used pairs # Concept Example Count 1 verb ⇒ 3pSg (accept, accepts) 32 2 verb ⇒ Ving (add, adding) 31 3 verb ⇒ Ved (accept, accepted) 47 4 Ving ⇒ 3pSg (adding, adds) 27 5 Ving ⇒ Ved (adding, added) 34 6 3pSg ⇒ Ved (adds, added) 29 7 verb ⇒ V + able (accept, acceptable) 6 8 verb ⇒...
work page 2023
-
[30]
tokens, 90% of which is in English. This model uses 32,000 tokens and 4,096 dimensions for its token embeddings. Counterfactual pairs Tokenization poses a challenge in using certain words. First, a word can be tokenized to more than one token. For example, a word “princess” is tokenized to “prin” + “cess”, and γ(“princess”) does not exist. Thus, we cannot...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.