pith. machine review for the scientific record. sign in

arxiv: 2310.15154 · v1 · submitted 2023-10-23 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Linear Representations of Sentiment in Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords sentimentlinear representationactivation spacecausal interventionattention headssummarizationlarge language modelsStanford Sentiment Treebank
0
0 comments X

The pith

Sentiment in large language models is captured by one direction in activation space, with positive and negative at opposite poles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that across multiple models, sentiment appears as a linear feature: a single direction in the internal activations separates positive from negative cases on varied tasks. Causal interventions confirm this direction drives behavior on both controlled toy problems and real benchmarks such as the Stanford Sentiment Treebank. The direction is not limited to emotionally charged tokens; models also compute and store summarized sentiment at neutral sites like commas and proper names. A small set of attention heads and neurons carries most of the signal. Removing the direction erases most of the model's above-chance accuracy on zero-shot sentiment classification.

Core claim

Sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. Causal interventions isolate this direction and demonstrate it is causally relevant in toy tasks and on the Stanford Sentiment Treebank. A small subset of attention heads and neurons implements the direction. Sentiment is additionally summarized at intermediate positions without inherent sentiment, such as punctuation and names; in Stanford Sentiment Treebank zero-shot classification, 76% of above-chance accuracy disappears when the direction is ablated, and roughly half of that loss traces to the summarized-s,

What carries the argument

The sentiment direction: a single vector in the model's activation space whose positive and negative extremes correspond to sentiment polarity and that can be read or edited via linear interventions.

If this is right

  • Ablating the direction removes 76% of above-chance accuracy on Stanford Sentiment Treebank zero-shot classification.
  • Roughly 36% of the accuracy loss comes from ablating the direction only at comma tokens where sentiment has been summarized.
  • A small number of attention heads and neurons are sufficient to implement the direction.
  • The same direction works across toy tasks and multiple real-world sentiment datasets.
  • Sentiment is computed and stored at neutral tokens rather than residing only on emotionally loaded words.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same linear-probe plus intervention method could be applied to locate directions for other high-level attributes such as truthfulness or toxicity.
  • If many abstract features turn out to be linear, targeted editing of model outputs becomes a practical engineering tool.
  • The summarization motif implies that models deliberately aggregate information at punctuation for later use in downstream decisions.
  • Linear sentiment directions might be portable across model families, enabling transfer of interpretability findings.

Load-bearing premise

The extracted direction is the main stable carrier of sentiment rather than one of several correlated vectors that happen to align on the tested models and datasets.

What would settle it

Ablating the identified direction produces no measurable drop in sentiment classification accuracy on a held-out distribution of text that was never used to locate the direction.

read the original abstract

Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs). In this study, we reveal that across a range of models, sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. Through causal interventions, we isolate this direction and show it is causally relevant in both toy tasks and real world datasets such as Stanford Sentiment Treebank. Through this case study we model a thorough investigation of what a single direction means on a broad data distribution. We further uncover the mechanisms that involve this direction, highlighting the roles of a small subset of attention heads and neurons. Finally, we discover a phenomenon which we term the summarization motif: sentiment is not solely represented on emotionally charged words, but is additionally summarized at intermediate positions without inherent sentiment, such as punctuation and names. We show that in Stanford Sentiment Treebank zero-shot classification, 76% of above-chance classification accuracy is lost when ablating the sentiment direction, nearly half of which (36%) is due to ablating the summarized sentiment direction exclusively at comma positions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that sentiment is represented linearly in large language models as a single direction in activation space, with one pole for positive and the other for negative sentiment. This direction is identified via contrast on labeled data and shown to be causally relevant through interventions on toy tasks and the Stanford Sentiment Treebank (SST), where ablating it removes 76% of above-chance zero-shot classification accuracy (with 36% of that loss attributable to ablating the direction only at comma positions). The work also identifies a small set of attention heads and neurons involved and introduces the 'summarization motif' whereby sentiment is aggregated at punctuation and other neutral tokens.

Significance. If the central claim is robust, the paper supplies concrete causal evidence for linear feature representations in LLMs and introduces a mechanistic account of information aggregation via the summarization motif. The interventions on a real-world dataset (SST) move beyond purely correlational probes and could inform downstream interpretability and editing techniques.

major comments (3)
  1. [Abstract / Results] Abstract and Results sections: the reported 76% loss of above-chance accuracy upon ablating the sentiment direction is a load-bearing quantitative claim, yet the manuscript provides no error bars, no number of random seeds, and no explicit description of how the direction vector is computed (e.g., whether it is the mean-difference vector, a normalized difference, or the result of a supervised probe).
  2. [Causal Interventions] Causal Interventions and Direction Identification: the claim that a single direction 'mostly captures' the feature requires evidence that this vector is primary rather than one of several correlated directions. The paper should compare intervention effect sizes for the chosen direction against (i) other principal components of the same positive-negative contrast and (ii) directions derived from unrelated tasks or random vectors of the same dimensionality.
  3. [Summarization Motif] Summarization Motif: the attribution of 36% of the accuracy loss to comma positions is interesting, but the manuscript does not test whether the same direction remains dominant or whether the motif persists when the input distribution is shifted away from the SST identification data (e.g., on other sentiment corpora or out-of-domain text).
minor comments (2)
  1. [Methods] Clarify the precise mathematical definition of the sentiment direction (including any centering, normalization, or projection steps) in the Methods section.
  2. [Figures] All figures showing ablation or intervention results should report standard errors or confidence intervals and state the number of independent runs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your detailed review. We have carefully considered each major comment and made revisions to improve the clarity and robustness of our claims. Below we respond point by point.

read point-by-point responses
  1. Referee: [Abstract / Results] Abstract and Results sections: the reported 76% loss of above-chance accuracy upon ablating the sentiment direction is a load-bearing quantitative claim, yet the manuscript provides no error bars, no number of random seeds, and no explicit description of how the direction vector is computed (e.g., whether it is the mean-difference vector, a normalized difference, or the result of a supervised probe).

    Authors: We thank the referee for pointing this out. The direction vector is computed as the difference in mean activations between positive and negative labeled examples from our contrast dataset. We will include an explicit description of this computation in the Methods section. Additionally, we will report the 76% figure with error bars computed over multiple random seeds for the ablation experiments in the revised manuscript. revision: yes

  2. Referee: [Causal Interventions] Causal Interventions and Direction Identification: the claim that a single direction 'mostly captures' the feature requires evidence that this vector is primary rather than one of several correlated directions. The paper should compare intervention effect sizes for the chosen direction against (i) other principal components of the same positive-negative contrast and (ii) directions derived from unrelated tasks or random vectors of the same dimensionality.

    Authors: We agree that demonstrating the primacy of this direction strengthens the claim. In the original manuscript, we show that ablating this specific direction has a large causal effect on sentiment classification. To address this, we will add comparisons showing that the intervention effect of our sentiment direction exceeds that of random vectors and other principal components from the contrast set in the revised version. Directions from unrelated tasks are not directly comparable without new experiments, but we note that our direction is derived specifically from sentiment contrast. revision: partial

  3. Referee: [Summarization Motif] Summarization Motif: the attribution of 36% of the accuracy loss to comma positions is interesting, but the manuscript does not test whether the same direction remains dominant or whether the motif persists when the input distribution is shifted away from the SST identification data (e.g., on other sentiment corpora or out-of-domain text).

    Authors: The summarization motif was identified through analysis on the SST dataset, which is a standard benchmark. We observed the motif consistently across different models. While we did not perform experiments on additional corpora in this work, the causal interventions on SST demonstrate the motif's relevance in a real-world setting. We believe this provides a solid foundation, and testing on out-of-domain text is an important direction for future work but beyond the scope of the current manuscript. revision: no

Circularity Check

0 steps flagged

No circularity: direction extracted from contrast data and validated via independent causal interventions

full rationale

The paper extracts a sentiment direction by averaging activations on positive vs. negative examples from labeled datasets (SST etc.) and then performs causal ablations and interventions on held-out or differently distributed text. No equation defines the direction in terms of the same quantity it later predicts; the central claim is supported by out-of-sample causal effect sizes rather than by construction or self-citation chains. The summarization-motif analysis at commas is an additional empirical observation, not a definitional loop. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper relies on the assumption that a single linear direction extracted via some contrast or probing method captures the dominant sentiment signal; this direction is treated as an empirical discovery rather than derived from first principles.

free parameters (1)
  • sentiment direction vector
    Extracted from model activations on chosen positive/negative examples; its precise construction method is not detailed in the abstract.
axioms (1)
  • domain assumption Linear representation hypothesis for high-level features in transformer activations
    The study assumes that sentiment, like other features studied in prior interpretability work, admits a linear encoding.

pith-pipeline@v0.9.0 · 5518 in / 1270 out tokens · 37629 ms · 2026-05-15T12:37:30.988087+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

    cs.LG 2026-04 accept novelty 8.0

    Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.

  2. Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions

    cs.LG 2026-05 unverdicted novelty 7.0

    Linear probes for Othello board states factor into tensor-product structure with square and color embeddings composed by a binding matrix, from which the linear probes can be directly recovered.

  3. How LLMs Are Persuaded: A Few Attention Heads, Rerouted

    cs.AI 2026-05 unverdicted novelty 7.0

    Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.

  4. Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs

    cs.CL 2026-05 unverdicted novelty 7.0

    LLMs encode repeated token counts correctly in residual streams but a format-triggered MLP at 88-93% depth overwrites it with an incorrect fixed value.

  5. PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction

    cs.LG 2026-05 unverdicted novelty 7.0

    PLOT localizes causal variables in neural networks by fitting optimal transport couplings between abstract and neural intervention effect geometries, enabling fast handles or guided search.

  6. Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation

    cs.CL 2026-04 unverdicted novelty 7.0

    Translation function vectors extracted from English to one target language improve correct token ranking for translations to multiple other unseen target languages in decoder-only multilingual LLMs.

  7. Cell-Based Representation of Relational Binding in Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...

  8. Emotion Concepts and their Function in a Large Language Model

    cs.AI 2026-04 unverdicted novelty 7.0

    Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.

  9. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  10. Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

    cs.LG 2026-05 unverdicted novelty 6.0

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

  11. Tool Calling is Linearly Readable and Steerable in Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.

  12. Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.

  13. LLM Safety From Within: Detecting Harmful Content with Internal Representations

    cs.AI 2026-04 unverdicted novelty 6.0

    SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.

  14. Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

    cs.LG 2026-04 unverdicted novelty 6.0

    DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...

  15. Linear Representations of Hierarchical Concepts in Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    Language models encode concept hierarchies as linear transformations that are domain-specific yet structurally similar across domains.

  16. Steering Llama 2 via Contrastive Activation Addition

    cs.CL 2023-12 unverdicted novelty 6.0

    Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.

  17. Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity

    cs.LG 2026-05 unverdicted novelty 5.0

    Cosine similarity poorly predicts performance degradation from layer removal in LLMs, making direct accuracy-drop ablation a more reliable relevance metric.

  18. Negative Before Positive: Asymmetric Valence Processing in Large Language Models

    cs.CL 2026-05 unverdicted novelty 5.0

    Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.

  19. Semantic Structure of Feature Space in Large Language Models

    cs.CL 2026-04 unverdicted novelty 5.0

    LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.

Reference graph

Works this paper leans on

122 extracted references · 122 canonical work pages · cited by 19 Pith papers · 1 internal anchor

  1. [1]

    2021 , month=

    Eliciting latent knowledge: How to tell if your eyes deceive you , author=. 2021 , month=

  2. [2]

    , title =

    Karl Pearson F.R.S. , title =. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science , volume =. 1901 , publisher =

  3. [3]

    Information Theory, IEEE Transactions on , volume=

    Least squares quantization in PCM , author=. Information Theory, IEEE Transactions on , volume=

  4. [4]

    Journal of the Royal Statistical Society: Series B (Methodological) , volume=

    The regression analysis of binary sequences , author=. Journal of the Royal Statistical Society: Series B (Methodological) , volume=. 1958 , publisher=

  5. [5]

    2023 , eprint=

    Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations , author=. 2023 , eprint=

  6. [6]

    2022 , eprint=

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , author=. 2022 , eprint=

  7. [7]

    2023 , howpublished=

    Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] , author=. 2023 , howpublished=

  8. [8]

    2023 , eprint=

    Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. 2023 , eprint=

  9. [9]

    2021 , journal=

    A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

  10. [10]

    Distill , year =

    Zoom In: An Introduction to Circuits , author =. Distill , year =

  11. [11]

    2023 , eprint=

    Locating and Editing Factual Associations in GPT , author=. 2023 , eprint=

  12. [12]

    2023 , eprint=

    Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , author=. 2023 , eprint=

  13. [13]

    2022 , eprint=

    Discovering Latent Knowledge in Language Models Without Supervision , author=. 2022 , eprint=

  14. [14]

    2023 , eprint=

    Emergent Linear Representations in World Models of Self-Supervised Sequence Models , author=. 2023 , eprint=

  15. [15]

    Proceedings of the 2013 conference on empirical methods in natural language processing , interhash =

    Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D and Ng, Andrew Y and Potts, Christopher , biburl =. Proceedings of the 2013 conference on empirical methods in natural language processing , interhash =

  16. [16]

    and Wang, Z

    Cui, J. and Wang, Z. and Ho, SB. and others , title =. Artif Intell Rev , volume =. 2023 , url =

  17. [17]

    2023 , eprint=

    Activation Addition: Steering Language Models Without Optimization , author=. 2023 , eprint=

  18. [18]

    2023 , eprint=

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author=. 2023 , eprint=

  19. [19]

    Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales

    Pang, Bo and Lee, Lillian. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics ( ACL ' 05). 2005. doi:10.3115/1219840.1219855

  20. [20]

    Accurate Unlexicalized Parsing

    Klein, Dan and Manning, Christopher D. Accurate Unlexicalized Parsing. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 2003. doi:10.3115/1075096.1075150

  21. [21]

    2023 , url =

    An Interpretability Illusion for Activation Patching of Arbitrary Subspaces , author =. 2023 , url =

  22. [22]

    2020 , eprint=

    Scaling Laws for Neural Language Models , author=. 2020 , eprint=

  23. [23]

    neelnanda.io , author=

    Actually, Othello-GPT Has A Linear Emergent World Model , url=. neelnanda.io , author=. 2023 , month=

  24. [24]

    2022 , url=

    Where, When & Which Concepts Does AlphaZero Learn? Lessons from the Game of Hex , author=. 2022 , url=

  25. [25]

    2021 , eprint=

    Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color , author=. 2021 , eprint=

  26. [26]

    International Conference on Learning Representations , year=

    Mapping Language Models to Grounded Conceptual Spaces , author=. International Conference on Learning Representations , year=

  27. [27]

    2021 , eprint=

    Implicit Representations of Meaning in Neural Language Models , author=. 2021 , eprint=

  28. [28]

    2006 , edition=

    An Introduction to Systems Biology: Design Principles of Biological Circuits , author=. 2006 , edition=

  29. [29]

    2023 , url =

    Stefan Heimersheim, Jett Janiak , title =. 2023 , url =

  30. [30]

    2023 , eprint=

    How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model , author=. 2023 , eprint=

  31. [31]

    2023 , eprint=

    Evaluating the Ripple Effects of Knowledge Editing in Language Models , author=. 2023 , eprint=

  32. [32]

    2020 , eprint=

    Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias , author=. 2020 , eprint=

  33. [33]

    2023 , eprint=

    Progress measures for grokking via mechanistic interpretability , author=. 2023 , eprint=

  34. [34]

    2023 , eprint=

    Explaining grokking through circuit efficiency , author=. 2023 , eprint=

  35. [35]

    Neuroscope: A Website for Mechanistic Interpretability of Language Models , author =

  36. [37]

    Causal Abstractions of Neural Networks , url =

    Geiger, Atticus and Lu, Hanson and Icard, Thomas and Potts, Christopher , booktitle =. Causal Abstractions of Neural Networks , url =

  37. [41]

    Inducing Causal Structure for Interpretable Neural Networks , url =

    Geiger, Atticus and Wu, Zhengxuan and Lu, Hanson and Rozner, Josh and Kreiss, Elisa and Icard, Thomas and Goodman, Noah and Potts, Christopher , booktitle =. Inducing Causal Structure for Interpretable Neural Networks , url =

  38. [44]

    Text Analytics 2014: User Perspectives on Solutions and Providers , url =

    Grimes, Seth , institution =. Text Analytics 2014: User Perspectives on Solutions and Providers , url =

  39. [45]

    2021 , eprint=

    An Interpretability Illusion for BERT , author=. 2021 , eprint=

  40. [46]

    OpenWebText Corpus , author=

  41. [47]

    2022 , journal=

    Toy Models of Superposition , author=. 2022 , journal=

  42. [48]

    nostalgebraist , title =

  43. [49]

    Linguistic Regularities in Continuous Space Word Representations

    Mikolov, Tomas and Yih, Wen-tau and Zweig, Geoffrey. Linguistic Regularities in Continuous Space Word Representations. Proceedings of the 2013 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013

  44. [50]

    2020 , eprint=

    Attention is Not Only a Weight: Analyzing Transformers with Vector Norms , author=. 2020 , eprint=

  45. [51]

    Causal Abstraction for Faithful Model Interpretation , url =

    Geiger, Atticus and Potts, Christopher and Icard, Thomas , note =. Causal Abstraction for Faithful Model Interpretation , url =

  46. [52]

    Abraham, Eldar David and D'Oosterlinck, Karel and Feder, Amir and Gat, Yair and Geiger, Atticus and Potts, Christopher and Reichart, Roi and Wu, Zhengxuan , booktitle =

  47. [53]

    2023 , eprint=

    The Hydra Effect: Emergent Self-repair in Language Model Computations , author=. 2023 , eprint=

  48. [54]

    2017 , eprint=

    Learning to Generate Reviews and Discovering Sentiment , author=. 2017 , eprint=

  49. [55]

    Language Models are Unsupervised Multitask Learners , author=

  50. [56]

    Causal Distillation for Language Models , url =

    Wu, Zhengxuan and Geiger, Atticus and Rozner, Joshua and Kreiss, Elisa and Lu, Hanson and Icard, Thomas and Potts, Christopher and Goodman, Noah , booktitle =. Causal Distillation for Language Models , url =. doi:10.18653/v1/2022.naacl-main.318 , pages =

  51. [57]

    2023 , eprint=

    Sparks of Artificial General Intelligence: Early experiments with GPT-4 , author=. 2023 , eprint=

  52. [58]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  53. [59]

    Probabilistic and causal inference: the works of Judea Pearl , publisher =

    Direct and indirect effects , author=. Probabilistic and causal inference: the works of Judea Pearl , publisher =

  54. [61]

    2020 , eprint=

    Language Through a Prism: A Spectral Approach for Multiscale Language Representations , author=. 2020 , eprint=

  55. [62]

    Distill , year =

    Goh, Gabriel and †, Nick Cammarata and †, Chelsea Voss and Carter, Shan and Petrov, Michael and Schubert, Ludwig and Radford, Alec and Olah, Chris , title =. Distill , year =

  56. [63]

    2023 , eprint=

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author=. 2023 , eprint=

  57. [64]

    2023 , journal=

    Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

  58. [65]

    2023 , eprint=

    Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=

  59. [66]

    2023 , note =

    Adam Yedidia , title =. 2023 , note =

  60. [67]

    Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small , url =

    Chris Mathwin and Guillaume Corlouer and Esben Kran and Fazl Barez and Neel Nanda , date =. Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small , url =

  61. [68]

    2022 , howpublished =

    TransformerLens , author =. 2022 , howpublished =

  62. [69]

    Can language models encode perceptual structure without grounding? a case study in color, 2021

    Mostafa Abdou, Artur Kulmizev, Daniel Hershcovich, Stella Frank, Ellie Pavlick, and Anders Søgaard. Can language models encode perceptual structure without grounding? a case study in color, 2021

  63. [70]

    CEBaB : Estimating the causal effects of real-world concepts on nlp model behavior

    Eldar David Abraham, Karel D'Oosterlinck, Amir Feder, Yair Gat, Atticus Geiger, Christopher Potts, Roi Reichart, and Zhengxuan Wu. CEBaB : Estimating the causal effects of real-world concepts on nlp model behavior. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp....

  64. [71]

    An Introduction to Systems Biology: Design Principles of Biological Circuits

    Uri Alon. An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman and Hall/CRC, 1st edition, 2006. doi:10.1201/9781420011432. URL https://doi.org/10.1201/9781420011432

  65. [72]

    Pythia: A suite for analyzing large language models across training and scaling, 2023

    Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023

  66. [73]

    An interpretability illusion for bert, 2021

    Tolga Bolukbasi, Adam Pearce, Ann Yuan, Andy Coenen, Emily Reif, Fernanda Viégas, and Martin Wattenberg. An interpretability illusion for bert, 2021

  67. [74]

    Towards monosemanticity: Decomposing language models with dictionary learning

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

  68. [75]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  69. [76]

    Sparks of artificial general intelligence: Early experiments with gpt-4, 2023

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023

  70. [77]

    Discovering latent knowledge in language models without supervision, 2022

    Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision, 2022

  71. [78]

    Causal scrubbing: a method for rigorously testing interpretability hypotheses [redwood research]

    Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas. Causal scrubbing: a method for rigorously testing interpretability hypotheses [redwood research]. Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-metho...

  72. [79]

    Eliciting latent knowledge: How to tell if your eyes deceive you

    Paul Christiano, Ajeya Cotra, and Mark Xu. Eliciting latent knowledge: How to tell if your eyes deceive you. Google Docs, December 2021. Accessed: 17th Sep 2023

  73. [80]

    Evaluating the ripple effects of knowledge editing in language models, 2023

    Roi Cohen, Eden Biran, Ori Yoran, Amir Globerson, and Mor Geva. Evaluating the ripple effects of knowledge editing in language models, 2023

  74. [81]

    Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso

    Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability, 2023

  75. [82]

    J. Cui, Z. Wang, SB. Ho, et al. Survey on sentiment analysis: evolution of research methods and topics. Artif Intell Rev, 56: 0 8469--8510, 2023. doi:10.1007/s10462-022-10386-z. URL https://doi.org/10.1007/s10462-022-10386-z

  76. [83]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

  77. [84]

    A mathematical framework for transformer circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

  78. [85]

    A mathematical framework for transformer circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

  79. [86]

    Toy models of superposition

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread, 2022. URL https://transformer-circuits.pub/20...

  80. [87]

    Neural natural language inference models partially embed theories of lexical entailment and negation

    Atticus Geiger, Kyle Richardson, and Christopher Potts. Neural natural language inference models partially embed theories of lexical entailment and negation. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp.\ 163--173, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v...

Showing first 80 references.