arxiv: 2310.15154 · v1 · submitted 2023-10-23 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

Linear Representations of Sentiment in Large Language Models

Curt Tigges , Oskar John Hollinsworth , Atticus Geiger , Neel Nanda

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords sentimentlinear representationactivation spacecausal interventionattention headssummarizationlarge language modelsStanford Sentiment Treebank

0 comments

The pith

Sentiment in large language models is captured by one direction in activation space, with positive and negative at opposite poles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that across multiple models, sentiment appears as a linear feature: a single direction in the internal activations separates positive from negative cases on varied tasks. Causal interventions confirm this direction drives behavior on both controlled toy problems and real benchmarks such as the Stanford Sentiment Treebank. The direction is not limited to emotionally charged tokens; models also compute and store summarized sentiment at neutral sites like commas and proper names. A small set of attention heads and neurons carries most of the signal. Removing the direction erases most of the model's above-chance accuracy on zero-shot sentiment classification.

Core claim

Sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. Causal interventions isolate this direction and demonstrate it is causally relevant in toy tasks and on the Stanford Sentiment Treebank. A small subset of attention heads and neurons implements the direction. Sentiment is additionally summarized at intermediate positions without inherent sentiment, such as punctuation and names; in Stanford Sentiment Treebank zero-shot classification, 76% of above-chance accuracy disappears when the direction is ablated, and roughly half of that loss traces to the summarized-s,

What carries the argument

The sentiment direction: a single vector in the model's activation space whose positive and negative extremes correspond to sentiment polarity and that can be read or edited via linear interventions.

If this is right

Ablating the direction removes 76% of above-chance accuracy on Stanford Sentiment Treebank zero-shot classification.
Roughly 36% of the accuracy loss comes from ablating the direction only at comma tokens where sentiment has been summarized.
A small number of attention heads and neurons are sufficient to implement the direction.
The same direction works across toy tasks and multiple real-world sentiment datasets.
Sentiment is computed and stored at neutral tokens rather than residing only on emotionally loaded words.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same linear-probe plus intervention method could be applied to locate directions for other high-level attributes such as truthfulness or toxicity.
If many abstract features turn out to be linear, targeted editing of model outputs becomes a practical engineering tool.
The summarization motif implies that models deliberately aggregate information at punctuation for later use in downstream decisions.
Linear sentiment directions might be portable across model families, enabling transfer of interpretability findings.

Load-bearing premise

The extracted direction is the main stable carrier of sentiment rather than one of several correlated vectors that happen to align on the tested models and datasets.

What would settle it

Ablating the identified direction produces no measurable drop in sentiment classification accuracy on a held-out distribution of text that was never used to locate the direction.

read the original abstract

Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs). In this study, we reveal that across a range of models, sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. Through causal interventions, we isolate this direction and show it is causally relevant in both toy tasks and real world datasets such as Stanford Sentiment Treebank. Through this case study we model a thorough investigation of what a single direction means on a broad data distribution. We further uncover the mechanisms that involve this direction, highlighting the roles of a small subset of attention heads and neurons. Finally, we discover a phenomenon which we term the summarization motif: sentiment is not solely represented on emotionally charged words, but is additionally summarized at intermediate positions without inherent sentiment, such as punctuation and names. We show that in Stanford Sentiment Treebank zero-shot classification, 76% of above-chance classification accuracy is lost when ablating the sentiment direction, nearly half of which (36%) is due to ablating the summarized sentiment direction exclusively at comma positions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a linear sentiment direction with causal impact on SST via interventions and a concrete summarization effect at commas, but the direction's status as the primary representation is not fully locked down.

read the letter

The paper's main result is that sentiment in large language models lives along a single direction in activation space. They extract this direction from positive and negative examples and show that zeroing it out tanks performance on Stanford Sentiment Treebank zero-shot classification, losing 76% of the above-chance accuracy. About half of that loss comes from ablating the direction only at comma positions, which points to this summarization motif where the model collects sentiment info at neutral tokens. What they do well is run actual causal interventions instead of just looking at correlations. The toy tasks give clean evidence that the direction matters for the model's behavior, and the SST results tie it to a real task. Finding that a small set of heads and neurons are involved adds a mechanistic layer. The 36% figure from commas is a nice concrete observation that wasn't in earlier work on linear features. The soft spot is that the direction is defined using the same kind of labeled data it's tested on. Without checks that it outperforms other candidate directions or holds up on unlabeled text, it could be a strong correlate rather than the core representation. The abstract doesn't give error bars or details on how they picked the direction, so the exact percentages might shift with different choices. The stress-test concern about primacy is fair; the evidence supports usefulness on these setups but doesn't fully rule out multiple directions doing similar work. This kind of paper is aimed at the mechanistic interpretability crowd. Readers working on feature directions or steering will get value from the methods and the motif idea. It is coherent and engages the literature on linear representations, so it deserves a serious referee even if revisions are needed for tighter controls. I'd recommend sending it to peer review.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that sentiment is represented linearly in large language models as a single direction in activation space, with one pole for positive and the other for negative sentiment. This direction is identified via contrast on labeled data and shown to be causally relevant through interventions on toy tasks and the Stanford Sentiment Treebank (SST), where ablating it removes 76% of above-chance zero-shot classification accuracy (with 36% of that loss attributable to ablating the direction only at comma positions). The work also identifies a small set of attention heads and neurons involved and introduces the 'summarization motif' whereby sentiment is aggregated at punctuation and other neutral tokens.

Significance. If the central claim is robust, the paper supplies concrete causal evidence for linear feature representations in LLMs and introduces a mechanistic account of information aggregation via the summarization motif. The interventions on a real-world dataset (SST) move beyond purely correlational probes and could inform downstream interpretability and editing techniques.

major comments (3)

[Abstract / Results] Abstract and Results sections: the reported 76% loss of above-chance accuracy upon ablating the sentiment direction is a load-bearing quantitative claim, yet the manuscript provides no error bars, no number of random seeds, and no explicit description of how the direction vector is computed (e.g., whether it is the mean-difference vector, a normalized difference, or the result of a supervised probe).
[Causal Interventions] Causal Interventions and Direction Identification: the claim that a single direction 'mostly captures' the feature requires evidence that this vector is primary rather than one of several correlated directions. The paper should compare intervention effect sizes for the chosen direction against (i) other principal components of the same positive-negative contrast and (ii) directions derived from unrelated tasks or random vectors of the same dimensionality.
[Summarization Motif] Summarization Motif: the attribution of 36% of the accuracy loss to comma positions is interesting, but the manuscript does not test whether the same direction remains dominant or whether the motif persists when the input distribution is shifted away from the SST identification data (e.g., on other sentiment corpora or out-of-domain text).

minor comments (2)

[Methods] Clarify the precise mathematical definition of the sentiment direction (including any centering, normalization, or projection steps) in the Methods section.
[Figures] All figures showing ablation or intervention results should report standard errors or confidence intervals and state the number of independent runs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for your detailed review. We have carefully considered each major comment and made revisions to improve the clarity and robustness of our claims. Below we respond point by point.

read point-by-point responses

Referee: [Abstract / Results] Abstract and Results sections: the reported 76% loss of above-chance accuracy upon ablating the sentiment direction is a load-bearing quantitative claim, yet the manuscript provides no error bars, no number of random seeds, and no explicit description of how the direction vector is computed (e.g., whether it is the mean-difference vector, a normalized difference, or the result of a supervised probe).

Authors: We thank the referee for pointing this out. The direction vector is computed as the difference in mean activations between positive and negative labeled examples from our contrast dataset. We will include an explicit description of this computation in the Methods section. Additionally, we will report the 76% figure with error bars computed over multiple random seeds for the ablation experiments in the revised manuscript. revision: yes
Referee: [Causal Interventions] Causal Interventions and Direction Identification: the claim that a single direction 'mostly captures' the feature requires evidence that this vector is primary rather than one of several correlated directions. The paper should compare intervention effect sizes for the chosen direction against (i) other principal components of the same positive-negative contrast and (ii) directions derived from unrelated tasks or random vectors of the same dimensionality.

Authors: We agree that demonstrating the primacy of this direction strengthens the claim. In the original manuscript, we show that ablating this specific direction has a large causal effect on sentiment classification. To address this, we will add comparisons showing that the intervention effect of our sentiment direction exceeds that of random vectors and other principal components from the contrast set in the revised version. Directions from unrelated tasks are not directly comparable without new experiments, but we note that our direction is derived specifically from sentiment contrast. revision: partial
Referee: [Summarization Motif] Summarization Motif: the attribution of 36% of the accuracy loss to comma positions is interesting, but the manuscript does not test whether the same direction remains dominant or whether the motif persists when the input distribution is shifted away from the SST identification data (e.g., on other sentiment corpora or out-of-domain text).

Authors: The summarization motif was identified through analysis on the SST dataset, which is a standard benchmark. We observed the motif consistently across different models. While we did not perform experiments on additional corpora in this work, the causal interventions on SST demonstrate the motif's relevance in a real-world setting. We believe this provides a solid foundation, and testing on out-of-domain text is an important direction for future work but beyond the scope of the current manuscript. revision: no

Circularity Check

0 steps flagged

No circularity: direction extracted from contrast data and validated via independent causal interventions

full rationale

The paper extracts a sentiment direction by averaging activations on positive vs. negative examples from labeled datasets (SST etc.) and then performs causal ablations and interventions on held-out or differently distributed text. No equation defines the direction in terms of the same quantity it later predicts; the central claim is supported by out-of-sample causal effect sizes rather than by construction or self-citation chains. The summarization-motif analysis at commas is an additional empirical observation, not a definitional loop. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The paper relies on the assumption that a single linear direction extracted via some contrast or probing method captures the dominant sentiment signal; this direction is treated as an empirical discovery rather than derived from first principles.

free parameters (1)

sentiment direction vector
Extracted from model activations on chosen positive/negative examples; its precise construction method is not detailed in the abstract.

axioms (1)

domain assumption Linear representation hypothesis for high-level features in transformer activations
The study assumes that sentiment, like other features studied in prior interpretability work, admits a linear encoding.

pith-pipeline@v0.9.0 · 5518 in / 1270 out tokens · 37629 ms · 2026-05-15T12:37:30.988087+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
cs.LG 2026-04 accept novelty 8.0

Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions
cs.LG 2026-05 unverdicted novelty 7.0

Linear probes for Othello board states factor into tensor-product structure with square and color embeddings composed by a binding matrix, from which the linear probes can be directly recovered.
How LLMs Are Persuaded: A Few Attention Heads, Rerouted
cs.AI 2026-05 unverdicted novelty 7.0

Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.
Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs
cs.CL 2026-05 unverdicted novelty 7.0

LLMs encode repeated token counts correctly in residual streams but a format-triggered MLP at 88-93% depth overwrites it with an incorrect fixed value.
PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction
cs.LG 2026-05 unverdicted novelty 7.0

PLOT localizes causal variables in neural networks by fitting optimal transport couplings between abstract and neural intervention effect geometries, enabling fast handles or guided search.
Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation
cs.CL 2026-04 unverdicted novelty 7.0

Translation function vectors extracted from English to one target language improve correct token ranking for translations to multiple other unseen target languages in decoder-only multilingual LLMs.
Cell-Based Representation of Relational Binding in Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...
Emotion Concepts and their Function in a Large Language Model
cs.AI 2026-04 unverdicted novelty 7.0

Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
Refusal in Language Models Is Mediated by a Single Direction
cs.LG 2024-06 accept novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Tool Calling is Linearly Readable and Steerable in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.
LLM Safety From Within: Detecting Harmful Content with Internal Representations
cs.AI 2026-04 unverdicted novelty 6.0

SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
cs.LG 2026-04 unverdicted novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
Linear Representations of Hierarchical Concepts in Language Models
cs.CL 2026-04 unverdicted novelty 6.0

Language models encode concept hierarchies as linear transformations that are domain-specific yet structurally similar across domains.
Steering Llama 2 via Contrastive Activation Addition
cs.CL 2023-12 unverdicted novelty 6.0

Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.
Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity
cs.LG 2026-05 unverdicted novelty 5.0

Cosine similarity poorly predicts performance degradation from layer removal in LLMs, making direct accuracy-drop ablation a more reliable relevance metric.
Negative Before Positive: Asymmetric Valence Processing in Large Language Models
cs.CL 2026-05 unverdicted novelty 5.0

Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.
Semantic Structure of Feature Space in Large Language Models
cs.CL 2026-04 unverdicted novelty 5.0

LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.

Reference graph

Works this paper leans on

122 extracted references · 122 canonical work pages · cited by 19 Pith papers · 1 internal anchor

[1]

2021 , month=

Eliciting latent knowledge: How to tell if your eyes deceive you , author=. 2021 , month=

work page 2021
[2]

, title =

Karl Pearson F.R.S. , title =. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science , volume =. 1901 , publisher =

work page 1901
[3]

Information Theory, IEEE Transactions on , volume=

Least squares quantization in PCM , author=. Information Theory, IEEE Transactions on , volume=

work page
[4]

Journal of the Royal Statistical Society: Series B (Methodological) , volume=

The regression analysis of binary sequences , author=. Journal of the Royal Statistical Society: Series B (Methodological) , volume=. 1958 , publisher=

work page 1958
[5]

2023 , eprint=

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations , author=. 2023 , eprint=

work page 2023
[6]

2022 , eprint=

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , author=. 2022 , eprint=

work page 2022
[7]

2023 , howpublished=

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] , author=. 2023 , howpublished=

work page 2023
[8]

2023 , eprint=

Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. 2023 , eprint=

work page 2023
[9]

2021 , journal=

A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=

work page 2021
[10]

Distill , year =

Zoom In: An Introduction to Circuits , author =. Distill , year =

work page
[11]

2023 , eprint=

Locating and Editing Factual Associations in GPT , author=. 2023 , eprint=

work page 2023
[12]

2023 , eprint=

Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , author=. 2023 , eprint=

work page 2023
[13]

2022 , eprint=

Discovering Latent Knowledge in Language Models Without Supervision , author=. 2022 , eprint=

work page 2022
[14]

2023 , eprint=

Emergent Linear Representations in World Models of Self-Supervised Sequence Models , author=. 2023 , eprint=

work page 2023
[15]

Proceedings of the 2013 conference on empirical methods in natural language processing , interhash =

Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D and Ng, Andrew Y and Potts, Christopher , biburl =. Proceedings of the 2013 conference on empirical methods in natural language processing , interhash =

work page 2013
[16]

and Wang, Z

Cui, J. and Wang, Z. and Ho, SB. and others , title =. Artif Intell Rev , volume =. 2023 , url =

work page 2023
[17]

2023 , eprint=

Activation Addition: Steering Language Models Without Optimization , author=. 2023 , eprint=

work page 2023
[18]

2023 , eprint=

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author=. 2023 , eprint=

work page 2023
[19]

Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales

Pang, Bo and Lee, Lillian. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics ( ACL ' 05). 2005. doi:10.3115/1219840.1219855

work page doi:10.3115/1219840.1219855 2005
[20]

Accurate Unlexicalized Parsing

Klein, Dan and Manning, Christopher D. Accurate Unlexicalized Parsing. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 2003. doi:10.3115/1075096.1075150

work page doi:10.3115/1075096.1075150 2003
[21]

2023 , url =

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces , author =. 2023 , url =

work page 2023
[22]

2020 , eprint=

Scaling Laws for Neural Language Models , author=. 2020 , eprint=

work page 2020
[23]

neelnanda.io , author=

Actually, Othello-GPT Has A Linear Emergent World Model , url=. neelnanda.io , author=. 2023 , month=

work page 2023
[24]

2022 , url=

Where, When & Which Concepts Does AlphaZero Learn? Lessons from the Game of Hex , author=. 2022 , url=

work page 2022
[25]

2021 , eprint=

Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color , author=. 2021 , eprint=

work page 2021
[26]

International Conference on Learning Representations , year=

Mapping Language Models to Grounded Conceptual Spaces , author=. International Conference on Learning Representations , year=

work page
[27]

2021 , eprint=

Implicit Representations of Meaning in Neural Language Models , author=. 2021 , eprint=

work page 2021
[28]

2006 , edition=

An Introduction to Systems Biology: Design Principles of Biological Circuits , author=. 2006 , edition=

work page 2006
[29]

2023 , url =

Stefan Heimersheim, Jett Janiak , title =. 2023 , url =

work page 2023
[30]

2023 , eprint=

How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model , author=. 2023 , eprint=

work page 2023
[31]

2023 , eprint=

Evaluating the Ripple Effects of Knowledge Editing in Language Models , author=. 2023 , eprint=

work page 2023
[32]

2020 , eprint=

Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias , author=. 2020 , eprint=

work page 2020
[33]

2023 , eprint=

Progress measures for grokking via mechanistic interpretability , author=. 2023 , eprint=

work page 2023
[34]

2023 , eprint=

Explaining grokking through circuit efficiency , author=. 2023 , eprint=

work page 2023
[35]

Neuroscope: A Website for Mechanistic Interpretability of Language Models , author =

work page
[37]

Causal Abstractions of Neural Networks , url =

Geiger, Atticus and Lu, Hanson and Icard, Thomas and Potts, Christopher , booktitle =. Causal Abstractions of Neural Networks , url =

work page
[41]

Inducing Causal Structure for Interpretable Neural Networks , url =

Geiger, Atticus and Wu, Zhengxuan and Lu, Hanson and Rozner, Josh and Kreiss, Elisa and Icard, Thomas and Goodman, Noah and Potts, Christopher , booktitle =. Inducing Causal Structure for Interpretable Neural Networks , url =

work page
[44]

Text Analytics 2014: User Perspectives on Solutions and Providers , url =

Grimes, Seth , institution =. Text Analytics 2014: User Perspectives on Solutions and Providers , url =

work page 2014
[45]

2021 , eprint=

An Interpretability Illusion for BERT , author=. 2021 , eprint=

work page 2021
[46]

OpenWebText Corpus , author=

work page
[47]

2022 , journal=

Toy Models of Superposition , author=. 2022 , journal=

work page 2022
[48]

nostalgebraist , title =

work page
[49]

Linguistic Regularities in Continuous Space Word Representations

Mikolov, Tomas and Yih, Wen-tau and Zweig, Geoffrey. Linguistic Regularities in Continuous Space Word Representations. Proceedings of the 2013 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013

work page 2013
[50]

2020 , eprint=

Attention is Not Only a Weight: Analyzing Transformers with Vector Norms , author=. 2020 , eprint=

work page 2020
[51]

Causal Abstraction for Faithful Model Interpretation , url =

Geiger, Atticus and Potts, Christopher and Icard, Thomas , note =. Causal Abstraction for Faithful Model Interpretation , url =

work page
[52]

Abraham, Eldar David and D'Oosterlinck, Karel and Feder, Amir and Gat, Yair and Geiger, Atticus and Potts, Christopher and Reichart, Roi and Wu, Zhengxuan , booktitle =

work page
[53]

2023 , eprint=

The Hydra Effect: Emergent Self-repair in Language Model Computations , author=. 2023 , eprint=

work page 2023
[54]

2017 , eprint=

Learning to Generate Reviews and Discovering Sentiment , author=. 2017 , eprint=

work page 2017
[55]

Language Models are Unsupervised Multitask Learners , author=

work page
[56]

Causal Distillation for Language Models , url =

Wu, Zhengxuan and Geiger, Atticus and Rozner, Joshua and Kreiss, Elisa and Lu, Hanson and Icard, Thomas and Potts, Christopher and Goodman, Noah , booktitle =. Causal Distillation for Language Models , url =. doi:10.18653/v1/2022.naacl-main.318 , pages =

work page doi:10.18653/v1/2022.naacl-main.318 2022
[57]

2023 , eprint=

Sparks of Artificial General Intelligence: Early experiments with GPT-4 , author=. 2023 , eprint=

work page 2023
[58]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[59]

Probabilistic and causal inference: the works of Judea Pearl , publisher =

Direct and indirect effects , author=. Probabilistic and causal inference: the works of Judea Pearl , publisher =

work page
[61]

2020 , eprint=

Language Through a Prism: A Spectral Approach for Multiscale Language Representations , author=. 2020 , eprint=

work page 2020
[62]

Distill , year =

Goh, Gabriel and †, Nick Cammarata and †, Chelsea Voss and Carter, Shan and Petrov, Michael and Schubert, Ludwig and Radford, Alec and Olah, Chris , title =. Distill , year =

work page
[63]

2023 , eprint=

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author=. 2023 , eprint=

work page 2023
[64]

2023 , journal=

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=

work page 2023
[65]

2023 , eprint=

Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=

work page 2023
[66]

2023 , note =

Adam Yedidia , title =. 2023 , note =

work page 2023
[67]

Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small , url =

Chris Mathwin and Guillaume Corlouer and Esben Kran and Fazl Barez and Neel Nanda , date =. Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small , url =

work page
[68]

2022 , howpublished =

TransformerLens , author =. 2022 , howpublished =

work page 2022
[69]

Can language models encode perceptual structure without grounding? a case study in color, 2021

Mostafa Abdou, Artur Kulmizev, Daniel Hershcovich, Stella Frank, Ellie Pavlick, and Anders Søgaard. Can language models encode perceptual structure without grounding? a case study in color, 2021

work page 2021
[70]

CEBaB : Estimating the causal effects of real-world concepts on nlp model behavior

Eldar David Abraham, Karel D'Oosterlinck, Amir Feder, Yair Gat, Atticus Geiger, Christopher Potts, Roi Reichart, and Zhengxuan Wu. CEBaB : Estimating the causal effects of real-world concepts on nlp model behavior. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp....

work page 2022
[71]

An Introduction to Systems Biology: Design Principles of Biological Circuits

Uri Alon. An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman and Hall/CRC, 1st edition, 2006. doi:10.1201/9781420011432. URL https://doi.org/10.1201/9781420011432

work page doi:10.1201/9781420011432 2006
[72]

Pythia: A suite for analyzing large language models across training and scaling, 2023

Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023

work page 2023
[73]

An interpretability illusion for bert, 2021

Tolga Bolukbasi, Adam Pearce, Ann Yuan, Andy Coenen, Emily Reif, Fernanda Viégas, and Martin Wattenberg. An interpretability illusion for bert, 2021

work page 2021
[74]

Towards monosemanticity: Decomposing language models with dictionary learning

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

work page 2023
[75]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[76]

Sparks of artificial general intelligence: Early experiments with gpt-4, 2023

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023

work page 2023
[77]

Discovering latent knowledge in language models without supervision, 2022

Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision, 2022

work page 2022
[78]

Causal scrubbing: a method for rigorously testing interpretability hypotheses [redwood research]

Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas. Causal scrubbing: a method for rigorously testing interpretability hypotheses [redwood research]. Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-metho...

work page 2023
[79]

Eliciting latent knowledge: How to tell if your eyes deceive you

Paul Christiano, Ajeya Cotra, and Mark Xu. Eliciting latent knowledge: How to tell if your eyes deceive you. Google Docs, December 2021. Accessed: 17th Sep 2023

work page 2021
[80]

Evaluating the ripple effects of knowledge editing in language models, 2023

Roi Cohen, Eden Biran, Ori Yoran, Amir Globerson, and Mor Geva. Evaluating the ripple effects of knowledge editing in language models, 2023

work page 2023
[81]

Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability, 2023

work page 2023
[82]

J. Cui, Z. Wang, SB. Ho, et al. Survey on sentiment analysis: evolution of research methods and topics. Artif Intell Rev, 56: 0 8469--8510, 2023. doi:10.1007/s10462-022-10386-z. URL https://doi.org/10.1007/s10462-022-10386-z

work page doi:10.1007/s10462-022-10386-z 2023
[83]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[84]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

work page 2021
[85]

A mathematical framework for transformer circuits

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

work page 2021
[86]

Toy models of superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread, 2022. URL https://transformer-circuits.pub/20...

work page 2022
[87]

Neural natural language inference models partially embed theories of lexical entailment and negation

Atticus Geiger, Kyle Richardson, and Christopher Potts. Neural natural language inference models partially embed theories of lexical entailment and negation. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp.\ 163--173, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v...

work page doi:10.18653/v1/2020.blackboxnlp-1.16 2020

Showing first 80 references.