Recognition: 2 theorem links
· Lean TheoremLinear Representations of Sentiment in Large Language Models
Pith reviewed 2026-05-15 12:37 UTC · model grok-4.3
The pith
Sentiment in large language models is captured by one direction in activation space, with positive and negative at opposite poles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. Causal interventions isolate this direction and demonstrate it is causally relevant in toy tasks and on the Stanford Sentiment Treebank. A small subset of attention heads and neurons implements the direction. Sentiment is additionally summarized at intermediate positions without inherent sentiment, such as punctuation and names; in Stanford Sentiment Treebank zero-shot classification, 76% of above-chance accuracy disappears when the direction is ablated, and roughly half of that loss traces to the summarized-s,
What carries the argument
The sentiment direction: a single vector in the model's activation space whose positive and negative extremes correspond to sentiment polarity and that can be read or edited via linear interventions.
If this is right
- Ablating the direction removes 76% of above-chance accuracy on Stanford Sentiment Treebank zero-shot classification.
- Roughly 36% of the accuracy loss comes from ablating the direction only at comma tokens where sentiment has been summarized.
- A small number of attention heads and neurons are sufficient to implement the direction.
- The same direction works across toy tasks and multiple real-world sentiment datasets.
- Sentiment is computed and stored at neutral tokens rather than residing only on emotionally loaded words.
Where Pith is reading between the lines
- The same linear-probe plus intervention method could be applied to locate directions for other high-level attributes such as truthfulness or toxicity.
- If many abstract features turn out to be linear, targeted editing of model outputs becomes a practical engineering tool.
- The summarization motif implies that models deliberately aggregate information at punctuation for later use in downstream decisions.
- Linear sentiment directions might be portable across model families, enabling transfer of interpretability findings.
Load-bearing premise
The extracted direction is the main stable carrier of sentiment rather than one of several correlated vectors that happen to align on the tested models and datasets.
What would settle it
Ablating the identified direction produces no measurable drop in sentiment classification accuracy on a held-out distribution of text that was never used to locate the direction.
read the original abstract
Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs). In this study, we reveal that across a range of models, sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. Through causal interventions, we isolate this direction and show it is causally relevant in both toy tasks and real world datasets such as Stanford Sentiment Treebank. Through this case study we model a thorough investigation of what a single direction means on a broad data distribution. We further uncover the mechanisms that involve this direction, highlighting the roles of a small subset of attention heads and neurons. Finally, we discover a phenomenon which we term the summarization motif: sentiment is not solely represented on emotionally charged words, but is additionally summarized at intermediate positions without inherent sentiment, such as punctuation and names. We show that in Stanford Sentiment Treebank zero-shot classification, 76% of above-chance classification accuracy is lost when ablating the sentiment direction, nearly half of which (36%) is due to ablating the summarized sentiment direction exclusively at comma positions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that sentiment is represented linearly in large language models as a single direction in activation space, with one pole for positive and the other for negative sentiment. This direction is identified via contrast on labeled data and shown to be causally relevant through interventions on toy tasks and the Stanford Sentiment Treebank (SST), where ablating it removes 76% of above-chance zero-shot classification accuracy (with 36% of that loss attributable to ablating the direction only at comma positions). The work also identifies a small set of attention heads and neurons involved and introduces the 'summarization motif' whereby sentiment is aggregated at punctuation and other neutral tokens.
Significance. If the central claim is robust, the paper supplies concrete causal evidence for linear feature representations in LLMs and introduces a mechanistic account of information aggregation via the summarization motif. The interventions on a real-world dataset (SST) move beyond purely correlational probes and could inform downstream interpretability and editing techniques.
major comments (3)
- [Abstract / Results] Abstract and Results sections: the reported 76% loss of above-chance accuracy upon ablating the sentiment direction is a load-bearing quantitative claim, yet the manuscript provides no error bars, no number of random seeds, and no explicit description of how the direction vector is computed (e.g., whether it is the mean-difference vector, a normalized difference, or the result of a supervised probe).
- [Causal Interventions] Causal Interventions and Direction Identification: the claim that a single direction 'mostly captures' the feature requires evidence that this vector is primary rather than one of several correlated directions. The paper should compare intervention effect sizes for the chosen direction against (i) other principal components of the same positive-negative contrast and (ii) directions derived from unrelated tasks or random vectors of the same dimensionality.
- [Summarization Motif] Summarization Motif: the attribution of 36% of the accuracy loss to comma positions is interesting, but the manuscript does not test whether the same direction remains dominant or whether the motif persists when the input distribution is shifted away from the SST identification data (e.g., on other sentiment corpora or out-of-domain text).
minor comments (2)
- [Methods] Clarify the precise mathematical definition of the sentiment direction (including any centering, normalization, or projection steps) in the Methods section.
- [Figures] All figures showing ablation or intervention results should report standard errors or confidence intervals and state the number of independent runs.
Simulated Author's Rebuttal
Thank you for your detailed review. We have carefully considered each major comment and made revisions to improve the clarity and robustness of our claims. Below we respond point by point.
read point-by-point responses
-
Referee: [Abstract / Results] Abstract and Results sections: the reported 76% loss of above-chance accuracy upon ablating the sentiment direction is a load-bearing quantitative claim, yet the manuscript provides no error bars, no number of random seeds, and no explicit description of how the direction vector is computed (e.g., whether it is the mean-difference vector, a normalized difference, or the result of a supervised probe).
Authors: We thank the referee for pointing this out. The direction vector is computed as the difference in mean activations between positive and negative labeled examples from our contrast dataset. We will include an explicit description of this computation in the Methods section. Additionally, we will report the 76% figure with error bars computed over multiple random seeds for the ablation experiments in the revised manuscript. revision: yes
-
Referee: [Causal Interventions] Causal Interventions and Direction Identification: the claim that a single direction 'mostly captures' the feature requires evidence that this vector is primary rather than one of several correlated directions. The paper should compare intervention effect sizes for the chosen direction against (i) other principal components of the same positive-negative contrast and (ii) directions derived from unrelated tasks or random vectors of the same dimensionality.
Authors: We agree that demonstrating the primacy of this direction strengthens the claim. In the original manuscript, we show that ablating this specific direction has a large causal effect on sentiment classification. To address this, we will add comparisons showing that the intervention effect of our sentiment direction exceeds that of random vectors and other principal components from the contrast set in the revised version. Directions from unrelated tasks are not directly comparable without new experiments, but we note that our direction is derived specifically from sentiment contrast. revision: partial
-
Referee: [Summarization Motif] Summarization Motif: the attribution of 36% of the accuracy loss to comma positions is interesting, but the manuscript does not test whether the same direction remains dominant or whether the motif persists when the input distribution is shifted away from the SST identification data (e.g., on other sentiment corpora or out-of-domain text).
Authors: The summarization motif was identified through analysis on the SST dataset, which is a standard benchmark. We observed the motif consistently across different models. While we did not perform experiments on additional corpora in this work, the causal interventions on SST demonstrate the motif's relevance in a real-world setting. We believe this provides a solid foundation, and testing on out-of-domain text is an important direction for future work but beyond the scope of the current manuscript. revision: no
Circularity Check
No circularity: direction extracted from contrast data and validated via independent causal interventions
full rationale
The paper extracts a sentiment direction by averaging activations on positive vs. negative examples from labeled datasets (SST etc.) and then performs causal ablations and interventions on held-out or differently distributed text. No equation defines the direction in terms of the same quantity it later predicts; the central claim is supported by out-of-sample causal effect sizes rather than by construction or self-citation chains. The summarization-motif analysis at commas is an additional empirical observation, not a definitional loop. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- sentiment direction vector
axioms (1)
- domain assumption Linear representation hypothesis for high-level features in transformer activations
Forward citations
Cited by 19 Pith papers
-
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
-
Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions
Linear probes for Othello board states factor into tensor-product structure with square and color embeddings composed by a binding matrix, from which the linear probes can be directly recovered.
-
How LLMs Are Persuaded: A Few Attention Heads, Rerouted
Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.
-
Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs
LLMs encode repeated token counts correctly in residual streams but a format-triggered MLP at 88-93% depth overwrites it with an incorrect fixed value.
-
PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction
PLOT localizes causal variables in neural networks by fitting optimal transport couplings between abstract and neural intervention effect geometries, enabling fast handles or guided search.
-
Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation
Translation function vectors extracted from English to one target language improve correct token ranking for translations to multiple other unseen target languages in decoder-only multilingual LLMs.
-
Cell-Based Representation of Relational Binding in Language Models
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the...
-
Emotion Concepts and their Function in a Large Language Model
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
Tool Calling is Linearly Readable and Steerable in Language Models
Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
-
Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models
VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.
-
LLM Safety From Within: Detecting Harmful Content with Internal Representations
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
-
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and...
-
Linear Representations of Hierarchical Concepts in Language Models
Language models encode concept hierarchies as linear transformations that are domain-specific yet structurally similar across domains.
-
Steering Llama 2 via Contrastive Activation Addition
Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.
-
Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity
Cosine similarity poorly predicts performance degradation from layer removal in LLMs, making direct accuracy-drop ablation a more reliable relevance metric.
-
Negative Before Positive: Asymmetric Valence Processing in Large Language Models
Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.
-
Semantic Structure of Feature Space in Large Language Models
LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.
Reference graph
Works this paper leans on
-
[1]
Eliciting latent knowledge: How to tell if your eyes deceive you , author=. 2021 , month=
work page 2021
- [2]
-
[3]
Information Theory, IEEE Transactions on , volume=
Least squares quantization in PCM , author=. Information Theory, IEEE Transactions on , volume=
-
[4]
Journal of the Royal Statistical Society: Series B (Methodological) , volume=
The regression analysis of binary sequences , author=. Journal of the Royal Statistical Society: Series B (Methodological) , volume=. 1958 , publisher=
work page 1958
-
[5]
Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations , author=. 2023 , eprint=
work page 2023
-
[6]
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small , author=. 2022 , eprint=
work page 2022
-
[7]
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] , author=. 2023 , howpublished=
work page 2023
-
[8]
Towards Automated Circuit Discovery for Mechanistic Interpretability , author=. 2023 , eprint=
work page 2023
-
[9]
A Mathematical Framework for Transformer Circuits , author=. 2021 , journal=
work page 2021
- [10]
-
[11]
Locating and Editing Factual Associations in GPT , author=. 2023 , eprint=
work page 2023
-
[12]
Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , author=. 2023 , eprint=
work page 2023
-
[13]
Discovering Latent Knowledge in Language Models Without Supervision , author=. 2022 , eprint=
work page 2022
-
[14]
Emergent Linear Representations in World Models of Self-Supervised Sequence Models , author=. 2023 , eprint=
work page 2023
-
[15]
Proceedings of the 2013 conference on empirical methods in natural language processing , interhash =
Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D and Ng, Andrew Y and Potts, Christopher , biburl =. Proceedings of the 2013 conference on empirical methods in natural language processing , interhash =
work page 2013
-
[16]
Cui, J. and Wang, Z. and Ho, SB. and others , title =. Artif Intell Rev , volume =. 2023 , url =
work page 2023
-
[17]
Activation Addition: Steering Language Models Without Optimization , author=. 2023 , eprint=
work page 2023
-
[18]
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author=. 2023 , eprint=
work page 2023
-
[19]
Pang, Bo and Lee, Lillian. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics ( ACL ' 05). 2005. doi:10.3115/1219840.1219855
-
[20]
Accurate Unlexicalized Parsing
Klein, Dan and Manning, Christopher D. Accurate Unlexicalized Parsing. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 2003. doi:10.3115/1075096.1075150
-
[21]
An Interpretability Illusion for Activation Patching of Arbitrary Subspaces , author =. 2023 , url =
work page 2023
- [22]
-
[23]
Actually, Othello-GPT Has A Linear Emergent World Model , url=. neelnanda.io , author=. 2023 , month=
work page 2023
-
[24]
Where, When & Which Concepts Does AlphaZero Learn? Lessons from the Game of Hex , author=. 2022 , url=
work page 2022
-
[25]
Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color , author=. 2021 , eprint=
work page 2021
-
[26]
International Conference on Learning Representations , year=
Mapping Language Models to Grounded Conceptual Spaces , author=. International Conference on Learning Representations , year=
-
[27]
Implicit Representations of Meaning in Neural Language Models , author=. 2021 , eprint=
work page 2021
-
[28]
An Introduction to Systems Biology: Design Principles of Biological Circuits , author=. 2006 , edition=
work page 2006
- [29]
-
[30]
How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model , author=. 2023 , eprint=
work page 2023
-
[31]
Evaluating the Ripple Effects of Knowledge Editing in Language Models , author=. 2023 , eprint=
work page 2023
-
[32]
Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias , author=. 2020 , eprint=
work page 2020
-
[33]
Progress measures for grokking via mechanistic interpretability , author=. 2023 , eprint=
work page 2023
-
[34]
Explaining grokking through circuit efficiency , author=. 2023 , eprint=
work page 2023
-
[35]
Neuroscope: A Website for Mechanistic Interpretability of Language Models , author =
-
[37]
Causal Abstractions of Neural Networks , url =
Geiger, Atticus and Lu, Hanson and Icard, Thomas and Potts, Christopher , booktitle =. Causal Abstractions of Neural Networks , url =
-
[41]
Inducing Causal Structure for Interpretable Neural Networks , url =
Geiger, Atticus and Wu, Zhengxuan and Lu, Hanson and Rozner, Josh and Kreiss, Elisa and Icard, Thomas and Goodman, Noah and Potts, Christopher , booktitle =. Inducing Causal Structure for Interpretable Neural Networks , url =
-
[44]
Text Analytics 2014: User Perspectives on Solutions and Providers , url =
Grimes, Seth , institution =. Text Analytics 2014: User Perspectives on Solutions and Providers , url =
work page 2014
- [45]
-
[46]
OpenWebText Corpus , author=
- [47]
-
[48]
nostalgebraist , title =
-
[49]
Linguistic Regularities in Continuous Space Word Representations
Mikolov, Tomas and Yih, Wen-tau and Zweig, Geoffrey. Linguistic Regularities in Continuous Space Word Representations. Proceedings of the 2013 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies. 2013
work page 2013
-
[50]
Attention is Not Only a Weight: Analyzing Transformers with Vector Norms , author=. 2020 , eprint=
work page 2020
-
[51]
Causal Abstraction for Faithful Model Interpretation , url =
Geiger, Atticus and Potts, Christopher and Icard, Thomas , note =. Causal Abstraction for Faithful Model Interpretation , url =
-
[52]
Abraham, Eldar David and D'Oosterlinck, Karel and Feder, Amir and Gat, Yair and Geiger, Atticus and Potts, Christopher and Reichart, Roi and Wu, Zhengxuan , booktitle =
-
[53]
The Hydra Effect: Emergent Self-repair in Language Model Computations , author=. 2023 , eprint=
work page 2023
-
[54]
Learning to Generate Reviews and Discovering Sentiment , author=. 2017 , eprint=
work page 2017
-
[55]
Language Models are Unsupervised Multitask Learners , author=
-
[56]
Causal Distillation for Language Models , url =
Wu, Zhengxuan and Geiger, Atticus and Rozner, Joshua and Kreiss, Elisa and Lu, Hanson and Icard, Thomas and Potts, Christopher and Goodman, Noah , booktitle =. Causal Distillation for Language Models , url =. doi:10.18653/v1/2022.naacl-main.318 , pages =
-
[57]
Sparks of Artificial General Intelligence: Early experiments with GPT-4 , author=. 2023 , eprint=
work page 2023
-
[58]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[59]
Probabilistic and causal inference: the works of Judea Pearl , publisher =
Direct and indirect effects , author=. Probabilistic and causal inference: the works of Judea Pearl , publisher =
-
[61]
Language Through a Prism: A Spectral Approach for Multiscale Language Representations , author=. 2020 , eprint=
work page 2020
-
[62]
Goh, Gabriel and †, Nick Cammarata and †, Chelsea Voss and Carter, Shan and Petrov, Michael and Schubert, Ludwig and Radford, Alec and Olah, Chris , title =. Distill , year =
-
[63]
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author=. 2023 , eprint=
work page 2023
-
[64]
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning , author=. 2023 , journal=
work page 2023
-
[65]
Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=
work page 2023
- [66]
-
[67]
Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small , url =
Chris Mathwin and Guillaume Corlouer and Esben Kran and Fazl Barez and Neel Nanda , date =. Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small , url =
- [68]
-
[69]
Can language models encode perceptual structure without grounding? a case study in color, 2021
Mostafa Abdou, Artur Kulmizev, Daniel Hershcovich, Stella Frank, Ellie Pavlick, and Anders Søgaard. Can language models encode perceptual structure without grounding? a case study in color, 2021
work page 2021
-
[70]
CEBaB : Estimating the causal effects of real-world concepts on nlp model behavior
Eldar David Abraham, Karel D'Oosterlinck, Amir Feder, Yair Gat, Atticus Geiger, Christopher Potts, Roi Reichart, and Zhengxuan Wu. CEBaB : Estimating the causal effects of real-world concepts on nlp model behavior. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp....
work page 2022
-
[71]
An Introduction to Systems Biology: Design Principles of Biological Circuits
Uri Alon. An Introduction to Systems Biology: Design Principles of Biological Circuits. Chapman and Hall/CRC, 1st edition, 2006. doi:10.1201/9781420011432. URL https://doi.org/10.1201/9781420011432
-
[72]
Pythia: A suite for analyzing large language models across training and scaling, 2023
Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O'Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023
work page 2023
-
[73]
An interpretability illusion for bert, 2021
Tolga Bolukbasi, Adam Pearce, Ann Yuan, Andy Coenen, Emily Reif, Fernanda Viégas, and Martin Wattenberg. An interpretability illusion for bert, 2021
work page 2021
-
[74]
Towards monosemanticity: Decomposing language models with dictionary learning
Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...
work page 2023
-
[75]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020
work page 1901
-
[76]
Sparks of artificial general intelligence: Early experiments with gpt-4, 2023
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023
work page 2023
-
[77]
Discovering latent knowledge in language models without supervision, 2022
Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision, 2022
work page 2022
-
[78]
Causal scrubbing: a method for rigorously testing interpretability hypotheses [redwood research]
Lawrence Chan, Adrià Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas. Causal scrubbing: a method for rigorously testing interpretability hypotheses [redwood research]. Alignment Forum, 2023. URL https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-metho...
work page 2023
-
[79]
Eliciting latent knowledge: How to tell if your eyes deceive you
Paul Christiano, Ajeya Cotra, and Mark Xu. Eliciting latent knowledge: How to tell if your eyes deceive you. Google Docs, December 2021. Accessed: 17th Sep 2023
work page 2021
-
[80]
Evaluating the ripple effects of knowledge editing in language models, 2023
Roi Cohen, Eden Biran, Ori Yoran, Amir Globerson, and Mor Geva. Evaluating the ripple effects of knowledge editing in language models, 2023
work page 2023
-
[81]
Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso
Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability, 2023
work page 2023
-
[82]
J. Cui, Z. Wang, SB. Ho, et al. Survey on sentiment analysis: evolution of research methods and topics. Artif Intell Rev, 56: 0 8469--8510, 2023. doi:10.1007/s10462-022-10386-z. URL https://doi.org/10.1007/s10462-022-10386-z
-
[83]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[84]
A mathematical framework for transformer circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...
work page 2021
-
[85]
A mathematical framework for transformer circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...
work page 2021
-
[86]
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread, 2022. URL https://transformer-circuits.pub/20...
work page 2022
-
[87]
Neural natural language inference models partially embed theories of lexical entailment and negation
Atticus Geiger, Kyle Richardson, and Christopher Potts. Neural natural language inference models partially embed theories of lexical entailment and negation. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pp.\ 163--173, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.