hub Canonical reference

Linear Representations of Sentiment in Large Language Models

Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, Neel Nanda · 2023 · cs.LG · arXiv 2310.15154

Canonical reference. 100% of citing Pith papers cite this work as background.

23 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 23 citing papers arXiv PDF

abstract

Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs). In this study, we reveal that across a range of models, sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. Through causal interventions, we isolate this direction and show it is causally relevant in both toy tasks and real world datasets such as Stanford Sentiment Treebank. Through this case study we model a thorough investigation of what a single direction means on a broad data distribution. We further uncover the mechanisms that involve this direction, highlighting the roles of a small subset of attention heads and neurons. Finally, we discover a phenomenon which we term the summarization motif: sentiment is not solely represented on emotionally charged words, but is additionally summarized at intermediate positions without inherent sentiment, such as punctuation and names. We show that in Stanford Sentiment Treebank zero-shot classification, 76% of above-chance classification accuracy is lost when ablating the sentiment direction, nearly half of which (36%) is due to ablating the summarized sentiment direction exclusively at comma positions.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

cs.LG · 2026-04-03 · accept · novelty 8.0

Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.

Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Linear probes for Othello board states factor into tensor-product structure with square and color embeddings composed by a binding matrix, from which the linear probes can be directly recovered.

How LLMs Are Persuaded: A Few Attention Heads, Rerouted

cs.AI · 2026-05-10 · unverdicted · novelty 7.0

Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.

Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

LLMs encode repeated token counts correctly in residual streams but a format-triggered MLP at 88-93% depth overwrites it with an incorrect fixed value.

PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

PLOT localizes causal variables in neural networks by fitting optimal transport couplings between abstract and neural intervention effect geometries, enabling fast handles or guided search.

Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

Translation function vectors extracted from English to one target language improve correct token ranking for translations to multiple other unseen target languages in decoder-only multilingual LLMs.

Cell-Based Representation of Relational Binding in Language Models

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.

Emotion Concepts and their Function in a Large Language Model

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Tool Calling is Linearly Readable and Steerable in Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.

Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.

LLM Safety From Within: Detecting Harmful Content with Internal Representations

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.

Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.

Linear Representations of Hierarchical Concepts in Language Models

cs.CL · 2026-04-09 · unverdicted · novelty 6.0

Language models encode concept hierarchies as linear transformations that are domain-specific yet structurally similar across domains.

Graph-Based Alternatives to LLMs for Human Simulation

cs.CL · 2025-11-03 · conditional · novelty 6.0

GEMS formulates close-ended human-behavior simulation as link prediction on a heterogeneous graph and matches or exceeds LLM performance with three orders of magnitude fewer parameters across three datasets and three evaluation settings.

Steering Llama 2 via Contrastive Activation Addition

cs.CL · 2023-12-09 · unverdicted · novelty 6.0

Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.

Multi-Turn Neural Transparency: Surfacing Neural Activations Improves User Calibration to LLM Behavioral Drift

cs.HC · 2026-05-14 · unverdicted · novelty 5.0

Multi-turn neural transparency using behavioral vectors and dynamic visualizations improves user anticipation and evaluation of LLM trait expression while reducing overconfidence, per a randomized study with 246 participants.

Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity

cs.LG · 2026-05-13 · unverdicted · novelty 5.0

Cosine similarity poorly predicts performance degradation from layer removal in LLMs, making direct accuracy-drop ablation a more reliable relevance metric.

Negative Before Positive: Asymmetric Valence Processing in Large Language Models

cs.CL · 2026-05-07 · unverdicted · novelty 5.0

Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.

Semantic Structure of Feature Space in Large Language Models

cs.CL · 2026-04-29 · unverdicted · novelty 5.0

LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.

How to use and interpret activation patching

cs.LG · 2024-04-23 · accept · novelty 5.0

Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.

Can Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-Thought

cs.LG · 2025-10-28

citing papers explorer

Showing 23 of 23 citing papers.

Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens cs.LG · 2026-04-03 · accept · none · ref 25 · internal anchor
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions cs.LG · 2026-05-11 · unverdicted · none · ref 24 · internal anchor
Linear probes for Othello board states factor into tensor-product structure with square and color embeddings composed by a binding matrix, from which the linear probes can be directly recovered.
How LLMs Are Persuaded: A Few Attention Heads, Rerouted cs.AI · 2026-05-10 · unverdicted · none · ref 16 · internal anchor
Persuasion in LLMs works by redirecting a small set of attention heads to copy the target option token instead of reasoning over evidence, via a rank-one routing feature that can be directly edited or removed.
Repeated-Token Counting Reveals a Dissociation Between Representations and Outputs cs.CL · 2026-05-10 · unverdicted · none · ref 10 · internal anchor
LLMs encode repeated token counts correctly in residual streams but a format-triggered MLP at 88-93% depth overwrites it with an incorrect fixed value.
PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction cs.LG · 2026-05-07 · unverdicted · none · ref 15 · internal anchor
PLOT localizes causal variables in neural networks by fitting optimal transport couplings between abstract and neural intervention effect geometries, enabling fast handles or guided search.
Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation cs.CL · 2026-04-21 · unverdicted · none · ref 19 · internal anchor
Translation function vectors extracted from English to one target language improve correct token ranking for translations to multiple other unseen target languages in decoder-only multilingual LLMs.
Cell-Based Representation of Relational Binding in Language Models cs.CL · 2026-04-21 · unverdicted · none · ref 39 · internal anchor
Large language models encode relational bindings via a cell-based representation: a low-dimensional linear subspace in which each cell corresponds to an entity-relation index pair and attributes are retrieved from the matching cell.
Emotion Concepts and their Function in a Large Language Model cs.AI · 2026-04-09 · unverdicted · none · ref 15 · internal anchor
Claude Sonnet 4.5 exhibits functional emotions via abstract internal representations of emotion concepts that causally influence its preferences and misaligned behaviors without implying subjective experience.
Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 190 · internal anchor
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 84 · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Tool Calling is Linearly Readable and Steerable in Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 70 · internal anchor
Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models cs.CV · 2026-05-08 · unverdicted · none · ref 37 · internal anchor
VLMs possess a latent 3D scene topology subspace corresponding to Laplacian eigenmaps that can be causally shaped via Dirichlet energy regularization to improve spatial task performance by up to 12.1%.
LLM Safety From Within: Detecting Harmful Content with Internal Representations cs.AI · 2026-04-20 · unverdicted · none · ref 75 · internal anchor
SIREN identifies safety neurons via linear probing on internal LLM layers and combines them with adaptive weighting to detect harm, outperforming prior guard models with 250x fewer parameters.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs cs.LG · 2026-04-10 · unverdicted · none · ref 98 · internal anchor
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
Linear Representations of Hierarchical Concepts in Language Models cs.CL · 2026-04-09 · unverdicted · none · ref 26 · internal anchor
Language models encode concept hierarchies as linear transformations that are domain-specific yet structurally similar across domains.
Graph-Based Alternatives to LLMs for Human Simulation cs.CL · 2025-11-03 · conditional · none · ref 73 · internal anchor
GEMS formulates close-ended human-behavior simulation as link prediction on a heterogeneous graph and matches or exceeds LLM performance with three orders of magnitude fewer parameters across three datasets and three evaluation settings.
Steering Llama 2 via Contrastive Activation Addition cs.CL · 2023-12-09 · unverdicted · none · ref 22 · internal anchor
Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.
Multi-Turn Neural Transparency: Surfacing Neural Activations Improves User Calibration to LLM Behavioral Drift cs.HC · 2026-05-14 · unverdicted · none · ref 44 · internal anchor
Multi-turn neural transparency using behavioral vectors and dynamic visualizations improves user anticipation and evaluation of LLM trait expression while reducing overconfidence, per a randomized study with 246 participants.
Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity cs.LG · 2026-05-13 · unverdicted · none · ref 30 · internal anchor
Cosine similarity poorly predicts performance degradation from layer removal in LLMs, making direct accuracy-drop ablation a more reliable relevance metric.
Negative Before Positive: Asymmetric Valence Processing in Large Language Models cs.CL · 2026-05-07 · unverdicted · none · ref 13 · internal anchor
Negative valence localizes to early layers and positive valence to mid-to-late layers in LLMs, with the directions being causally steerable.
Semantic Structure of Feature Space in Large Language Models cs.CL · 2026-04-29 · unverdicted · none · ref 15 · internal anchor
LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.
How to use and interpret activation patching cs.LG · 2024-04-23 · accept · none · ref 26 · internal anchor
Activation patching provides evidence about neural network circuits when the choice of metric is aligned with the hypothesis and common interpretation errors are avoided.
Can Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-Thought cs.LG · 2025-10-28 · unreviewed · ref 27 · internal anchor

Linear Representations of Sentiment in Large Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer