Recognition: no theorem link
Extracting and Steering Emotion Representations in Small Language Models: A Methodological Comparison
Pith reviewed 2026-05-13 17:08 UTC · model grok-4.3
The pith
Generation-based extraction separates emotions more cleanly than comprehension-based methods in small language models, with signals localizing at middle layers across architectures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Generation-based extraction produces statistically superior emotion separation with a Mann-Whitney p-value of 0.007 and large effect size, while emotion representations localize at middle transformer layers following a U-shaped curve that remains architecture-invariant from 124M to 3B parameters; steering experiments identify three regimes separated by architecture rather than scale, with an external classifier confirming causal effects at 92 percent success rate.
What carries the argument
Emotion vectors extracted by contrasting generation-based versus comprehension-based methods, localized by layer-wise probing in transformer blocks.
If this is right
- Middle-layer localization holds across GPT-2, Gemma, Qwen, Llama, and Mistral families, suggesting a shared structural pattern.
- Steering success splits into surgical transformation, repetitive collapse, or explosive degradation depending on architecture.
- Instruction tuning modulates the advantage of generation-based extraction.
- Cross-lingual emotion entanglement appears in Qwen models, activating aligned tokens in other languages.
Where Pith is reading between the lines
- If the middle-layer pattern generalizes, emotion control techniques developed on one family may transfer to others with minimal retuning.
- The architecture-dependent steering regimes imply that safety filters tuned on one model type may leave gaps in others of similar size.
- The U-shaped localization curve could be used to prune or compress models while preserving emotion-handling capacity.
Load-bearing premise
The extracted vectors and steering interventions reflect genuine causal emotion states rather than mere statistical correlations, and the external classifier validates them without inheriting the same biases as the extraction process.
What would settle it
A controlled test in which steering with the extracted vectors fails to shift the predictions of a fully independent emotion classifier on held-out prompts while keeping perplexity stable.
Figures
read the original abstract
Small language models (SLMs) in the 100M-10B parameter range increasingly power production systems, yet whether they possess the internal emotion representations recently discovered in frontier models remains unknown. We present the first comparative analysis of emotion vector extraction methods for SLMs, evaluating 9 models across 5 architectural families (GPT-2, Gemma, Qwen, Llama, Mistral) using 20 emotions and two extraction methods (generation-based and comprehension-based). Generation-based extraction produces statistically superior emotion separation (Mann-Whitney p = 0.007; Cohen's d = -107.5), with the advantage modulated by instruction tuning and architecture. Emotion representations localize at middle transformer layers (~50% depth), following a U-shaped curve that is architecture-invariant from 124M to 3B parameters. We validate these findings against representational anisotropy baselines across 4 models and confirm causal behavioral effects through steering experiments, independently verified by an external emotion classifier (92% success rate, 37/40 scenarios). Steering reveals three regimes -- surgical (coherent text transformation), repetitive collapse, and explosive (text degradation) -- quantified by perplexity ratios and separated by model architecture rather than scale. We document cross-lingual emotion entanglement in Qwen, where steering activates semantically aligned Chinese tokens that RLHF does not suppress, raising safety concerns for multilingual deployment. This work provides methodological guidelines for emotion research on open-weight models and contributes to the Model Medicine series by bridging external behavioral profiling with internal representational analysis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript conducts the first comparative study of emotion vector extraction methods in small language models (SLMs) ranging from 124M to 3B parameters across five architectural families. It evaluates generation-based versus comprehension-based extraction for 20 emotions, finding generation-based superior with Mann-Whitney p=0.007 and Cohen's d=-107.5. Emotion representations are shown to localize at middle layers (~50% depth) in a U-shaped pattern invariant to architecture. Steering interventions demonstrate three behavioral regimes differentiated by architecture, with external validation via an emotion classifier achieving 92% success, and notes cross-lingual entanglement in Qwen models.
Significance. Should the reported statistical superiority be confirmed after addressing the effect-size computation, the work provides practical methodological guidelines for emotion representation research in open-weight SLMs. The combination of internal probing, steering for causal effects, and external classifier validation offers a robust empirical framework that bridges representational analysis with behavioral outcomes, contributing to safety considerations in multilingual deployments.
major comments (1)
- [Results section (Mann-Whitney test and effect size)] The reported Cohen's d = -107.5 accompanying the Mann-Whitney p = 0.007 for generation-based extraction superiority is implausibly large. In high-dimensional embedding spaces, realistic class separations yield |Cohen's d| values typically below 5; a value of 107.5 implies either degenerate vectors with near-zero variance or an error in the effect-size formula (such as omitting the pooled standard deviation or incorrect scaling). This metric is central to the claim of statistical superiority and the architecture-invariance conclusions, so the computation method must be detailed and corrected if erroneous.
minor comments (2)
- [Abstract and Methods] Provide more explicit details on data splits, exact baseline constructions for anisotropy, and how potential confounds like vector normalization are handled to strengthen reproducibility.
- [Steering experiments] Clarify the criteria for classifying the three regimes (surgical, repetitive collapse, explosive) and report the perplexity ratio thresholds used.
Simulated Author's Rebuttal
We thank the referee for their careful and constructive review. We address the single major comment below and have prepared revisions to the manuscript accordingly.
read point-by-point responses
-
Referee: [Results section (Mann-Whitney test and effect size)] The reported Cohen's d = -107.5 accompanying the Mann-Whitney p = 0.007 for generation-based extraction superiority is implausibly large. In high-dimensional embedding spaces, realistic class separations yield |Cohen's d| values typically below 5; a value of 107.5 implies either degenerate vectors with near-zero variance or an error in the effect-size formula (such as omitting the pooled standard deviation or incorrect scaling). This metric is central to the claim of statistical superiority and the architecture-invariance conclusions, so the computation method must be detailed and corrected if erroneous.
Authors: We agree that a Cohen's d of 107.5 is implausibly large and indicates an error in our effect-size computation. The value most likely arose from an incorrect implementation that omitted division by the pooled standard deviation or applied improper scaling to the high-dimensional vectors. In the revised manuscript we will (1) state the exact formula employed, (2) recompute Cohen's d using the standard definition d = (μ_gen - μ_comp) / σ_pooled on the per-emotion separation scores, and (3) report the corrected, realistic effect size together with the associated confidence interval. The direction of the Mann-Whitney result remains unchanged, but the magnitude will be brought into the expected range for embedding-space separations. We will also add a short methods subsection describing the computation and release the corresponding code snippet for reproducibility. revision: yes
Circularity Check
No significant circularity; purely empirical measurements with external validation
full rationale
The paper reports direct empirical results from activation extraction, statistical tests (Mann-Whitney p-values and Cohen's d), layer localization curves, and steering interventions across multiple models. All load-bearing claims rest on observable outputs from the models themselves and an independent external classifier (92% success rate), with no equations, fitted parameters renamed as predictions, self-citations forming the central premise, or derivations that reduce to inputs by construction. The analysis is self-contained against external benchmarks and contains no circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Transformer layer activations can be treated as linearly extractable feature vectors that correspond to human-interpretable concepts such as emotions.
Forward citations
Cited by 1 Pith paper
-
Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds
Mature small language models share nearly identical 21-emotion geometries across architectures with Spearman correlations 0.74-0.92 despite opposite behavioral profiles, while immature models restructure under RLHF an...
Reference graph
Works this paper leans on
-
[1]
Anthropic. (2026). Emotion Concepts and Their Function in a Large Language Model. Transformer Circuits
work page 2026
-
[2]
Bloom, J. (2024). SAELens: A Library for Training and Analyzing Sparse Autoencoders. GitHub
work page 2024
-
[3]
Burnell, R., et al. (2023). Rethink Reporting of Evaluation Results in AI. Science
work page 2023
-
[4]
Cunningham, H., et al. (2023). Sparse Autoencoders Find Highly Interpretable Directions in Lan- guage Models
work page 2023
-
[5]
Ekman, P. (1992). An Argument for Basic Emotions. Cognition & Emotion , 6(3-4), 169–200
work page 1992
-
[6]
Ethayarajh, K. (2019). How Contextual are Contextualized Word Representations? EMNLP
work page 2019
-
[7]
Izard, C. E. (2007). Basic Emotions, Natural Kinds, Emotion Schemas, and a New Paradigm. Perspectives on Psychological Science , 2(3), 260–280
work page 2007
- [8]
-
[9]
Jeong, J. (2026). Neural-MRI: A Diagnostic Scanner for Language Model Internal States. GitHub. https://github.com/JihoonJeong/Neural-MRI
work page 2026
- [10]
-
[11]
Jiralerspong, T. & Bricken, T. (2026). A “Diff” Tool for AI. arXiv:2602.11729
-
[12]
Li, K., et al. (2024). Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
work page 2024
-
[13]
Nanda, N., et al. (2023). Progress Measures for Grokking via Mechanistic Interpretability. ICLR. 13
work page 2023
-
[14]
Russell, J. A. (1980). A Circumplex Model of Affect. J. Personality and Social Psychology , 39(6), 1161–1178. Serapio-García, G., et al. (2025). Personality Traits in Large Language Models. Nature Machine Intelligence
work page 1980
-
[15]
Templeton, A., et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
work page 2024
-
[16]
Turner, A., et al. (2023). Activation Addition: Steering Language Models Without Optimization
work page 2023
-
[17]
Zou, A., et al. (2023). Representation Engineering: A Top-Down Approach to AI Transparency. A Comprehension Text Passage Validation External classifier validation of the 60 comprehension passages using j-hartmann/emotion-english- distilroberta-base: overall match rate 27/60 (45%). Basic emotions (happy, sad, angry, afraid, calm): 14/15 = 93%. Nuanced emot...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.