pith. machine review for the scientific record. sign in

arxiv: 2604.04064 · v1 · submitted 2026-04-05 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Extracting and Steering Emotion Representations in Small Language Models: A Methodological Comparison

Authors on Pith no claims yet

Pith reviewed 2026-05-13 17:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords emotion representationssmall language modelsvector extractionsteeringtransformer layersgeneration-based methodsmodel architecture
0
0 comments X

The pith

Generation-based extraction separates emotions more cleanly than comprehension-based methods in small language models, with signals localizing at middle layers across architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests two methods for extracting emotion vectors from models between 124 million and 3 billion parameters. Generation-based extraction, which uses the model to produce emotion-laden text, yields statistically stronger separation than comprehension-based extraction that relies on reading emotion in prompts. These vectors concentrate around the middle of the transformer stack, producing a U-shaped strength curve that stays the same across five different model families. Steering the models with the extracted vectors produces three distinct behavioral regimes, divided by architecture rather than model size. An external classifier confirms the steering changes in most cases and also reveals cross-lingual effects in one model family.

Core claim

Generation-based extraction produces statistically superior emotion separation with a Mann-Whitney p-value of 0.007 and large effect size, while emotion representations localize at middle transformer layers following a U-shaped curve that remains architecture-invariant from 124M to 3B parameters; steering experiments identify three regimes separated by architecture rather than scale, with an external classifier confirming causal effects at 92 percent success rate.

What carries the argument

Emotion vectors extracted by contrasting generation-based versus comprehension-based methods, localized by layer-wise probing in transformer blocks.

If this is right

  • Middle-layer localization holds across GPT-2, Gemma, Qwen, Llama, and Mistral families, suggesting a shared structural pattern.
  • Steering success splits into surgical transformation, repetitive collapse, or explosive degradation depending on architecture.
  • Instruction tuning modulates the advantage of generation-based extraction.
  • Cross-lingual emotion entanglement appears in Qwen models, activating aligned tokens in other languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the middle-layer pattern generalizes, emotion control techniques developed on one family may transfer to others with minimal retuning.
  • The architecture-dependent steering regimes imply that safety filters tuned on one model type may leave gaps in others of similar size.
  • The U-shaped localization curve could be used to prune or compress models while preserving emotion-handling capacity.

Load-bearing premise

The extracted vectors and steering interventions reflect genuine causal emotion states rather than mere statistical correlations, and the external classifier validates them without inheriting the same biases as the extraction process.

What would settle it

A controlled test in which steering with the extracted vectors fails to shift the predictions of a fully independent emotion classifier on held-out prompts while keeping perplexity stable.

Figures

Figures reproduced from arXiv: 2604.04064 by Jihoon Jeong.

Figure 1
Figure 1. Figure 1: Emotion vector separation by layer depth (SmolLM2-1.7B-Instruct). The U-shaped [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Extraction method × model type interaction. Generation (left) shows wide variance across models; comprehension (right) converges to a narrow band (0.59–0.67). All lines slope downward, confirming the universal generation advantage. amplification depends on the specific architecture. 4.4 Size Effects Contrary to the intuition that larger models should produce better-separated emotion vectors, we observe no … view at source ↗
Figure 3
Figure 3. Figure 3: Dose-response curve for GPT-2 (Aggressive [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Three steering regimes visualized by mean activation delta vs. perplexity ratio. Surgical [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

Small language models (SLMs) in the 100M-10B parameter range increasingly power production systems, yet whether they possess the internal emotion representations recently discovered in frontier models remains unknown. We present the first comparative analysis of emotion vector extraction methods for SLMs, evaluating 9 models across 5 architectural families (GPT-2, Gemma, Qwen, Llama, Mistral) using 20 emotions and two extraction methods (generation-based and comprehension-based). Generation-based extraction produces statistically superior emotion separation (Mann-Whitney p = 0.007; Cohen's d = -107.5), with the advantage modulated by instruction tuning and architecture. Emotion representations localize at middle transformer layers (~50% depth), following a U-shaped curve that is architecture-invariant from 124M to 3B parameters. We validate these findings against representational anisotropy baselines across 4 models and confirm causal behavioral effects through steering experiments, independently verified by an external emotion classifier (92% success rate, 37/40 scenarios). Steering reveals three regimes -- surgical (coherent text transformation), repetitive collapse, and explosive (text degradation) -- quantified by perplexity ratios and separated by model architecture rather than scale. We document cross-lingual emotion entanglement in Qwen, where steering activates semantically aligned Chinese tokens that RLHF does not suppress, raising safety concerns for multilingual deployment. This work provides methodological guidelines for emotion research on open-weight models and contributes to the Model Medicine series by bridging external behavioral profiling with internal representational analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript conducts the first comparative study of emotion vector extraction methods in small language models (SLMs) ranging from 124M to 3B parameters across five architectural families. It evaluates generation-based versus comprehension-based extraction for 20 emotions, finding generation-based superior with Mann-Whitney p=0.007 and Cohen's d=-107.5. Emotion representations are shown to localize at middle layers (~50% depth) in a U-shaped pattern invariant to architecture. Steering interventions demonstrate three behavioral regimes differentiated by architecture, with external validation via an emotion classifier achieving 92% success, and notes cross-lingual entanglement in Qwen models.

Significance. Should the reported statistical superiority be confirmed after addressing the effect-size computation, the work provides practical methodological guidelines for emotion representation research in open-weight SLMs. The combination of internal probing, steering for causal effects, and external classifier validation offers a robust empirical framework that bridges representational analysis with behavioral outcomes, contributing to safety considerations in multilingual deployments.

major comments (1)
  1. [Results section (Mann-Whitney test and effect size)] The reported Cohen's d = -107.5 accompanying the Mann-Whitney p = 0.007 for generation-based extraction superiority is implausibly large. In high-dimensional embedding spaces, realistic class separations yield |Cohen's d| values typically below 5; a value of 107.5 implies either degenerate vectors with near-zero variance or an error in the effect-size formula (such as omitting the pooled standard deviation or incorrect scaling). This metric is central to the claim of statistical superiority and the architecture-invariance conclusions, so the computation method must be detailed and corrected if erroneous.
minor comments (2)
  1. [Abstract and Methods] Provide more explicit details on data splits, exact baseline constructions for anisotropy, and how potential confounds like vector normalization are handled to strengthen reproducibility.
  2. [Steering experiments] Clarify the criteria for classifying the three regimes (surgical, repetitive collapse, explosive) and report the perplexity ratio thresholds used.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful and constructive review. We address the single major comment below and have prepared revisions to the manuscript accordingly.

read point-by-point responses
  1. Referee: [Results section (Mann-Whitney test and effect size)] The reported Cohen's d = -107.5 accompanying the Mann-Whitney p = 0.007 for generation-based extraction superiority is implausibly large. In high-dimensional embedding spaces, realistic class separations yield |Cohen's d| values typically below 5; a value of 107.5 implies either degenerate vectors with near-zero variance or an error in the effect-size formula (such as omitting the pooled standard deviation or incorrect scaling). This metric is central to the claim of statistical superiority and the architecture-invariance conclusions, so the computation method must be detailed and corrected if erroneous.

    Authors: We agree that a Cohen's d of 107.5 is implausibly large and indicates an error in our effect-size computation. The value most likely arose from an incorrect implementation that omitted division by the pooled standard deviation or applied improper scaling to the high-dimensional vectors. In the revised manuscript we will (1) state the exact formula employed, (2) recompute Cohen's d using the standard definition d = (μ_gen - μ_comp) / σ_pooled on the per-emotion separation scores, and (3) report the corrected, realistic effect size together with the associated confidence interval. The direction of the Mann-Whitney result remains unchanged, but the magnitude will be brought into the expected range for embedding-space separations. We will also add a short methods subsection describing the computation and release the corresponding code snippet for reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical measurements with external validation

full rationale

The paper reports direct empirical results from activation extraction, statistical tests (Mann-Whitney p-values and Cohen's d), layer localization curves, and steering interventions across multiple models. All load-bearing claims rest on observable outputs from the models themselves and an independent external classifier (92% success rate), with no equations, fitted parameters renamed as predictions, self-citations forming the central premise, or derivations that reduce to inputs by construction. The analysis is self-contained against external benchmarks and contains no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study relies on standard mechanistic interpretability assumptions without introducing new free parameters, axioms beyond domain conventions, or invented entities.

axioms (1)
  • domain assumption Transformer layer activations can be treated as linearly extractable feature vectors that correspond to human-interpretable concepts such as emotions.
    Implicit foundation for both generation-based and comprehension-based extraction methods.

pith-pipeline@v0.9.0 · 5565 in / 1275 out tokens · 47925 ms · 2026-05-13T17:08:21.843293+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds

    cs.CL 2026-04 unverdicted novelty 6.0

    Mature small language models share nearly identical 21-emotion geometries across architectures with Spearman correlations 0.74-0.92 despite opposite behavioral profiles, while immature models restructure under RLHF an...

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 1 Pith paper

  1. [1]

    Anthropic. (2026). Emotion Concepts and Their Function in a Large Language Model. Transformer Circuits

  2. [2]

    Bloom, J. (2024). SAELens: A Library for Training and Analyzing Sparse Autoencoders. GitHub

  3. [3]

    Burnell, R., et al. (2023). Rethink Reporting of Evaluation Results in AI. Science

  4. [4]

    Cunningham, H., et al. (2023). Sparse Autoencoders Find Highly Interpretable Directions in Lan- guage Models

  5. [5]

    Ekman, P. (1992). An Argument for Basic Emotions. Cognition & Emotion , 6(3-4), 169–200

  6. [6]

    Ethayarajh, K. (2019). How Contextual are Contextualized Word Representations? EMNLP

  7. [7]

    Izard, C. E. (2007). Basic Emotions, Natural Kinds, Emotion Schemas, and a New Paradigm. Perspectives on Psychological Science , 2(3), 260–280

  8. [8]

    Jeong, J. (2026). MTI: A Behavior-Based Temperament Profiling System for AI Agents. arXiv:2604.02145

  9. [9]

    Jeong, J. (2026). Neural-MRI: A Diagnostic Scanner for Language Model Internal States. GitHub. https://github.com/JihoonJeong/Neural-MRI

  10. [10]

    Jeong, J. (2026). Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models. arXiv:2603.04722

  11. [11]

    2026 , archivePrefix =

    Jiralerspong, T. & Bricken, T. (2026). A “Diff” Tool for AI. arXiv:2602.11729

  12. [12]

    Li, K., et al. (2024). Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

  13. [13]

    Nanda, N., et al. (2023). Progress Measures for Grokking via Mechanistic Interpretability. ICLR. 13

  14. [14]

    Russell, J. A. (1980). A Circumplex Model of Affect. J. Personality and Social Psychology , 39(6), 1161–1178. Serapio-García, G., et al. (2025). Personality Traits in Large Language Models. Nature Machine Intelligence

  15. [15]

    Templeton, A., et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

  16. [16]

    Turner, A., et al. (2023). Activation Addition: Steering Language Models Without Optimization

  17. [17]

    Zou, A., et al. (2023). Representation Engineering: A Top-Down Approach to AI Transparency. A Comprehension Text Passage Validation External classifier validation of the 60 comprehension passages using j-hartmann/emotion-english- distilroberta-base: overall match rate 27/60 (45%). Basic emotions (happy, sad, angry, afraid, calm): 14/15 = 93%. Nuanced emot...