arxiv: 2604.23719 · v2 · submitted 2026-04-26 · 💻 cs.CL · cs.AI

Recognition: unknown

AIPsy-Affect: A Keyword-Free Clinical Stimulus Battery for Mechanistic Interpretability of Emotion in Language Models

Michael Keeman

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords mechanistic interpretabilityemotion in language modelsstimulus batterykeyword-free vignettesPlutchik emotionsactivation patchinglinear probingmatched controls

0 comments

The pith

A 480-item battery of narrative vignettes tests emotion in language models without using any emotion keywords.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mechanistic interpretability studies of emotion in LLMs have relied on prompts that contain words such as 'furious' or 'happy', so it remains unclear whether models are responding to the underlying state or simply to the lexical item. The paper supplies AIPsy-Affect, a set of 192 keyword-free vignettes that evoke each of Plutchik's eight primary emotions solely through described situations, together with 192 matched neutral controls that preserve characters, setting, length, and surface form. Any internal representation that separates a clinical vignette from its neutral counterpart therefore cannot be explained by the presence of an emotion word. The authors validate the property with three NLP checks: bag-of-words methods register only situational vocabulary, a contextual classifier reliably detects the presence of affect, yet cannot name the specific emotion category at better than chance levels on the new items.

Core claim

AIPsy-Affect provides 480 clinical items consisting of 192 keyword-free narrative vignettes for Plutchik's eight emotions, 192 matched neutral controls, and additional moderate-intensity and discriminant-validity splits, so that linear probes, activation patches, SAE features, causal ablations, and steering vectors can be applied under the guarantee that observed distinctions arise from situational affect rather than emotion-keyword presence.

What carries the argument

The matched-pair structure of keyword-free affect-evoking vignettes and neutral controls that share characters, setting, length, and surface structure.

If this is right

Linear probes and activation patches can now isolate representations of specific emotions without lexical confounds.
Steering vectors extracted from the battery target genuine situational affect rather than word triggers.
Sparse autoencoder features can be tested for encoding emotion categories independently of surface vocabulary.
The fourfold increase in item count relative to the prior 96-item battery supports larger-scale causal experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same matched-pair logic could be applied to other abstract internal states such as moral judgment or intent without keyword cues.
If models show no reliable distinction between the pairs, the result would indicate that current LLMs lack robust situational emotion understanding.
The battery's open MIT release allows direct comparison of emotion circuitry across model families and training regimes.

Load-bearing premise

The narrative situations reliably produce the intended emotional states inside the target models rather than some other uncontrolled linguistic or structural difference.

What would settle it

A demonstration that a bag-of-words model can distinguish the clinical vignettes from their matched neutrals at high accuracy, or that a contextual classifier fails to detect affect presence while still succeeding on a keyword-rich control set.

Figures

Figures reproduced from arXiv: 2604.23719 by Michael Keeman.

**Figure 1.** Figure 1: Three-method NLP defense battery applied to AIPsy-Affect. Left: VADER compound view at source ↗

**Figure 2.** Figure 2: GoEmotions DistilRoBERTa class probabilities aggregated by target Plutchik emotion view at source ↗

read the original abstract

Mechanistic interpretability research on emotion in large language models -- linear probing, activation patching, sparse autoencoder (SAE) feature analysis, causal ablation, steering vector extraction -- depends on stimuli that contain the words for the emotions they test. When a probe fires on "I am furious", it is unclear whether the model has detected anger or detected the word "furious". The two readings have very different consequences for every downstream claim about emotion circuits, features, and interventions. We release AIPsy-Affect, a 480-item clinical stimulus battery that removes the confound at the stimulus level: 192 keyword-free vignettes evoking each of Plutchik's eight primary emotions through narrative situation alone, 192 matched neutral controls that share characters, setting, length, and surface structure with the affect surgically removed, plus moderate-intensity and discriminant-validity splits. The matched-pair structure supports linear probing, activation patching, SAE feature analysis, causal ablation, and steering vector extraction under a strong methodological guarantee: any internal representation that distinguishes a clinical item from its matched neutral cannot be doing so on the basis of emotion-keyword presence. A three-method NLP defense battery -- bag-of-words sentiment, an emotion-category lexicon, and a contextual transformer classifier -- confirms the property: bag-of-words methods see only situational vocabulary, and a contextual classifier detects affect (p < 10^-15) but cannot identify the category (5.2% top-1 vs. 82.5% on a keyword-rich control). AIPsy-Affect extends our earlier 96-item battery (arXiv:2603.22295) by a factor of four and is released openly under MIT license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces AIPsy-Affect, a 480-item stimulus battery for mechanistic interpretability of emotion in LLMs. It comprises 192 keyword-free narrative vignettes evoking Plutchik's eight primary emotions, 192 matched neutral controls sharing characters/setting/length/structure with affect removed, plus moderate-intensity and discriminant-validity splits. The matched-pair design provides a methodological guarantee that any model distinction between paired items cannot stem from emotion-keyword presence. Validation uses a three-method NLP battery: bag-of-words sentiment and emotion lexicons detect only situational vocabulary, while a contextual transformer classifier detects affect presence (p < 10^-15) but achieves only 5.2% top-1 category accuracy (vs. 82.5% on keyword-rich controls). This extends a prior 96-item set by 4x and is released under MIT license.

Significance. If the construction and defense metrics hold, the battery offers a practical advance for interpretability work by removing a pervasive lexical confound. It directly enables cleaner linear probing, activation patching, SAE feature analysis, causal ablation, and steering experiments where observed differences can be attributed to situational affect rather than keyword detection. The open release, explicit no-keyword guarantee, and empirical defense (especially the category-identification failure) make it immediately usable and falsifiable. This strengthens downstream claims about emotion circuits and features.

major comments (2)

[Abstract] Abstract and validation section: The central guarantee is secured by keyword-free construction plus the reported defense battery, but the paper does not report per-category accuracies or a confusion matrix for the contextual classifier on the AIPsy-Affect items. Without this, it remains possible that the 5.2% top-1 figure masks uneven detection across the eight emotions, which would weaken the claim that category information is absent.
[Methods] Methods (vignette construction): The matched neutral controls are described as having affect 'surgically removed' while preserving characters, setting, length, and surface structure. No quantitative metrics (e.g., edit distance, lexical overlap scores, or human ratings of structural similarity) are provided to verify that the pairs differ only in emotional content and not in other uncontrolled linguistic features that could drive model distinctions.

minor comments (3)

[Introduction] The abstract cites Plutchik's eight primary emotions; the introduction should include the original reference (Plutchik 1980) and a brief justification for choosing this taxonomy over alternatives such as Ekman or dimensional models.
[Abstract] The release statement mentions MIT license and extension from arXiv:2603.22295; include a direct link or DOI to the prior battery and the new dataset repository in the abstract or data-availability statement.
[Results] Table or figure presenting the NLP defense results should report exact p-values, sample sizes, and baseline accuracies rather than only the summary statistics given in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the supportive review and constructive comments. We address the two major points below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract and validation section: The central guarantee is secured by keyword-free construction plus the reported defense battery, but the paper does not report per-category accuracies or a confusion matrix for the contextual classifier on the AIPsy-Affect items. Without this, it remains possible that the 5.2% top-1 figure masks uneven detection across the eight emotions, which would weaken the claim that category information is absent.

Authors: We agree that the aggregate 5.2% top-1 accuracy alone leaves open the possibility of uneven per-category performance. We will add both the full confusion matrix and per-category accuracies to the validation section. These additional results will be computed on the same contextual classifier and AIPsy-Affect items already described. revision: yes
Referee: [Methods] Methods (vignette construction): The matched neutral controls are described as having affect 'surgically removed' while preserving characters, setting, length, and surface structure. No quantitative metrics (e.g., edit distance, lexical overlap scores, or human ratings of structural similarity) are provided to verify that the pairs differ only in emotional content and not in other uncontrolled linguistic features that could drive model distinctions.

Authors: We acknowledge that the original submission did not include quantitative similarity metrics between matched pairs. The pairs were constructed via manual editing to excise affective content while preserving characters, setting, length, and surface structure. We will add Levenshtein edit distance, Jaccard token overlap, and human structural-similarity ratings (with inter-rater reliability) to the methods section to document the closeness of the controls. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a constructed stimulus battery whose core methodological guarantee—that distinctions between affect vignettes and matched neutrals cannot arise from emotion-keyword presence—follows directly from the explicit design rules (keyword-free narrative situations plus surgical removal of affect in controls) and is corroborated by independent external NLP classifiers. No equations, parameter fits, predictions, or uniqueness theorems appear; the single self-citation to prior work is merely a note on dataset scale and carries no load-bearing role in the central claim. The contribution is therefore self-contained as an empirical artifact whose validity rests on construction and external validation rather than any reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on established psychological categories and standard NLP validation methods rather than introducing new theoretical machinery.

axioms (1)

domain assumption Plutchik's eight primary emotions form a suitable and exhaustive basis for constructing the stimulus categories.
The battery is explicitly organized around these eight emotions.

pith-pipeline@v0.9.0 · 5605 in / 1366 out tokens · 40578 ms · 2026-05-08T06:19:26.383682+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 16 canonical work pages · 2 internal anchors

[1]

Whether, not which: Mechanistic interpretability reveals dissociable affect reception and emotion categorization in LLMs.arXiv preprint arXiv:2603.22295, March2026

Michael Keeman. Whether, not which: Mechanistic interpretability reveals dissociable affect reception and emotion categorization in LLMs.arXiv preprint arXiv:2603.22295, March2026. doi: 10.48550/arXiv.2603.22295. URLhttps://arxiv.org/abs/2603.22295

work page doi:10.48550/arxiv.2603.22295
[2]

Dimensional modeling of emotions in text with appraisal theories: Corpus creation, annotation reliability, and prediction

Enrica Troiano, Laura Oberländer, and Roman Klinger. Dimensional modeling of emotions in text with appraisal theories: Corpus creation, annotation reliability, and prediction. arXiv preprint arXiv:2206.05238, June2022. doi: 10.48550/arXiv.2206.05238. URLhttps: //arxiv.org/abs/2206.05238

work page doi:10.48550/arxiv.2206.05238
[3]

arXiv preprint arXiv:2005.00547 (2020)

Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Ne- made, and Sujith Ravi. GoEmotions: A dataset of fine-grained emotions.arXiv preprint arXiv:2005.00547, June 2020. doi: 10.48550/arXiv.2005.00547. URLhttps://arxiv.org/ abs/2005.00547

work page doi:10.48550/arxiv.2005.00547 2005
[4]

Emotion Concepts and their Function in a Large Language Model

Nicholas Sofroniew, Isaac Kauvar, William Saunders, Runjin Chen, Tom Henighan, Sasha Hydrie, Craig Citro, Adam Pearce, Julius Tarng, Wes Gurnee, Joshua Batson, Sam Zim- merman, Kelley Rivoire, Kyle Fish, Chris Olah, and Jack Lindsey. Emotion concepts and their function in a large language model.arXiv preprint arXiv:2604.07729, April 2026. doi: 10.48550/ar...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.07729 2026
[5]

In this text: I felt . . . when an aeroplane I was on hit heavy turbulence and dropped a long way down suddenly, the emotion implied is:

Chenxi Wang, Yixuan Zhang, Ruiji Yu, Yufei Zheng, Lang Gao, Zirui Song, Zixiang Xu, Gus Xia, Huishuai Zhang, Dongyan Zhao, and Xiuying Chen. Do llms “feel”? emotion circuits discovery and control.arXiv preprint arXiv:2510.11328, October 2025. doi: 10. 48550/arXiv.2510.11328. URLhttps://arxiv.org/abs/2510.11328

work page arXiv 2025
[6]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, July 2025. doi: 10.48550/arXiv.2507.21509. URLhttps://arxiv.org/ abs/2507.21509

work page internal anchor Pith review doi:10.48550/arxiv.2507.21509 2025
[7]

Discovering and causally val- idating emotion-sensitive neurons in large audio-language models.arXiv preprint arXiv:2601.03115, January 2026

Xiutian Zhao, Björn Schuller, and Berrak Sisman. Discovering and causally val- idating emotion-sensitive neurons in large audio-language models.arXiv preprint arXiv:2601.03115, January 2026. doi: 10.48550/arXiv.2601.03115. URLhttps://arxiv. org/abs/2601.03115

work page doi:10.48550/arxiv.2601.03115 2026
[8]

The SuperEmotion dataset.arXiv preprint arXiv:2505.15348, May 2025

Enric Junqué de Fortuny. The SuperEmotion dataset.arXiv preprint arXiv:2505.15348, May 2025. doi: 10.48550/arXiv.2505.15348. URLhttps://arxiv.org/abs/2505.15348

work page doi:10.48550/arxiv.2505.15348 2025
[9]

Mechanistic interpretability of emotion inference in large language models.arXiv preprint arXiv:2502.05489, 2025

AlaN.Tak, AminBanayeeanzade, AnahitaBolourani, MinaKian, RobinJia, andJonathan Gratch. Mechanistic interpretability of emotion inference in large language models.arXiv preprint arXiv:2502.05489, February 2025. doi: 10.48550/arXiv.2502.05489. URLhttps: //arxiv.org/abs/2502.05489

work page doi:10.48550/arxiv.2502.05489 2025
[10]

Linear representations of sentiment in transformer language models.arXiv preprint, 2023

Curt Tigges, Oskar Hollinsworth, Atticus Geiger, and Neel Nanda. Linear representations of sentiment in transformer language models.arXiv preprint, 2023. 13

2023
[11]

(2025) Integrated Information Theory: A Consciousness-First Approach to What Exists.https://doi.org/10.48550/arXiv.2510

Benjamin Reichman, Adar Avsian, and Larry Heck. Emotions where art thou: Understand- ing and characterizing the emotional latent space of large language models. InInternational Conference on Learning Representations (ICLR), April 2026. doi: 10.48550/arXiv.2510. 22042. URLhttps://arxiv.org/abs/2510.22042

work page doi:10.48550/arxiv.2510 2026
[12]

Phelps and Joseph E

Elizabeth A. Phelps and Joseph E. LeDoux. Contributions of the amygdala to emotion processing: From animal models to human behavior.Neuron, 48(2):175–187, October 2005. doi: 10.1016/j.neuron.2005.09.025. URLhttps://doi.org/10.1016/j.neuron.2005.09. 025

work page doi:10.1016/j.neuron.2005.09.025 2005
[13]

Joseph E. LeDoux. Emotion circuits in the brain.Annual Review of Neuroscience, 23: 155–184, March 2000. doi: 10.1146/annurev.neuro.23.1.155. URLhttps://doi.org/10. 1146/annurev.neuro.23.1.155

work page doi:10.1146/annurev.neuro.23.1.155 2000
[14]

Klaus R. Scherer. The dynamic architecture of emotion: Evidence for the component process model.Cognition and Emotion, 23(7):1307–1351, September 2009. doi: 10.1080/ 02699930902928969. URLhttps://doi.org/10.1080/02699930902928969

work page doi:10.1080/02699930902928969 2009
[15]

Robert Plutchik. The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice.American Scientist, 89(4):344–350, July 2001. URLhttps://www.jstor.org/stable/27857503

work page arXiv 2001
[16]

Mohammad and Peter D

Saif M. Mohammad and Peter D. Turney. Crowdsourcing a word-emotion association lexi- con.Computational Intelligence, 29(3):436–465, September 2012. doi: 10.1111/j.1467-8640. 2012.00460.x. URLhttps://doi.org/10.1111/j.1467-8640.2012.00460.x

work page doi:10.1111/j.1467-8640 2012
[17]

C. J. Hutto and Eric Gilbert. VADER: A parsimonious rule-based model for sentiment analysis of social media text. InProceedings of the Eighth International AAAI Conference on Weblogs and Social Media (ICWSM-14). AAAI Press, June 2014. doi: 10.1609/icwsm. v8i1.14550. URLhttps://doi.org/10.1609/icwsm.v8i1.14550. 14

work page doi:10.1609/icwsm 2014