Recognition: unknown
AIPsy-Affect: A Keyword-Free Clinical Stimulus Battery for Mechanistic Interpretability of Emotion in Language Models
Pith reviewed 2026-05-08 06:19 UTC · model grok-4.3
The pith
A 480-item battery of narrative vignettes tests emotion in language models without using any emotion keywords.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AIPsy-Affect provides 480 clinical items consisting of 192 keyword-free narrative vignettes for Plutchik's eight emotions, 192 matched neutral controls, and additional moderate-intensity and discriminant-validity splits, so that linear probes, activation patches, SAE features, causal ablations, and steering vectors can be applied under the guarantee that observed distinctions arise from situational affect rather than emotion-keyword presence.
What carries the argument
The matched-pair structure of keyword-free affect-evoking vignettes and neutral controls that share characters, setting, length, and surface structure.
If this is right
- Linear probes and activation patches can now isolate representations of specific emotions without lexical confounds.
- Steering vectors extracted from the battery target genuine situational affect rather than word triggers.
- Sparse autoencoder features can be tested for encoding emotion categories independently of surface vocabulary.
- The fourfold increase in item count relative to the prior 96-item battery supports larger-scale causal experiments.
Where Pith is reading between the lines
- The same matched-pair logic could be applied to other abstract internal states such as moral judgment or intent without keyword cues.
- If models show no reliable distinction between the pairs, the result would indicate that current LLMs lack robust situational emotion understanding.
- The battery's open MIT release allows direct comparison of emotion circuitry across model families and training regimes.
Load-bearing premise
The narrative situations reliably produce the intended emotional states inside the target models rather than some other uncontrolled linguistic or structural difference.
What would settle it
A demonstration that a bag-of-words model can distinguish the clinical vignettes from their matched neutrals at high accuracy, or that a contextual classifier fails to detect affect presence while still succeeding on a keyword-rich control set.
Figures
read the original abstract
Mechanistic interpretability research on emotion in large language models -- linear probing, activation patching, sparse autoencoder (SAE) feature analysis, causal ablation, steering vector extraction -- depends on stimuli that contain the words for the emotions they test. When a probe fires on "I am furious", it is unclear whether the model has detected anger or detected the word "furious". The two readings have very different consequences for every downstream claim about emotion circuits, features, and interventions. We release AIPsy-Affect, a 480-item clinical stimulus battery that removes the confound at the stimulus level: 192 keyword-free vignettes evoking each of Plutchik's eight primary emotions through narrative situation alone, 192 matched neutral controls that share characters, setting, length, and surface structure with the affect surgically removed, plus moderate-intensity and discriminant-validity splits. The matched-pair structure supports linear probing, activation patching, SAE feature analysis, causal ablation, and steering vector extraction under a strong methodological guarantee: any internal representation that distinguishes a clinical item from its matched neutral cannot be doing so on the basis of emotion-keyword presence. A three-method NLP defense battery -- bag-of-words sentiment, an emotion-category lexicon, and a contextual transformer classifier -- confirms the property: bag-of-words methods see only situational vocabulary, and a contextual classifier detects affect (p < 10^-15) but cannot identify the category (5.2% top-1 vs. 82.5% on a keyword-rich control). AIPsy-Affect extends our earlier 96-item battery (arXiv:2603.22295) by a factor of four and is released openly under MIT license.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AIPsy-Affect, a 480-item stimulus battery for mechanistic interpretability of emotion in LLMs. It comprises 192 keyword-free narrative vignettes evoking Plutchik's eight primary emotions, 192 matched neutral controls sharing characters/setting/length/structure with affect removed, plus moderate-intensity and discriminant-validity splits. The matched-pair design provides a methodological guarantee that any model distinction between paired items cannot stem from emotion-keyword presence. Validation uses a three-method NLP battery: bag-of-words sentiment and emotion lexicons detect only situational vocabulary, while a contextual transformer classifier detects affect presence (p < 10^-15) but achieves only 5.2% top-1 category accuracy (vs. 82.5% on keyword-rich controls). This extends a prior 96-item set by 4x and is released under MIT license.
Significance. If the construction and defense metrics hold, the battery offers a practical advance for interpretability work by removing a pervasive lexical confound. It directly enables cleaner linear probing, activation patching, SAE feature analysis, causal ablation, and steering experiments where observed differences can be attributed to situational affect rather than keyword detection. The open release, explicit no-keyword guarantee, and empirical defense (especially the category-identification failure) make it immediately usable and falsifiable. This strengthens downstream claims about emotion circuits and features.
major comments (2)
- [Abstract] Abstract and validation section: The central guarantee is secured by keyword-free construction plus the reported defense battery, but the paper does not report per-category accuracies or a confusion matrix for the contextual classifier on the AIPsy-Affect items. Without this, it remains possible that the 5.2% top-1 figure masks uneven detection across the eight emotions, which would weaken the claim that category information is absent.
- [Methods] Methods (vignette construction): The matched neutral controls are described as having affect 'surgically removed' while preserving characters, setting, length, and surface structure. No quantitative metrics (e.g., edit distance, lexical overlap scores, or human ratings of structural similarity) are provided to verify that the pairs differ only in emotional content and not in other uncontrolled linguistic features that could drive model distinctions.
minor comments (3)
- [Introduction] The abstract cites Plutchik's eight primary emotions; the introduction should include the original reference (Plutchik 1980) and a brief justification for choosing this taxonomy over alternatives such as Ekman or dimensional models.
- [Abstract] The release statement mentions MIT license and extension from arXiv:2603.22295; include a direct link or DOI to the prior battery and the new dataset repository in the abstract or data-availability statement.
- [Results] Table or figure presenting the NLP defense results should report exact p-values, sample sizes, and baseline accuracies rather than only the summary statistics given in the abstract.
Simulated Author's Rebuttal
We thank the referee for the supportive review and constructive comments. We address the two major points below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract and validation section: The central guarantee is secured by keyword-free construction plus the reported defense battery, but the paper does not report per-category accuracies or a confusion matrix for the contextual classifier on the AIPsy-Affect items. Without this, it remains possible that the 5.2% top-1 figure masks uneven detection across the eight emotions, which would weaken the claim that category information is absent.
Authors: We agree that the aggregate 5.2% top-1 accuracy alone leaves open the possibility of uneven per-category performance. We will add both the full confusion matrix and per-category accuracies to the validation section. These additional results will be computed on the same contextual classifier and AIPsy-Affect items already described. revision: yes
-
Referee: [Methods] Methods (vignette construction): The matched neutral controls are described as having affect 'surgically removed' while preserving characters, setting, length, and surface structure. No quantitative metrics (e.g., edit distance, lexical overlap scores, or human ratings of structural similarity) are provided to verify that the pairs differ only in emotional content and not in other uncontrolled linguistic features that could drive model distinctions.
Authors: We acknowledge that the original submission did not include quantitative similarity metrics between matched pairs. The pairs were constructed via manual editing to excise affective content while preserving characters, setting, length, and surface structure. We will add Levenshtein edit distance, Jaccard token overlap, and human structural-similarity ratings (with inter-rater reliability) to the methods section to document the closeness of the controls. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces a constructed stimulus battery whose core methodological guarantee—that distinctions between affect vignettes and matched neutrals cannot arise from emotion-keyword presence—follows directly from the explicit design rules (keyword-free narrative situations plus surgical removal of affect in controls) and is corroborated by independent external NLP classifiers. No equations, parameter fits, predictions, or uniqueness theorems appear; the single self-citation to prior work is merely a note on dataset scale and carries no load-bearing role in the central claim. The contribution is therefore self-contained as an empirical artifact whose validity rests on construction and external validation rather than any reduction to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Plutchik's eight primary emotions form a suitable and exhaustive basis for constructing the stimulus categories.
Reference graph
Works this paper leans on
-
[1]
Michael Keeman. Whether, not which: Mechanistic interpretability reveals dissociable affect reception and emotion categorization in LLMs.arXiv preprint arXiv:2603.22295, March2026. doi: 10.48550/arXiv.2603.22295. URLhttps://arxiv.org/abs/2603.22295
-
[2]
Enrica Troiano, Laura Oberländer, and Roman Klinger. Dimensional modeling of emotions in text with appraisal theories: Corpus creation, annotation reliability, and prediction. arXiv preprint arXiv:2206.05238, June2022. doi: 10.48550/arXiv.2206.05238. URLhttps: //arxiv.org/abs/2206.05238
-
[3]
arXiv preprint arXiv:2005.00547 (2020)
Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Ne- made, and Sujith Ravi. GoEmotions: A dataset of fine-grained emotions.arXiv preprint arXiv:2005.00547, June 2020. doi: 10.48550/arXiv.2005.00547. URLhttps://arxiv.org/ abs/2005.00547
-
[4]
Emotion Concepts and their Function in a Large Language Model
Nicholas Sofroniew, Isaac Kauvar, William Saunders, Runjin Chen, Tom Henighan, Sasha Hydrie, Craig Citro, Adam Pearce, Julius Tarng, Wes Gurnee, Joshua Batson, Sam Zim- merman, Kelley Rivoire, Kyle Fish, Chris Olah, and Jack Lindsey. Emotion concepts and their function in a large language model.arXiv preprint arXiv:2604.07729, April 2026. doi: 10.48550/ar...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.07729 2026
-
[5]
Chenxi Wang, Yixuan Zhang, Ruiji Yu, Yufei Zheng, Lang Gao, Zirui Song, Zixiang Xu, Gus Xia, Huishuai Zhang, Dongyan Zhao, and Xiuying Chen. Do llms “feel”? emotion circuits discovery and control.arXiv preprint arXiv:2510.11328, October 2025. doi: 10. 48550/arXiv.2510.11328. URLhttps://arxiv.org/abs/2510.11328
-
[6]
Persona Vectors: Monitoring and Controlling Character Traits in Language Models
Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, July 2025. doi: 10.48550/arXiv.2507.21509. URLhttps://arxiv.org/ abs/2507.21509
work page internal anchor Pith review doi:10.48550/arxiv.2507.21509 2025
-
[7]
Xiutian Zhao, Björn Schuller, and Berrak Sisman. Discovering and causally val- idating emotion-sensitive neurons in large audio-language models.arXiv preprint arXiv:2601.03115, January 2026. doi: 10.48550/arXiv.2601.03115. URLhttps://arxiv. org/abs/2601.03115
-
[8]
The SuperEmotion dataset.arXiv preprint arXiv:2505.15348, May 2025
Enric Junqué de Fortuny. The SuperEmotion dataset.arXiv preprint arXiv:2505.15348, May 2025. doi: 10.48550/arXiv.2505.15348. URLhttps://arxiv.org/abs/2505.15348
-
[9]
AlaN.Tak, AminBanayeeanzade, AnahitaBolourani, MinaKian, RobinJia, andJonathan Gratch. Mechanistic interpretability of emotion inference in large language models.arXiv preprint arXiv:2502.05489, February 2025. doi: 10.48550/arXiv.2502.05489. URLhttps: //arxiv.org/abs/2502.05489
-
[10]
Linear representations of sentiment in transformer language models.arXiv preprint, 2023
Curt Tigges, Oskar Hollinsworth, Atticus Geiger, and Neel Nanda. Linear representations of sentiment in transformer language models.arXiv preprint, 2023. 13
2023
-
[11]
Benjamin Reichman, Adar Avsian, and Larry Heck. Emotions where art thou: Understand- ing and characterizing the emotional latent space of large language models. InInternational Conference on Learning Representations (ICLR), April 2026. doi: 10.48550/arXiv.2510. 22042. URLhttps://arxiv.org/abs/2510.22042
-
[12]
Elizabeth A. Phelps and Joseph E. LeDoux. Contributions of the amygdala to emotion processing: From animal models to human behavior.Neuron, 48(2):175–187, October 2005. doi: 10.1016/j.neuron.2005.09.025. URLhttps://doi.org/10.1016/j.neuron.2005.09. 025
-
[13]
Joseph E. LeDoux. Emotion circuits in the brain.Annual Review of Neuroscience, 23: 155–184, March 2000. doi: 10.1146/annurev.neuro.23.1.155. URLhttps://doi.org/10. 1146/annurev.neuro.23.1.155
-
[14]
Klaus R. Scherer. The dynamic architecture of emotion: Evidence for the component process model.Cognition and Emotion, 23(7):1307–1351, September 2009. doi: 10.1080/ 02699930902928969. URLhttps://doi.org/10.1080/02699930902928969
- [15]
-
[16]
Saif M. Mohammad and Peter D. Turney. Crowdsourcing a word-emotion association lexi- con.Computational Intelligence, 29(3):436–465, September 2012. doi: 10.1111/j.1467-8640. 2012.00460.x. URLhttps://doi.org/10.1111/j.1467-8640.2012.00460.x
-
[17]
C. J. Hutto and Eric Gilbert. VADER: A parsimonious rule-based model for sentiment analysis of social media text. InProceedings of the Eighth International AAAI Conference on Weblogs and Social Media (ICWSM-14). AAAI Press, June 2014. doi: 10.1609/icwsm. v8i1.14550. URLhttps://doi.org/10.1609/icwsm.v8i1.14550. 14
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.