CAREBench: Evaluating LLMs' Emotion Understanding by Assessing Cognitive Appraisal Reasoning

Andero Uusberg; Hainiu Xu; James J. Gross; Petr Slovak; Yulan He; Zhaoyue Sun

arxiv: 2605.17176 · v1 · pith:KPK2VCMBnew · submitted 2026-05-16 · 💻 cs.AI

CAREBench: Evaluating LLMs' Emotion Understanding by Assessing Cognitive Appraisal Reasoning

Zhaoyue Sun , Hainiu Xu , Andero Uusberg , James J. Gross , Petr Slovak , Yulan He This is my paper

Pith reviewed 2026-05-20 13:56 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM evaluationemotion understandingcognitive appraisalbenchmarkappraisal theoryaffective computingsubjective heterogeneity

0 comments

The pith

LLMs correctly name many emotions yet fail to reconstruct the appraisal reasoning that produces them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates CAREBench to test whether LLMs can follow the step-by-step cognitive appraisals that lead to emotions in everyday stories, using both the person's own view and an outside observer's view. Experiments on six models show that bigger models often match or beat human labelers on final emotion tags but lag on the reasoning chains and on positive emotions. This matters because many real uses of LLMs assume they grasp how feelings arise, and the results indicate that matching labels alone does not guarantee that grasp. The authors conclude that current evaluation methods can overstate how well models understand human emotions.

Core claim

Stronger LLMs match or surpass human observers on certain emotion tasks yet fall short on appraisal reasoning and positive emotion recognition, and current models have not internalized the mechanisms needed to capture human subjective heterogeneity.

What carries the argument

CAREBench, which supplies full inferential chain annotations including appraisal reasoning, appraisal ratings, and multi-label emotion labels from first-person and third-person perspectives on real-world narratives.

If this is right

Downstream emotion prediction metrics may overestimate LLMs' true emotion understanding.
Performance across different steps in the reasoning chain and responses to appraisal changes vary across models.
Models must develop better internal mechanisms for subjective differences in how people appraise the same events.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could use the annotated chains to create training data that teaches models the hidden steps of emotion generation.
Applications in counseling or social robots may need additional checks beyond emotion labels to ensure they align with how users actually feel.
This approach could extend to other mental processes where the path to an output matters as much as the output itself.

Load-bearing premise

Human-annotated inferential chains from first- and third-person perspectives on real-world stories accurately reflect the cognitive appraisal processes people use to generate their emotions.

What would settle it

If fine-tuning an LLM on the CAREBench appraisal chains produces measurably better predictions of emotions in fresh, unannotated real-life situations than training on emotion labels alone, that would support the claim that appraisal reasoning is the missing piece.

Figures

Figures reproduced from arXiv: 2605.17176 by Andero Uusberg, Hainiu Xu, James J. Gross, Petr Slovak, Yulan He, Zhaoyue Sun.

**Figure 2.** Figure 2: Likert-scale response distributions across 22 appraisal dimensions from Table A2 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Pearson correlation between model-predicted rating changes and human annotation changes [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Emotion understanding is a core capability for LLMs to interact effectively with humans, yet existing evaluation paradigms rely on discrete emotion label prediction and fail to capture the cognitive processes underlying emotion generation. Grounded in appraisal theory, we introduce CAREBench, the first benchmark with complete inferential chain annotations from both first- and third-person perspectives on real-world narratives, spanning appraisal reasoning, appraisal ratings, and multi-label emotion annotation. We propose a process-level evaluation framework and conduct systematic experiments across six LLMs organized around four research questions. We find that stronger models match or surpass human observers on certain tasks, yet fall short on appraisal reasoning and positive emotion recognition; performance across chain steps and sensitivity to appraisal interventions exhibit dissociations across models; and current models have not internalized the mechanisms needed to capture human subjective heterogeneity. These findings suggest that downstream emotion prediction metrics may overestimate LLMs' true emotion understanding, and CAREBench provides a foundation for more diagnostically informative evaluation of LLMs' affective cognitive capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAREBench adds a new appraisal-chain benchmark for LLM emotion tasks but its core claims depend on unverified human annotations as ground truth.

read the letter

The main thing to know is that CAREBench supplies full inferential chains—appraisal reasoning, ratings, and multi-label emotions—annotated from both first-person and third-person perspectives on real narratives. That setup lets them move past simple emotion-label accuracy and test whether models follow the steps that appraisal theory says produce feelings. They run six models on four questions and report that bigger models hold their own or beat human observers on some emotion tasks but drop on appraisal reasoning and positive-emotion cases, with clear dissociations across chain steps and intervention sensitivity. They also note that models do not yet capture the spread of human subjective responses to the same event. This is genuinely new compared with prior discrete-label benchmarks. The process-level framing and the dual-perspective annotations are concrete additions that give a more diagnostic view of where current models fall short. The experiments are organized and the dissociations they surface are worth seeing. The soft spot is the ground-truth assumption. The whole argument that models lack internalized affective mechanisms rests on the human chains accurately reflecting the actual cognitive processes that generated the reported emotions. If the annotations are post-hoc rationalizations or miss key appraisal variables, the observed gaps do not cleanly show model limitations. The abstract gives little on inter-annotator reliability or how they validated coverage, so that needs tighter evidence in the full paper. This work is for people who build or evaluate LLMs for social or affective applications and want something beyond label matching. A reader focused on diagnostic benchmarks will get value from the dataset and the task breakdowns. It deserves a serious referee because the benchmark idea is solid enough to review even if the current claims need more backing on annotation quality.

Referee Report

2 major / 2 minor

Summary. The paper introduces CAREBench, a benchmark for evaluating LLMs' emotion understanding via cognitive appraisal reasoning. It features complete inferential chain annotations from first- and third-person perspectives on real-world narratives, spanning appraisal reasoning steps, appraisal ratings, and multi-label emotion labels. Experiments across six LLMs organized around four research questions show stronger models matching or surpassing humans on some tasks yet falling short on appraisal reasoning and positive emotion recognition, with dissociations in chain-step performance and intervention sensitivity, and limited capture of human subjective heterogeneity.

Significance. If the human annotations validly reflect underlying appraisal mechanisms, the work is significant for shifting evaluation from discrete label prediction to process-level diagnostics grounded in appraisal theory. This could reveal that standard emotion metrics overestimate LLMs' affective capabilities and provide a foundation for more targeted improvements in modeling human emotional subjectivity.

major comments (2)

[§3 (Annotation and Dataset Construction)] §3 (Annotation and Dataset Construction): The central claim that LLMs have not internalized mechanisms to capture human subjective heterogeneity requires that the human-annotated inferential chains constitute valid ground truth for appraisal processes. The manuscript does not report inter-annotator reliability (e.g., agreement on chain steps or appraisal dimensions), leaving open the possibility that annotation noise or post-hoc rationalization undermines attribution of model shortfalls to deficiencies in internalized affective reasoning.
[§5 (Experimental Results and Analysis)] §5 (Experimental Results and Analysis): The reported dissociations in performance across models, chain steps, and appraisal interventions would be strengthened by explicit statistical tests (e.g., significance of differences between appraisal reasoning accuracy and emotion prediction, with effect sizes), as qualitative patterns alone may not robustly support the conclusion that downstream metrics overestimate true understanding.

minor comments (2)

[Abstract] Abstract: The specific LLMs evaluated (e.g., model names and sizes) are not listed, which reduces immediate clarity for readers comparing to existing benchmarks.
[Figure 1 (Benchmark Overview)] Figure 1 (Benchmark Overview): The diagram of first- vs. third-person chains could include explicit arrows or labels for intervention points to improve readability of the process-level framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and have incorporated revisions to improve the rigor and clarity of the work.

read point-by-point responses

Referee: [§3 (Annotation and Dataset Construction)] §3 (Annotation and Dataset Construction): The central claim that LLMs have not internalized mechanisms to capture human subjective heterogeneity requires that the human-annotated inferential chains constitute valid ground truth for appraisal processes. The manuscript does not report inter-annotator reliability (e.g., agreement on chain steps or appraisal dimensions), leaving open the possibility that annotation noise or post-hoc rationalization undermines attribution of model shortfalls to deficiencies in internalized affective reasoning.

Authors: We agree that establishing inter-annotator reliability is essential to support the validity of the annotations as ground truth. In the revised manuscript, we have added a dedicated paragraph in §3 reporting inter-annotator agreement. We computed Fleiss' kappa across the three annotators for chain-step identification and multi-label emotion categories, along with intraclass correlation coefficients (ICC(2,1)) for the continuous appraisal dimension ratings. The results show moderate-to-substantial agreement (kappa = 0.68 for chain steps, kappa = 0.62 for emotions, ICC = 0.71 for appraisals), which we now cite to mitigate concerns about annotation noise or post-hoc rationalization. revision: yes
Referee: [§5 (Experimental Results and Analysis)] §5 (Experimental Results and Analysis): The reported dissociations in performance across models, chain steps, and appraisal interventions would be strengthened by explicit statistical tests (e.g., significance of differences between appraisal reasoning accuracy and emotion prediction, with effect sizes), as qualitative patterns alone may not robustly support the conclusion that downstream metrics overestimate true understanding.

Authors: We appreciate this recommendation for greater statistical rigor. In the revised §5, we have added explicit hypothesis tests for the key dissociations. We applied paired t-tests (with Bonferroni correction) and reported Cohen's d effect sizes when comparing appraisal-reasoning accuracy against emotion-prediction accuracy across models and conditions. All reported dissociations reach statistical significance (p < 0.01 for the main contrasts, with medium-to-large effect sizes d > 0.6), providing quantitative support for the claim that downstream emotion metrics can overestimate process-level understanding. revision: yes

Circularity Check

0 steps flagged

No circularity: external benchmark with human annotations evaluated empirically against LLMs

full rationale

The paper constructs CAREBench by collecting new human annotations of inferential chains on real-world narratives from first- and third-person perspectives, then performs direct empirical comparisons of LLM outputs to these annotations across tasks like appraisal reasoning and emotion prediction. No equations, fitted parameters, or derivations are present that reduce predictions back to the inputs by construction. The central findings (stronger LLMs matching humans on some tasks but falling short on appraisal) rest on external human data rather than self-referential loops or self-citation chains. This matches the default case of a self-contained empirical benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that appraisal theory provides a valid and complete model of emotion generation and that the provided annotations faithfully capture human cognitive processes.

axioms (1)

domain assumption Appraisal theory accurately captures the cognitive processes underlying emotion generation in real-world narratives.
The entire benchmark and evaluation framework is explicitly grounded in appraisal theory as stated in the abstract.

pith-pipeline@v0.9.0 · 5715 in / 1430 out tokens · 57926 ms · 2026-05-20T13:56:57.147673+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 2 internal anchors

[1]

Improving language models for emotion analysis: In- sights from cognitive science

Constant Bonard and Gustave Cortal. Improving language models for emotion analysis: In- sights from cognitive science. In Tatsuki Kuribayashi, Giulia Rambelli, Ece Takmaz, Philipp Wicke, and Yohei Oseki, editors,Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 264–277, Bangkok, Thailand, August 2024. Association for C...

work page 2024
[2]

Appraisal theory.Handbook of Cognition and Emotion, pages 637–663, 1999

Klaus R Scherer. Appraisal theory.Handbook of Cognition and Emotion, pages 637–663, 1999

work page 1999
[3]

Dimensional modeling of emotions in text with appraisal theories: Corpus creation, annotation reliability, and prediction.Computa- tional Linguistics, 49(1):1–72, March 2023

Enrica Troiano, Laura Oberländer, and Roman Klinger. Dimensional modeling of emotions in text with appraisal theories: Corpus creation, annotation reliability, and prediction.Computa- tional Linguistics, 49(1):1–72, March 2023

work page 2023
[4]

Liu, He Cao, Renliang Sun, Rui Wang, Yu Li, and Jiaxing Zhang

June M. Liu, He Cao, Renliang Sun, Rui Wang, Yu Li, and Jiaxing Zhang. CAPE: A Chinese dataset for appraisal-based emotional generation in large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 6306–6324, Albuquerque, New Mexico, April 2025. Association for Co...

work page 2025
[5]

Sree Bhattacharyya, Lucas Craig, Tharun Dilliraj, Jia Li, and James Z. Wang. Do machines think emotionally? a cognitive appraisal analysis of large language models. InWomen in Machine Learning Workshop @ NeurIPS 2025, 2026

work page 2025
[6]

Oxford University Press, 2001

Klaus R Scherer, Angela Schorr, and Tom Johnstone.Appraisal processes in emotion: Theory, methods, research. Oxford University Press, 2001

work page 2001
[7]

Ema: A process model of appraisal dynamics.Cognitive Systems Research, 10(1):70–90, 2009

Stacy C Marsella and Jonathan Gratch. Ema: A process model of appraisal dynamics.Cognitive Systems Research, 10(1):70–90, 2009

work page 2009
[8]

Micro-narratives: A scalable method for eliciting stories of people’s lived experience

Amira Skeggs, Ashish Mehta, Valerie Yap, Seray B Ibrahim, Charla Rhodes, James J Gross, Sean A Munson, Predrag Klasnja, Amy Orben, and Petr Slovak. Micro-narratives: A scalable method for eliciting stories of people’s lived experience. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025

work page 2025
[9]

Measuring personality in one minute or less: A 10-item short version of the big five inventory in english and german.Journal of research in Personality, 41(1):203–212, 2007

Beatrice Rammstedt and Oliver P John. Measuring personality in one minute or less: A 10-item short version of the big five inventory in english and german.Journal of research in Personality, 41(1):203–212, 2007

work page 2007
[10]

Gpt-5.2.https://openai.com/index/introducing-gpt-5-2/, 2025

OpenAI. Gpt-5.2.https://openai.com/index/introducing-gpt-5-2/, 2025

work page 2025
[11]

Claude sonnet 4.6

Anthropic. Claude sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6 , 2026. 10

work page 2026
[12]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

work page 2026
[14]

Beyond empathy: Integrating diagnostic and therapeutic reasoning with large language models for mental health counseling.arXiv preprint arXiv:2505.15715, 2025

He Hu, Yucheng Zhou, Juzheng Si, Qianning Wang, Hengheng Zhang, Fuji Ren, Fei Ma, and Laizhong Cui. Beyond empathy: Integrating diagnostic and therapeutic reasoning with large language models for mental health counseling.arXiv preprint arXiv:2505.15715, 2025

work page arXiv 2025
[15]

Emollms: A series of emotional large language models and annotation tools for comprehensive affective analysis

Zhiwei Liu, Kailai Yang, Qianqian Xie, Tianlin Zhang, and Sophia Ananiadou. Emollms: A series of emotional large language models and annotation tools for comprehensive affective analysis. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5487–5496, 2024

work page 2024
[16]

Ong, Maria Liakata, Petr Slovak, and Yulan He

Yuxiang Zhou, Hainiu Xu, Desmond C. Ong, Maria Liakata, Petr Slovak, and Yulan He. Model- ing subjectivity in cognitive appraisal with language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 13811–13833, Suzhou, China, November

work page 2025
[17]

Association for Computational Linguistics

work page
[18]

Patterns of cognitive appraisal in emotion.Journal of personality and social psychology, 48(4):813, 1985

Craig A Smith and Phoebe C Ellsworth. Patterns of cognitive appraisal in emotion.Journal of personality and social psychology, 48(4):813, 1985

work page 1985
[19]

Appraisal determinants of discrete emotions.Cognition & Emotion, 5(3):161– 200, 1991

Ira J Roseman. Appraisal determinants of discrete emotions.Cognition & Emotion, 5(3):161– 200, 1991

work page 1991
[20]

An appraisal theoretic approach to modelling affect flow in conversation corpora

Alok Debnath, Yvette Graham, and Owen Conlan. An appraisal theoretic approach to modelling affect flow in conversation corpora. InProceedings of the 29th Conference on Computational Natural Language Learning, pages 233–250, 2025

work page 2025
[21]

Tak, Amin Banayeeanzade, Anahita Bolourani, Mina Kian, Robin Jia, and Jonathan Gratch

Ala N. Tak, Amin Banayeeanzade, Anahita Bolourani, Mina Kian, Robin Jia, and Jonathan Gratch. Mechanistic interpretability of emotion inference in large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 13090–13120, Vienna, Aus...

work page 2025
[22]

Emotion understanding as third-person appraisals: Integrating appraisal theories with developmental theories of emotion.Psychological Review, 132(1):130, 2025

Tiffany Doan, Desmond C Ong, and Yang Wu. Emotion understanding as third-person appraisals: Integrating appraisal theories with developmental theories of emotion.Psychological Review, 132(1):130, 2025

work page 2025
[23]

An argument for basic emotions.Cognition & emotion, 6(3-4):169–200, 1992

Paul Ekman. An argument for basic emotions.Cognition & emotion, 6(3-4):169–200, 1992

work page 1992
[24]

A circumplex model of affect.Journal of personality and social psychology, 39(6):1161, 1980

James A Russell. A circumplex model of affect.Journal of personality and social psychology, 39(6):1161, 1980

work page 1980
[25]

Intelligent Agents with Emotional Intelligence: Current Trends, Challenges, and Future Prospects

Raziyeh Zall, Alireza Kheyrkhah, Erik Cambria, Zahra Naseri, and M Reza Kangavari. Intelli- gent agents with emotional intelligence: Current trends, challenges, and future prospects.arXiv preprint arXiv:2511.20657, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Unpacking reappraisal: different appraisal shifts underlie reappraisal effects on valence and activation

Maria Krajuškina, Annikki Remmelgas, Helen Uusberg, and Andero Uusberg. Unpacking reappraisal: different appraisal shifts underlie reappraisal effects on valence and activation. Cognition and Emotion, pages 1–8, 2025

work page 2025
[27]

The dynamic architecture of emotion: Evidence for the component process model.Cognition and emotion, 23(7):1307–1351, 2009

Klaus R Scherer. The dynamic architecture of emotion: Evidence for the component process model.Cognition and emotion, 23(7):1307–1351, 2009

work page 2009
[28]

Evaluating subjective cognitive appraisals of emotions from large language models

Hongli Zhan, Desmond Ong, and Junyi Jessy Li. Evaluating subjective cognitive appraisals of emotions from large language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 14418–14446, 2023. 11

work page 2023
[29]

The PEACE-reviews dataset: Modeling cognitive appraisals in emotion text analysis

Gerard Yeo and Kokil Jaidka. The PEACE-reviews dataset: Modeling cognitive appraisals in emotion text analysis. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2822–2840, Singapore, December 2023. Association for Computational Linguistics

work page 2023
[30]

Beyond context to cognitive appraisal: Emotion reasoning as a theory of mind benchmark for large language models

Gerard Christopher Yeo and Kokil Jaidka. Beyond context to cognitive appraisal: Emotion reasoning as a theory of mind benchmark for large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 26517–26525, 2025

work page 2025
[31]

Assessing the reliability and validity of gpt-4 in annotating emotion appraisal ratings

Deniss Ruder, Andero Uusberg, and Kairit Sirts. Assessing the reliability and validity of gpt-4 in annotating emotion appraisal ratings. InProceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2025), pages 1–11, 2025

work page 2025
[32]

Gpt-4 emulates average-human emotional cognition from a third-person perspective

Ala N Tak and Jonathan Gratch. Gpt-4 emulates average-human emotional cognition from a third-person perspective. In2024 12th International Conference on Affective Computing and Intelligent Interaction (ACII), pages 337–345. IEEE, 2024

work page 2024
[33]

Human-like affective cognition in foundation models.arXiv preprint arXiv:2409.11733, 2024

Kanishk Gandhi, Zoe Lynch, Jan-Philipp Fränken, Kayla Patterson, Sharon Wambu, Tobias Gerstenberg, Desmond C Ong, and Noah D Goodman. Human-like affective cognition in foundation models.arXiv preprint arXiv:2409.11733, 2024

work page arXiv 2024
[34]

When the event happened, how much did it matter to you? Why?

Nutchanon Yongsatianchot, Parisa Ghanad Torshizi, and Stacy Marsella. Investigating large language models’ perception of emotion using appraisal theory. In2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), pages 1–8. IEEE, 2023. A Technical Appendices A.1 Related Work Appraisal Theory and Emo...

work page 2023
[35]

If you did not feel any positive / negative emotions, respond only with “None”

work page
[36]

If you experienced any of the above, list all applicable groups exactly as they are written, separated by a semicolon (;). 17

work page
[37]

Do not provide any introductory text, explanation, or punctuation outside of the list. My Answer: A.5 Supplementary Results A.5.1 Supplementary Results for RQ1 Human Evaluation of Appraisal ReasoningTo complement automatic metrics for appraisal reasoning evaluation, we conducted a human evaluation on a sample of 100 scenarios. For each core appraisal dime...

work page arXiv 1910

[1] [1]

Improving language models for emotion analysis: In- sights from cognitive science

Constant Bonard and Gustave Cortal. Improving language models for emotion analysis: In- sights from cognitive science. In Tatsuki Kuribayashi, Giulia Rambelli, Ece Takmaz, Philipp Wicke, and Yohei Oseki, editors,Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 264–277, Bangkok, Thailand, August 2024. Association for C...

work page 2024

[2] [2]

Appraisal theory.Handbook of Cognition and Emotion, pages 637–663, 1999

Klaus R Scherer. Appraisal theory.Handbook of Cognition and Emotion, pages 637–663, 1999

work page 1999

[3] [3]

Dimensional modeling of emotions in text with appraisal theories: Corpus creation, annotation reliability, and prediction.Computa- tional Linguistics, 49(1):1–72, March 2023

Enrica Troiano, Laura Oberländer, and Roman Klinger. Dimensional modeling of emotions in text with appraisal theories: Corpus creation, annotation reliability, and prediction.Computa- tional Linguistics, 49(1):1–72, March 2023

work page 2023

[4] [4]

Liu, He Cao, Renliang Sun, Rui Wang, Yu Li, and Jiaxing Zhang

June M. Liu, He Cao, Renliang Sun, Rui Wang, Yu Li, and Jiaxing Zhang. CAPE: A Chinese dataset for appraisal-based emotional generation in large language models. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 6306–6324, Albuquerque, New Mexico, April 2025. Association for Co...

work page 2025

[5] [5]

Sree Bhattacharyya, Lucas Craig, Tharun Dilliraj, Jia Li, and James Z. Wang. Do machines think emotionally? a cognitive appraisal analysis of large language models. InWomen in Machine Learning Workshop @ NeurIPS 2025, 2026

work page 2025

[6] [6]

Oxford University Press, 2001

Klaus R Scherer, Angela Schorr, and Tom Johnstone.Appraisal processes in emotion: Theory, methods, research. Oxford University Press, 2001

work page 2001

[7] [7]

Ema: A process model of appraisal dynamics.Cognitive Systems Research, 10(1):70–90, 2009

Stacy C Marsella and Jonathan Gratch. Ema: A process model of appraisal dynamics.Cognitive Systems Research, 10(1):70–90, 2009

work page 2009

[8] [8]

Micro-narratives: A scalable method for eliciting stories of people’s lived experience

Amira Skeggs, Ashish Mehta, Valerie Yap, Seray B Ibrahim, Charla Rhodes, James J Gross, Sean A Munson, Predrag Klasnja, Amy Orben, and Petr Slovak. Micro-narratives: A scalable method for eliciting stories of people’s lived experience. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20, 2025

work page 2025

[9] [9]

Measuring personality in one minute or less: A 10-item short version of the big five inventory in english and german.Journal of research in Personality, 41(1):203–212, 2007

Beatrice Rammstedt and Oliver P John. Measuring personality in one minute or less: A 10-item short version of the big five inventory in english and german.Journal of research in Personality, 41(1):203–212, 2007

work page 2007

[10] [10]

Gpt-5.2.https://openai.com/index/introducing-gpt-5-2/, 2025

OpenAI. Gpt-5.2.https://openai.com/index/introducing-gpt-5-2/, 2025

work page 2025

[11] [11]

Claude sonnet 4.6

Anthropic. Claude sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6 , 2026. 10

work page 2026

[12] [12]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

Qwen Team. Qwen3.5: Accelerating productivity with native multimodal agents, February 2026

work page 2026

[14] [14]

Beyond empathy: Integrating diagnostic and therapeutic reasoning with large language models for mental health counseling.arXiv preprint arXiv:2505.15715, 2025

He Hu, Yucheng Zhou, Juzheng Si, Qianning Wang, Hengheng Zhang, Fuji Ren, Fei Ma, and Laizhong Cui. Beyond empathy: Integrating diagnostic and therapeutic reasoning with large language models for mental health counseling.arXiv preprint arXiv:2505.15715, 2025

work page arXiv 2025

[15] [15]

Emollms: A series of emotional large language models and annotation tools for comprehensive affective analysis

Zhiwei Liu, Kailai Yang, Qianqian Xie, Tianlin Zhang, and Sophia Ananiadou. Emollms: A series of emotional large language models and annotation tools for comprehensive affective analysis. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5487–5496, 2024

work page 2024

[16] [16]

Ong, Maria Liakata, Petr Slovak, and Yulan He

Yuxiang Zhou, Hainiu Xu, Desmond C. Ong, Maria Liakata, Petr Slovak, and Yulan He. Model- ing subjectivity in cognitive appraisal with language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pages 13811–13833, Suzhou, China, November

work page 2025

[17] [17]

Association for Computational Linguistics

work page

[18] [18]

Patterns of cognitive appraisal in emotion.Journal of personality and social psychology, 48(4):813, 1985

Craig A Smith and Phoebe C Ellsworth. Patterns of cognitive appraisal in emotion.Journal of personality and social psychology, 48(4):813, 1985

work page 1985

[19] [19]

Appraisal determinants of discrete emotions.Cognition & Emotion, 5(3):161– 200, 1991

Ira J Roseman. Appraisal determinants of discrete emotions.Cognition & Emotion, 5(3):161– 200, 1991

work page 1991

[20] [20]

An appraisal theoretic approach to modelling affect flow in conversation corpora

Alok Debnath, Yvette Graham, and Owen Conlan. An appraisal theoretic approach to modelling affect flow in conversation corpora. InProceedings of the 29th Conference on Computational Natural Language Learning, pages 233–250, 2025

work page 2025

[21] [21]

Tak, Amin Banayeeanzade, Anahita Bolourani, Mina Kian, Robin Jia, and Jonathan Gratch

Ala N. Tak, Amin Banayeeanzade, Anahita Bolourani, Mina Kian, Robin Jia, and Jonathan Gratch. Mechanistic interpretability of emotion inference in large language models. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Findings of the Association for Computational Linguistics: ACL 2025, pages 13090–13120, Vienna, Aus...

work page 2025

[22] [22]

Emotion understanding as third-person appraisals: Integrating appraisal theories with developmental theories of emotion.Psychological Review, 132(1):130, 2025

Tiffany Doan, Desmond C Ong, and Yang Wu. Emotion understanding as third-person appraisals: Integrating appraisal theories with developmental theories of emotion.Psychological Review, 132(1):130, 2025

work page 2025

[23] [23]

An argument for basic emotions.Cognition & emotion, 6(3-4):169–200, 1992

Paul Ekman. An argument for basic emotions.Cognition & emotion, 6(3-4):169–200, 1992

work page 1992

[24] [24]

A circumplex model of affect.Journal of personality and social psychology, 39(6):1161, 1980

James A Russell. A circumplex model of affect.Journal of personality and social psychology, 39(6):1161, 1980

work page 1980

[25] [25]

Intelligent Agents with Emotional Intelligence: Current Trends, Challenges, and Future Prospects

Raziyeh Zall, Alireza Kheyrkhah, Erik Cambria, Zahra Naseri, and M Reza Kangavari. Intelli- gent agents with emotional intelligence: Current trends, challenges, and future prospects.arXiv preprint arXiv:2511.20657, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Unpacking reappraisal: different appraisal shifts underlie reappraisal effects on valence and activation

Maria Krajuškina, Annikki Remmelgas, Helen Uusberg, and Andero Uusberg. Unpacking reappraisal: different appraisal shifts underlie reappraisal effects on valence and activation. Cognition and Emotion, pages 1–8, 2025

work page 2025

[27] [27]

The dynamic architecture of emotion: Evidence for the component process model.Cognition and emotion, 23(7):1307–1351, 2009

Klaus R Scherer. The dynamic architecture of emotion: Evidence for the component process model.Cognition and emotion, 23(7):1307–1351, 2009

work page 2009

[28] [28]

Evaluating subjective cognitive appraisals of emotions from large language models

Hongli Zhan, Desmond Ong, and Junyi Jessy Li. Evaluating subjective cognitive appraisals of emotions from large language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 14418–14446, 2023. 11

work page 2023

[29] [29]

The PEACE-reviews dataset: Modeling cognitive appraisals in emotion text analysis

Gerard Yeo and Kokil Jaidka. The PEACE-reviews dataset: Modeling cognitive appraisals in emotion text analysis. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2822–2840, Singapore, December 2023. Association for Computational Linguistics

work page 2023

[30] [30]

Beyond context to cognitive appraisal: Emotion reasoning as a theory of mind benchmark for large language models

Gerard Christopher Yeo and Kokil Jaidka. Beyond context to cognitive appraisal: Emotion reasoning as a theory of mind benchmark for large language models. InFindings of the Association for Computational Linguistics: ACL 2025, pages 26517–26525, 2025

work page 2025

[31] [31]

Assessing the reliability and validity of gpt-4 in annotating emotion appraisal ratings

Deniss Ruder, Andero Uusberg, and Kairit Sirts. Assessing the reliability and validity of gpt-4 in annotating emotion appraisal ratings. InProceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2025), pages 1–11, 2025

work page 2025

[32] [32]

Gpt-4 emulates average-human emotional cognition from a third-person perspective

Ala N Tak and Jonathan Gratch. Gpt-4 emulates average-human emotional cognition from a third-person perspective. In2024 12th International Conference on Affective Computing and Intelligent Interaction (ACII), pages 337–345. IEEE, 2024

work page 2024

[33] [33]

Human-like affective cognition in foundation models.arXiv preprint arXiv:2409.11733, 2024

Kanishk Gandhi, Zoe Lynch, Jan-Philipp Fränken, Kayla Patterson, Sharon Wambu, Tobias Gerstenberg, Desmond C Ong, and Noah D Goodman. Human-like affective cognition in foundation models.arXiv preprint arXiv:2409.11733, 2024

work page arXiv 2024

[34] [34]

When the event happened, how much did it matter to you? Why?

Nutchanon Yongsatianchot, Parisa Ghanad Torshizi, and Stacy Marsella. Investigating large language models’ perception of emotion using appraisal theory. In2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), pages 1–8. IEEE, 2023. A Technical Appendices A.1 Related Work Appraisal Theory and Emo...

work page 2023

[35] [35]

If you did not feel any positive / negative emotions, respond only with “None”

work page

[36] [36]

If you experienced any of the above, list all applicable groups exactly as they are written, separated by a semicolon (;). 17

work page

[37] [37]

Do not provide any introductory text, explanation, or punctuation outside of the list. My Answer: A.5 Supplementary Results A.5.1 Supplementary Results for RQ1 Human Evaluation of Appraisal ReasoningTo complement automatic metrics for appraisal reasoning evaluation, we conducted a human evaluation on a sample of 100 scenarios. For each core appraisal dime...

work page arXiv 1910