pith. sign in

arxiv: 2509.10746 · v3 · submitted 2025-09-12 · 💻 cs.CL

RECAP: Transparent Inference-Time Emotion Alignment for Medical Dialogue Systems

Pith reviewed 2026-05-18 16:48 UTC · model grok-4.3

classification 💻 cs.CL
keywords emotion alignmentmedical dialogue systemscognitive appraisal theoryinference-time promptinglarge language modelstransparent reasoningclinical trust
0
0 comments X

The pith

RECAP decomposes medical patient inputs into explicit cognitive appraisal stages to align AI responses with human emotional judgments at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RECAP, a sequence of prompting steps that breaks patient messages into stages based on cognitive appraisal theory. This produces responses that better match how humans would rate emotional support and appropriateness in medical conversations. The gains appear largest for smaller models and expose a consistent pattern where models downplay relational elements such as social support. When oncology fellows reviewed outputs without knowing their source, they chose RECAP versions over standard ones in the large majority of cases while the step-by-step reasoning stayed visible for review.

Core claim

RECAP applies a Reflect-Extract-Calibrate-Align-Produce pipeline grounded in cognitive appraisal theory to decompose patient input into auditable stages without retraining, yielding higher alignment with human emotional judgments across model sizes from 8B to 120B parameters, with larger relative gains for smaller models, and securing 76-88 percent win rates in blinded expert evaluations by oncology fellows.

What carries the argument

The RECAP pipeline that sequences prompting stages drawn from cognitive appraisal theory to render emotional reasoning explicit and inspectable before response generation.

If this is right

  • Smaller language models can reach emotional alignment levels closer to those of much larger models in medical dialogue tasks.
  • Clinicians gain an auditable trace of how the model reached its emotional stance before using the output.
  • Gaps in model attention to relational factors such as social support become visible and correctable at inference time.
  • Medical dialogue systems can be upgraded for greater clinical trust without additional training or fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar staged prompting might raise emotional appropriateness in other high-stakes dialogue domains such as customer support or legal intake.
  • The documented underweighting of social support points to a broader tendency in current models to favor individual over relational context that could be measured across tasks.
  • Extending RECAP to non-English or culturally varied patient populations would test whether appraisal stages transfer or require adaptation.
  • Pairing the framework with other inference-time methods could produce additive improvements in transparency and alignment.

Load-bearing premise

That breaking patient input into appraisal-theoretic stages via prompting produces genuinely improved emotional alignment without adding new inaccuracies or biases that standard prompting would not introduce.

What would settle it

A blinded evaluation with more oncology fellows in which RECAP responses receive equal or lower ratings than baseline outputs, or where the intermediate appraisal stages show no correlation with final response quality.

Figures

Figures reproduced from arXiv: 2509.10746 by Adarsh Srinivasan, Ben Zhou, Irbaz B. Riaz, Jacob Dineen, Muhammad Umar Afzal, Muhammad Uzair Sarfraz.

Figure 1
Figure 1. Figure 1: Patient input (left) is transformed into appraisal-theoretic intermediates with per-dimension Likert ratings (center), [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: RECAP Pipeline for Emotional Alignment. Model-agnostic inference-time prompting that externalizes emotional [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Representative synthetic patient scenarios. (a) Single-turn evaluation assesses individual response quality. (b) Multi [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Human evaluation results. (a,b) Mean ratings with standard error bars (1–5 scale). (c) Scenario-level win rates showing [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative analysis. (a) RECAP achieved 44% high-rated scenarios with zero low-rated, vs. 8–12% high and 4–12% low [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: LLM-as-Judge ratings vs. human annotators. All [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Single-turn annotation interface [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Multi-turn annotation interface [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
read the original abstract

Large language models in healthcare often produce emotionally flat or opaque responses, failing to provide the transparent reasoning required for clinical trust. We present RECAP (Reflect-Extract-Calibrate-Align-Produce), an inference-time framework grounded in cognitive appraisal theory that decomposes patient input into auditable, appraisal-theoretic stages without retraining. Across multiple benchmarks and models from 8B to 120B parameters, RECAP improves alignment with human judgments, with gains inversely proportional to model scale. Intermediate outputs further reveal that models systematically underweight relational factors such as social support. In blinded evaluations, oncology fellows rated RECAP responses significantly higher than baselines with 76-88% win rates, demonstrating that principled prompting can enhance medical AI's emotional intelligence while maintaining the transparency required for clinical deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RECAP, an inference-time framework for transparent emotion alignment in medical dialogue systems. Grounded in cognitive appraisal theory, it decomposes patient input into five explicit stages (Reflect-Extract-Calibrate-Align-Produce) via prompting, without model retraining. Across models from 8B to 120B parameters, the authors report improved alignment with human judgments (with gains inversely proportional to model scale), that models systematically underweight relational factors such as social support, and that oncology fellows in blinded evaluations prefer RECAP outputs over baselines with 76-88% win rates.

Significance. If the empirical results hold under rigorous controls, the work offers a practical, auditable method to improve emotional intelligence and clinical trust in medical LLMs. The inverse-scaling pattern and the identification of underweighted relational factors provide potentially useful diagnostic insights into current model limitations. The emphasis on transparency without retraining is well-aligned with deployment constraints in healthcare.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The abstract reports quantitative gains and win rates but supplies no details on benchmark construction, statistical tests, inter-rater reliability, or controls for prompt sensitivity; without these, it is not possible to confirm that the data support the stated improvements.
  2. [§3 and §4.2] §3 (Method) and §4.2 (Ablations): The central claim requires that the specific Reflect-Extract-Calibrate-Align-Produce stages derived from cognitive appraisal theory produce measurably better human alignment than standard prompting. No ablations are presented against control prompts that use equivalent multi-stage structure without invoking appraisal constructs; if gains persist under such controls, the theory-specific decomposition is not load-bearing for the reported improvements or the revelation of underweighted relational factors.
minor comments (2)
  1. [§3] The manuscript would benefit from a concise table summarizing the exact prompt templates used for each RECAP stage to support reproducibility.
  2. [§4] Ensure all reported win rates are accompanied by exact sample sizes, number of evaluators, and any statistical significance tests in the main text or appendix.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We appreciate the emphasis on methodological transparency and the specificity of our theoretical contribution. We address each major comment below and commit to revisions that directly strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The abstract reports quantitative gains and win rates but supplies no details on benchmark construction, statistical tests, inter-rater reliability, or controls for prompt sensitivity; without these, it is not possible to confirm that the data support the stated improvements.

    Authors: We agree that the abstract, as a concise summary, does not include these methodological specifics. The full details of benchmark construction (including the medical dialogue datasets and human judgment protocols), statistical significance testing, inter-rater reliability calculations, and prompt-variation controls are provided in §4. To improve accessibility, we will revise the abstract to incorporate brief mentions of these elements (e.g., reporting inter-rater agreement and the use of multiple prompt templates for sensitivity analysis) while preserving its length constraints. This change will make the reported gains and win rates more readily evaluable. revision: yes

  2. Referee: [§3 and §4.2] §3 (Method) and §4.2 (Ablations): The central claim requires that the specific Reflect-Extract-Calibrate-Align-Produce stages derived from cognitive appraisal theory produce measurably better human alignment than standard prompting. No ablations are presented against control prompts that use equivalent multi-stage structure without invoking appraisal constructs; if gains persist under such controls, the theory-specific decomposition is not load-bearing for the reported improvements or the revelation of underweighted relational factors.

    Authors: This observation correctly identifies a gap in isolating the contribution of the appraisal-theoretic framing. Our existing ablations in §4.2 isolate the effect of individual stages within the RECAP pipeline and show that each contributes to alignment and to surfacing underweighted relational factors. However, we did not include a direct control condition consisting of a structurally identical multi-stage prompt that omits cognitive appraisal constructs. In the revision we will add this control ablation, allowing a direct comparison that tests whether the theory-derived stage definitions are necessary for the observed gains and diagnostic insights. We expect the results to support the load-bearing role of the appraisal grounding, but we will report the outcomes transparently regardless. revision: yes

Circularity Check

0 steps flagged

No significant circularity: RECAP stages drawn from external theory with independent human/expert evaluations

full rationale

The paper grounds RECAP in cognitive appraisal theory as an external source and evaluates outputs via separate human judgment alignment benchmarks plus blinded oncology fellow ratings (76-88% win rates). No equations, fitted parameters, or self-citations are shown reducing the claimed gains or intermediate revelations to quantities defined by the evaluation itself. The multi-stage prompting is presented as a transparent decomposition rather than a fit or renaming of results, and the inverse scaling observation is reported as an empirical finding rather than a constructed prediction. This leaves the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of cognitive appraisal theory to LLM prompting for medical emotion alignment; no free parameters or invented entities are visible in the abstract.

axioms (1)
  • domain assumption Cognitive appraisal theory supplies a valid and sufficient decomposition of patient emotional input for improving LLM response alignment in medical contexts.
    The framework is explicitly grounded in this theory per the abstract.

pith-pipeline@v0.9.0 · 5676 in / 1293 out tokens · 44293 ms · 2026-05-18T16:48:06.845110+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 5 internal anchors

  1. [1]

    Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D

    Rishabh Agarwal, Avi Singh, Lei M. Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D. Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, and Hugo Larochelle. 2024. Many- Shot In-Context Learning. arXiv:2404.11018 [cs.LG] https://arxiv.org/abs/2404. 11018

  2. [2]

    Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz

    Saleema Amershi, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Daniel Weld, Mihaela Vorvoreanu, and Eric Horvitz. 2019. Guidelines for Human- AI Interaction.Proceedings of the ACM CHI Conference on Human Factors in Computing Systems2019 (2019), 1–13. doi:10.1145/3290605.3300233

  3. [3]

    Doctor ChatGPT, Can You Help Me?

    Jonas Armbruster, Florian Bussmann, Catharina Rothhaas, Nadine Titze, Paul A. Grützner, and Holger Freischmidt. 2024. "Doctor ChatGPT, Can You Help Me?" The Patient’s Perspective: Cross-Sectional Study.Journal of Medical Internet Research26 (2024), e58831. doi:10.2196/58831

  4. [4]

    Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, An- drea Vallone, Alex Beutel, et al. 2025. Healthbench: Evaluating large language models towards improved human health

  5. [5]

    Ayers, Adam Poliak, Mark Dredze, Eric C

    John W. Ayers, Adam Poliak, Mark Dredze, Eric C. Leas, Zidian Zhu, Jon- Patrick Allem Kelley, Zoe Chu, and David M. Smith. 2023. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.JAMA Internal Medicine183, 6 (2023), 589–596. doi:10.1001/jamainternmed.2023.1838

  6. [6]

    Vardoulakis

    Emma Beede, Elizabeth Baylor, Fred Hersch, Anna Iurchenko, Lauren Wilcox, Paisan Ruamviboonsuk, and Laura M. Vardoulakis. 2020. A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy.Proceedings of the 2020 CHI Conference on Human Factors in Computing SystemsCHI ’20 (2020), Paper 589, 1–14. doi:...

  7. [7]

    Anna Bodonhelyi, Christian Stegemann-Philipps, Alessandra Sonanini, Lea Her- schbach, Marton Szep, Anne Herrmann-Werner, Teresa Festl-Wietek, Enkelejda Kasneci, and Friederike Holderried. 2025. Modeling Challenging Patient Inter- actions: LLMs for Medical Communication Training. arXiv:2503.22250 [cs.HC] https://arxiv.org/abs/2503.22250

  8. [8]

    Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. 2021. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making.Proceedings of the ACM on Human–Computer Interaction5, CSCW1 (2021), 188:1–188:21. doi:10.1145/3449287

  9. [9]

    Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry

    Carrie J. Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry

  10. [10]

    Hello AI

    "Hello AI": Uncovering the Onboarding Needs of Medical Practitioners for Human–AI Collaborative Decision-Making.Proceedings of the ACM on Human–Computer Interaction3, CSCW (2019), Article 104, 1–24. doi:10.1145/ 3359206

  11. [11]

    David Chen, Kabir Chauhan, Rod Parsa, Zhihui Amy Liu, Fei-Fei Liu, Ernie Mak, Lawson Eng, Breffni Louise Hannon, Jennifer Croke, Andrew Hope, Nazanin Fallah-Rad, Phillip Wong, Srinivas Raman, et al. 2025. Patient perceptions of empathy in physician and artificial intelligence chatbot responses to patient questions about cancer.npj Digital Medicine8 (May 2...

  12. [12]

    Heesters, and Srinivas Raman

    David Chen, Rod Parsa, Andrew Hope, Breffni Hannon, Ernie Mak, Lawson Eng, Fei-Fei Liu, Nazanin Fallah-Rad, Ann M. Heesters, and Srinivas Raman. 2024. Physician and Artificial Intelligence Chatbot Responses to Cancer Questions From Social Media.JAMA Oncology10, 7 (07 2024), 956–960. doi:10.1001/ jamaoncol.2024.0836

  13. [13]

    Yirong Chen, Xiaofen Xing, Jingkai Lin, Huimin Zheng, Zhenyu Wang, Qi Liu, and Xiangmin Xu. 2023. SoulChat: Improving LLMs’ empathy, listening, and comfort abilities through fine-tuning with multi-turn empathy conversations

  14. [14]

    Maximilian Croissant, Madeleine Frister, Guy Schofield, and Cade McCall. 2024. An Appraisal-based Chain-of-Emotion Architecture for Affective Language Model Game Agents.PLOS ONE19, 5 (2024), e0301033. doi:10.1371/journal. pone.0301033

  15. [15]

    Jacob Dineen, Don Kridel, Daniel Dolk, and David Castillo. 2024. Uni- fied Explanations in Machine Learning Models: A Perturbation Approach. arXiv:2405.20200 [cs.LG] Adarsh Srinivasan, Jacob Dineen, Muhammad Umar Afzal, Muhammad Uzair Sarfraz, Irbaz B. Riaz, and Ben Zhou

  16. [16]

    Jacob Dineen, Aswin RRV, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, et al. 2025. QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA. arXiv:2506.08123 [cs.CL]

  17. [17]

    Daniel Dolk, Donald Kridel, Jacob Dineen, and David Castillo. 2020. Model Interpretation and Explainability towards Creating Transparency in Prediction Models. InProceedings of the 53rd Hawaii International Conference on System Sciences (HICSS). Hawaii International Conference on System Sciences, Maui, HI, 956–965. doi:10.24251/hicss.2020.120

  18. [18]

    Yu Feng, Ben Zhou, Weidong Lin, and Dan Roth. 2024. Bird: A trustworthy bayesian inference framework for large language models

  19. [19]

    Yu Feng, Ben Zhou, Haoyu Wang, Helen Jin, and Dan Roth. 2023. Generic Temporal Reasoning with Differential Analysis and Explanation. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 12013–12029

  20. [20]

    Laura Francis and Noelle Robertson. 2023. Healthcare practitioners’ experiences of breaking bad news: A critical interpretative meta synthesis.Patient Education and Counseling107 (2023), 107574. doi:10.1016/j.pec.2022.107574

  21. [21]

    Ludwig Franke Føyen, Emma Zapel, Mats Lekander, Erik Hedman-Lagerlöf, and Elin Lindsäter. 2025. Artificial intelligence vs. human expert: Licensed mental health clinicians’ blinded evaluation of AI-generated and expert psychological advice on quality, empathy, and perceived authorship.Internet Interventions41 (2025), 100841. doi:10.1016/j.invent.2025.100841

  22. [22]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models

  23. [23]

    Alexa Hepburn and Jonathan Potter. 2023. Understanding mixed emotions in organized helping through emotionography.Frontiers in Psychology14 (Oct. 2023), 1236148. doi:10.3389/fpsyg.2023.1236148

  24. [24]

    Tiancheng Hu and Nigel Collier. 2024. Quantifying the Persona Effect in LLM Simulations. arXiv:2402.10811 [cs.CL]

  25. [25]

    Jen-tse Huang, Man Ho Lam, Eric John Li, Shujie Ren, Wenxuan Wang, Wenx- iang Jiao, Zhaopeng Tu, and Michael R. Lyu. 2024. Apathetic or Empathetic? Evaluating LLMs’ Emotional Alignments with Humans. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran...

  26. [26]

    Hutto and Eric Gilbert

    Clayton J. Hutto and Eric Gilbert. 2014. VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. InProceedings of the Eighth International Conference on Weblogs and Social Media (ICWSM ’14). AAAI Press, Ann Arbor, MI, 216–225

  27. [27]

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Nicholas Schiefer, Eric Clark, Guy Amir, Kamal Ndousse, Tom B. Brown, Steven Lar- son, Roger Grosse, Jared Kaplan, Natasha McAleese, David Hernandez, Micah Carroll, Deep Ganguli, Jan Leike, Catherine Olsson, David Krueger, Evan Hub- inger, Collin Burns, Samuel Bowman, Jacob Hilton, Long Ouyang, Yu...

  28. [28]

    Why Language Models Hallucinate

    Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. 2025. Why Language Models Hallucinate. arXiv:2509.04664 [cs.CL]

  29. [29]

    Richard S. Lazarus. 1991.Emotion and Adaptation. Oxford University Press, New York, NY

  30. [30]

    Kraut, Eunyoung Suh, Eun mee Kim, and Hajin Lim

    Keyeun Lee, Seolhee Lee, Esther Hehsun Kim, Yena Ko, Jinsu Eun, Dahee Kim, Hyewon Cho, Haiyi Zhu, Robert E. Kraut, Eunyoung Suh, Eun mee Kim, and Hajin Lim. 2025. Adaptive-VP: A Framework for LLM-Based Virtual Patients that Adapts to Trainees’ Dialogue to Facilitate Nurse Communication Training. arXiv:2506.00386 [cs.CL] https://arxiv.org/abs/2506.00386

  31. [31]

    Yoon Kyung Lee, Inju Lee, Minjung Shin, Seoyeon Bae, and Sowon Hahn. 2024. Chain of Empathy: Enhancing Empathetic Response of Large Language Models Based on Psychotherapy Models.Korean Journal of Cognitive Science35, 1 (2024), 23–48. Also available as arXiv:2311.04915

  32. [32]

    Warren, Lu Cheng, Haidar M

    Man Luo, Christopher J. Warren, Lu Cheng, Haidar M. Abdul-Muhsin, and Imon Banerjee. 2024. Assessing Empathy in Large Language Models with Real-World Physician-Patient Interactions. InIEEE International Conference on Big Data (Big- Data 2024). IEEE, Washington, DC, USA, 6510–6519. doi:10.1109/BigData62323. 2024.10825307

  33. [33]

    Pedro Henrique Luz de Araujo and Benjamin Roth. 2025. Helpful assistant or fruitful facilitator? Investigating how personas affect language model behavior. PLOS ONE20, 6 (2025), e0325664. doi:10.1371/journal.pone.0325664

  34. [34]

    OpenAI. 2025. gpt-oss-120b and gpt-oss-20b Model Card. arXiv:2508.10925 [cs.CL] https://arxiv.org/abs/2508.10925

  35. [35]

    Clore, and Allan Collins

    Andrew Ortony, Gerald L. Clore, and Allan Collins. 1988.The Cognitive Structure of Emotions. Cambridge University Press, Cambridge, United Kingdom

  36. [36]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training Language Models to Follow Instructions with Human Fee...

  37. [37]

    Samuel J. Paech. 2023. EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models. arXiv:2312.06281 [cs.CL]

  38. [38]

    Rafael Rafailov, Pratyusha Ramesh, Avnish Narayan Das, Sishir Syed, Sam Basu, Yao Li, James Zou, Qi Yang, Yuntao Bai, Pieter Abbeel, and Kiril Rafailov. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., New Orleans, LA, USA, 53416–53432

  39. [39]

    Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 5370–5381. doi:10.18653/v1/P19-1534

  40. [40]

    Aswin RRV, Jacob Dineen, Divij Handa, Md Nayem Uddin, Mihir Parmar, Chitta Baral, and Ben Zhou. 2025. ThinkTuning: Instilling Cognitive Reflections without Distillation. arXiv:2508.07616 [cs.CL]

  41. [41]

    Sahand Sabour, Siyang Liu, Zheyuan Zhang, June Liu, Jinfeng Zhou, Alvionna Sunaryo, Tatia Lee, Rada Mihalcea, and Minlie Huang. 2024. EmoBench: Eval- uating the Emotional Intelligence of Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and V...

  42. [42]

    Scherer and Agnes Moors

    Klaus R. Scherer and Agnes Moors. 2019. The Emotion Process: Event Appraisal and Component Differentiation.Annual Review of Psychology70 (2019), 719–745. doi:10.1146/annurev-psych-122216-011854

  43. [43]

    Ming Shen, Zhikun Xu, Xiao Ye, Jacob Dineen, and Ben Zhou. 2025. BOW: Bottlenecked Next Word Exploration. arXiv:2506.13502 [cs.CL]

  44. [44]

    Vlad Sorin, Benjamin Sheffer, Nada Meirow, et al. 2024. Large Language Models and Empathy: Systematic Review.Journal of Medical Internet Research26 (2024), e55610. doi:10.2196/55610

  45. [45]

    John Sweller. 2023. The Development of Cognitive Load Theory: Replication Crises and Incorporation of Other Theories Can Lead to Theory Expansion. Educational Psychology Review35 (Sept. 2023), 95. doi:10.1007/s10648-023-09817- 2

  46. [46]

    Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. 2022. Entailer: Answer- ing Questions with Faithful and Truthful Chains of Reasoning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2078–2093

  47. [47]

    Chiu, Jiayin Zhi, Shaun M

    Ruiyi Wang, Stephanie Milani, Jamie C. Chiu, Jiayin Zhi, Shaun M. Eack, Travis Labrum, Samuel M. Murphy, Nev Jones, Kate Hardy, Hong Shen, Fei Fang, and Zhiyu Zoey Chen. 2024. PATIENT-Ψ: Using Large Language Models to Simulate Patients for Training Mental Health Professionals. arXiv:2405.19660 [cs.CL] https://arxiv.org/abs/2405.19660

  48. [48]

    Xuena Wang et al . 2023. Emotional Intelligence of Large Language Models. Includes SECEU scenario-based assessments

  49. [49]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, et al . 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

  50. [50]

    Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel Deutch, and Jonathan Berant. 2020. Break it down: A question understanding benchmark.Transactions of the Association for Computational Linguistics8 (2020), 183–198

  51. [51]

    Zhikun Xu, Ming Shen, Jacob Dineen, Zhaonan Li, Xiao Ye, Shijie Lu, Aswin RRV, Chitta Baral, and Ben Zhou. 2024. Tow: Thoughts of Words Improve Reasoning in Large Language Models. arXiv:2410.16235 [cs.CL]

  52. [52]

    Qian Yang, Aaron Steinfeld, and John Zimmerman. 2019. Unremarkable AI: Fitting Intelligent Decision Support into Critical, Clinical Decision-Making Processes. doi:10.48550/arXiv.1904.09612 Published as proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19), DOI:10.1145/3290605.3300468

  53. [53]

    Xiao Ye, Jacob Dineen, Zhaonan Li, Zhikun Xu, Weiyu Chen, Shijie Lu, Yuxi Huang, Ming Shen, Phu Tran, Ji-Eun Irene Yum, Muhammad Ali Khan, Muham- mad Umar Afzal, Irbaz Bin Riaz, and Ben Zhou. 2025. Evaluating Medical LLMs by Levels of Autonomy: A Survey Moving from Benchmarks to Applications. arXiv:2510.17764 [cs.CL] https://arxiv.org/abs/2510.17764

  54. [54]

    Xiao Ye, Shaswat Shrivastava, Zhaonan Li, Jacob Dineen, Shijie Lu, Avneet Ahuja, Ming Shen, Zhikun Xu, and Ben Zhou. 2025. CC-LEARN: Cohort-based Consistency Learning. arXiv:2506.15662 [cs.CL]

  55. [55]

    Jianwen Zeng, Wenhao Qi, Shiying Shen, Xin Liu, Sixie Li, Bing Wang, Chao- qun Dong, Xiaohong Zhu, Yankai Shi, Xiajing Lou, Bingsheng Wang, Jiani Yao, Guowei Jiang, Qiong Zhang, and Shihua Cao. 2025. Embracing the Future of RECAP: Transparent Inference-Time Emotion Alignment for Medical Dialogue Systems Medical Education With Large Language Model–Based Vi...

  56. [56]

    Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing Dialogue Agents: I Have a Dog, Do You Have Pets Too?. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 2204–2213. doi:...

  57. [57]

    a helpful assistant

    Mingqian Zheng, Jiaxin Pei, Lajanugen Logeswaran, Moontae Lee, and David Jurgens. 2024. When “a helpful assistant” is not really helpful: Personas in system prompts do not improve performances of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics, Miami, Florida, USA, 1...

  58. [58]

    Ben Zhou, Kyle Richardson, Xiaodong Yu, and Dan Roth. 2022. Learning to Decompose: Hypothetical Question Decomposition Based on Comparable Texts. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2223–2235

  59. [59]

    Sarah found out that her younger brother is being bullied at school but he begged her not to tell their parents

    Ben Zhou, Hongming Zhang, Sihao Chen, Dian Yu, Hongwei Wang, Baolin Peng, Dan Roth, and Dong Yu. 2024. Conceptual and Unbiased Reasoning in Language Models. arXiv:2404.00205 [cs.CL] 9 Appendix 9.1RECAPLikert-Scale Probability Mapping Likert value Probability very-unlikely 0.05 unlikely 0.25 neutral 0.50 likely 0.75 very-likely 0.95 Table 4: Likert-scale t...

  60. [61]

    Factor name: Description (value1/value2)

  61. [62]

    Do not start with END_OF_FACTORS

    Factor name: Description (value1/value2) Do not include any explanations after the factors. Do not start with END_OF_FACTORS. After you list the factors, output a single line exactly: END_OF_FACTORS Never output END_OF_FACTORS before the list. Only place it after the final factor line. Example:

  62. [63]

    Self-efficacy: Person’s belief in their ability to handle challenges (low/high)

  63. [64]

    Social support: Availability of emotional support from others (absent/present)

  64. [65]

    choice_letter

    Stress level: Amount of psychological pressure experienced (low/high) Factor Value Selection Prompt Task:Analyze this situation and determine the specific value for each psychological factor: SITUATION:{situation} PSYCHOLOGICAL FACTORS: {factors_text} For each factor, choose the most appropriate value based on what you can observe in the situation. Provid...