RECAP: Transparent Inference-Time Emotion Alignment for Medical Dialogue Systems

Adarsh Srinivasan; Ben Zhou; Irbaz B. Riaz; Jacob Dineen; Muhammad Umar Afzal; Muhammad Uzair Sarfraz

arxiv: 2509.10746 · v3 · submitted 2025-09-12 · 💻 cs.CL

RECAP: Transparent Inference-Time Emotion Alignment for Medical Dialogue Systems

Adarsh Srinivasan , Jacob Dineen , Muhammad Umar Afzal , Muhammad Uzair Sarfraz , Irbaz B. Riaz , Ben Zhou This is my paper

Pith reviewed 2026-05-18 16:48 UTC · model grok-4.3

classification 💻 cs.CL

keywords emotion alignmentmedical dialogue systemscognitive appraisal theoryinference-time promptinglarge language modelstransparent reasoningclinical trust

0 comments

The pith

RECAP decomposes medical patient inputs into explicit cognitive appraisal stages to align AI responses with human emotional judgments at inference time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RECAP, a sequence of prompting steps that breaks patient messages into stages based on cognitive appraisal theory. This produces responses that better match how humans would rate emotional support and appropriateness in medical conversations. The gains appear largest for smaller models and expose a consistent pattern where models downplay relational elements such as social support. When oncology fellows reviewed outputs without knowing their source, they chose RECAP versions over standard ones in the large majority of cases while the step-by-step reasoning stayed visible for review.

Core claim

RECAP applies a Reflect-Extract-Calibrate-Align-Produce pipeline grounded in cognitive appraisal theory to decompose patient input into auditable stages without retraining, yielding higher alignment with human emotional judgments across model sizes from 8B to 120B parameters, with larger relative gains for smaller models, and securing 76-88 percent win rates in blinded expert evaluations by oncology fellows.

What carries the argument

The RECAP pipeline that sequences prompting stages drawn from cognitive appraisal theory to render emotional reasoning explicit and inspectable before response generation.

If this is right

Smaller language models can reach emotional alignment levels closer to those of much larger models in medical dialogue tasks.
Clinicians gain an auditable trace of how the model reached its emotional stance before using the output.
Gaps in model attention to relational factors such as social support become visible and correctable at inference time.
Medical dialogue systems can be upgraded for greater clinical trust without additional training or fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar staged prompting might raise emotional appropriateness in other high-stakes dialogue domains such as customer support or legal intake.
The documented underweighting of social support points to a broader tendency in current models to favor individual over relational context that could be measured across tasks.
Extending RECAP to non-English or culturally varied patient populations would test whether appraisal stages transfer or require adaptation.
Pairing the framework with other inference-time methods could produce additive improvements in transparency and alignment.

Load-bearing premise

That breaking patient input into appraisal-theoretic stages via prompting produces genuinely improved emotional alignment without adding new inaccuracies or biases that standard prompting would not introduce.

What would settle it

A blinded evaluation with more oncology fellows in which RECAP responses receive equal or lower ratings than baseline outputs, or where the intermediate appraisal stages show no correlation with final response quality.

Figures

Figures reproduced from arXiv: 2509.10746 by Adarsh Srinivasan, Ben Zhou, Irbaz B. Riaz, Jacob Dineen, Muhammad Umar Afzal, Muhammad Uzair Sarfraz.

**Figure 1.** Figure 1: Patient input (left) is transformed into appraisal-theoretic intermediates with per-dimension Likert ratings (center), [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: RECAP Pipeline for Emotional Alignment. Model-agnostic inference-time prompting that externalizes emotional [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Representative synthetic patient scenarios. (a) Single-turn evaluation assesses individual response quality. (b) Multi [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Human evaluation results. (a,b) Mean ratings with standard error bars (1–5 scale). (c) Scenario-level win rates showing [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative analysis. (a) RECAP achieved 44% high-rated scenarios with zero low-rated, vs. 8–12% high and 4–12% low [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: LLM-as-Judge ratings vs. human annotators. All [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Single-turn annotation interface [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Multi-turn annotation interface [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

Large language models in healthcare often produce emotionally flat or opaque responses, failing to provide the transparent reasoning required for clinical trust. We present RECAP (Reflect-Extract-Calibrate-Align-Produce), an inference-time framework grounded in cognitive appraisal theory that decomposes patient input into auditable, appraisal-theoretic stages without retraining. Across multiple benchmarks and models from 8B to 120B parameters, RECAP improves alignment with human judgments, with gains inversely proportional to model scale. Intermediate outputs further reveal that models systematically underweight relational factors such as social support. In blinded evaluations, oncology fellows rated RECAP responses significantly higher than baselines with 76-88% win rates, demonstrating that principled prompting can enhance medical AI's emotional intelligence while maintaining the transparency required for clinical deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RECAP gives a structured inference-time prompting method for emotional alignment in medical LLMs, with reported gains that are larger on smaller models and strong preference from oncology fellows, but the specific role of cognitive appraisal theory still needs checking against generic structured prompts.

read the letter

RECAP breaks patient inputs into five stages—Reflect, Extract, Calibrate, Align, Produce—drawn from cognitive appraisal theory. The goal is transparent emotional reasoning in medical dialogues without any retraining. The paper reports that this approach improves alignment with human judgments, with the biggest lifts on the smaller models in their 8B-to-120B range. Intermediate outputs also flag that models tend to downplay relational factors such as social support. Blinded ratings by oncology fellows show 76-88% win rates over baselines, which is the most concrete signal in the abstract.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RECAP, an inference-time framework for transparent emotion alignment in medical dialogue systems. Grounded in cognitive appraisal theory, it decomposes patient input into five explicit stages (Reflect-Extract-Calibrate-Align-Produce) via prompting, without model retraining. Across models from 8B to 120B parameters, the authors report improved alignment with human judgments (with gains inversely proportional to model scale), that models systematically underweight relational factors such as social support, and that oncology fellows in blinded evaluations prefer RECAP outputs over baselines with 76-88% win rates.

Significance. If the empirical results hold under rigorous controls, the work offers a practical, auditable method to improve emotional intelligence and clinical trust in medical LLMs. The inverse-scaling pattern and the identification of underweighted relational factors provide potentially useful diagnostic insights into current model limitations. The emphasis on transparency without retraining is well-aligned with deployment constraints in healthcare.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The abstract reports quantitative gains and win rates but supplies no details on benchmark construction, statistical tests, inter-rater reliability, or controls for prompt sensitivity; without these, it is not possible to confirm that the data support the stated improvements.
[§3 and §4.2] §3 (Method) and §4.2 (Ablations): The central claim requires that the specific Reflect-Extract-Calibrate-Align-Produce stages derived from cognitive appraisal theory produce measurably better human alignment than standard prompting. No ablations are presented against control prompts that use equivalent multi-stage structure without invoking appraisal constructs; if gains persist under such controls, the theory-specific decomposition is not load-bearing for the reported improvements or the revelation of underweighted relational factors.

minor comments (2)

[§3] The manuscript would benefit from a concise table summarizing the exact prompt templates used for each RECAP stage to support reproducibility.
[§4] Ensure all reported win rates are accompanied by exact sample sizes, number of evaluators, and any statistical significance tests in the main text or appendix.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We appreciate the emphasis on methodological transparency and the specificity of our theoretical contribution. We address each major comment below and commit to revisions that directly strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The abstract reports quantitative gains and win rates but supplies no details on benchmark construction, statistical tests, inter-rater reliability, or controls for prompt sensitivity; without these, it is not possible to confirm that the data support the stated improvements.

Authors: We agree that the abstract, as a concise summary, does not include these methodological specifics. The full details of benchmark construction (including the medical dialogue datasets and human judgment protocols), statistical significance testing, inter-rater reliability calculations, and prompt-variation controls are provided in §4. To improve accessibility, we will revise the abstract to incorporate brief mentions of these elements (e.g., reporting inter-rater agreement and the use of multiple prompt templates for sensitivity analysis) while preserving its length constraints. This change will make the reported gains and win rates more readily evaluable. revision: yes
Referee: [§3 and §4.2] §3 (Method) and §4.2 (Ablations): The central claim requires that the specific Reflect-Extract-Calibrate-Align-Produce stages derived from cognitive appraisal theory produce measurably better human alignment than standard prompting. No ablations are presented against control prompts that use equivalent multi-stage structure without invoking appraisal constructs; if gains persist under such controls, the theory-specific decomposition is not load-bearing for the reported improvements or the revelation of underweighted relational factors.

Authors: This observation correctly identifies a gap in isolating the contribution of the appraisal-theoretic framing. Our existing ablations in §4.2 isolate the effect of individual stages within the RECAP pipeline and show that each contributes to alignment and to surfacing underweighted relational factors. However, we did not include a direct control condition consisting of a structurally identical multi-stage prompt that omits cognitive appraisal constructs. In the revision we will add this control ablation, allowing a direct comparison that tests whether the theory-derived stage definitions are necessary for the observed gains and diagnostic insights. We expect the results to support the load-bearing role of the appraisal grounding, but we will report the outcomes transparently regardless. revision: yes

Circularity Check

0 steps flagged

No significant circularity: RECAP stages drawn from external theory with independent human/expert evaluations

full rationale

The paper grounds RECAP in cognitive appraisal theory as an external source and evaluates outputs via separate human judgment alignment benchmarks plus blinded oncology fellow ratings (76-88% win rates). No equations, fitted parameters, or self-citations are shown reducing the claimed gains or intermediate revelations to quantities defined by the evaluation itself. The multi-stage prompting is presented as a transparent decomposition rather than a fit or renaming of results, and the inverse scaling observation is reported as an empirical finding rather than a constructed prediction. This leaves the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of cognitive appraisal theory to LLM prompting for medical emotion alignment; no free parameters or invented entities are visible in the abstract.

axioms (1)

domain assumption Cognitive appraisal theory supplies a valid and sufficient decomposition of patient emotional input for improving LLM response alignment in medical contexts.
The framework is explicitly grounded in this theory per the abstract.

pith-pipeline@v0.9.0 · 5676 in / 1293 out tokens · 44293 ms · 2026-05-18T16:48:06.845110+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RECAP guides the model through a structured reasoning process: first abstracting the situation, then identifying psychological factors... generating candidate emotions, quantifying their likelihood using Likert scales, and finally producing an emotion-aligned response.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RECAP decomposes patient input into appraisal-theoretic stages... Likert-based emotion likelihoods

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 5 internal anchors

[1]

Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D

Rishabh Agarwal, Avi Singh, Lei M. Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D. Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, and Hugo Larochelle. 2024. Many- Shot In-Context Learning. arXiv:2404.11018 [cs.LG] https://arxiv.org/abs/2404. 11018

work page arXiv 2024
[2]

Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz

Saleema Amershi, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Daniel Weld, Mihaela Vorvoreanu, and Eric Horvitz. 2019. Guidelines for Human- AI Interaction.Proceedings of the ACM CHI Conference on Human Factors in Computing Systems2019 (2019), 1–13. doi:10.1145/3290605.3300233

work page doi:10.1145/3290605.3300233 2019
[3]

Doctor ChatGPT, Can You Help Me?

Jonas Armbruster, Florian Bussmann, Catharina Rothhaas, Nadine Titze, Paul A. Grützner, and Holger Freischmidt. 2024. "Doctor ChatGPT, Can You Help Me?" The Patient’s Perspective: Cross-Sectional Study.Journal of Medical Internet Research26 (2024), e58831. doi:10.2196/58831

work page doi:10.2196/58831 2024
[4]

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, An- drea Vallone, Alex Beutel, et al. 2025. Healthbench: Evaluating large language models towards improved human health

work page 2025
[5]

Ayers, Adam Poliak, Mark Dredze, Eric C

John W. Ayers, Adam Poliak, Mark Dredze, Eric C. Leas, Zidian Zhu, Jon- Patrick Allem Kelley, Zoe Chu, and David M. Smith. 2023. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.JAMA Internal Medicine183, 6 (2023), 589–596. doi:10.1001/jamainternmed.2023.1838

work page doi:10.1001/jamainternmed.2023.1838 2023
[6]

Vardoulakis

Emma Beede, Elizabeth Baylor, Fred Hersch, Anna Iurchenko, Lauren Wilcox, Paisan Ruamviboonsuk, and Laura M. Vardoulakis. 2020. A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy.Proceedings of the 2020 CHI Conference on Human Factors in Computing SystemsCHI ’20 (2020), Paper 589, 1–14. doi:...

work page doi:10.1145/3313831.3376718 2020
[7]

Anna Bodonhelyi, Christian Stegemann-Philipps, Alessandra Sonanini, Lea Her- schbach, Marton Szep, Anne Herrmann-Werner, Teresa Festl-Wietek, Enkelejda Kasneci, and Friederike Holderried. 2025. Modeling Challenging Patient Inter- actions: LLMs for Medical Communication Training. arXiv:2503.22250 [cs.HC] https://arxiv.org/abs/2503.22250

work page arXiv 2025
[8]

Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. 2021. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making.Proceedings of the ACM on Human–Computer Interaction5, CSCW1 (2021), 188:1–188:21. doi:10.1145/3449287

work page internal anchor Pith review doi:10.1145/3449287 2021
[9]

Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry

Carrie J. Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry

work page
[10]

Hello AI

"Hello AI": Uncovering the Onboarding Needs of Medical Practitioners for Human–AI Collaborative Decision-Making.Proceedings of the ACM on Human–Computer Interaction3, CSCW (2019), Article 104, 1–24. doi:10.1145/ 3359206

work page 2019
[11]

David Chen, Kabir Chauhan, Rod Parsa, Zhihui Amy Liu, Fei-Fei Liu, Ernie Mak, Lawson Eng, Breffni Louise Hannon, Jennifer Croke, Andrew Hope, Nazanin Fallah-Rad, Phillip Wong, Srinivas Raman, et al. 2025. Patient perceptions of empathy in physician and artificial intelligence chatbot responses to patient questions about cancer.npj Digital Medicine8 (May 2...

work page 2025
[12]

Heesters, and Srinivas Raman

David Chen, Rod Parsa, Andrew Hope, Breffni Hannon, Ernie Mak, Lawson Eng, Fei-Fei Liu, Nazanin Fallah-Rad, Ann M. Heesters, and Srinivas Raman. 2024. Physician and Artificial Intelligence Chatbot Responses to Cancer Questions From Social Media.JAMA Oncology10, 7 (07 2024), 956–960. doi:10.1001/ jamaoncol.2024.0836

work page arXiv 2024
[13]

Yirong Chen, Xiaofen Xing, Jingkai Lin, Huimin Zheng, Zhenyu Wang, Qi Liu, and Xiangmin Xu. 2023. SoulChat: Improving LLMs’ empathy, listening, and comfort abilities through fine-tuning with multi-turn empathy conversations

work page 2023
[14]

Maximilian Croissant, Madeleine Frister, Guy Schofield, and Cade McCall. 2024. An Appraisal-based Chain-of-Emotion Architecture for Affective Language Model Game Agents.PLOS ONE19, 5 (2024), e0301033. doi:10.1371/journal. pone.0301033

work page doi:10.1371/journal 2024
[15]

Jacob Dineen, Don Kridel, Daniel Dolk, and David Castillo. 2024. Uni- fied Explanations in Machine Learning Models: A Perturbation Approach. arXiv:2405.20200 [cs.LG] Adarsh Srinivasan, Jacob Dineen, Muhammad Umar Afzal, Muhammad Uzair Sarfraz, Irbaz B. Riaz, and Ben Zhou

work page arXiv 2024
[16]

Jacob Dineen, Aswin RRV, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, et al. 2025. QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA. arXiv:2506.08123 [cs.CL]

work page arXiv 2025
[17]

Daniel Dolk, Donald Kridel, Jacob Dineen, and David Castillo. 2020. Model Interpretation and Explainability towards Creating Transparency in Prediction Models. InProceedings of the 53rd Hawaii International Conference on System Sciences (HICSS). Hawaii International Conference on System Sciences, Maui, HI, 956–965. doi:10.24251/hicss.2020.120

work page doi:10.24251/hicss.2020.120 2020
[18]

Yu Feng, Ben Zhou, Weidong Lin, and Dan Roth. 2024. Bird: A trustworthy bayesian inference framework for large language models

work page 2024
[19]

Yu Feng, Ben Zhou, Haoyu Wang, Helen Jin, and Dan Roth. 2023. Generic Temporal Reasoning with Differential Analysis and Explanation. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 12013–12029

work page 2023
[20]

Laura Francis and Noelle Robertson. 2023. Healthcare practitioners’ experiences of breaking bad news: A critical interpretative meta synthesis.Patient Education and Counseling107 (2023), 107574. doi:10.1016/j.pec.2022.107574

work page doi:10.1016/j.pec.2022.107574 2023
[21]

Ludwig Franke Føyen, Emma Zapel, Mats Lekander, Erik Hedman-Lagerlöf, and Elin Lindsäter. 2025. Artificial intelligence vs. human expert: Licensed mental health clinicians’ blinded evaluation of AI-generated and expert psychological advice on quality, empathy, and perceived authorship.Internet Interventions41 (2025), 100841. doi:10.1016/j.invent.2025.100841

work page doi:10.1016/j.invent.2025.100841 2025
[22]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models

work page 2024
[23]

Alexa Hepburn and Jonathan Potter. 2023. Understanding mixed emotions in organized helping through emotionography.Frontiers in Psychology14 (Oct. 2023), 1236148. doi:10.3389/fpsyg.2023.1236148

work page doi:10.3389/fpsyg.2023.1236148 2023
[24]

Tiancheng Hu and Nigel Collier. 2024. Quantifying the Persona Effect in LLM Simulations. arXiv:2402.10811 [cs.CL]

work page arXiv 2024
[25]

Jen-tse Huang, Man Ho Lam, Eric John Li, Shujie Ren, Wenxuan Wang, Wenx- iang Jiao, Zhaopeng Tu, and Michael R. Lyu. 2024. Apathetic or Empathetic? Evaluating LLMs’ Emotional Alignments with Humans. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran...

work page 2024
[26]

Hutto and Eric Gilbert

Clayton J. Hutto and Eric Gilbert. 2014. VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. InProceedings of the Eighth International Conference on Weblogs and Social Media (ICWSM ’14). AAAI Press, Ann Arbor, MI, 216–225

work page 2014
[27]

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Nicholas Schiefer, Eric Clark, Guy Amir, Kamal Ndousse, Tom B. Brown, Steven Lar- son, Roger Grosse, Jared Kaplan, Natasha McAleese, David Hernandez, Micah Carroll, Deep Ganguli, Jan Leike, Catherine Olsson, David Krueger, Evan Hub- inger, Collin Burns, Samuel Bowman, Jacob Hilton, Long Ouyang, Yu...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[28]

Why Language Models Hallucinate

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. 2025. Why Language Models Hallucinate. arXiv:2509.04664 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Richard S. Lazarus. 1991.Emotion and Adaptation. Oxford University Press, New York, NY

work page 1991
[30]

Kraut, Eunyoung Suh, Eun mee Kim, and Hajin Lim

Keyeun Lee, Seolhee Lee, Esther Hehsun Kim, Yena Ko, Jinsu Eun, Dahee Kim, Hyewon Cho, Haiyi Zhu, Robert E. Kraut, Eunyoung Suh, Eun mee Kim, and Hajin Lim. 2025. Adaptive-VP: A Framework for LLM-Based Virtual Patients that Adapts to Trainees’ Dialogue to Facilitate Nurse Communication Training. arXiv:2506.00386 [cs.CL] https://arxiv.org/abs/2506.00386

work page arXiv 2025
[31]

Yoon Kyung Lee, Inju Lee, Minjung Shin, Seoyeon Bae, and Sowon Hahn. 2024. Chain of Empathy: Enhancing Empathetic Response of Large Language Models Based on Psychotherapy Models.Korean Journal of Cognitive Science35, 1 (2024), 23–48. Also available as arXiv:2311.04915

work page arXiv 2024
[32]

Warren, Lu Cheng, Haidar M

Man Luo, Christopher J. Warren, Lu Cheng, Haidar M. Abdul-Muhsin, and Imon Banerjee. 2024. Assessing Empathy in Large Language Models with Real-World Physician-Patient Interactions. InIEEE International Conference on Big Data (Big- Data 2024). IEEE, Washington, DC, USA, 6510–6519. doi:10.1109/BigData62323. 2024.10825307

work page doi:10.1109/bigdata62323 2024
[33]

Pedro Henrique Luz de Araujo and Benjamin Roth. 2025. Helpful assistant or fruitful facilitator? Investigating how personas affect language model behavior. PLOS ONE20, 6 (2025), e0325664. doi:10.1371/journal.pone.0325664

work page doi:10.1371/journal.pone.0325664 2025
[34]

OpenAI. 2025. gpt-oss-120b and gpt-oss-20b Model Card. arXiv:2508.10925 [cs.CL] https://arxiv.org/abs/2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Clore, and Allan Collins

Andrew Ortony, Gerald L. Clore, and Allan Collins. 1988.The Cognitive Structure of Emotions. Cambridge University Press, Cambridge, United Kingdom

work page 1988
[36]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training Language Models to Follow Instructions with Human Fee...

work page 2022
[37]

Samuel J. Paech. 2023. EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models. arXiv:2312.06281 [cs.CL]

work page arXiv 2023
[38]

Rafael Rafailov, Pratyusha Ramesh, Avnish Narayan Das, Sishir Syed, Sam Basu, Yao Li, James Zou, Qi Yang, Yuntao Bai, Pieter Abbeel, and Kiril Rafailov. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., New Orleans, LA, USA, 53416–53432

work page 2023
[39]

Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 5370–5381. doi:10.18653/v1/P19-1534

work page doi:10.18653/v1/p19-1534 2019
[40]

Aswin RRV, Jacob Dineen, Divij Handa, Md Nayem Uddin, Mihir Parmar, Chitta Baral, and Ben Zhou. 2025. ThinkTuning: Instilling Cognitive Reflections without Distillation. arXiv:2508.07616 [cs.CL]

work page arXiv 2025
[41]

Sahand Sabour, Siyang Liu, Zheyuan Zhang, June Liu, Jinfeng Zhou, Alvionna Sunaryo, Tatia Lee, Rada Mihalcea, and Minlie Huang. 2024. EmoBench: Eval- uating the Emotional Intelligence of Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and V...

work page doi:10.18653/v1/2024.acl-long.326 2024
[42]

Scherer and Agnes Moors

Klaus R. Scherer and Agnes Moors. 2019. The Emotion Process: Event Appraisal and Component Differentiation.Annual Review of Psychology70 (2019), 719–745. doi:10.1146/annurev-psych-122216-011854

work page doi:10.1146/annurev-psych-122216-011854 2019
[43]

Ming Shen, Zhikun Xu, Xiao Ye, Jacob Dineen, and Ben Zhou. 2025. BOW: Bottlenecked Next Word Exploration. arXiv:2506.13502 [cs.CL]

work page arXiv 2025
[44]

Vlad Sorin, Benjamin Sheffer, Nada Meirow, et al. 2024. Large Language Models and Empathy: Systematic Review.Journal of Medical Internet Research26 (2024), e55610. doi:10.2196/55610

work page doi:10.2196/55610 2024
[45]

John Sweller. 2023. The Development of Cognitive Load Theory: Replication Crises and Incorporation of Other Theories Can Lead to Theory Expansion. Educational Psychology Review35 (Sept. 2023), 95. doi:10.1007/s10648-023-09817- 2

work page doi:10.1007/s10648-023-09817- 2023
[46]

Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. 2022. Entailer: Answer- ing Questions with Faithful and Truthful Chains of Reasoning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2078–2093

work page 2022
[47]

Chiu, Jiayin Zhi, Shaun M

Ruiyi Wang, Stephanie Milani, Jamie C. Chiu, Jiayin Zhi, Shaun M. Eack, Travis Labrum, Samuel M. Murphy, Nev Jones, Kate Hardy, Hong Shen, Fei Fang, and Zhiyu Zoey Chen. 2024. PATIENT-Ψ: Using Large Language Models to Simulate Patients for Training Mental Health Professionals. arXiv:2405.19660 [cs.CL] https://arxiv.org/abs/2405.19660

work page arXiv 2024
[48]

Xuena Wang et al . 2023. Emotional Intelligence of Large Language Models. Includes SECEU scenario-based assessments

work page 2023
[49]

Jason Wei, Xuezhi Wang, Dale Schuurmans, et al . 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

work page 2022
[50]

Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel Deutch, and Jonathan Berant. 2020. Break it down: A question understanding benchmark.Transactions of the Association for Computational Linguistics8 (2020), 183–198

work page 2020
[51]

Zhikun Xu, Ming Shen, Jacob Dineen, Zhaonan Li, Xiao Ye, Shijie Lu, Aswin RRV, Chitta Baral, and Ben Zhou. 2024. Tow: Thoughts of Words Improve Reasoning in Large Language Models. arXiv:2410.16235 [cs.CL]

work page arXiv 2024
[52]

Qian Yang, Aaron Steinfeld, and John Zimmerman. 2019. Unremarkable AI: Fitting Intelligent Decision Support into Critical, Clinical Decision-Making Processes. doi:10.48550/arXiv.1904.09612 Published as proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19), DOI:10.1145/3290605.3300468

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1904.09612 2019
[53]

Xiao Ye, Jacob Dineen, Zhaonan Li, Zhikun Xu, Weiyu Chen, Shijie Lu, Yuxi Huang, Ming Shen, Phu Tran, Ji-Eun Irene Yum, Muhammad Ali Khan, Muham- mad Umar Afzal, Irbaz Bin Riaz, and Ben Zhou. 2025. Evaluating Medical LLMs by Levels of Autonomy: A Survey Moving from Benchmarks to Applications. arXiv:2510.17764 [cs.CL] https://arxiv.org/abs/2510.17764

work page arXiv 2025
[54]

Xiao Ye, Shaswat Shrivastava, Zhaonan Li, Jacob Dineen, Shijie Lu, Avneet Ahuja, Ming Shen, Zhikun Xu, and Ben Zhou. 2025. CC-LEARN: Cohort-based Consistency Learning. arXiv:2506.15662 [cs.CL]

work page arXiv 2025
[55]

Jianwen Zeng, Wenhao Qi, Shiying Shen, Xin Liu, Sixie Li, Bing Wang, Chao- qun Dong, Xiaohong Zhu, Yankai Shi, Xiajing Lou, Bingsheng Wang, Jiani Yao, Guowei Jiang, Qiong Zhang, and Shihua Cao. 2025. Embracing the Future of RECAP: Transparent Inference-Time Emotion Alignment for Medical Dialogue Systems Medical Education With Large Language Model–Based Vi...

work page doi:10.2196/79091 2025
[56]

Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing Dialogue Agents: I Have a Dog, Do You Have Pets Too?. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 2204–2213. doi:...

work page doi:10.18653/v1/p18-1205 2018
[57]

a helpful assistant

Mingqian Zheng, Jiaxin Pei, Lajanugen Logeswaran, Moontae Lee, and David Jurgens. 2024. When “a helpful assistant” is not really helpful: Personas in system prompts do not improve performances of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics, Miami, Florida, USA, 1...

work page 2024
[58]

Ben Zhou, Kyle Richardson, Xiaodong Yu, and Dan Roth. 2022. Learning to Decompose: Hypothetical Question Decomposition Based on Comparable Texts. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2223–2235

work page 2022
[59]

Sarah found out that her younger brother is being bullied at school but he begged her not to tell their parents

Ben Zhou, Hongming Zhang, Sihao Chen, Dian Yu, Hongwei Wang, Baolin Peng, Dan Roth, and Dong Yu. 2024. Conceptual and Unbiased Reasoning in Language Models. arXiv:2404.00205 [cs.CL] 9 Appendix 9.1RECAPLikert-Scale Probability Mapping Likert value Probability very-unlikely 0.05 unlikely 0.25 neutral 0.50 likely 0.75 very-likely 0.95 Table 4: Likert-scale t...

work page arXiv 2024
[61]

Factor name: Description (value1/value2)

work page
[62]

Do not start with END_OF_FACTORS

Factor name: Description (value1/value2) Do not include any explanations after the factors. Do not start with END_OF_FACTORS. After you list the factors, output a single line exactly: END_OF_FACTORS Never output END_OF_FACTORS before the list. Only place it after the final factor line. Example:

work page
[63]

Self-efficacy: Person’s belief in their ability to handle challenges (low/high)

work page
[64]

Social support: Availability of emotional support from others (absent/present)

work page
[65]

choice_letter

Stress level: Amount of psychological pressure experienced (low/high) Factor Value Selection Prompt Task:Analyze this situation and determine the specific value for each psychological factor: SITUATION:{situation} PSYCHOLOGICAL FACTORS: {factors_text} For each factor, choose the most appropriate value based on what you can observe in the situation. Provid...

work page

[1] [1]

Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D

Rishabh Agarwal, Avi Singh, Lei M. Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D. Co-Reyes, Eric Chu, Feryal Behbahani, Aleksandra Faust, and Hugo Larochelle. 2024. Many- Shot In-Context Learning. arXiv:2404.11018 [cs.LG] https://arxiv.org/abs/2404. 11018

work page arXiv 2024

[2] [2]

Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz

Saleema Amershi, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Daniel Weld, Mihaela Vorvoreanu, and Eric Horvitz. 2019. Guidelines for Human- AI Interaction.Proceedings of the ACM CHI Conference on Human Factors in Computing Systems2019 (2019), 1–13. doi:10.1145/3290605.3300233

work page doi:10.1145/3290605.3300233 2019

[3] [3]

Doctor ChatGPT, Can You Help Me?

Jonas Armbruster, Florian Bussmann, Catharina Rothhaas, Nadine Titze, Paul A. Grützner, and Holger Freischmidt. 2024. "Doctor ChatGPT, Can You Help Me?" The Patient’s Perspective: Cross-Sectional Study.Journal of Medical Internet Research26 (2024), e58831. doi:10.2196/58831

work page doi:10.2196/58831 2024

[4] [4]

Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, An- drea Vallone, Alex Beutel, et al. 2025. Healthbench: Evaluating large language models towards improved human health

work page 2025

[5] [5]

Ayers, Adam Poliak, Mark Dredze, Eric C

John W. Ayers, Adam Poliak, Mark Dredze, Eric C. Leas, Zidian Zhu, Jon- Patrick Allem Kelley, Zoe Chu, and David M. Smith. 2023. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.JAMA Internal Medicine183, 6 (2023), 589–596. doi:10.1001/jamainternmed.2023.1838

work page doi:10.1001/jamainternmed.2023.1838 2023

[6] [6]

Vardoulakis

Emma Beede, Elizabeth Baylor, Fred Hersch, Anna Iurchenko, Lauren Wilcox, Paisan Ruamviboonsuk, and Laura M. Vardoulakis. 2020. A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy.Proceedings of the 2020 CHI Conference on Human Factors in Computing SystemsCHI ’20 (2020), Paper 589, 1–14. doi:...

work page doi:10.1145/3313831.3376718 2020

[7] [7]

Anna Bodonhelyi, Christian Stegemann-Philipps, Alessandra Sonanini, Lea Her- schbach, Marton Szep, Anne Herrmann-Werner, Teresa Festl-Wietek, Enkelejda Kasneci, and Friederike Holderried. 2025. Modeling Challenging Patient Inter- actions: LLMs for Medical Communication Training. arXiv:2503.22250 [cs.HC] https://arxiv.org/abs/2503.22250

work page arXiv 2025

[8] [8]

Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. 2021. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making.Proceedings of the ACM on Human–Computer Interaction5, CSCW1 (2021), 188:1–188:21. doi:10.1145/3449287

work page internal anchor Pith review doi:10.1145/3449287 2021

[9] [9]

Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry

Carrie J. Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry

work page

[10] [10]

Hello AI

"Hello AI": Uncovering the Onboarding Needs of Medical Practitioners for Human–AI Collaborative Decision-Making.Proceedings of the ACM on Human–Computer Interaction3, CSCW (2019), Article 104, 1–24. doi:10.1145/ 3359206

work page 2019

[11] [11]

David Chen, Kabir Chauhan, Rod Parsa, Zhihui Amy Liu, Fei-Fei Liu, Ernie Mak, Lawson Eng, Breffni Louise Hannon, Jennifer Croke, Andrew Hope, Nazanin Fallah-Rad, Phillip Wong, Srinivas Raman, et al. 2025. Patient perceptions of empathy in physician and artificial intelligence chatbot responses to patient questions about cancer.npj Digital Medicine8 (May 2...

work page 2025

[12] [12]

Heesters, and Srinivas Raman

David Chen, Rod Parsa, Andrew Hope, Breffni Hannon, Ernie Mak, Lawson Eng, Fei-Fei Liu, Nazanin Fallah-Rad, Ann M. Heesters, and Srinivas Raman. 2024. Physician and Artificial Intelligence Chatbot Responses to Cancer Questions From Social Media.JAMA Oncology10, 7 (07 2024), 956–960. doi:10.1001/ jamaoncol.2024.0836

work page arXiv 2024

[13] [13]

Yirong Chen, Xiaofen Xing, Jingkai Lin, Huimin Zheng, Zhenyu Wang, Qi Liu, and Xiangmin Xu. 2023. SoulChat: Improving LLMs’ empathy, listening, and comfort abilities through fine-tuning with multi-turn empathy conversations

work page 2023

[14] [14]

Maximilian Croissant, Madeleine Frister, Guy Schofield, and Cade McCall. 2024. An Appraisal-based Chain-of-Emotion Architecture for Affective Language Model Game Agents.PLOS ONE19, 5 (2024), e0301033. doi:10.1371/journal. pone.0301033

work page doi:10.1371/journal 2024

[15] [15]

Jacob Dineen, Don Kridel, Daniel Dolk, and David Castillo. 2024. Uni- fied Explanations in Machine Learning Models: A Perturbation Approach. arXiv:2405.20200 [cs.LG] Adarsh Srinivasan, Jacob Dineen, Muhammad Umar Afzal, Muhammad Uzair Sarfraz, Irbaz B. Riaz, and Ben Zhou

work page arXiv 2024

[16] [16]

Jacob Dineen, Aswin RRV, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, et al. 2025. QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA. arXiv:2506.08123 [cs.CL]

work page arXiv 2025

[17] [17]

Daniel Dolk, Donald Kridel, Jacob Dineen, and David Castillo. 2020. Model Interpretation and Explainability towards Creating Transparency in Prediction Models. InProceedings of the 53rd Hawaii International Conference on System Sciences (HICSS). Hawaii International Conference on System Sciences, Maui, HI, 956–965. doi:10.24251/hicss.2020.120

work page doi:10.24251/hicss.2020.120 2020

[18] [18]

Yu Feng, Ben Zhou, Weidong Lin, and Dan Roth. 2024. Bird: A trustworthy bayesian inference framework for large language models

work page 2024

[19] [19]

Yu Feng, Ben Zhou, Haoyu Wang, Helen Jin, and Dan Roth. 2023. Generic Temporal Reasoning with Differential Analysis and Explanation. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Toronto, Canada, 12013–12029

work page 2023

[20] [20]

Laura Francis and Noelle Robertson. 2023. Healthcare practitioners’ experiences of breaking bad news: A critical interpretative meta synthesis.Patient Education and Counseling107 (2023), 107574. doi:10.1016/j.pec.2022.107574

work page doi:10.1016/j.pec.2022.107574 2023

[21] [21]

Ludwig Franke Føyen, Emma Zapel, Mats Lekander, Erik Hedman-Lagerlöf, and Elin Lindsäter. 2025. Artificial intelligence vs. human expert: Licensed mental health clinicians’ blinded evaluation of AI-generated and expert psychological advice on quality, empathy, and perceived authorship.Internet Interventions41 (2025), 100841. doi:10.1016/j.invent.2025.100841

work page doi:10.1016/j.invent.2025.100841 2025

[22] [22]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models

work page 2024

[23] [23]

Alexa Hepburn and Jonathan Potter. 2023. Understanding mixed emotions in organized helping through emotionography.Frontiers in Psychology14 (Oct. 2023), 1236148. doi:10.3389/fpsyg.2023.1236148

work page doi:10.3389/fpsyg.2023.1236148 2023

[24] [24]

Tiancheng Hu and Nigel Collier. 2024. Quantifying the Persona Effect in LLM Simulations. arXiv:2402.10811 [cs.CL]

work page arXiv 2024

[25] [25]

Jen-tse Huang, Man Ho Lam, Eric John Li, Shujie Ren, Wenxuan Wang, Wenx- iang Jiao, Zhaopeng Tu, and Michael R. Lyu. 2024. Apathetic or Empathetic? Evaluating LLMs’ Emotional Alignments with Humans. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran...

work page 2024

[26] [26]

Hutto and Eric Gilbert

Clayton J. Hutto and Eric Gilbert. 2014. VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. InProceedings of the Eighth International Conference on Weblogs and Social Media (ICWSM ’14). AAAI Press, Ann Arbor, MI, 216–225

work page 2014

[27] [27]

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Nicholas Schiefer, Eric Clark, Guy Amir, Kamal Ndousse, Tom B. Brown, Steven Lar- son, Roger Grosse, Jared Kaplan, Natasha McAleese, David Hernandez, Micah Carroll, Deep Ganguli, Jan Leike, Catherine Olsson, David Krueger, Evan Hub- inger, Collin Burns, Samuel Bowman, Jacob Hilton, Long Ouyang, Yu...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[28] [28]

Why Language Models Hallucinate

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. 2025. Why Language Models Hallucinate. arXiv:2509.04664 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Richard S. Lazarus. 1991.Emotion and Adaptation. Oxford University Press, New York, NY

work page 1991

[30] [30]

Kraut, Eunyoung Suh, Eun mee Kim, and Hajin Lim

Keyeun Lee, Seolhee Lee, Esther Hehsun Kim, Yena Ko, Jinsu Eun, Dahee Kim, Hyewon Cho, Haiyi Zhu, Robert E. Kraut, Eunyoung Suh, Eun mee Kim, and Hajin Lim. 2025. Adaptive-VP: A Framework for LLM-Based Virtual Patients that Adapts to Trainees’ Dialogue to Facilitate Nurse Communication Training. arXiv:2506.00386 [cs.CL] https://arxiv.org/abs/2506.00386

work page arXiv 2025

[31] [31]

Yoon Kyung Lee, Inju Lee, Minjung Shin, Seoyeon Bae, and Sowon Hahn. 2024. Chain of Empathy: Enhancing Empathetic Response of Large Language Models Based on Psychotherapy Models.Korean Journal of Cognitive Science35, 1 (2024), 23–48. Also available as arXiv:2311.04915

work page arXiv 2024

[32] [32]

Warren, Lu Cheng, Haidar M

Man Luo, Christopher J. Warren, Lu Cheng, Haidar M. Abdul-Muhsin, and Imon Banerjee. 2024. Assessing Empathy in Large Language Models with Real-World Physician-Patient Interactions. InIEEE International Conference on Big Data (Big- Data 2024). IEEE, Washington, DC, USA, 6510–6519. doi:10.1109/BigData62323. 2024.10825307

work page doi:10.1109/bigdata62323 2024

[33] [33]

Pedro Henrique Luz de Araujo and Benjamin Roth. 2025. Helpful assistant or fruitful facilitator? Investigating how personas affect language model behavior. PLOS ONE20, 6 (2025), e0325664. doi:10.1371/journal.pone.0325664

work page doi:10.1371/journal.pone.0325664 2025

[34] [34]

OpenAI. 2025. gpt-oss-120b and gpt-oss-20b Model Card. arXiv:2508.10925 [cs.CL] https://arxiv.org/abs/2508.10925

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

Clore, and Allan Collins

Andrew Ortony, Gerald L. Clore, and Allan Collins. 1988.The Cognitive Structure of Emotions. Cambridge University Press, Cambridge, United Kingdom

work page 1988

[36] [36]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training Language Models to Follow Instructions with Human Fee...

work page 2022

[37] [37]

Samuel J. Paech. 2023. EQ-Bench: An Emotional Intelligence Benchmark for Large Language Models. arXiv:2312.06281 [cs.CL]

work page arXiv 2023

[38] [38]

Rafael Rafailov, Pratyusha Ramesh, Avnish Narayan Das, Sishir Syed, Sam Basu, Yao Li, James Zou, Qi Yang, Yuntao Bai, Pieter Abbeel, and Kiril Rafailov. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InAdvances in Neural Information Processing Systems. Curran Associates, Inc., New Orleans, LA, USA, 53416–53432

work page 2023

[39] [39]

Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards Empathetic Open-domain Conversation Models: A New Benchmark and Dataset. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 5370–5381. doi:10.18653/v1/P19-1534

work page doi:10.18653/v1/p19-1534 2019

[40] [40]

Aswin RRV, Jacob Dineen, Divij Handa, Md Nayem Uddin, Mihir Parmar, Chitta Baral, and Ben Zhou. 2025. ThinkTuning: Instilling Cognitive Reflections without Distillation. arXiv:2508.07616 [cs.CL]

work page arXiv 2025

[41] [41]

Sahand Sabour, Siyang Liu, Zheyuan Zhang, June Liu, Jinfeng Zhou, Alvionna Sunaryo, Tatia Lee, Rada Mihalcea, and Minlie Huang. 2024. EmoBench: Eval- uating the Emotional Intelligence of Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and V...

work page doi:10.18653/v1/2024.acl-long.326 2024

[42] [42]

Scherer and Agnes Moors

Klaus R. Scherer and Agnes Moors. 2019. The Emotion Process: Event Appraisal and Component Differentiation.Annual Review of Psychology70 (2019), 719–745. doi:10.1146/annurev-psych-122216-011854

work page doi:10.1146/annurev-psych-122216-011854 2019

[43] [43]

Ming Shen, Zhikun Xu, Xiao Ye, Jacob Dineen, and Ben Zhou. 2025. BOW: Bottlenecked Next Word Exploration. arXiv:2506.13502 [cs.CL]

work page arXiv 2025

[44] [44]

Vlad Sorin, Benjamin Sheffer, Nada Meirow, et al. 2024. Large Language Models and Empathy: Systematic Review.Journal of Medical Internet Research26 (2024), e55610. doi:10.2196/55610

work page doi:10.2196/55610 2024

[45] [45]

John Sweller. 2023. The Development of Cognitive Load Theory: Replication Crises and Incorporation of Other Theories Can Lead to Theory Expansion. Educational Psychology Review35 (Sept. 2023), 95. doi:10.1007/s10648-023-09817- 2

work page doi:10.1007/s10648-023-09817- 2023

[46] [46]

Oyvind Tafjord, Bhavana Dalvi Mishra, and Peter Clark. 2022. Entailer: Answer- ing Questions with Faithful and Truthful Chains of Reasoning. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2078–2093

work page 2022

[47] [47]

Chiu, Jiayin Zhi, Shaun M

Ruiyi Wang, Stephanie Milani, Jamie C. Chiu, Jiayin Zhi, Shaun M. Eack, Travis Labrum, Samuel M. Murphy, Nev Jones, Kate Hardy, Hong Shen, Fei Fang, and Zhiyu Zoey Chen. 2024. PATIENT-Ψ: Using Large Language Models to Simulate Patients for Training Mental Health Professionals. arXiv:2405.19660 [cs.CL] https://arxiv.org/abs/2405.19660

work page arXiv 2024

[48] [48]

Xuena Wang et al . 2023. Emotional Intelligence of Large Language Models. Includes SECEU scenario-based assessments

work page 2023

[49] [49]

Jason Wei, Xuezhi Wang, Dale Schuurmans, et al . 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

work page 2022

[50] [50]

Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel Deutch, and Jonathan Berant. 2020. Break it down: A question understanding benchmark.Transactions of the Association for Computational Linguistics8 (2020), 183–198

work page 2020

[51] [51]

Zhikun Xu, Ming Shen, Jacob Dineen, Zhaonan Li, Xiao Ye, Shijie Lu, Aswin RRV, Chitta Baral, and Ben Zhou. 2024. Tow: Thoughts of Words Improve Reasoning in Large Language Models. arXiv:2410.16235 [cs.CL]

work page arXiv 2024

[52] [52]

Qian Yang, Aaron Steinfeld, and John Zimmerman. 2019. Unremarkable AI: Fitting Intelligent Decision Support into Critical, Clinical Decision-Making Processes. doi:10.48550/arXiv.1904.09612 Published as proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19), DOI:10.1145/3290605.3300468

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1904.09612 2019

[53] [53]

Xiao Ye, Jacob Dineen, Zhaonan Li, Zhikun Xu, Weiyu Chen, Shijie Lu, Yuxi Huang, Ming Shen, Phu Tran, Ji-Eun Irene Yum, Muhammad Ali Khan, Muham- mad Umar Afzal, Irbaz Bin Riaz, and Ben Zhou. 2025. Evaluating Medical LLMs by Levels of Autonomy: A Survey Moving from Benchmarks to Applications. arXiv:2510.17764 [cs.CL] https://arxiv.org/abs/2510.17764

work page arXiv 2025

[54] [54]

Xiao Ye, Shaswat Shrivastava, Zhaonan Li, Jacob Dineen, Shijie Lu, Avneet Ahuja, Ming Shen, Zhikun Xu, and Ben Zhou. 2025. CC-LEARN: Cohort-based Consistency Learning. arXiv:2506.15662 [cs.CL]

work page arXiv 2025

[55] [55]

Jianwen Zeng, Wenhao Qi, Shiying Shen, Xin Liu, Sixie Li, Bing Wang, Chao- qun Dong, Xiaohong Zhu, Yankai Shi, Xiajing Lou, Bingsheng Wang, Jiani Yao, Guowei Jiang, Qiong Zhang, and Shihua Cao. 2025. Embracing the Future of RECAP: Transparent Inference-Time Emotion Alignment for Medical Dialogue Systems Medical Education With Large Language Model–Based Vi...

work page doi:10.2196/79091 2025

[56] [56]

Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing Dialogue Agents: I Have a Dog, Do You Have Pets Too?. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 2204–2213. doi:...

work page doi:10.18653/v1/p18-1205 2018

[57] [57]

a helpful assistant

Mingqian Zheng, Jiaxin Pei, Lajanugen Logeswaran, Moontae Lee, and David Jurgens. 2024. When “a helpful assistant” is not really helpful: Personas in system prompts do not improve performances of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024. Association for Computational Linguistics, Miami, Florida, USA, 1...

work page 2024

[58] [58]

Ben Zhou, Kyle Richardson, Xiaodong Yu, and Dan Roth. 2022. Learning to Decompose: Hypothetical Question Decomposition Based on Comparable Texts. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2223–2235

work page 2022

[59] [59]

Sarah found out that her younger brother is being bullied at school but he begged her not to tell their parents

Ben Zhou, Hongming Zhang, Sihao Chen, Dian Yu, Hongwei Wang, Baolin Peng, Dan Roth, and Dong Yu. 2024. Conceptual and Unbiased Reasoning in Language Models. arXiv:2404.00205 [cs.CL] 9 Appendix 9.1RECAPLikert-Scale Probability Mapping Likert value Probability very-unlikely 0.05 unlikely 0.25 neutral 0.50 likely 0.75 very-likely 0.95 Table 4: Likert-scale t...

work page arXiv 2024

[60] [61]

Factor name: Description (value1/value2)

work page

[61] [62]

Do not start with END_OF_FACTORS

Factor name: Description (value1/value2) Do not include any explanations after the factors. Do not start with END_OF_FACTORS. After you list the factors, output a single line exactly: END_OF_FACTORS Never output END_OF_FACTORS before the list. Only place it after the final factor line. Example:

work page

[62] [63]

Self-efficacy: Person’s belief in their ability to handle challenges (low/high)

work page

[63] [64]

Social support: Availability of emotional support from others (absent/present)

work page

[64] [65]

choice_letter

Stress level: Amount of psychological pressure experienced (low/high) Factor Value Selection Prompt Task:Analyze this situation and determine the specific value for each psychological factor: SITUATION:{situation} PSYCHOLOGICAL FACTORS: {factors_text} For each factor, choose the most appropriate value based on what you can observe in the situation. Provid...

work page