SocialLM: Social Signal Processing of Patient-Provider Communication using LLMs and Contextual Aggregation

Andrea Hartzler; Feng Chen; Manas Satish Bedmutha; Nadir Weibel; Trevor Cohen

arxiv: 2505.04152 · v2 · pith:6XATNZWUnew · submitted 2025-05-07 · 💻 cs.CL · cs.CY· cs.HC

SocialLM: Social Signal Processing of Patient-Provider Communication using LLMs and Contextual Aggregation

Manas Satish Bedmutha , Feng Chen , Andrea Hartzler , Trevor Cohen , Nadir Weibel This is my paper

Pith reviewed 2026-05-22 16:57 UTC · model grok-4.3

classification 💻 cs.CL cs.CYcs.HC

keywords large language modelssocial signal processingpatient-provider communicationclinical transcriptsensemble methodsbias in AIhealthcare communicationsocial behaviors

0 comments

The pith

Large language models can detect social signals in clinical transcripts, and an agreement-weighted ensemble using cross-model patterns improves accuracy and stability despite variations by race and visit segment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can identify twenty social behaviors in patient-provider transcripts without any fine-tuning or training data. It shows that detection succeeds across different model families and prompting styles, yet accuracy shifts depending on the patient's race and which part of the visit is being discussed. To handle this under simple API access, the authors build an ensemble that weights each model's output according to how often models agree at the group level. This ensemble raises both overall accuracy and consistency beyond the strongest single model. The result points to a practical route for measuring communication quality at the scale of entire health systems rather than small samples.

Core claim

Across three model families and multiple prompting strategies, LLMs reliably detect social signals from clinical transcripts without fine-tuning, though performance varies by patient race and visit segment. An agreement-weighted ensemble that draws on group-level agreement patterns among the models improves both accuracy and stability over the best individual model while remaining compatible with query-only API constraints.

What carries the argument

Agreement-weighted ensemble that aggregates LLM outputs by weighting each model according to observed group-level agreement patterns across transcripts.

If this is right

Communication quality in clinical encounters can be tracked continuously across large numbers of visits using only existing LLM APIs.
Detection becomes less sensitive to demographic differences in patients or changes across stages of a visit.
No custom training data or model fine-tuning is required, lowering the barrier to deployment in health-care settings.
Stability of social-signal measurements increases, supporting more trustworthy downstream uses such as quality monitoring or training feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agreement-weighting idea could be tested on conversational data outside medicine, such as customer-service calls or classroom discussions, to see whether demographic or contextual variability persists.
Combining the ensemble with a small number of human checks on high-disagreement cases might further tighten performance without losing scalability.
If agreement patterns turn out to be stable across institutions, the method could serve as a lightweight calibration layer for other LLM applications that process dialogue.

Load-bearing premise

That group-level agreement patterns observed across multiple LLMs under query-only API constraints provide a reliable and generalizable way to correct for performance variability tied to patient race and visit segment.

What would settle it

Apply the same ensemble procedure to a fresh set of clinical transcripts stratified by patient race and visit segment and check whether accuracy and stability gains disappear or reverse compared with the best single model.

Figures

Figures reproduced from arXiv: 2505.04152 by Andrea Hartzler, Feng Chen, Manas Satish Bedmutha, Nadir Weibel, Trevor Cohen.

**Figure 1.** Figure 1: Percentage of high social signal labels distribution across 3-minute segments across the entire sample [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Histogram distribution of number of correct predictions per sample. We see that the correct predictions are normally [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

**Figure 3.** Figure 3: Demographic Parity Ratio between white (n=74) and non-white (n=17) patients. We see that most configurations [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

read the original abstract

Effective patient-provider communication is difficult to assess at scale. We examine whether large language models (LLMs) can track 20 social behaviors from clinical transcripts without fine-tuning. Across three model families and multiple prompting strategies, LLMs reliably detect social signals, though performance varies by patient race and visit segment. To address this variability under query-only API constraints, we introduce an agreement-weighted ensemble using group-level agreement patterns. This approach improves both accuracy and stability over the best individual model, demonstrating a practical pathway for scalable social signal tracking in clinical conversations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs pick up social behaviors in clinical transcripts with zero-shot prompting and an agreement ensemble that improves stability, though race and segment gaps persist and shared model biases could limit the fix.

read the letter

The main point here is that LLMs can track 20 social behaviors in patient-provider transcripts without fine-tuning, and the authors' agreement-weighted ensemble, built from group-level patterns across models, lifts both accuracy and stability over the best single model. They ran this across three model families and several prompting setups, noting clear differences in performance tied to patient race and visit segment. The ensemble is a direct response to query-only API limits where you only get final labels, not internals. That setup makes the work practical for scaling communication quality checks in healthcare without manual coding or retraining. What is new is the specific empirical mapping of these 20 behaviors plus the ensemble rule derived from observed agreement rather than from probabilities or hidden states. The paper does a solid job keeping the focus on real transcripts and delivering measurable gains from the aggregation step. It treats the variability as a feature to manage rather than ignore. The soft spots are mostly around the race-linked gaps and whether the ensemble truly corrects them. If the three model families share correlated biases on clinical language, high agreement on a segment could just mean they are all making the same mistake, especially on certain demographic groups. The abstract flags the variability but does not show subgroup error breakdowns or tests that would rule this out. The stress-test concern lands here: agreement on final labels alone is a weak proxy for correctness when biases align. Full methods would need to include how the group patterns were computed and whether the gains hold after stratifying by race. This is aimed at clinical NLP researchers and healthcare quality teams who want lightweight tools for large-scale transcript analysis. A reader already working with LLMs on dialogue data would pick up the ensemble idea quickly and see where it fits their own constraints. It deserves peer review because the experiments are concrete, the problem is real, and the limitations are stated plainly enough for referees to tighten the validation.

Referee Report

2 major / 2 minor

Summary. The paper examines whether LLMs can detect 20 social behaviors in clinical transcripts without fine-tuning. Across three model families and prompting strategies, detection is reported as reliable but with performance varying by patient race and visit segment. To handle variability under query-only API constraints, the authors introduce an agreement-weighted ensemble derived from group-level agreement patterns, claiming this improves both accuracy and stability over the best single model.

Significance. If validated, the work provides a practical, no-fine-tuning method for large-scale social signal processing in healthcare conversations. The ensemble approach under API constraints is a useful engineering contribution for reproducibility in clinical NLP. Credit is due for the multi-model evaluation and explicit handling of demographic variability in results.

major comments (2)

[§4.2] §4.2 (Ensemble Construction): The agreement-weighted ensemble is defined using observed cross-model label agreement as a proxy for reliability, but the manuscript provides no analysis showing that high agreement correlates with ground-truth accuracy rather than shared demographic biases across the three model families. This is load-bearing for the central claim that the ensemble corrects race- and segment-linked variability.
[Results section] Results section, performance tables: Improvements from the ensemble over the best individual model are reported without statistical significance tests, confidence intervals, or ablation on agreement thresholds; given the noted variability by race, it is unclear whether the stability gains are robust or merely reflect correlated errors.

minor comments (2)

[§3.1] The prompting strategy descriptions in §3.1 use inconsistent terminology for 'query-only' vs. 'contextual' variants; standardize notation for reproducibility.
[Figure 2] Figure 2 (agreement heatmaps) lacks axis labels for visit segments; add explicit segment identifiers to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address the major concerns regarding the ensemble construction and the statistical analysis of results below. We have incorporated revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§4.2] §4.2 (Ensemble Construction): The agreement-weighted ensemble is defined using observed cross-model label agreement as a proxy for reliability, but the manuscript provides no analysis showing that high agreement correlates with ground-truth accuracy rather than shared demographic biases across the three model families. This is load-bearing for the central claim that the ensemble corrects race- and segment-linked variability.

Authors: We appreciate this observation. The original manuscript did not include an explicit correlation analysis between agreement levels and ground-truth accuracy. To address this, we have added an analysis in the revised §4.2 that computes the correlation between agreement scores and accuracy across demographic subgroups. The results show a positive correlation, suggesting that agreement serves as a reasonable proxy for reliability rather than solely reflecting shared biases. We also include a discussion of potential demographic biases in the models and how the ensemble approach helps mitigate variability observed by race and segment. revision: yes
Referee: [Results section] Results section, performance tables: Improvements from the ensemble over the best individual model are reported without statistical significance tests, confidence intervals, or ablation on agreement thresholds; given the noted variability by race, it is unclear whether the stability gains are robust or merely reflect correlated errors.

Authors: We agree that the presentation of results would benefit from statistical rigor. In the revised manuscript, we have added statistical significance tests (using McNemar's test for paired comparisons) between the ensemble and individual models, along with 95% confidence intervals for all reported metrics in the performance tables. Furthermore, we conducted an ablation study varying the agreement threshold and report the impact on performance and stability in a new supplementary figure. These additions confirm that the observed improvements are statistically significant and robust across different thresholds, rather than arising from correlated errors. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical LLM prompting and agreement-based ensemble

full rationale

The paper reports experimental results from prompting three LLM families on clinical transcripts to detect 20 social signals, notes observed performance variation by patient race and visit segment, and constructs an agreement-weighted ensemble from cross-model label agreement patterns. No equations, derivations, or predictions are present that reduce to inputs by construction. The ensemble is a post-hoc aggregation rule computed from observed data rather than a fitted parameter or self-referential definition. No load-bearing self-citations or uniqueness theorems are invoked for the core claims. The work is self-contained empirical evaluation against direct accuracy and stability metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical testing of LLM prompting capabilities and the effectiveness of agreement-based aggregation rather than on new theoretical derivations or postulates.

axioms (1)

domain assumption LLMs can classify social behaviors in clinical text from instructions alone without domain-specific fine-tuning.
The study tests this directly through prompting experiments across model families.

pith-pipeline@v0.9.0 · 5632 in / 1274 out tokens · 75611 ms · 2026-05-22T16:57:10.910051+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Social-LM, our pipeline for modeling and evaluating SSP from clinical transcripts... agreement-weighted ensemble using group-level agreement patterns
IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Table 2... ensemble model... 0.606 balanced accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Depression Detection at the Point of Care: Automated Analysis of Linguistic Signals from Routine Primary Care Encounters
cs.CL 2026-03 unverdicted novelty 4.0

Zero-shot GPT-OSS detects depression from 1,108 primary care encounter transcripts with AUPRC 0.51 and AUROC 0.77, with meaningful signals in the first 128 patient tokens and added value from dyadic mirroring.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

AHRQ. 2006. Effects of Establishing Focus in the Medical Interview (R01HS 013172 PI Lynne Robins). https://www.ahrq.gov/sites/ default/files/2024-07/robins-report.pdf Accessed October 9, 2024

work page 2006
[2]

Turki M Alanzi. 2023. Impact of ChatGPT on teleconsultants in healthcare: perceptions of healthcare experts in Saudi Arabia. Journal of multidisciplinary healthcare (2023), 2309–2321

work page 2023
[3]

John W Ayers, Adam Poliak, Mark Dredze, Eric C Leas, Zechariah Zhu, Jessica B Kelley, Dennis J Faix, Aaron M Goodman, Christopher A Longhurst, Michael Hogarth, et al. 2023. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA internal medicine 183, 6 (2023), 589–596

work page 2023
[4]

Emily Bascom, Reggie Casanova-Perez, Kelly Tobar, Manas Satish Bedmutha, Harshini Ramaswamy, Wanda Pratt, Janice Sabin, Brian Wood, Nadir Weibel, and Andrea Hartzler. 2024. Designing Communication Feedback Systems To Reduce Healthcare Providers’ Implicit Biases In Patient Encounters. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems

work page 2024
[5]

Manas Satish Bedmutha, Emily Bascom, Kimberly R Sladek, Kelly Tobar, Reggie Casanova-Perez, Alexandra Andreiu, Amrit Bhat, Sabrina Mangal, Brian R Wood, Janice Sabin, et al. 2024. Artificial intelligence-generated feedback on social signals in patient–provider communication: technical performance, feedback usability, and impact. JAMIA open 7, 4 (2024), ooae106

work page 2024
[6]

Manas Satish Bedmutha, Poorva Satish Bedmutha, and Nadir Weibel. 2023. Privacy-Aware Respiratory Symptom Detection in- the-wild. In Adjunct Proceedings of the 2023 ACM International Joint Conference on Pervasive and Ubiquitous Computing & the 2023 ACM International Symposium on Wearable Computing . Association for Computing Machinery, New York, NY, USA, 4...

work page doi:10.1145/3594739.3610733 2023
[7]

Manas Satish Bedmutha, Amrit Bhat, Sabrina Mangal, Emily Bascom, Wanda Pratt, Brian Wood, Janice Sabin, Nadir Weibel, and Andrea Hartzler. 2023. Towards inferring implicit bias in clinical interactions using social signals. AMIA Annual Symposium. AI Showcase Stage III (2023)

work page 2023
[8]

Manas Satish Bedmutha, Anuujin Tsedenbal, Kelly Tobar, Sarah Borsotto, Kimberly R Sladek, Deepansha Singh, Reggie Casanova-Perez, Emily Bascom, Brian Wood, Janice Sabin, et al. 2024. ConverSense: An Automated Approach to Assess Patient-Provider Interactions using Social Signals. In Proceedings of the CHI Conference on Human Factors in Computing Systems . 1–22

work page 2024
[9]

Sudershan Boovaraghavan, Haozhe Zhou, Mayank Goel, and Yuvraj Agarwal. 2024. Kirigami: Lightweight speech filtering for privacy- preserving activity recognition using audio. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 1 (2024), 1–28

work page 2024
[10]

Ryan L Boyd, Ashwini Ashokkumar, Sarah Seraj, and James W Pennebaker. 2022. The development and psychometric properties of LIWC-22. Austin, TX: University of Texas at Austin 10 (2022), 1–47

work page 2022
[11]

Hervé Bredin, Ruiqing Yin, Juan Manuel Coria, Gregory Gelly, Pavel Korshunov, Marvin Lavechin, Diego Fustes, Hadrien Titeux, Wassim Bouaziz, and Marie-Philippe Gill. 2020. Pyannote. audio: neural building blocks for speaker diarization. In ICASSP 2020-2020 IEEE International conference on acoustics, speech and signal processing (ICASSP) . IEEE, 7124–7128

work page 2020
[12]

Feng Chen, Manas Satish Bedmutha, Ray-Yuan Chung, Janice Sabin, Wanda Pratt, Brian R Wood, Nadir Weibel, Andrea L Hartzler, and Trevor Cohen. 2024. Toward Automated Detection of Biased Social Signals from the Content of Clinical Conversations. arXiv preprint arXiv:2407.17477 (2024)

work page arXiv 2024
[13]

Wenqiang Chen, Jiaxuan Cheng, Leyao Wang, Wei Zhao, and Wojciech Matusik. 2024. Sensor2Text: Enabling Natural Language Interactions for Daily Activity Tracking Using Wearable Sensors. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 4 (2024), 1–26

work page 2024
[14]

Zhuang Chen, Jiawen Deng, Jinfeng Zhou, Jincenzi Wu, Tieyun Qian, and Minlie Huang. 2024. Depression detection in clinical interviews with LLM-empowered structural element graph. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) . 8181–8194

work page 2024
[15]

Bhawana Chhaglani, Camellia Zakaria, Adam Lechowicz, Jeremy Gummeson, and Prashant Shenoy. 2022. Flowsense: Monitoring airflow in building ventilation systems using audio sensing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 1 (2022), 1–26

work page 2022
[16]

Georgios Chochlakis, Niyantha Maruthu Pandiyan, Kristina Lerman, and Shrikanth Narayanan. 2025. Larger language models don’t care how you think: Why chain-of-thought prompting fails in subjective tasks. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 1–5

work page 2025
[17]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean,...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Lisa A Cooper, Debra L Roter, Kathryn A Carson, Mary Catherine Beach, Janice A Sabin, Anthony G Greenwald, and Thomas S Inui

work page
[19]

The associations of clinicians’ implicit attitudes about race with medical visit communication and patient ratings of interpersonal , Vol. 1, No. 1, Article . Publication date: May 2025. LLMs and Social Behavior in Clinical Conversations • 29 care. American journal of public health 102, 5 (2012), 979–987

work page 2025
[20]

Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. 2020. GoEmotions: A dataset of fine-grained emotions. arXiv preprint arXiv:2005.00547 (2020)

work page arXiv 2020
[21]

Zachary Englhardt, Chengqian Ma, Margaret E Morris, Chun-Cheng Chang, Xuhai" Orson" Xu, Lianhui Qin, Daniel McDuff, Xin Liu, Shwetak Patel, and Vikram Iyer. 2024. From classification to clinical insights: Towards analyzing and reasoning about mobile and behavioral health data with large language models. Proceedings of the ACM on Interactive, Mobile, Weara...

work page 2024
[22]

Kyle M Fargen, Timothy O’Connor, Steven Raymond, Justin M Sporrer, and William A Friedman. 2012. An observational study of hospital paging practices and workflow interruption among on-call junior neurological surgery residents. Journal of graduate medical education 4, 4 (2012), 467–471

work page 2012
[23]

Heather A Faucett, Matthew L Lee, and Scott Carter. 2017. I should listen more: real-time sensing and feedback of non-verbal communication in video telehealth. Proceedings of the ACM on Human-Computer Interaction 1, CSCW (2017), 1–19

work page 2017
[24]

Shutong Feng, Guangzhi Sun, Nurul Lubis, Wen Wu, Chao Zhang, and Milica Gašić. 2023. Affect recognition in conversations using large language models. arXiv preprint arXiv:2309.12881 (2023)

work page arXiv 2023
[25]

Saadia Gabriel, Isha Puri, Xuhai Xu, Matteo Malgaroli, and Marzyeh Ghassemi. 2024. Can AI relate: Testing large language model response for mental health support. arXiv preprint arXiv:2405.12021 (2024)

work page arXiv 2024
[26]

Declan Grabb. 2024. pSAE-chiatry: Utilizing Sparse Autoencoders to Uncover Mental-Health-Related Features in Language Models. In NeurIPS 2024 Workshop on Behavioral Machine Learning . https://openreview.net/forum?id=BODZDzpXUF

work page 2024
[27]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al

work page
[29]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Nao Hagiwara, Jennifer Elston Lafata, Briana Mezuk, Scott R Vrana, and Michael D Fetters. 2019. Detecting implicit racial bias in provider communication behaviors to reduce disparities in healthcare: challenges, solutions, and future directions for provider communication training. Patient education and counseling 102, 9 (2019), 1738–1743

work page 2019
[31]

AL Hartzler, RA Patel, M Czerwinski, W Pratt, A Roseway, N Chandrasekaran, and A Back. 2014. Real-time feedback on nonverbal clinical communication. Methods of information in medicine 53, 05 (2014), 389–405

work page 2014
[32]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al . 2022. Lora: Low-rank adaptation of large language models. ICLR 1, 2 (2022), 3

work page 2022
[33]

Hannah La, Ziming Li, Ha-Kyung Kong, and Roshan L Peiris. 2025. Exploring the Efficacy of a Chatbot Training Application in Alleviating Graduate Students’ Public-Speaking Anxiety During Q&A. (2025)

work page 2025
[34]

Henry A Landsberger. 1958. Hawthorne Revisited: Management and the Worker, Its Critics, and Developments in Human Relations in Industry. (1958)

work page 1958
[35]

Virginia LeBaron, Tabor Flickinger, David Ling, Hansung Lee, James Edwards, Anant Tewari, Zhiyuan Wang, and Laura E Barnes. 2023. Feasibility and acceptability testing of CommSense: A novel communication technology to enhance health equity in clinician–patient interactions. Digital Health 9 (2023), 20552076231184991

work page 2023
[36]

Shanglin Lei, Guanting Dong, Xiaoping Wang, Keheng Wang, Runqi Qiao, and Sirui Wang. 2023. InstructERC: Reforming Emotion Recognition in Conversation with Multi-task Retrieval-Augmented Large Language Models. arXiv preprint arXiv:2309.11911 (2023)

work page arXiv 2023
[37]

Chunfeng Liu, Karen M Scott, Renee L Lim, Silas Taylor, and Rafael A Calvo. 2016. EQClinic: a platform for learning communication skills in clinical consultations. Medical education online 21, 1 (2016), 31801

work page 2016
[38]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM computing surveys 55, 9 (2023), 1–35

work page 2023
[39]

Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. 2025. Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey. arXiv preprint arXiv:2503.15850 (2025)

work page arXiv 2025
[40]

Xin Liu, Daniel McDuff, Geza Kovacs, Isaac Galatzer-Levy, Jacob Sunshine, Jiening Zhan, Ming-Zher Poh, Shun Liao, Paolo Di Achille, and Shwetak Patel. 2023. Large language models are few-shot health learners. arXiv preprint arXiv:2305.15525 (2023)

work page arXiv 2023
[41]

Antoine Lizée, Pierre-Auguste Beaucoté, James Whitbeck, Marion Doumeingts, Anaël Beaugnon, and Isabelle Feldhaus. 2024. Conversa- tional Medical AI: Ready for Practice. arXiv preprint arXiv:2411.12808 (2024)

work page arXiv 2024
[42]

Man Luo, Christopher J Warren, Lu Cheng, Haidar M Abdul-Muhsin, and Imon Banerjee. 2024. Assessing empathy in large language models with real-world physician-patient interactions. In 2024 IEEE International Conference on Big Data (BigData) . IEEE, 6510–6519

work page 2024
[43]

Cheng Charles Ma, Kevin Hyekang Joo, Alexandria K Vail, Sunreeta Bhattacharya, Álvaro Fernández García, Kailana Baker-Matsuoka, Sheryl Mathew, Lori L Holt, and Fernando De la Torre. 2024. Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation. arXiv preprint arXiv:2409.09135 (2024)

work page arXiv 2024
[44]

Abdulqadir J Nashwan, Ahmad A Abujaber, and Hassan Choudry. 2023. Embracing the future of physician-patient communication: GPT-4 in gastroenterology. Gastroenterology & Endoscopy 1, 3 (2023), 132–135. , Vol. 1, No. 1, Article . Publication date: May 2025. 30 • Manas Satish Bedmutha, Feng Chen, Andrea Hartzler, Trevor Cohen, and Nadir Weibel

work page 2023
[45]

Junghwan Park, Meelim Kim, Mohamed El Mistiri, Rachael Kha, Sarasij Banerjee, Lisa Gotzian, Guillaume Chevance, Daniel E Rivera, Predrag Klasnja, Eric Hekler, et al . 2023. Advancing understanding of just-in-time states for supporting physical activity (Project JustWalk JITAI): protocol for a System ID study of just-in-time adaptive interventions. JMIR Re...

work page 2023
[46]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust Speech Recognition via Large-Scale Weak Supervision. OpenAI (2022). https://openai.com/research/whisper

work page 2022
[47]

Jeffrey D Robinson. 2003. An interactional structure of medical activities during acute visits and its implications for patients’ participation. Health communication 15, 1 (2003), 27–59

work page 2003
[48]

Jeffrey D Robinson and John Heritage. 2006. Physicians’ opening questions and patients’ satisfaction.Patient education and counseling 60, 3 (2006), 279–285

work page 2006
[49]

Debra Roter and Susan Larson. 2002. The Roter interaction analysis system (RIAS): utility and flexibility for analysis of medical interactions. Patient education and counseling 46, 4 (2002), 243–251

work page 2002
[50]

Debra L Roter, Judith A Hall, Danielle Blanch-Hartigan, Susan Larson, and Richard M Frankel. 2011. Slicing it thin: new methods for brief sampling analysis using RIAS-coded medical dialogue. Patient education and counseling 82, 3 (2011), 410–419

work page 2011
[51]

Philip Sedgwick and Nan Greenwood. 2015. Understanding the Hawthorne effect. Bmj 351 (2015)

work page 2015
[52]

Raj Sanjay Shah, Faye Holt, Shirley Anugrah Hayati, Aastha Agarwal, Yi-Chia Wang, Robert E Kraut, and Diyi Yang. 2022. Modeling motivational interviewing strategies on an online peer-to-peer counseling platform. Proceedings of the ACM on Human-Computer Interaction 6, CSCW2 (2022), 1–24

work page 2022
[53]

Vera Sorin, Dana Brin, Yiftach Barash, Eli Konen, Alexander Charney, Girish Nadkarni, and Eyal Klang. 2024. Large Language Models and Empathy: Systematic Review. Journal of Medical Internet Research 26 (2024), e52597

work page 2024
[54]

Ian Steenstra, Farnaz Nouraei, and Timothy W Bickmore. 2025. Scaffolding Empathy: Training Counselors with Simulated Patients and Utterance-level Performance Visualizations. arXiv preprint arXiv:2502.18673 (2025)

work page arXiv 2025
[55]

Richard L Street Jr, Howard Gordon, and Paul Haidet. 2007. Physicians’ communication and perceptions of patients: is it how they look, how they talk, or is it just the doctor? Social science & medicine 65, 3 (2007), 586–598

work page 2007
[56]

G. Swain. 2024. Patients may suffer from hallucinations of AI Medical Transcription Tools. CIO (2024). https://www.cio.com/article/ 3593403/patients-may-suffer-from-hallucinations-of-ai-medical-transcription-tools.html

work page 2024
[57]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Aaron A Tierney, Gregg Gayre, Brian Hoberman, Britt Mattern, Manuel Ballesca, Patricia Kipnis, Vincent Liu, and Kristine Lee. 2024. Ambient artificial intelligence scribes to alleviate the burden of clinical documentation. NEJM Catalyst Innovations in Care Delivery 5, 3 (2024), CAT–23

work page 2024
[59]

Tao Tu, Mike Schaekermann, Anil Palepu, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Yong Cheng, et al. 2025. Towards conversational diagnostic artificial intelligence. Nature (2025), 1–9

work page 2025
[60]

Equal Employment Opportunity Commission

U.S. Equal Employment Opportunity Commission. 1978. Uniform Guidelines on Employee Selection Procedures. https://www.eeoc.gov/laws/guidance/uniform-guidelines-employment-selection-procedures. Federal Register, Volume 43, Number 138, July 20, 1978

work page 1978
[61]

Alexandria Vail, Jeffrey Girard, Lauren Bylsma, Jeffrey Cohn, Jay Fournier, Holly Swartz, and Louis-Philippe Morency. 2022. Toward causal understanding of therapist-client relationships: A study of language modality and social entrainment. In Proceedings of the 2022 International Conference on Multimodal Interaction . 487–494

work page 2022
[62]

Alessandro Vinciarelli, Maja Pantic, and Hervé Bourlard. 2009. Social signal processing: Survey of an emerging domain. Image and vision computing 27, 12 (2009), 1743–1759

work page 2009
[63]

Aditya B Vishwanath, Vijay Kumar Srinivasalu, and Narayana Subramaniam. 2024. Role of large language models in improving provider–patient experience and interaction efficiency: A scoping review. Artificial Intelligence in Health (2024), 4808

work page 2024
[64]

Quan Wang, Yiling Huang, Guanlong Zhao, Evan Clark, Wei Xia, and Hank Liao. 2024. Diarizationlm: Speaker diarization post-processing with large language models. arXiv preprint arXiv:2401.03506 (2024)

work page arXiv 2024
[65]

Zhiyuan Wang, Nusayer Hassan, Virginia LeBaron, Tabor Flickinger, David Ling, James Edwards, Congyu Wu, Mehdi Boukhechba, and Laura E Barnes. 2024. CommSense: A Wearable Sensing Computational Framework for Evaluating Patient-Clinician Interactions. Proceedings of the ACM on Human-Computer Interaction 8, CSCW2 (2024), 1–31

work page 2024
[66]

Jocelyn White, Wendy Levinson, and Debra Roter. 1994. Oh, by the way. . . The closing moments of the medical visit. Journal of General Internal Medicine 9 (1994), 24–28

work page 1994
[67]

Zehui Wu, Ziwei Gong, Lin Ai, Pengyuan Shi, Kaan Donbekci, and Julia Hirschberg. 2024. Beyond silent letters: Amplifying llms in emotion recognition with vocal nuances. arXiv preprint arXiv:2407.21315 (2024)

work page arXiv 2024
[68]

Xuhai Xu, Bingsheng Yao, Yuanzhe Dong, Saadia Gabriel, Hong Yu, James Hendler, Marzyeh Ghassemi, Anind K Dey, and Dakuo Wang

work page
[69]

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 1 (2024), 1–32

Mental-llm: Leveraging large language models for mental health prediction via online text data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 1 (2024), 1–32. , Vol. 1, No. 1, Article . Publication date: May 2025. LLMs and Social Behavior in Clinical Conversations • 31

work page 2024
[70]

Haoning Xue, Wang Liao, and Jingwen Zhang. 2024. Interaction dynamics of social support expressions predict future support-seeking behaviors in online support groups. Computers in Human Behavior 156 (2024), 108224

work page 2024
[71]

Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, and Sophia Ananiadou. 2023. MentalLLaMA: Interpretable Mental Health Analysis on Social Media with Large Language Models. arXiv preprint arXiv:2309.13567 (2023)

work page arXiv 2023
[72]

Ziqi Yang, Xuhai Xu, Bingsheng Yao, Ethan Rogers, Shao Zhang, Stephen Intille, Nawar Shara, Guodong Gordon Gao, and Dakuo Wang

work page
[73]

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 2 (2024), 1–35

Talk2care: An llm-based voice assistant for communication between healthcare providers and older adults. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 2 (2024), 1–35

work page 2024
[74]

Travis Zack, Eric Lehman, Mirac Suzgun, Jorge A Rodriguez, Leo Anthony Celi, Judy Gichoya, Dan Jurafsky, Peter Szolovits, David W Bates, Raja-Elie E Abdulnour, et al. 2023. Coding inequity: assessing GPT-4’s potential for perpetuating racial and gender biases in healthcare. medRxiv (2023), 2023–07

work page 2023
[75]

Did you see any presence of signal_name in this slice?

Maxime Zanella and Ismail Ben Ayed. 2024. On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 23783–23793. , Vol. 1, No. 1, Article . Publication date: May 2025. 32 • Manas Satish Bedmutha, Feng Chen, Andrea Hartzler,...

work page 2024

[1] [1]

AHRQ. 2006. Effects of Establishing Focus in the Medical Interview (R01HS 013172 PI Lynne Robins). https://www.ahrq.gov/sites/ default/files/2024-07/robins-report.pdf Accessed October 9, 2024

work page 2006

[2] [2]

Turki M Alanzi. 2023. Impact of ChatGPT on teleconsultants in healthcare: perceptions of healthcare experts in Saudi Arabia. Journal of multidisciplinary healthcare (2023), 2309–2321

work page 2023

[3] [3]

John W Ayers, Adam Poliak, Mark Dredze, Eric C Leas, Zechariah Zhu, Jessica B Kelley, Dennis J Faix, Aaron M Goodman, Christopher A Longhurst, Michael Hogarth, et al. 2023. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA internal medicine 183, 6 (2023), 589–596

work page 2023

[4] [4]

Emily Bascom, Reggie Casanova-Perez, Kelly Tobar, Manas Satish Bedmutha, Harshini Ramaswamy, Wanda Pratt, Janice Sabin, Brian Wood, Nadir Weibel, and Andrea Hartzler. 2024. Designing Communication Feedback Systems To Reduce Healthcare Providers’ Implicit Biases In Patient Encounters. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems

work page 2024

[5] [5]

Manas Satish Bedmutha, Emily Bascom, Kimberly R Sladek, Kelly Tobar, Reggie Casanova-Perez, Alexandra Andreiu, Amrit Bhat, Sabrina Mangal, Brian R Wood, Janice Sabin, et al. 2024. Artificial intelligence-generated feedback on social signals in patient–provider communication: technical performance, feedback usability, and impact. JAMIA open 7, 4 (2024), ooae106

work page 2024

[6] [6]

Manas Satish Bedmutha, Poorva Satish Bedmutha, and Nadir Weibel. 2023. Privacy-Aware Respiratory Symptom Detection in- the-wild. In Adjunct Proceedings of the 2023 ACM International Joint Conference on Pervasive and Ubiquitous Computing & the 2023 ACM International Symposium on Wearable Computing . Association for Computing Machinery, New York, NY, USA, 4...

work page doi:10.1145/3594739.3610733 2023

[7] [7]

Manas Satish Bedmutha, Amrit Bhat, Sabrina Mangal, Emily Bascom, Wanda Pratt, Brian Wood, Janice Sabin, Nadir Weibel, and Andrea Hartzler. 2023. Towards inferring implicit bias in clinical interactions using social signals. AMIA Annual Symposium. AI Showcase Stage III (2023)

work page 2023

[8] [8]

Manas Satish Bedmutha, Anuujin Tsedenbal, Kelly Tobar, Sarah Borsotto, Kimberly R Sladek, Deepansha Singh, Reggie Casanova-Perez, Emily Bascom, Brian Wood, Janice Sabin, et al. 2024. ConverSense: An Automated Approach to Assess Patient-Provider Interactions using Social Signals. In Proceedings of the CHI Conference on Human Factors in Computing Systems . 1–22

work page 2024

[9] [9]

Sudershan Boovaraghavan, Haozhe Zhou, Mayank Goel, and Yuvraj Agarwal. 2024. Kirigami: Lightweight speech filtering for privacy- preserving activity recognition using audio. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 1 (2024), 1–28

work page 2024

[10] [10]

Ryan L Boyd, Ashwini Ashokkumar, Sarah Seraj, and James W Pennebaker. 2022. The development and psychometric properties of LIWC-22. Austin, TX: University of Texas at Austin 10 (2022), 1–47

work page 2022

[11] [11]

Hervé Bredin, Ruiqing Yin, Juan Manuel Coria, Gregory Gelly, Pavel Korshunov, Marvin Lavechin, Diego Fustes, Hadrien Titeux, Wassim Bouaziz, and Marie-Philippe Gill. 2020. Pyannote. audio: neural building blocks for speaker diarization. In ICASSP 2020-2020 IEEE International conference on acoustics, speech and signal processing (ICASSP) . IEEE, 7124–7128

work page 2020

[12] [12]

Feng Chen, Manas Satish Bedmutha, Ray-Yuan Chung, Janice Sabin, Wanda Pratt, Brian R Wood, Nadir Weibel, Andrea L Hartzler, and Trevor Cohen. 2024. Toward Automated Detection of Biased Social Signals from the Content of Clinical Conversations. arXiv preprint arXiv:2407.17477 (2024)

work page arXiv 2024

[13] [13]

Wenqiang Chen, Jiaxuan Cheng, Leyao Wang, Wei Zhao, and Wojciech Matusik. 2024. Sensor2Text: Enabling Natural Language Interactions for Daily Activity Tracking Using Wearable Sensors. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 4 (2024), 1–26

work page 2024

[14] [14]

Zhuang Chen, Jiawen Deng, Jinfeng Zhou, Jincenzi Wu, Tieyun Qian, and Minlie Huang. 2024. Depression detection in clinical interviews with LLM-empowered structural element graph. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) . 8181–8194

work page 2024

[15] [15]

Bhawana Chhaglani, Camellia Zakaria, Adam Lechowicz, Jeremy Gummeson, and Prashant Shenoy. 2022. Flowsense: Monitoring airflow in building ventilation systems using audio sensing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 1 (2022), 1–26

work page 2022

[16] [16]

Georgios Chochlakis, Niyantha Maruthu Pandiyan, Kristina Lerman, and Shrikanth Narayanan. 2025. Larger language models don’t care how you think: Why chain-of-thought prompting fails in subjective tasks. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 1–5

work page 2025

[17] [17]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean,...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Lisa A Cooper, Debra L Roter, Kathryn A Carson, Mary Catherine Beach, Janice A Sabin, Anthony G Greenwald, and Thomas S Inui

work page

[19] [19]

The associations of clinicians’ implicit attitudes about race with medical visit communication and patient ratings of interpersonal , Vol. 1, No. 1, Article . Publication date: May 2025. LLMs and Social Behavior in Clinical Conversations • 29 care. American journal of public health 102, 5 (2012), 979–987

work page 2025

[20] [20]

Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. 2020. GoEmotions: A dataset of fine-grained emotions. arXiv preprint arXiv:2005.00547 (2020)

work page arXiv 2020

[21] [21]

Zachary Englhardt, Chengqian Ma, Margaret E Morris, Chun-Cheng Chang, Xuhai" Orson" Xu, Lianhui Qin, Daniel McDuff, Xin Liu, Shwetak Patel, and Vikram Iyer. 2024. From classification to clinical insights: Towards analyzing and reasoning about mobile and behavioral health data with large language models. Proceedings of the ACM on Interactive, Mobile, Weara...

work page 2024

[22] [22]

Kyle M Fargen, Timothy O’Connor, Steven Raymond, Justin M Sporrer, and William A Friedman. 2012. An observational study of hospital paging practices and workflow interruption among on-call junior neurological surgery residents. Journal of graduate medical education 4, 4 (2012), 467–471

work page 2012

[23] [23]

Heather A Faucett, Matthew L Lee, and Scott Carter. 2017. I should listen more: real-time sensing and feedback of non-verbal communication in video telehealth. Proceedings of the ACM on Human-Computer Interaction 1, CSCW (2017), 1–19

work page 2017

[24] [24]

Shutong Feng, Guangzhi Sun, Nurul Lubis, Wen Wu, Chao Zhang, and Milica Gašić. 2023. Affect recognition in conversations using large language models. arXiv preprint arXiv:2309.12881 (2023)

work page arXiv 2023

[25] [25]

Saadia Gabriel, Isha Puri, Xuhai Xu, Matteo Malgaroli, and Marzyeh Ghassemi. 2024. Can AI relate: Testing large language model response for mental health support. arXiv preprint arXiv:2405.12021 (2024)

work page arXiv 2024

[26] [26]

Declan Grabb. 2024. pSAE-chiatry: Utilizing Sparse Autoencoders to Uncover Mental-Health-Related Features in Language Models. In NeurIPS 2024 Workshop on Behavioral Machine Learning . https://openreview.net/forum?id=BODZDzpXUF

work page 2024

[27] [27]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al

work page

[29] [29]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Nao Hagiwara, Jennifer Elston Lafata, Briana Mezuk, Scott R Vrana, and Michael D Fetters. 2019. Detecting implicit racial bias in provider communication behaviors to reduce disparities in healthcare: challenges, solutions, and future directions for provider communication training. Patient education and counseling 102, 9 (2019), 1738–1743

work page 2019

[31] [31]

AL Hartzler, RA Patel, M Czerwinski, W Pratt, A Roseway, N Chandrasekaran, and A Back. 2014. Real-time feedback on nonverbal clinical communication. Methods of information in medicine 53, 05 (2014), 389–405

work page 2014

[32] [32]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al . 2022. Lora: Low-rank adaptation of large language models. ICLR 1, 2 (2022), 3

work page 2022

[33] [33]

Hannah La, Ziming Li, Ha-Kyung Kong, and Roshan L Peiris. 2025. Exploring the Efficacy of a Chatbot Training Application in Alleviating Graduate Students’ Public-Speaking Anxiety During Q&A. (2025)

work page 2025

[34] [34]

Henry A Landsberger. 1958. Hawthorne Revisited: Management and the Worker, Its Critics, and Developments in Human Relations in Industry. (1958)

work page 1958

[35] [35]

Virginia LeBaron, Tabor Flickinger, David Ling, Hansung Lee, James Edwards, Anant Tewari, Zhiyuan Wang, and Laura E Barnes. 2023. Feasibility and acceptability testing of CommSense: A novel communication technology to enhance health equity in clinician–patient interactions. Digital Health 9 (2023), 20552076231184991

work page 2023

[36] [36]

Shanglin Lei, Guanting Dong, Xiaoping Wang, Keheng Wang, Runqi Qiao, and Sirui Wang. 2023. InstructERC: Reforming Emotion Recognition in Conversation with Multi-task Retrieval-Augmented Large Language Models. arXiv preprint arXiv:2309.11911 (2023)

work page arXiv 2023

[37] [37]

Chunfeng Liu, Karen M Scott, Renee L Lim, Silas Taylor, and Rafael A Calvo. 2016. EQClinic: a platform for learning communication skills in clinical consultations. Medical education online 21, 1 (2016), 31801

work page 2016

[38] [38]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM computing surveys 55, 9 (2023), 1–35

work page 2023

[39] [39]

Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. 2025. Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey. arXiv preprint arXiv:2503.15850 (2025)

work page arXiv 2025

[40] [40]

Xin Liu, Daniel McDuff, Geza Kovacs, Isaac Galatzer-Levy, Jacob Sunshine, Jiening Zhan, Ming-Zher Poh, Shun Liao, Paolo Di Achille, and Shwetak Patel. 2023. Large language models are few-shot health learners. arXiv preprint arXiv:2305.15525 (2023)

work page arXiv 2023

[41] [41]

Antoine Lizée, Pierre-Auguste Beaucoté, James Whitbeck, Marion Doumeingts, Anaël Beaugnon, and Isabelle Feldhaus. 2024. Conversa- tional Medical AI: Ready for Practice. arXiv preprint arXiv:2411.12808 (2024)

work page arXiv 2024

[42] [42]

Man Luo, Christopher J Warren, Lu Cheng, Haidar M Abdul-Muhsin, and Imon Banerjee. 2024. Assessing empathy in large language models with real-world physician-patient interactions. In 2024 IEEE International Conference on Big Data (BigData) . IEEE, 6510–6519

work page 2024

[43] [43]

Cheng Charles Ma, Kevin Hyekang Joo, Alexandria K Vail, Sunreeta Bhattacharya, Álvaro Fernández García, Kailana Baker-Matsuoka, Sheryl Mathew, Lori L Holt, and Fernando De la Torre. 2024. Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation. arXiv preprint arXiv:2409.09135 (2024)

work page arXiv 2024

[44] [44]

Abdulqadir J Nashwan, Ahmad A Abujaber, and Hassan Choudry. 2023. Embracing the future of physician-patient communication: GPT-4 in gastroenterology. Gastroenterology & Endoscopy 1, 3 (2023), 132–135. , Vol. 1, No. 1, Article . Publication date: May 2025. 30 • Manas Satish Bedmutha, Feng Chen, Andrea Hartzler, Trevor Cohen, and Nadir Weibel

work page 2023

[45] [45]

Junghwan Park, Meelim Kim, Mohamed El Mistiri, Rachael Kha, Sarasij Banerjee, Lisa Gotzian, Guillaume Chevance, Daniel E Rivera, Predrag Klasnja, Eric Hekler, et al . 2023. Advancing understanding of just-in-time states for supporting physical activity (Project JustWalk JITAI): protocol for a System ID study of just-in-time adaptive interventions. JMIR Re...

work page 2023

[46] [46]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust Speech Recognition via Large-Scale Weak Supervision. OpenAI (2022). https://openai.com/research/whisper

work page 2022

[47] [47]

Jeffrey D Robinson. 2003. An interactional structure of medical activities during acute visits and its implications for patients’ participation. Health communication 15, 1 (2003), 27–59

work page 2003

[48] [48]

Jeffrey D Robinson and John Heritage. 2006. Physicians’ opening questions and patients’ satisfaction.Patient education and counseling 60, 3 (2006), 279–285

work page 2006

[49] [49]

Debra Roter and Susan Larson. 2002. The Roter interaction analysis system (RIAS): utility and flexibility for analysis of medical interactions. Patient education and counseling 46, 4 (2002), 243–251

work page 2002

[50] [50]

Debra L Roter, Judith A Hall, Danielle Blanch-Hartigan, Susan Larson, and Richard M Frankel. 2011. Slicing it thin: new methods for brief sampling analysis using RIAS-coded medical dialogue. Patient education and counseling 82, 3 (2011), 410–419

work page 2011

[51] [51]

Philip Sedgwick and Nan Greenwood. 2015. Understanding the Hawthorne effect. Bmj 351 (2015)

work page 2015

[52] [52]

Raj Sanjay Shah, Faye Holt, Shirley Anugrah Hayati, Aastha Agarwal, Yi-Chia Wang, Robert E Kraut, and Diyi Yang. 2022. Modeling motivational interviewing strategies on an online peer-to-peer counseling platform. Proceedings of the ACM on Human-Computer Interaction 6, CSCW2 (2022), 1–24

work page 2022

[53] [53]

Vera Sorin, Dana Brin, Yiftach Barash, Eli Konen, Alexander Charney, Girish Nadkarni, and Eyal Klang. 2024. Large Language Models and Empathy: Systematic Review. Journal of Medical Internet Research 26 (2024), e52597

work page 2024

[54] [54]

Ian Steenstra, Farnaz Nouraei, and Timothy W Bickmore. 2025. Scaffolding Empathy: Training Counselors with Simulated Patients and Utterance-level Performance Visualizations. arXiv preprint arXiv:2502.18673 (2025)

work page arXiv 2025

[55] [55]

Richard L Street Jr, Howard Gordon, and Paul Haidet. 2007. Physicians’ communication and perceptions of patients: is it how they look, how they talk, or is it just the doctor? Social science & medicine 65, 3 (2007), 586–598

work page 2007

[56] [56]

G. Swain. 2024. Patients may suffer from hallucinations of AI Medical Transcription Tools. CIO (2024). https://www.cio.com/article/ 3593403/patients-may-suffer-from-hallucinations-of-ai-medical-transcription-tools.html

work page 2024

[57] [57]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [58]

Aaron A Tierney, Gregg Gayre, Brian Hoberman, Britt Mattern, Manuel Ballesca, Patricia Kipnis, Vincent Liu, and Kristine Lee. 2024. Ambient artificial intelligence scribes to alleviate the burden of clinical documentation. NEJM Catalyst Innovations in Care Delivery 5, 3 (2024), CAT–23

work page 2024

[59] [59]

Tao Tu, Mike Schaekermann, Anil Palepu, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Yong Cheng, et al. 2025. Towards conversational diagnostic artificial intelligence. Nature (2025), 1–9

work page 2025

[60] [60]

Equal Employment Opportunity Commission

U.S. Equal Employment Opportunity Commission. 1978. Uniform Guidelines on Employee Selection Procedures. https://www.eeoc.gov/laws/guidance/uniform-guidelines-employment-selection-procedures. Federal Register, Volume 43, Number 138, July 20, 1978

work page 1978

[61] [61]

Alexandria Vail, Jeffrey Girard, Lauren Bylsma, Jeffrey Cohn, Jay Fournier, Holly Swartz, and Louis-Philippe Morency. 2022. Toward causal understanding of therapist-client relationships: A study of language modality and social entrainment. In Proceedings of the 2022 International Conference on Multimodal Interaction . 487–494

work page 2022

[62] [62]

Alessandro Vinciarelli, Maja Pantic, and Hervé Bourlard. 2009. Social signal processing: Survey of an emerging domain. Image and vision computing 27, 12 (2009), 1743–1759

work page 2009

[63] [63]

Aditya B Vishwanath, Vijay Kumar Srinivasalu, and Narayana Subramaniam. 2024. Role of large language models in improving provider–patient experience and interaction efficiency: A scoping review. Artificial Intelligence in Health (2024), 4808

work page 2024

[64] [64]

Quan Wang, Yiling Huang, Guanlong Zhao, Evan Clark, Wei Xia, and Hank Liao. 2024. Diarizationlm: Speaker diarization post-processing with large language models. arXiv preprint arXiv:2401.03506 (2024)

work page arXiv 2024

[65] [65]

Zhiyuan Wang, Nusayer Hassan, Virginia LeBaron, Tabor Flickinger, David Ling, James Edwards, Congyu Wu, Mehdi Boukhechba, and Laura E Barnes. 2024. CommSense: A Wearable Sensing Computational Framework for Evaluating Patient-Clinician Interactions. Proceedings of the ACM on Human-Computer Interaction 8, CSCW2 (2024), 1–31

work page 2024

[66] [66]

Jocelyn White, Wendy Levinson, and Debra Roter. 1994. Oh, by the way. . . The closing moments of the medical visit. Journal of General Internal Medicine 9 (1994), 24–28

work page 1994

[67] [67]

Zehui Wu, Ziwei Gong, Lin Ai, Pengyuan Shi, Kaan Donbekci, and Julia Hirschberg. 2024. Beyond silent letters: Amplifying llms in emotion recognition with vocal nuances. arXiv preprint arXiv:2407.21315 (2024)

work page arXiv 2024

[68] [68]

Xuhai Xu, Bingsheng Yao, Yuanzhe Dong, Saadia Gabriel, Hong Yu, James Hendler, Marzyeh Ghassemi, Anind K Dey, and Dakuo Wang

work page

[69] [69]

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 1 (2024), 1–32

Mental-llm: Leveraging large language models for mental health prediction via online text data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 1 (2024), 1–32. , Vol. 1, No. 1, Article . Publication date: May 2025. LLMs and Social Behavior in Clinical Conversations • 31

work page 2024

[70] [70]

Haoning Xue, Wang Liao, and Jingwen Zhang. 2024. Interaction dynamics of social support expressions predict future support-seeking behaviors in online support groups. Computers in Human Behavior 156 (2024), 108224

work page 2024

[71] [71]

Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, and Sophia Ananiadou. 2023. MentalLLaMA: Interpretable Mental Health Analysis on Social Media with Large Language Models. arXiv preprint arXiv:2309.13567 (2023)

work page arXiv 2023

[72] [72]

Ziqi Yang, Xuhai Xu, Bingsheng Yao, Ethan Rogers, Shao Zhang, Stephen Intille, Nawar Shara, Guodong Gordon Gao, and Dakuo Wang

work page

[73] [73]

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 2 (2024), 1–35

Talk2care: An llm-based voice assistant for communication between healthcare providers and older adults. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 2 (2024), 1–35

work page 2024

[74] [74]

Travis Zack, Eric Lehman, Mirac Suzgun, Jorge A Rodriguez, Leo Anthony Celi, Judy Gichoya, Dan Jurafsky, Peter Szolovits, David W Bates, Raja-Elie E Abdulnour, et al. 2023. Coding inequity: assessing GPT-4’s potential for perpetuating racial and gender biases in healthcare. medRxiv (2023), 2023–07

work page 2023

[75] [75]

Did you see any presence of signal_name in this slice?

Maxime Zanella and Ismail Ben Ayed. 2024. On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 23783–23793. , Vol. 1, No. 1, Article . Publication date: May 2025. 32 • Manas Satish Bedmutha, Feng Chen, Andrea Hartzler,...

work page 2024