pith. sign in

arxiv: 2505.04152 · v2 · pith:6XATNZWUnew · submitted 2025-05-07 · 💻 cs.CL · cs.CY· cs.HC

SocialLM: Social Signal Processing of Patient-Provider Communication using LLMs and Contextual Aggregation

Pith reviewed 2026-05-22 16:57 UTC · model grok-4.3

classification 💻 cs.CL cs.CYcs.HC
keywords large language modelssocial signal processingpatient-provider communicationclinical transcriptsensemble methodsbias in AIhealthcare communicationsocial behaviors
0
0 comments X

The pith

Large language models can detect social signals in clinical transcripts, and an agreement-weighted ensemble using cross-model patterns improves accuracy and stability despite variations by race and visit segment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models can identify twenty social behaviors in patient-provider transcripts without any fine-tuning or training data. It shows that detection succeeds across different model families and prompting styles, yet accuracy shifts depending on the patient's race and which part of the visit is being discussed. To handle this under simple API access, the authors build an ensemble that weights each model's output according to how often models agree at the group level. This ensemble raises both overall accuracy and consistency beyond the strongest single model. The result points to a practical route for measuring communication quality at the scale of entire health systems rather than small samples.

Core claim

Across three model families and multiple prompting strategies, LLMs reliably detect social signals from clinical transcripts without fine-tuning, though performance varies by patient race and visit segment. An agreement-weighted ensemble that draws on group-level agreement patterns among the models improves both accuracy and stability over the best individual model while remaining compatible with query-only API constraints.

What carries the argument

Agreement-weighted ensemble that aggregates LLM outputs by weighting each model according to observed group-level agreement patterns across transcripts.

If this is right

  • Communication quality in clinical encounters can be tracked continuously across large numbers of visits using only existing LLM APIs.
  • Detection becomes less sensitive to demographic differences in patients or changes across stages of a visit.
  • No custom training data or model fine-tuning is required, lowering the barrier to deployment in health-care settings.
  • Stability of social-signal measurements increases, supporting more trustworthy downstream uses such as quality monitoring or training feedback.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same agreement-weighting idea could be tested on conversational data outside medicine, such as customer-service calls or classroom discussions, to see whether demographic or contextual variability persists.
  • Combining the ensemble with a small number of human checks on high-disagreement cases might further tighten performance without losing scalability.
  • If agreement patterns turn out to be stable across institutions, the method could serve as a lightweight calibration layer for other LLM applications that process dialogue.

Load-bearing premise

That group-level agreement patterns observed across multiple LLMs under query-only API constraints provide a reliable and generalizable way to correct for performance variability tied to patient race and visit segment.

What would settle it

Apply the same ensemble procedure to a fresh set of clinical transcripts stratified by patient race and visit segment and check whether accuracy and stability gains disappear or reverse compared with the best single model.

Figures

Figures reproduced from arXiv: 2505.04152 by Andrea Hartzler, Feng Chen, Manas Satish Bedmutha, Nadir Weibel, Trevor Cohen.

Figure 1
Figure 1. Figure 1: Percentage of high social signal labels distribution across 3-minute segments across the entire sample [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Histogram distribution of number of correct predictions per sample. We see that the correct predictions are normally [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Demographic Parity Ratio between white (n=74) and non-white (n=17) patients. We see that most configurations [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗
read the original abstract

Effective patient-provider communication is difficult to assess at scale. We examine whether large language models (LLMs) can track 20 social behaviors from clinical transcripts without fine-tuning. Across three model families and multiple prompting strategies, LLMs reliably detect social signals, though performance varies by patient race and visit segment. To address this variability under query-only API constraints, we introduce an agreement-weighted ensemble using group-level agreement patterns. This approach improves both accuracy and stability over the best individual model, demonstrating a practical pathway for scalable social signal tracking in clinical conversations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines whether LLMs can detect 20 social behaviors in clinical transcripts without fine-tuning. Across three model families and prompting strategies, detection is reported as reliable but with performance varying by patient race and visit segment. To handle variability under query-only API constraints, the authors introduce an agreement-weighted ensemble derived from group-level agreement patterns, claiming this improves both accuracy and stability over the best single model.

Significance. If validated, the work provides a practical, no-fine-tuning method for large-scale social signal processing in healthcare conversations. The ensemble approach under API constraints is a useful engineering contribution for reproducibility in clinical NLP. Credit is due for the multi-model evaluation and explicit handling of demographic variability in results.

major comments (2)
  1. [§4.2] §4.2 (Ensemble Construction): The agreement-weighted ensemble is defined using observed cross-model label agreement as a proxy for reliability, but the manuscript provides no analysis showing that high agreement correlates with ground-truth accuracy rather than shared demographic biases across the three model families. This is load-bearing for the central claim that the ensemble corrects race- and segment-linked variability.
  2. [Results section] Results section, performance tables: Improvements from the ensemble over the best individual model are reported without statistical significance tests, confidence intervals, or ablation on agreement thresholds; given the noted variability by race, it is unclear whether the stability gains are robust or merely reflect correlated errors.
minor comments (2)
  1. [§3.1] The prompting strategy descriptions in §3.1 use inconsistent terminology for 'query-only' vs. 'contextual' variants; standardize notation for reproducibility.
  2. [Figure 2] Figure 2 (agreement heatmaps) lacks axis labels for visit segments; add explicit segment identifiers to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address the major concerns regarding the ensemble construction and the statistical analysis of results below. We have incorporated revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Ensemble Construction): The agreement-weighted ensemble is defined using observed cross-model label agreement as a proxy for reliability, but the manuscript provides no analysis showing that high agreement correlates with ground-truth accuracy rather than shared demographic biases across the three model families. This is load-bearing for the central claim that the ensemble corrects race- and segment-linked variability.

    Authors: We appreciate this observation. The original manuscript did not include an explicit correlation analysis between agreement levels and ground-truth accuracy. To address this, we have added an analysis in the revised §4.2 that computes the correlation between agreement scores and accuracy across demographic subgroups. The results show a positive correlation, suggesting that agreement serves as a reasonable proxy for reliability rather than solely reflecting shared biases. We also include a discussion of potential demographic biases in the models and how the ensemble approach helps mitigate variability observed by race and segment. revision: yes

  2. Referee: [Results section] Results section, performance tables: Improvements from the ensemble over the best individual model are reported without statistical significance tests, confidence intervals, or ablation on agreement thresholds; given the noted variability by race, it is unclear whether the stability gains are robust or merely reflect correlated errors.

    Authors: We agree that the presentation of results would benefit from statistical rigor. In the revised manuscript, we have added statistical significance tests (using McNemar's test for paired comparisons) between the ensemble and individual models, along with 95% confidence intervals for all reported metrics in the performance tables. Furthermore, we conducted an ablation study varying the agreement threshold and report the impact on performance and stability in a new supplementary figure. These additions confirm that the observed improvements are statistically significant and robust across different thresholds, rather than arising from correlated errors. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical LLM prompting and agreement-based ensemble

full rationale

The paper reports experimental results from prompting three LLM families on clinical transcripts to detect 20 social signals, notes observed performance variation by patient race and visit segment, and constructs an agreement-weighted ensemble from cross-model label agreement patterns. No equations, derivations, or predictions are present that reduce to inputs by construction. The ensemble is a post-hoc aggregation rule computed from observed data rather than a fitted parameter or self-referential definition. No load-bearing self-citations or uniqueness theorems are invoked for the core claims. The work is self-contained empirical evaluation against direct accuracy and stability metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical testing of LLM prompting capabilities and the effectiveness of agreement-based aggregation rather than on new theoretical derivations or postulates.

axioms (1)
  • domain assumption LLMs can classify social behaviors in clinical text from instructions alone without domain-specific fine-tuning.
    The study tests this directly through prompting experiments across model families.

pith-pipeline@v0.9.0 · 5632 in / 1274 out tokens · 75611 ms · 2026-05-22T16:57:10.910051+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Depression Detection at the Point of Care: Automated Analysis of Linguistic Signals from Routine Primary Care Encounters

    cs.CL 2026-03 unverdicted novelty 4.0

    Zero-shot GPT-OSS detects depression from 1,108 primary care encounter transcripts with AUPRC 0.51 and AUROC 0.77, with meaningful signals in the first 128 patient tokens and added value from dyadic mirroring.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    AHRQ. 2006. Effects of Establishing Focus in the Medical Interview (R01HS 013172 PI Lynne Robins). https://www.ahrq.gov/sites/ default/files/2024-07/robins-report.pdf Accessed October 9, 2024

  2. [2]

    Turki M Alanzi. 2023. Impact of ChatGPT on teleconsultants in healthcare: perceptions of healthcare experts in Saudi Arabia. Journal of multidisciplinary healthcare (2023), 2309–2321

  3. [3]

    John W Ayers, Adam Poliak, Mark Dredze, Eric C Leas, Zechariah Zhu, Jessica B Kelley, Dennis J Faix, Aaron M Goodman, Christopher A Longhurst, Michael Hogarth, et al. 2023. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA internal medicine 183, 6 (2023), 589–596

  4. [4]

    Emily Bascom, Reggie Casanova-Perez, Kelly Tobar, Manas Satish Bedmutha, Harshini Ramaswamy, Wanda Pratt, Janice Sabin, Brian Wood, Nadir Weibel, and Andrea Hartzler. 2024. Designing Communication Feedback Systems To Reduce Healthcare Providers’ Implicit Biases In Patient Encounters. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems

  5. [5]

    Manas Satish Bedmutha, Emily Bascom, Kimberly R Sladek, Kelly Tobar, Reggie Casanova-Perez, Alexandra Andreiu, Amrit Bhat, Sabrina Mangal, Brian R Wood, Janice Sabin, et al. 2024. Artificial intelligence-generated feedback on social signals in patient–provider communication: technical performance, feedback usability, and impact. JAMIA open 7, 4 (2024), ooae106

  6. [6]

    Manas Satish Bedmutha, Poorva Satish Bedmutha, and Nadir Weibel. 2023. Privacy-Aware Respiratory Symptom Detection in- the-wild. In Adjunct Proceedings of the 2023 ACM International Joint Conference on Pervasive and Ubiquitous Computing & the 2023 ACM International Symposium on Wearable Computing . Association for Computing Machinery, New York, NY, USA, 4...

  7. [7]

    Manas Satish Bedmutha, Amrit Bhat, Sabrina Mangal, Emily Bascom, Wanda Pratt, Brian Wood, Janice Sabin, Nadir Weibel, and Andrea Hartzler. 2023. Towards inferring implicit bias in clinical interactions using social signals. AMIA Annual Symposium. AI Showcase Stage III (2023)

  8. [8]

    Manas Satish Bedmutha, Anuujin Tsedenbal, Kelly Tobar, Sarah Borsotto, Kimberly R Sladek, Deepansha Singh, Reggie Casanova-Perez, Emily Bascom, Brian Wood, Janice Sabin, et al. 2024. ConverSense: An Automated Approach to Assess Patient-Provider Interactions using Social Signals. In Proceedings of the CHI Conference on Human Factors in Computing Systems . 1–22

  9. [9]

    Sudershan Boovaraghavan, Haozhe Zhou, Mayank Goel, and Yuvraj Agarwal. 2024. Kirigami: Lightweight speech filtering for privacy- preserving activity recognition using audio. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 1 (2024), 1–28

  10. [10]

    Ryan L Boyd, Ashwini Ashokkumar, Sarah Seraj, and James W Pennebaker. 2022. The development and psychometric properties of LIWC-22. Austin, TX: University of Texas at Austin 10 (2022), 1–47

  11. [11]

    Hervé Bredin, Ruiqing Yin, Juan Manuel Coria, Gregory Gelly, Pavel Korshunov, Marvin Lavechin, Diego Fustes, Hadrien Titeux, Wassim Bouaziz, and Marie-Philippe Gill. 2020. Pyannote. audio: neural building blocks for speaker diarization. In ICASSP 2020-2020 IEEE International conference on acoustics, speech and signal processing (ICASSP) . IEEE, 7124–7128

  12. [12]

    Feng Chen, Manas Satish Bedmutha, Ray-Yuan Chung, Janice Sabin, Wanda Pratt, Brian R Wood, Nadir Weibel, Andrea L Hartzler, and Trevor Cohen. 2024. Toward Automated Detection of Biased Social Signals from the Content of Clinical Conversations. arXiv preprint arXiv:2407.17477 (2024)

  13. [13]

    Wenqiang Chen, Jiaxuan Cheng, Leyao Wang, Wei Zhao, and Wojciech Matusik. 2024. Sensor2Text: Enabling Natural Language Interactions for Daily Activity Tracking Using Wearable Sensors. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 4 (2024), 1–26

  14. [14]

    Zhuang Chen, Jiawen Deng, Jinfeng Zhou, Jincenzi Wu, Tieyun Qian, and Minlie Huang. 2024. Depression detection in clinical interviews with LLM-empowered structural element graph. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) . 8181–8194

  15. [15]

    Bhawana Chhaglani, Camellia Zakaria, Adam Lechowicz, Jeremy Gummeson, and Prashant Shenoy. 2022. Flowsense: Monitoring airflow in building ventilation systems using audio sensing. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 1 (2022), 1–26

  16. [16]

    Georgios Chochlakis, Niyantha Maruthu Pandiyan, Kristina Lerman, and Shrikanth Narayanan. 2025. Larger language models don’t care how you think: Why chain-of-thought prompting fails in subjective tasks. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 1–5

  17. [17]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean,...

  18. [18]

    Lisa A Cooper, Debra L Roter, Kathryn A Carson, Mary Catherine Beach, Janice A Sabin, Anthony G Greenwald, and Thomas S Inui

  19. [19]

    The associations of clinicians’ implicit attitudes about race with medical visit communication and patient ratings of interpersonal , Vol. 1, No. 1, Article . Publication date: May 2025. LLMs and Social Behavior in Clinical Conversations • 29 care. American journal of public health 102, 5 (2012), 979–987

  20. [20]

    Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. 2020. GoEmotions: A dataset of fine-grained emotions. arXiv preprint arXiv:2005.00547 (2020)

  21. [21]

    Zachary Englhardt, Chengqian Ma, Margaret E Morris, Chun-Cheng Chang, Xuhai" Orson" Xu, Lianhui Qin, Daniel McDuff, Xin Liu, Shwetak Patel, and Vikram Iyer. 2024. From classification to clinical insights: Towards analyzing and reasoning about mobile and behavioral health data with large language models. Proceedings of the ACM on Interactive, Mobile, Weara...

  22. [22]

    Kyle M Fargen, Timothy O’Connor, Steven Raymond, Justin M Sporrer, and William A Friedman. 2012. An observational study of hospital paging practices and workflow interruption among on-call junior neurological surgery residents. Journal of graduate medical education 4, 4 (2012), 467–471

  23. [23]

    Heather A Faucett, Matthew L Lee, and Scott Carter. 2017. I should listen more: real-time sensing and feedback of non-verbal communication in video telehealth. Proceedings of the ACM on Human-Computer Interaction 1, CSCW (2017), 1–19

  24. [24]

    Shutong Feng, Guangzhi Sun, Nurul Lubis, Wen Wu, Chao Zhang, and Milica Gašić. 2023. Affect recognition in conversations using large language models. arXiv preprint arXiv:2309.12881 (2023)

  25. [25]

    Saadia Gabriel, Isha Puri, Xuhai Xu, Matteo Malgaroli, and Marzyeh Ghassemi. 2024. Can AI relate: Testing large language model response for mental health support. arXiv preprint arXiv:2405.12021 (2024)

  26. [26]

    Declan Grabb. 2024. pSAE-chiatry: Utilizing Sparse Autoencoders to Uncover Mental-Health-Related Features in Language Models. In NeurIPS 2024 Workshop on Behavioral Machine Learning . https://openreview.net/forum?id=BODZDzpXUF

  27. [27]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

  28. [28]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al

  29. [29]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

  30. [30]

    Nao Hagiwara, Jennifer Elston Lafata, Briana Mezuk, Scott R Vrana, and Michael D Fetters. 2019. Detecting implicit racial bias in provider communication behaviors to reduce disparities in healthcare: challenges, solutions, and future directions for provider communication training. Patient education and counseling 102, 9 (2019), 1738–1743

  31. [31]

    AL Hartzler, RA Patel, M Czerwinski, W Pratt, A Roseway, N Chandrasekaran, and A Back. 2014. Real-time feedback on nonverbal clinical communication. Methods of information in medicine 53, 05 (2014), 389–405

  32. [32]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al . 2022. Lora: Low-rank adaptation of large language models. ICLR 1, 2 (2022), 3

  33. [33]

    Hannah La, Ziming Li, Ha-Kyung Kong, and Roshan L Peiris. 2025. Exploring the Efficacy of a Chatbot Training Application in Alleviating Graduate Students’ Public-Speaking Anxiety During Q&A. (2025)

  34. [34]

    Henry A Landsberger. 1958. Hawthorne Revisited: Management and the Worker, Its Critics, and Developments in Human Relations in Industry. (1958)

  35. [35]

    Virginia LeBaron, Tabor Flickinger, David Ling, Hansung Lee, James Edwards, Anant Tewari, Zhiyuan Wang, and Laura E Barnes. 2023. Feasibility and acceptability testing of CommSense: A novel communication technology to enhance health equity in clinician–patient interactions. Digital Health 9 (2023), 20552076231184991

  36. [36]

    Shanglin Lei, Guanting Dong, Xiaoping Wang, Keheng Wang, Runqi Qiao, and Sirui Wang. 2023. InstructERC: Reforming Emotion Recognition in Conversation with Multi-task Retrieval-Augmented Large Language Models. arXiv preprint arXiv:2309.11911 (2023)

  37. [37]

    Chunfeng Liu, Karen M Scott, Renee L Lim, Silas Taylor, and Rafael A Calvo. 2016. EQClinic: a platform for learning communication skills in clinical consultations. Medical education online 21, 1 (2016), 31801

  38. [38]

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM computing surveys 55, 9 (2023), 1–35

  39. [39]

    Xiaoou Liu, Tiejin Chen, Longchao Da, Chacha Chen, Zhen Lin, and Hua Wei. 2025. Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey. arXiv preprint arXiv:2503.15850 (2025)

  40. [40]

    Xin Liu, Daniel McDuff, Geza Kovacs, Isaac Galatzer-Levy, Jacob Sunshine, Jiening Zhan, Ming-Zher Poh, Shun Liao, Paolo Di Achille, and Shwetak Patel. 2023. Large language models are few-shot health learners. arXiv preprint arXiv:2305.15525 (2023)

  41. [41]

    Antoine Lizée, Pierre-Auguste Beaucoté, James Whitbeck, Marion Doumeingts, Anaël Beaugnon, and Isabelle Feldhaus. 2024. Conversa- tional Medical AI: Ready for Practice. arXiv preprint arXiv:2411.12808 (2024)

  42. [42]

    Man Luo, Christopher J Warren, Lu Cheng, Haidar M Abdul-Muhsin, and Imon Banerjee. 2024. Assessing empathy in large language models with real-world physician-patient interactions. In 2024 IEEE International Conference on Big Data (BigData) . IEEE, 6510–6519

  43. [43]

    Cheng Charles Ma, Kevin Hyekang Joo, Alexandria K Vail, Sunreeta Bhattacharya, Álvaro Fernández García, Kailana Baker-Matsuoka, Sheryl Mathew, Lori L Holt, and Fernando De la Torre. 2024. Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation. arXiv preprint arXiv:2409.09135 (2024)

  44. [44]

    Abdulqadir J Nashwan, Ahmad A Abujaber, and Hassan Choudry. 2023. Embracing the future of physician-patient communication: GPT-4 in gastroenterology. Gastroenterology & Endoscopy 1, 3 (2023), 132–135. , Vol. 1, No. 1, Article . Publication date: May 2025. 30 • Manas Satish Bedmutha, Feng Chen, Andrea Hartzler, Trevor Cohen, and Nadir Weibel

  45. [45]

    Junghwan Park, Meelim Kim, Mohamed El Mistiri, Rachael Kha, Sarasij Banerjee, Lisa Gotzian, Guillaume Chevance, Daniel E Rivera, Predrag Klasnja, Eric Hekler, et al . 2023. Advancing understanding of just-in-time states for supporting physical activity (Project JustWalk JITAI): protocol for a System ID study of just-in-time adaptive interventions. JMIR Re...

  46. [46]

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust Speech Recognition via Large-Scale Weak Supervision. OpenAI (2022). https://openai.com/research/whisper

  47. [47]

    Jeffrey D Robinson. 2003. An interactional structure of medical activities during acute visits and its implications for patients’ participation. Health communication 15, 1 (2003), 27–59

  48. [48]

    Jeffrey D Robinson and John Heritage. 2006. Physicians’ opening questions and patients’ satisfaction.Patient education and counseling 60, 3 (2006), 279–285

  49. [49]

    Debra Roter and Susan Larson. 2002. The Roter interaction analysis system (RIAS): utility and flexibility for analysis of medical interactions. Patient education and counseling 46, 4 (2002), 243–251

  50. [50]

    Debra L Roter, Judith A Hall, Danielle Blanch-Hartigan, Susan Larson, and Richard M Frankel. 2011. Slicing it thin: new methods for brief sampling analysis using RIAS-coded medical dialogue. Patient education and counseling 82, 3 (2011), 410–419

  51. [51]

    Philip Sedgwick and Nan Greenwood. 2015. Understanding the Hawthorne effect. Bmj 351 (2015)

  52. [52]

    Raj Sanjay Shah, Faye Holt, Shirley Anugrah Hayati, Aastha Agarwal, Yi-Chia Wang, Robert E Kraut, and Diyi Yang. 2022. Modeling motivational interviewing strategies on an online peer-to-peer counseling platform. Proceedings of the ACM on Human-Computer Interaction 6, CSCW2 (2022), 1–24

  53. [53]

    Vera Sorin, Dana Brin, Yiftach Barash, Eli Konen, Alexander Charney, Girish Nadkarni, and Eyal Klang. 2024. Large Language Models and Empathy: Systematic Review. Journal of Medical Internet Research 26 (2024), e52597

  54. [54]

    Ian Steenstra, Farnaz Nouraei, and Timothy W Bickmore. 2025. Scaffolding Empathy: Training Counselors with Simulated Patients and Utterance-level Performance Visualizations. arXiv preprint arXiv:2502.18673 (2025)

  55. [55]

    Richard L Street Jr, Howard Gordon, and Paul Haidet. 2007. Physicians’ communication and perceptions of patients: is it how they look, how they talk, or is it just the doctor? Social science & medicine 65, 3 (2007), 586–598

  56. [56]

    G. Swain. 2024. Patients may suffer from hallucinations of AI Medical Transcription Tools. CIO (2024). https://www.cio.com/article/ 3593403/patients-may-suffer-from-hallucinations-of-ai-medical-transcription-tools.html

  57. [57]

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118 (2024)

  58. [58]

    Aaron A Tierney, Gregg Gayre, Brian Hoberman, Britt Mattern, Manuel Ballesca, Patricia Kipnis, Vincent Liu, and Kristine Lee. 2024. Ambient artificial intelligence scribes to alleviate the burden of clinical documentation. NEJM Catalyst Innovations in Care Delivery 5, 3 (2024), CAT–23

  59. [59]

    Tao Tu, Mike Schaekermann, Anil Palepu, Khaled Saab, Jan Freyberg, Ryutaro Tanno, Amy Wang, Brenna Li, Mohamed Amin, Yong Cheng, et al. 2025. Towards conversational diagnostic artificial intelligence. Nature (2025), 1–9

  60. [60]

    Equal Employment Opportunity Commission

    U.S. Equal Employment Opportunity Commission. 1978. Uniform Guidelines on Employee Selection Procedures. https://www.eeoc.gov/laws/guidance/uniform-guidelines-employment-selection-procedures. Federal Register, Volume 43, Number 138, July 20, 1978

  61. [61]

    Alexandria Vail, Jeffrey Girard, Lauren Bylsma, Jeffrey Cohn, Jay Fournier, Holly Swartz, and Louis-Philippe Morency. 2022. Toward causal understanding of therapist-client relationships: A study of language modality and social entrainment. In Proceedings of the 2022 International Conference on Multimodal Interaction . 487–494

  62. [62]

    Alessandro Vinciarelli, Maja Pantic, and Hervé Bourlard. 2009. Social signal processing: Survey of an emerging domain. Image and vision computing 27, 12 (2009), 1743–1759

  63. [63]

    Aditya B Vishwanath, Vijay Kumar Srinivasalu, and Narayana Subramaniam. 2024. Role of large language models in improving provider–patient experience and interaction efficiency: A scoping review. Artificial Intelligence in Health (2024), 4808

  64. [64]

    Quan Wang, Yiling Huang, Guanlong Zhao, Evan Clark, Wei Xia, and Hank Liao. 2024. Diarizationlm: Speaker diarization post-processing with large language models. arXiv preprint arXiv:2401.03506 (2024)

  65. [65]

    Zhiyuan Wang, Nusayer Hassan, Virginia LeBaron, Tabor Flickinger, David Ling, James Edwards, Congyu Wu, Mehdi Boukhechba, and Laura E Barnes. 2024. CommSense: A Wearable Sensing Computational Framework for Evaluating Patient-Clinician Interactions. Proceedings of the ACM on Human-Computer Interaction 8, CSCW2 (2024), 1–31

  66. [66]

    Jocelyn White, Wendy Levinson, and Debra Roter. 1994. Oh, by the way. . . The closing moments of the medical visit. Journal of General Internal Medicine 9 (1994), 24–28

  67. [67]

    Zehui Wu, Ziwei Gong, Lin Ai, Pengyuan Shi, Kaan Donbekci, and Julia Hirschberg. 2024. Beyond silent letters: Amplifying llms in emotion recognition with vocal nuances. arXiv preprint arXiv:2407.21315 (2024)

  68. [68]

    Xuhai Xu, Bingsheng Yao, Yuanzhe Dong, Saadia Gabriel, Hong Yu, James Hendler, Marzyeh Ghassemi, Anind K Dey, and Dakuo Wang

  69. [69]

    Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 1 (2024), 1–32

    Mental-llm: Leveraging large language models for mental health prediction via online text data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 1 (2024), 1–32. , Vol. 1, No. 1, Article . Publication date: May 2025. LLMs and Social Behavior in Clinical Conversations • 31

  70. [70]

    Haoning Xue, Wang Liao, and Jingwen Zhang. 2024. Interaction dynamics of social support expressions predict future support-seeking behaviors in online support groups. Computers in Human Behavior 156 (2024), 108224

  71. [71]

    Kailai Yang, Tianlin Zhang, Ziyan Kuang, Qianqian Xie, and Sophia Ananiadou. 2023. MentalLLaMA: Interpretable Mental Health Analysis on Social Media with Large Language Models. arXiv preprint arXiv:2309.13567 (2023)

  72. [72]

    Ziqi Yang, Xuhai Xu, Bingsheng Yao, Ethan Rogers, Shao Zhang, Stephen Intille, Nawar Shara, Guodong Gordon Gao, and Dakuo Wang

  73. [73]

    Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 2 (2024), 1–35

    Talk2care: An llm-based voice assistant for communication between healthcare providers and older adults. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, 2 (2024), 1–35

  74. [74]

    Travis Zack, Eric Lehman, Mirac Suzgun, Jorge A Rodriguez, Leo Anthony Celi, Judy Gichoya, Dan Jurafsky, Peter Szolovits, David W Bates, Raja-Elie E Abdulnour, et al. 2023. Coding inequity: assessing GPT-4’s potential for perpetuating racial and gender biases in healthcare. medRxiv (2023), 2023–07

  75. [75]

    Did you see any presence of signal_name in this slice?

    Maxime Zanella and Ismail Ben Ayed. 2024. On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 23783–23793. , Vol. 1, No. 1, Article . Publication date: May 2025. 32 • Manas Satish Bedmutha, Feng Chen, Andrea Hartzler,...