Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs
Pith reviewed 2026-05-10 00:48 UTC · model grok-4.3
The pith
Large language models achieve closest alignment with physicians when their outputs are collaboratively rewritten rather than generated directly.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The study finds that baseline LLMs amplify affective polarity and produce higher linguistic complexity than physicians, with FKGL up to 17.60 versus 11.47-12.50. Empathy-oriented prompting helps reduce these but doesn't boost semantic fidelity much. Collaborative rewriting, especially rephrase configurations, achieves the highest semantic similarity to physician answers (mean 0.93), improves readability, and reduces affective extremity. Dual evaluation shows models do not surpass physicians on epistemic criteria, but patients prefer rewritten variants for clarity and emotional tone. Thus, LLMs function most effectively as collaborative communication enhancers rather than replacements.
What carries the argument
Multidimensional evaluation measuring semantic fidelity, readability with FKGL, and affective resonance, with collaborative rewriting as the top-performing alignment strategy.
If this is right
- LLMs can serve as tools to refine medical explanations and improve patient comprehension.
- Rewritten outputs may lower risks of miscommunication in clinical interactions.
- Larger models gain the most from rephrasing in reduced complexity and extremity.
- Physicians remain central for epistemic accuracy, pointing to hybrid workflows.
- Patient preference for rewritten variants may improve satisfaction and treatment adherence.
Where Pith is reading between the lines
- Systems could embed automatic rewriting steps benchmarked against physician data to optimize communication.
- The results point toward hybrid human-AI setups where AI handles clarity and tone while humans ensure facts.
- Live clinical trials tracking health outcomes would test whether metric gains produce real-world benefits.
Load-bearing premise
Semantic similarity scores, FKGL readability, and affective polarity metrics sufficiently capture clinical alignment, and the sampled physician responses form an unbiased representative gold standard.
What would settle it
A study directly measuring patient comprehension, adherence, or satisfaction after receiving physician-written, raw LLM, or rewritten LLM explanations; if rewritten versions fail to match or exceed physician results in these outcomes, the alignment advantage would not hold.
read the original abstract
Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting yields the strongest overall alignment. Rephrase configurations achieve the highest semantic similarity to physician answers (up to mean = 0.93) while consistently improving readability and reducing affective extremity. Dual stakeholder evaluation shows that no model surpasses physicians on epistemic criteria, whereas patients consistently prefer rewritten variants for clarity and emotional tone. These findings suggest that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates general-purpose and domain-specialized LLMs on semantic fidelity, readability (FKGL), and affective resonance in structured medical explanations and real-world physician-patient interactions. Baseline models show amplified affective polarity (43.14-45.10% Very Negative vs. 37.25% for physicians) and higher linguistic complexity (FKGL up to 17.60 vs. 11.47-12.50); empathy-oriented prompting reduces extremity and complexity but not semantic fidelity; collaborative rewriting (rephrase configurations) achieves the highest semantic similarity to physician answers (mean up to 0.93), improves readability, reduces affective extremity, and is preferred by patients for clarity and tone, while no model surpasses physicians on epistemic criteria. The paper concludes that LLMs are most effective as collaborative communication enhancers rather than replacements.
Significance. If the results hold after addressing methodological gaps, the work provides a useful empirical benchmark for LLM deployment in clinical communication, with the dual-stakeholder (physician and patient) evaluation adding practical value and the finding that rephrasing improves alignment offering a concrete, actionable insight for human-AI collaboration in healthcare. The multidimensional design (semantic, readability, affective) is a strength relative to single-metric studies.
major comments (3)
- [Abstract] Abstract: the headline quantitative claims (semantic similarity mean = 0.93 for rephrase configs, FKGL reductions up to -6.87, affective polarity comparisons) are reported without sample sizes, statistical tests, confidence intervals, or power analysis, making it impossible to assess whether the differences are reliable or whether the conclusion that 'no model surpasses physicians on epistemic criteria' is supported.
- [Abstract] Abstract: semantic similarity, FKGL, and affective polarity are used as primary proxies for clinical alignment and empathy without reported validation against clinical accuracy, factual correctness, patient comprehension, or expert judgment of appropriate empathy; these metrics are insensitive to omissions, unsafe advice, or performative vs. genuine tone, which directly underpins the central claim that collaborative rewriting yields the strongest alignment.
- [Abstract] Abstract: the physician responses are treated as an unbiased gold standard for semantic similarity and epistemic evaluation, yet no details are provided on sampling frame, inter-physician variance, or controls for selection bias, which is load-bearing for the claim that patients prefer rewritten variants while physicians do not rate any model higher.
Simulated Author's Rebuttal
We are grateful to the referee for their insightful comments on our manuscript. These have prompted us to clarify several aspects of our methodology and strengthen the presentation of our results. We address each major comment in turn below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline quantitative claims (semantic similarity mean = 0.93 for rephrase configs, FKGL reductions up to -6.87, affective polarity comparisons) are reported without sample sizes, statistical tests, confidence intervals, or power analysis, making it impossible to assess whether the differences are reliable or whether the conclusion that 'no model surpasses physicians on epistemic criteria' is supported.
Authors: We agree that the abstract would be improved by including these details. The study is based on a collection of real-world physician-patient interactions, with quantitative results supported by statistical comparisons in the main text. In the revised manuscript, we will update the abstract to specify the sample size, indicate that the reported differences were evaluated with appropriate statistical tests, and include confidence intervals for the primary metrics. A formal power analysis was not performed, as the analysis was conducted on an existing dataset rather than a prospective experiment; we will note this in the revision. revision: yes
-
Referee: [Abstract] Abstract: semantic similarity, FKGL, and affective polarity are used as primary proxies for clinical alignment and empathy without reported validation against clinical accuracy, factual correctness, patient comprehension, or expert judgment of appropriate empathy; these metrics are insensitive to omissions, unsafe advice, or performative vs. genuine tone, which directly underpins the central claim that collaborative rewriting yields the strongest alignment.
Authors: This is a valid concern. Our work specifically targets the communicative qualities of responses—semantic alignment with physician language, readability, and affective tone—rather than clinical decision support or factual medical accuracy. We chose these metrics because they are quantifiable and have precedent in NLP for text evaluation. However, we recognize their limitations in detecting unsafe content or distinguishing performative empathy. We will revise the abstract to better contextualize the scope of our claims and expand the Limitations section to discuss these proxy limitations and the importance of human oversight for safety-critical applications. revision: yes
-
Referee: [Abstract] Abstract: the physician responses are treated as an unbiased gold standard for semantic similarity and epistemic evaluation, yet no details are provided on sampling frame, inter-physician variance, or controls for selection bias, which is load-bearing for the claim that patients prefer rewritten variants while physicians do not rate any model higher.
Authors: We appreciate the referee drawing attention to this methodological detail. The physician responses were obtained from a dataset of authentic clinical interactions, and the Methods section describes the data collection process. To address the concern, we will add more explicit information on the sampling frame (e.g., number of unique physicians and cases), any available measures of inter-physician variability, and our approach to mitigating selection bias through case diversity. This will provide better context for interpreting the gold standard comparisons. revision: yes
Circularity Check
No significant circularity: pure empirical comparison to external physician baselines
full rationale
The paper conducts a multidimensional empirical evaluation of LLMs against physician-authored responses using semantic similarity, FKGL readability, affective polarity counts, and dual-stakeholder ratings. No derivations, equations, fitted parameters, or first-principles claims appear; all reported results (e.g., rephrase configurations reaching mean semantic similarity 0.93) are direct measurements on held-out data. No self-citations are invoked to establish uniqueness theorems or to smuggle ansatzes, and no quantity is redefined as a prediction of itself. The analysis is therefore self-contained against external benchmarks with no load-bearing reductions to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Physician-authored responses represent the gold standard for clinical communication alignment
Reference graph
Works this paper leans on
-
[1]
What are human values, and how do we align ai to them? 2024
Oliver Klingefjord, Ryan Lowe, and Joe Edelman. What are human values, and how do we align ai to them? 2024
work page 2024
-
[2]
Paul R¨ ottger, Fabio Pernisi, Bertie Vidgen, and Dirk Hovy. Safetyprompts: a systematic review of open datasets for evaluating and improving large language model safety. 2025
work page 2025
-
[3]
Antonis A. Armoundas and Joseph Loscalzo. Patient agency and large language models in worldwide encoding of equity.npj Digit. Medicine, 8(1), 2025
work page 2025
-
[4]
Ascleai: A llm-based clinical note management system for enhancing clinician productivity
Jiyeon Han, Jimin Park, Jinyoung Huh, Uran Oh, Jaeyoung Do, and Daehee Kim. Ascleai: A llm-based clinical note management system for enhancing clinician productivity. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems, pages 1–7, 2024
work page 2024
-
[5]
Sina Shool, Sara Adimi, Reza Saboori Amleshi, Ehsan Bitaraf, Reza Golpira, and Mahmood Tara. A systematic review of large language model (llm) evaluations in clinical medicine.BMC Medical Informatics and Decision Making, 25(1):117, 2025
work page 2025
-
[6]
Marium M. Raza, Kaushik P. Venkatesh, and Joseph C. Kvedar. Generative AI and large language models in health care: pathways to implementation.npj Digit. Medicine, 7(1), 2024
work page 2024
-
[7]
Trustworthy ai for medicine: Continuous hallucination detection and elimination with check
Carlos Garcia-Fernandez, Luis Felipe, Monique Shotande, Muntasir Zitu, Aakash Tripathi, Ghulam Rasool, Issam El Naqa, Vivek Rudrapatna, and Gilmer Valdes. Trustworthy ai for medicine: Continuous hallucination detection and elimination with check. 2025. 39
work page 2025
-
[8]
Elham Asgari, Nina Monta˜ na Brown, Magda Dubois, Saleh Khalil, Jasmine Bal- loch, Joshua Au Yeung, and Dominic Pimenta. A framework to assess clinical safety and hallucination rates of llms for medical text summarisation.npj Digit. Medicine, 8(1), 2025
work page 2025
-
[9]
Matthew K Wynia and Chandra Y Osborn. Health literacy and communi- cation quality in health care organizations.Journal of health communication, 15(S2):102–115, 2010
work page 2010
-
[10]
Barriers and facilitators to medication adherence among the vulnerable elderly: a focus group study
Martina Horvat, Ivan Erˇ zen, and Dominika Vrbnjak. Barriers and facilitators to medication adherence among the vulnerable elderly: a focus group study. In Healthcare, volume 12, page 1723. MDPI, 2024
work page 2024
-
[11]
Lucille ML Ong, Johanna CJM De Haes, Alaysia M Hoos, and Frits B Lammes. Doctor-patient communication: a review of the literature.Social science & medicine, 40(7):903–918, 1995
work page 1995
-
[12]
Walter F Baile, Robert Buckman, Renato Lenzi, Gary Glober, Estela A Beale, and Andrzej P Kudelka. Spikes—a six-step protocol for delivering bad news: application to the patient with cancer.The oncologist, 5(4):302–311, 2000
work page 2000
-
[13]
Suzanne Kurtz, Jonathan Silverman, John Benson, and Juliet Draper. Marrying content and process in clinical method teaching: enhancing the calgary–cambridge guides.Academic Medicine, 78(8):802–809, 2003
work page 2003
-
[14]
Chen, Freya Gulamali, and Shalmali Joshi
Monica Agrawal, Irene Y. Chen, Freya Gulamali, and Shalmali Joshi. The eval- uation illusion of large language models in medicine.npj Digit. Medicine, 8(1), 2025
work page 2025
-
[15]
Overview of the medical question answering task at TREC 2017 liveqa
Asma Ben Abacha, Eugene Agichtein, Yuval Pinter, and Dina Demner-Fushman. Overview of the medical question answering task at TREC 2017 liveqa. In Ellen M. Voorhees and Angela Ellis, editors,Proceedings of The Twenty-Sixth Text REtrieval Conference, TREC 2017, Gaithersburg, Maryland, USA, Novem- ber 15-17, 2017, volume 500-324 ofNIST Special Publication. N...
work page 2017
-
[16]
Ingrid M Nembhard, Guy David, Iman Ezzeddine, David Betts, and Jennifer Radin. A systematic review of research on empathy in health care.Health services research, 58(2):250–263, 2023
work page 2023
-
[17]
Ayers, Adam Poliak, Mark Dredze, Eric C
John W. Ayers, Adam Poliak, Mark Dredze, Eric C. Leas, and et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum.JAMA Internal Medicine, 183:589–596, 2023
work page 2023
-
[18]
David Chen, Kabir Chauhan, Rod Parsa, Zhihui Amy Liu, Fei-Fei Liu, Ernie Mak, Lawson Eng, Breffni Louise Hannon, Jennifer Croke, Andrew Hope, Nazanin Fallah-Rad, Phillip Wong, and Srinivas Raman. Patient perceptions of empathy in 40 physician and artificial intelligence chatbot responses to patient questions about cancer.npj Digit. Medicine, 8(1), 2025
work page 2025
-
[19]
Assessing empathy in large language models with real-world physician-patient interactions
Man Luo, Christopher J Warren, Lu Cheng, Haidar M Abdul-Muhsin, and Imon Banerjee. Assessing empathy in large language models with real-world physician-patient interactions. In2024 IEEE International Conference on Big Data (BigData), pages 6510–6519. IEEE, 2024
work page 2024
-
[20]
Roy, Elias Atallah, Keenan Piper, Shyam Majmundar, Nikolaos Mouchtouris, D
Joanna M. Roy, Elias Atallah, Keenan Piper, Shyam Majmundar, Nikolaos Mouchtouris, D. Mitchell Self, Anand Kaul, Saman Sizdahkhani, Basel Musmar, Stavropoula I. Tjoumakaris, Michael R. Gooch, Robert H. Rosenwasser, and Pascal M. Jabbour. Comparison of quality, empathy and readability of physi- cian responses versus chatbot responses to common cerebrovascu...
work page 2025
-
[21]
Rohaid Ali, Ian D. Connolly, Oliver Y. Tang, Fatima N. Mirza, Benjamin John- ston, Hael F. Abdulrazeq, Paul F. Galamaga, Tiffany J. Libby, Neel R. Sodha, Michael W. Groff, Ziya L. Gokaslan, Albert E. Telfeian, John H. Shin, Wael F. Asaad, James Zou, and Curtis E. Doberstein. Bridging the literacy gap for sur- gical consents: an ai-human expert collaborati...
work page 2024
-
[22]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022
work page 2022
-
[23]
Ling Wang, Jinglin Li, Boyang Zhuang, Shasha Huang, Meilin Fang, Cunze Wang, Wen Li, Mohan Zhang, and Shurong Gong. Accuracy of large language models when answering clinical research questions: Systematic review and network meta- analysis.J Med Internet Res, 27:e64486, Apr 2025
work page 2025
-
[24]
An empirical evaluation of large language models on consumer health questions
Moaiz Abrar, Yusuf Sermet, and Ibrahim Demir. An empirical evaluation of large language models on consumer health questions. 2024
work page 2024
-
[25]
Kexin Ding, Mu Zhou, Akshay Chaudhari, Shaoting Zhang, and Dimitris N. Metaxas. Aligning large language models with healthcare stakeholders: A pathway to trustworthy ai integration. 2025
work page 2025
-
[26]
Charumathi Raghu Subramanian, Daniel A Yang, and Raman Khanna. Enhanc- ing health care communication with large language models—the role, challenges, and future directions.JAMA Network Open, 7(3):e240347–e240347, 2024
work page 2024
-
[27]
Stolyar, Katelyn Polanska, Karleigh R
Thomas Yu Chow Tam, Sonish Sivarajkumar, Sumit Kapoor, Alisa V. Stolyar, Katelyn Polanska, Karleigh R. McCarthy, Hunter Osterhoudt, Xizhi Wu, Shyam Visweswaran, Sunyang Fu, Piyush Mathur, Giovanni E. Cacciamani, Cong Sun, 41 Yifan Peng, and Yanshan Wang. A framework for human evaluation of large language models in healthcare derived from literature review...
work page 2024
-
[28]
Shirui Wang, Zhihui Tang, Huaxia Yang, Qiuhong Gong, Tiantian Gu, Hongyang Ma, Yongxin Wang, Wubin Sun, Zeliang Lian, Kehang Mao, Yinan Jiang, Zhicheng Huang, Lingyun Ma, Wenjie Shen, Yajie Ji, Yunhui Tan, Chunbo Wang, Yunlu Gao, Qianling Ye, Rui Lin, Mingyu Chen, Lijuan Niu, Zhihao Wang, Peng Yu, Mengran Lang, Yue Liu, Huimin Zhang, Haitao Shen, Long Che...
work page 2026
-
[29]
J Peter Kincaid, Robert P Fishburne Jr, Richard L Rogers, and Brad S Chissom. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel.–, 1975
work page 1975
-
[30]
Philip P. Gross and Karen Sadowski. Fogindex: A readability formula program for microcomputers.Journal of Reading, 28(7):614–618, 1985
work page 1985
- [31]
-
[32]
roberta-base-go emotions (revision 58b6c5b)
Sam Lowe. roberta-base-go emotions (revision 58b6c5b). 2024
work page 2024
-
[33]
A question-entailment approach to question answering.BMC Bioinform., 20(1):511:1–511:23, 2019
Asma Ben Abacha and Dina Demner-Fushman. A question-entailment approach to question answering.BMC Bioinform., 20(1):511:1–511:23, 2019
work page 2019
-
[34]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lam- ple, L´ elio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, T...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, Philip Mansfield, Blaise Aguera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj G...
work page 2022
-
[36]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor...., and Wesley Helmholz. Gemini 2.5: Pushing the frontier with advanced reasoning, multi...
work page 2025
-
[37]
Soumik Mandal, Batia Mishan Wiesenfeld, Adam Szerencsy, William R. Small, Vincent J. Major, Safiya Richardson, Antoinette M. Schoenthaler, Devin M. Mann, and Oded Nov. Utilization of generative ai-drafted responses for managing patient-provider communication.npj Digit. Medicine, 8(1), 2025
work page 2025
-
[38]
Evidence extraction to validate medical claims in fake news detection
Pritam Deka, Anna Jurek-Loughrey, et al. Evidence extraction to validate medical claims in fake news detection. InInternational Conference on Health Information Science, pages 3–15. Springer, 2022
work page 2022
-
[39]
Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a prac- tical and powerful approach to multiple testing.Journal of the Royal statistical society: series B (Methodological), 57(1):289–300, 1995
work page 1995
-
[40]
The application of large language models in medicine: A scoping review.Iscience, 27(5), 2024
Xiangbin Meng, Xiangyu Yan, Kuo Zhang, Da Liu, Xiaojuan Cui, Yaodong Yang, Muhan Zhang, Chunxia Cao, Jingjia Wang, Xuliang Wang, et al. The application of large language models in medicine: A scoping review.Iscience, 27(5), 2024
work page 2024
-
[41]
Clarissa Guidi and Chiara Traversa. Empathy in patient care: from ‘clinical empa- thy’to ‘empathic concern’.Medicine, Health Care and Philosophy, 24(4):573–585, 2021
work page 2021
-
[42]
Yoon Kyung Lee, Jina Suh, Hongli Zhan, Junyi Jessy Li, and Desmond C. Ong. Large language models produce responses perceived to be empathic, 2024
work page 2024
-
[43]
Zihao Li, Samuel Belkadi, Nicolo Micheletti, Lifeng Han, Matthew Shardlow, and Goran Nenadic. Investigating large language models and control mechanisms to improve text readability of biomedical abstracts. In2024 IEEE 12th International Conference on Healthcare Informatics (ICHI), pages 265–274. IEEE, 2024
work page 2024
-
[44]
Mengting Wang, Haoming Ma, and Meihua Piao. Effectiveness of large language models in preoperative and discharge education: a systematic review based on an evaluation framework.npj Digit. Medicine, 9(1), 2026
work page 2026
-
[45]
Avanti Bhandarkar, Ronald Wilson, Anushka Swarup, and Damon Woodard. Emulating author style: a feasibility study of prompt-enabled text stylization with off-the-shelf llms. InProceedings of the 1st Workshop on Personalization of Generative AI Systems (PERSONALIZE 2024), pages 76–82, 2024. 43
work page 2024
-
[46]
Archana Reddy Bongurala, Dhaval Save, Ankit Virmani, and Rahul Kashyap. Transforming health care with artificial intelligence: redefining medical documen- tation.Mayo Clinic Proceedings: Digital Health, 2(3):342–347, 2024
work page 2024
-
[47]
Zixiao Zhu and Kezhi Mao. Knowledge-based bert word embedding fine-tuning for emotion recognition.Neurocomputing, 552:126488, 2023
work page 2023
-
[48]
The path forward for large language models in medicine is open.npj Digit
Lars Riedemann, Maxime Labonne, and Stephen Gilbert. The path forward for large language models in medicine is open.npj Digit. Medicine, 7(1), 2024. 44
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.