Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization
Pith reviewed 2026-05-18 11:29 UTC · model grok-4.3
The pith
Clinician reference answers let automated metrics rank AI responses to patient hospitalization questions as well as human experts do.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across 100 patient cases, responses from 28 AI systems were assessed along three dimensions: whether a system response answers the question, appropriately uses clinical note evidence, and uses general medical knowledge. Using clinician-authored reference answers to anchor metrics, automated rankings closely matched human ratings.
What carries the argument
Clinician-authored reference answers that anchor automated evaluation metrics, allowing them to align with human judgments on the three dimensions of response quality.
If this is right
- Automated evaluation can scale comparative testing of many AI systems for patient questions without requiring large amounts of clinician time.
- Developers can more quickly identify which AI tools produce responses that align with expert standards on answering, evidence use, and medical knowledge.
- Improved AI responses could support clearer patient-clinician communication in hospitalization settings.
Where Pith is reading between the lines
- The same reference-anchored approach could be tested on outpatient or chronic-care questions to see if it generalizes beyond hospitalization.
- If the match holds, regulators or hospitals might use these automated rankings as an initial filter before human review of AI tools for patient use.
- Future checks could examine whether high-ranking systems under these metrics actually improve patient understanding or reduce follow-up questions.
Load-bearing premise
That the three chosen dimensions and the 100 patient cases are enough to show whether automated evaluation can reliably tell good AI responses from bad ones for real patient safety and communication.
What would settle it
A fresh collection of patient cases or a different set of evaluation dimensions where rankings produced by reference-anchored automated metrics no longer match the order given by human clinicians.
Figures
read the original abstract
Automated approaches to answer patient-posed health questions are rising, but selecting among systems requires reliable evaluation. The current gold standard for evaluating the free-text artificial intelligence (AI) responses--human expert review--is labor-intensive and slow, limiting scalability. Automated metrics are promising yet variably aligned with human judgments and often context-dependent. To address the feasibility of automating the evaluation of AI responses to hospitalization-related questions posed by patients, we conducted a large systematic study of evaluation approaches. Across 100 patient cases, we collected responses from 28 AI systems (2800 total) and assessed them along three dimensions: whether a system response (1) answers the question, (2) appropriately uses clinical note evidence, and (3) uses general medical knowledge. Using clinician-authored reference answers to anchor metrics, automated rankings closely matched human ratings. Our findings suggest that carefully designed automated evaluation can scale comparative assessment of AI systems and support patient-clinician communication.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports an empirical study of automated evaluation for AI-generated responses to patient questions about hospitalization. Across 100 patient cases, responses were collected from 28 AI systems (2800 total) and scored along three dimensions—answering the question, using clinical note evidence, and using general medical knowledge—using clinician-authored reference answers to anchor the metrics. The central finding is that the resulting automated rankings closely match independent human ratings, supporting the feasibility of scaling comparative assessment of AI systems without sole reliance on labor-intensive expert review.
Significance. If the alignment holds under reliable human anchors, the work provides a practical path to scalable, reproducible evaluation of patient-facing AI responses, which could accelerate development and deployment of systems that support clinician-patient communication. The large scale (2800 responses) and explicit multi-dimensional design are empirical strengths that distinguish this from smaller or single-metric studies.
major comments (2)
- [Human evaluation procedure (Methods)] The manuscript provides no inter-rater agreement statistics (e.g., Cohen’s kappa, Fleiss’ kappa, or intraclass correlation) for the human ratings that serve as the validation anchor. Because the claim that automated rankings “closely matched human ratings” rests on the stability of those ratings, the absence of reliability metrics leaves open the possibility that observed alignment reflects shared noise rather than true validity.
- [Results] The abstract and results state that automated rankings “closely matched” human ratings but supply no quantitative correlation values (Spearman rho, Kendall tau, or rank correlation threshold) or statistical tests for the match. Without these numbers it is impossible to assess whether the alignment is strong enough to support the conclusion that automated evaluation can reliably distinguish good from bad responses.
minor comments (1)
- [Abstract] The abstract mentions “exact metric implementations, statistical tests, potential data exclusions, or error analysis” only in passing; a brief summary of these choices in the abstract would improve readability for readers deciding whether to examine the full methods.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments highlight important aspects of methodological transparency that will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Human evaluation procedure (Methods)] The manuscript provides no inter-rater agreement statistics (e.g., Cohen’s kappa, Fleiss’ kappa, or intraclass correlation) for the human ratings that serve as the validation anchor. Because the claim that automated rankings “closely matched human ratings” rests on the stability of those ratings, the absence of reliability metrics leaves open the possibility that observed alignment reflects shared noise rather than true validity.
Authors: We agree that inter-rater reliability metrics are essential for validating the human ratings that anchor our comparisons. Although the current manuscript does not report these statistics, we will compute and include Fleiss’ kappa (or intraclass correlation coefficients where appropriate) for the three evaluation dimensions in the revised Methods and Results sections. This addition will allow readers to directly assess the stability of the human judgments. revision: yes
-
Referee: [Results] The abstract and results state that automated rankings “closely matched” human ratings but supply no quantitative correlation values (Spearman rho, Kendall tau, or rank correlation threshold) or statistical tests for the match. Without these numbers it is impossible to assess whether the alignment is strong enough to support the conclusion that automated evaluation can reliably distinguish good from bad responses.
Authors: We acknowledge that the current description of alignment is qualitative. In the revised manuscript we will report quantitative rank correlations (Spearman rho and Kendall tau) between the automated metric rankings and the human ratings for each of the three dimensions, along with associated statistical tests and confidence intervals. The abstract will be updated to include these specific values so that the strength of the match can be evaluated objectively. revision: yes
Circularity Check
Empirical head-to-head comparison with independent human ratings; no derivation chain present
full rationale
The paper reports an empirical study: 2800 responses from 28 AI systems on 100 patient cases were scored by humans and by automated metrics anchored to clinician-authored reference answers along three explicit dimensions. Automated rankings are then compared directly to those human ratings. No equations, fitted parameters, uniqueness theorems, or self-citations are invoked as load-bearing steps in any derivation. The central claim rests on an external benchmark (human expert ratings) that is independent of the automated metrics being evaluated, satisfying the criteria for a self-contained empirical result.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Clinician-authored reference answers are accurate and comprehensive anchors for evaluating response quality.
- domain assumption The three evaluation dimensions capture the essential aspects of response quality for patient questions.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Across 100 patient cases, we collected responses from 28 AI systems (2800 total) and assessed them along three dimensions: whether a system response (1) answers the question, (2) appropriately uses clinical note evidence, and (3) uses general medical knowledge. Using clinician-authored reference answers to anchor metrics, automated rankings closely matched human ratings.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Figure 1 shows Kendall’s τ correlations between system rankings induced by automated metrics and by human judgments.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Yeganeh Shahsavar and Avishek Choudhury. User Intentions to Use ChatGPT for Self-Diagnosis and Health- Related Purposes: Cross-sectional Survey Study.JMIR Human Factors, 10(1):e47564, May 2023
work page 2023
-
[2]
Felix Busch, Lena Hoffmann, Christopher Rueger, Elon HC van Dijk, Rawen Kader, Esteban Ortiz-Prado, Marcus R. Makowski, Luca Saba, Martin Hadamitzky, Jakob Nikolas Kather, Daniel Truhn, Renato Cuocolo, 8 Evaluating Responses to Patient Questions Lisa C. Adams, and Keno K. Bressem. Current applications and challenges in large language models for patient ca...
work page 2025
-
[3]
Sarvesh Soni, Soumya Gayen, and Dina Demner-Fushman. Overview of the ArchEHR-QA 2025 Shared Task on Grounded Question Answering from Electronic Health Records. In Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, and Junichi Tsujii, editors,Proceedings of the 24th Workshop on Biomedical Language Processing, pages 396–405, Viena, Austria, August 2025. As...
work page 2025
-
[4]
Bashkin Osnat. Patient perspectives on artificial intelligence in healthcare: A global scoping review of benefits, ethical concerns, and implementation strategies.International Journal of Medical Informatics, 203:106007, November 2025
work page 2025
-
[5]
Suhana Bedi, Yutong Liu, Lucy Orr-Ewing, Dev Dash, Sanmi Koyejo, Alison Callahan, Jason A. Fries, Michael Wornow, Akshay Swaminathan, Lisa Soleymani Lehmann, Hyo Jung Hong, Mehr Kashyap, Akash R. Chaurasia, Nirav R. Shah, Karandeep Singh, Troy Tazbaz, Arnold Milstein, Michael A. Pfeffer, and Nigam H. Shah. Testing and Evaluation of Health Care Application...
work page 2025
-
[6]
Tiffany H. Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, and Victor Tseng. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.PLOS Digital Health, 2(2):e0000198, February 2023
work page 2023
-
[7]
Nishant Balepur, Abhilasha Ravichander, and Rachel Rudinger. Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10308–10330, Bangkok, Thail...
work page 2024
-
[8]
Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine
Maxime Griot, Jean Vanderdonckt, Demet Yuksel, and Coralie Hemptinne. Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Lo...
-
[10]
Wangyue Li, Liangzhi Li, Tong Xiang, Xiao Liu, Wei Deng, and Noa Garcia. Can Multiple-choice Questions Really Be Useful in Detecting the Abilities of LLMs? In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors,Proceedings of the 2024 Joint International Conference on Computational Linguistics, Lan...
work page 2024
-
[11]
Ayers, Adam Poliak, Mark Dredze, Eric C
John W. Ayers, Adam Poliak, Mark Dredze, Eric C. Leas, Zechariah Zhu, Jessica B. Kelley, Dennis J. Faix, Aaron M. Goodman, Christopher A. Longhurst, Michael Hogarth, and Davey M. Smith. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.JAMA Internal Medicine, 183(6):589–596, June 2023
work page 2023
-
[12]
Bernstein, Youchen (Victor) Zhang, Devendra Govil, Iyad Majid, Robert T
Isaac A. Bernstein, Youchen (Victor) Zhang, Devendra Govil, Iyad Majid, Robert T. Chang, Yang Sun, Ann Shue, Jonathan C. Chou, Emily Schehlein, Karen L. Christopher, Sylvia L. Groth, Cassie Ludwig, and Sophia Y . Wang. Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.JAMA Network Open, 6(8):e233...
work page 2023
-
[13]
A Comprehensive Overview of Large Language Models.ACM Trans
Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A Comprehensive Overview of Large Language Models.ACM Trans. Intell. Syst. Technol., 16(5):106:1–106:72, August 2025
work page 2025
-
[14]
Fenglin Liu, Hongjian Zhou, Boyang Gu, Xinyu Zou, Jinfa Huang, Jinge Wu, Yiru Li, Sam S. Chen, Yining Hua, Peilin Zhou, Junling Liu, Chengfeng Mao, Chenyu You, Xian Wu, Yefeng Zheng, Lei Clifton, Zheng Li, Jiebo Luo, and David A. Clifton. Application of large language models in medicine.Nature Reviews Bioengineering, 3(6):445–464, June 2025
work page 2025
-
[15]
Evaluation of Question Answering Systems: Complexity of Judging a Natural Language.ACM Comput
Amer Farea, Zhen Yang, Kien Duong, Nadeesha Perera, and Frank Emmert-Streib. Evaluation of Question Answering Systems: Complexity of Judging a Natural Language.ACM Comput. Surv., 58(1):1:1–1:43, August 2025
work page 2025
-
[16]
Sai, Akash Kumar Mohankumar, and Mitesh M
Ananya B. Sai, Akash Kumar Mohankumar, and Mitesh M. Khapra. A Survey of Evaluation Metrics Used for NLG Systems.ACM Comput. Surv., 55(2):26:1–26:39, January 2022
work page 2022
-
[17]
Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Damir Juric, Jack Flann, Ehud Reiter, Anya Belz, and Aleksandar Savkov. Human Evaluation and Correlation with Automatic Metrics in Consultation Note 9 Evaluating Responses to Patient Questions Generation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics...
work page 2022
-
[18]
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark. InProceedings of the 41st International Conference on Machine Learning, pages 6562–6595. PMLR, July 2024
work page 2024
-
[19]
Nuggeteer: Automatic Nugget-Based Evaluation using Descriptions and Judgements
Gregory Marton and Alexey Radul. Nuggeteer: Automatic Nugget-Based Evaluation using Descriptions and Judgements. In Robert C. Moore, Jeff Bilmes, Jennifer Chu-Carroll, and Mark Sanderson, editors,Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 375–382, New York City, USA, June 2006. Association for Computationa...
work page 2006
-
[20]
Davis Bartels, Deepak Gupta, and Dina Demner-Fushman. Can Large Language Models Accurately Generate Answer Keys for Health-related Questions? In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mo- hammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 354–3...
work page 2025
-
[21]
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating Text Generation with BERT. InInternational Conference on Learning Representations, September 2019
work page 2019
-
[22]
A Fine-Grained Analysis of BERTScore
Michael Hanna and Ondˇrej Bojar. A Fine-Grained Analysis of BERTScore. In Loic Barrault, Ondrej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussa, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Tom Kocmi, Andre...
work page 2021
-
[23]
Reference-based Metrics Disprove Themselves in Question Generation
Bang Nguyen, Mengxia Yu, Yun Huang, and Meng Jiang. Reference-based Metrics Disprove Themselves in Question Generation. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13651–13666, Miami, Florida, USA, November
work page 2024
-
[24]
Association for Computational Linguistics
-
[25]
Why We Need New Evaluation Metrics for NLG
Jekaterina Novikova, Ondˇrej Dušek, Amanda Cercas Curry, and Verena Rieser. Why We Need New Evaluation Metrics for NLG. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel, editors,Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2241–2252, Copenhagen, Denmark, September 2017. Association for Computational Linguistics
work page 2017
-
[26]
G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December 2023. Association for Computational ...
work page 2023
-
[27]
Churpek, Anoop Mayampurath, Frank Liao, Cherodeep Goswami, Karen K
Emma Croxford, Yanjun Gao, Elliot First, Nicholas Pellegrino, Miranda Schnier, John Caskey, Madeline Oguss, Graham Wills, Guanhua Chen, Dmitriy Dligach, Matthew M. Churpek, Anoop Mayampurath, Frank Liao, Cherodeep Goswami, Karen K. Wong, Brian W. Patterson, and Majid Afshar. Automating Evaluation of AI Text Generation in Healthcare with a Large Language M...
work page 2025
-
[28]
Md Tahmid Rahman Laskar, Israt Jahan, Elham Dolatabadi, Chun Peng, Enamul Hoque, and Jimmy Huang. Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Assoc...
work page 2025
-
[29]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics
work page 2016
-
[30]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Tal Linzen, Grzegorz Chrupała, and Afra Alishahi, editors,Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brus...
work page 2018
-
[31]
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14):6421, January 2021. 10 Evaluating Responses to Patient Questions
work page 2021
-
[32]
A Dataset for Addressing Patient’s Information Needs related to Clinical Course of Hospitalization
Sarvesh Soni and Dina Demner-Fushman. A Dataset for Addressing Patient’s Information Needs related to Clinical Course of Hospitalization. (arXiv:2506.04156), June 2025
-
[33]
BLEU: A method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 311–318. Association for Computational Linguistics, 2002
work page 2002
-
[34]
ROUGE: A Package for Automatic Evaluation of Summaries
Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics
work page 2004
-
[35]
Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. Optimizing Statistical Machine Translation for Text Simplification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016
work page 2016
-
[36]
AlignScore: Evaluating Factual Consistency with A Unified Alignment Function
Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. AlignScore: Evaluating Factual Consistency with A Unified Alignment Function. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada, July 2023. Ass...
work page 2023
-
[37]
Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, and Meliha Yetisgen. Aci-bench: A Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation.Scientific Data, 10(1):586, September 2023
work page 2023
-
[38]
Evaluating Content Selection in Summarization: The Pyramid Method
Ani Nenkova and Rebecca Passonneau. Evaluating Content Selection in Summarization: The Pyramid Method. InProceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 145–152, Boston, Massachusetts, USA, May 2004. Association for Computational Linguistics
work page 2004
-
[39]
Learning Whom to Trust with MACE
Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, and Eduard Hovy. Learning Whom to Trust with MACE. In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff, editors,Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1120–1130, Atlanta, Georgia, June 2013...
work page 2013
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.