Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization

Dina Demner-Fushman; Sarvesh Soni

arxiv: 2510.00436 · v2 · submitted 2025-10-01 · 💻 cs.AI · cs.CL

Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization

Sarvesh Soni , Dina Demner-Fushman This is my paper

Pith reviewed 2026-05-18 11:29 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords automated evaluationAI responsespatient questionshospitalizationreference answershuman ratingsclinical notesmedical knowledge

0 comments

The pith

Clinician reference answers let automated metrics rank AI responses to patient hospitalization questions as well as human experts do.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether machines can evaluate free-text AI answers to real patient questions about hospital stays without needing a doctor to read every response. It collected answers from 28 different AI systems on 100 patient cases and scored them on three points: does the answer address the question, does it correctly use information from the patient's clinical notes, and does it draw on general medical knowledge. By anchoring the automated scores to reference answers written by clinicians, the machine rankings lined up closely with human expert judgments. A sympathetic reader would care because manual review by clinicians is too slow to test many AI tools, so reliable automation could let developers quickly find which systems are safe and helpful for patients talking with doctors.

Core claim

Across 100 patient cases, responses from 28 AI systems were assessed along three dimensions: whether a system response answers the question, appropriately uses clinical note evidence, and uses general medical knowledge. Using clinician-authored reference answers to anchor metrics, automated rankings closely matched human ratings.

What carries the argument

Clinician-authored reference answers that anchor automated evaluation metrics, allowing them to align with human judgments on the three dimensions of response quality.

If this is right

Automated evaluation can scale comparative testing of many AI systems for patient questions without requiring large amounts of clinician time.
Developers can more quickly identify which AI tools produce responses that align with expert standards on answering, evidence use, and medical knowledge.
Improved AI responses could support clearer patient-clinician communication in hospitalization settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reference-anchored approach could be tested on outpatient or chronic-care questions to see if it generalizes beyond hospitalization.
If the match holds, regulators or hospitals might use these automated rankings as an initial filter before human review of AI tools for patient use.
Future checks could examine whether high-ranking systems under these metrics actually improve patient understanding or reduce follow-up questions.

Load-bearing premise

That the three chosen dimensions and the 100 patient cases are enough to show whether automated evaluation can reliably tell good AI responses from bad ones for real patient safety and communication.

What would settle it

A fresh collection of patient cases or a different set of evaluation dimensions where rankings produced by reference-anchored automated metrics no longer match the order given by human clinicians.

Figures

Figures reproduced from arXiv: 2510.00436 by Dina Demner-Fushman, Sarvesh Soni.

**Figure 1.** Figure 1: Kendall’s τ correlations between system rankings induced by automated metrics (rows) and human-judgment rankings under two rank-aggregation schemes (Pyramid and MACE) across the three evaluation dimensions. Cells encode τ (-1 to +1; blue → red). Rows are grouped and color-coded by metric type (legend) and are ordered by τ for answers-question under the Pyramid scheme. Metrics tagged (human) use clinician-a… view at source ↗

**Figure 2.** Figure 2: Agreement between Pyramid and MACE rankings. Each panel (rows: [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Case-level performance across systems using the Pyramid scheme. Top row: histograms of across-system mean scores for each case on the three dimensions (answers-question, uses-evidence, uses-knowledge); vertical dashed lines mark the first (Q1) and third (Q3) quartiles. Bottom row: per-case mean score (higher = easier) versus across-system standard deviation, with points colored by difficulty level (scatter… view at source ↗

read the original abstract

Automated approaches to answer patient-posed health questions are rising, but selecting among systems requires reliable evaluation. The current gold standard for evaluating the free-text artificial intelligence (AI) responses--human expert review--is labor-intensive and slow, limiting scalability. Automated metrics are promising yet variably aligned with human judgments and often context-dependent. To address the feasibility of automating the evaluation of AI responses to hospitalization-related questions posed by patients, we conducted a large systematic study of evaluation approaches. Across 100 patient cases, we collected responses from 28 AI systems (2800 total) and assessed them along three dimensions: whether a system response (1) answers the question, (2) appropriately uses clinical note evidence, and (3) uses general medical knowledge. Using clinician-authored reference answers to anchor metrics, automated rankings closely matched human ratings. Our findings suggest that carefully designed automated evaluation can scale comparative assessment of AI systems and support patient-clinician communication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Automated metrics anchored on clinician references produce rankings that match human ratings for AI answers to hospitalization questions, but the human ratings lack reported reliability checks.

read the letter

The main takeaway is that this paper demonstrates automated evaluation can scale for comparing AI responses to patient questions about hospitalization. Using clinician-authored references to ground the metrics, the automated rankings aligned closely with human ratings across 2800 responses from 28 systems on 100 cases, scored on three dimensions: answering the question, using clinical note evidence, and drawing on general medical knowledge. This is a solid empirical head-to-head in a targeted clinical area that directly tackles the bottleneck of slow expert review.

Referee Report

2 major / 1 minor

Summary. The paper reports an empirical study of automated evaluation for AI-generated responses to patient questions about hospitalization. Across 100 patient cases, responses were collected from 28 AI systems (2800 total) and scored along three dimensions—answering the question, using clinical note evidence, and using general medical knowledge—using clinician-authored reference answers to anchor the metrics. The central finding is that the resulting automated rankings closely match independent human ratings, supporting the feasibility of scaling comparative assessment of AI systems without sole reliance on labor-intensive expert review.

Significance. If the alignment holds under reliable human anchors, the work provides a practical path to scalable, reproducible evaluation of patient-facing AI responses, which could accelerate development and deployment of systems that support clinician-patient communication. The large scale (2800 responses) and explicit multi-dimensional design are empirical strengths that distinguish this from smaller or single-metric studies.

major comments (2)

[Human evaluation procedure (Methods)] The manuscript provides no inter-rater agreement statistics (e.g., Cohen’s kappa, Fleiss’ kappa, or intraclass correlation) for the human ratings that serve as the validation anchor. Because the claim that automated rankings “closely matched human ratings” rests on the stability of those ratings, the absence of reliability metrics leaves open the possibility that observed alignment reflects shared noise rather than true validity.
[Results] The abstract and results state that automated rankings “closely matched” human ratings but supply no quantitative correlation values (Spearman rho, Kendall tau, or rank correlation threshold) or statistical tests for the match. Without these numbers it is impossible to assess whether the alignment is strong enough to support the conclusion that automated evaluation can reliably distinguish good from bad responses.

minor comments (1)

[Abstract] The abstract mentions “exact metric implementations, statistical tests, potential data exclusions, or error analysis” only in passing; a brief summary of these choices in the abstract would improve readability for readers deciding whether to examine the full methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of methodological transparency that will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [Human evaluation procedure (Methods)] The manuscript provides no inter-rater agreement statistics (e.g., Cohen’s kappa, Fleiss’ kappa, or intraclass correlation) for the human ratings that serve as the validation anchor. Because the claim that automated rankings “closely matched human ratings” rests on the stability of those ratings, the absence of reliability metrics leaves open the possibility that observed alignment reflects shared noise rather than true validity.

Authors: We agree that inter-rater reliability metrics are essential for validating the human ratings that anchor our comparisons. Although the current manuscript does not report these statistics, we will compute and include Fleiss’ kappa (or intraclass correlation coefficients where appropriate) for the three evaluation dimensions in the revised Methods and Results sections. This addition will allow readers to directly assess the stability of the human judgments. revision: yes
Referee: [Results] The abstract and results state that automated rankings “closely matched” human ratings but supply no quantitative correlation values (Spearman rho, Kendall tau, or rank correlation threshold) or statistical tests for the match. Without these numbers it is impossible to assess whether the alignment is strong enough to support the conclusion that automated evaluation can reliably distinguish good from bad responses.

Authors: We acknowledge that the current description of alignment is qualitative. In the revised manuscript we will report quantitative rank correlations (Spearman rho and Kendall tau) between the automated metric rankings and the human ratings for each of the three dimensions, along with associated statistical tests and confidence intervals. The abstract will be updated to include these specific values so that the strength of the match can be evaluated objectively. revision: yes

Circularity Check

0 steps flagged

Empirical head-to-head comparison with independent human ratings; no derivation chain present

full rationale

The paper reports an empirical study: 2800 responses from 28 AI systems on 100 patient cases were scored by humans and by automated metrics anchored to clinician-authored reference answers along three explicit dimensions. Automated rankings are then compared directly to those human ratings. No equations, fitted parameters, uniqueness theorems, or self-citations are invoked as load-bearing steps in any derivation. The central claim rests on an external benchmark (human expert ratings) that is independent of the automated metrics being evaluated, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the ledger is limited to the most visible assumptions in the summary.

axioms (2)

domain assumption Clinician-authored reference answers are accurate and comprehensive anchors for evaluating response quality.
The paper explicitly uses these references to anchor the automated metrics.
domain assumption The three evaluation dimensions capture the essential aspects of response quality for patient questions.
The study assesses responses along these three dimensions as the basis for both human and automated ratings.

pith-pipeline@v0.9.0 · 5696 in / 1238 out tokens · 54586 ms · 2026-05-18T11:29:51.489009+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Across 100 patient cases, we collected responses from 28 AI systems (2800 total) and assessed them along three dimensions: whether a system response (1) answers the question, (2) appropriately uses clinical note evidence, and (3) uses general medical knowledge. Using clinician-authored reference answers to anchor metrics, automated rankings closely matched human ratings.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Figure 1 shows Kendall’s τ correlations between system rankings induced by automated metrics and by human judgments.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

User Intentions to Use ChatGPT for Self-Diagnosis and Health- Related Purposes: Cross-sectional Survey Study.JMIR Human Factors, 10(1):e47564, May 2023

Yeganeh Shahsavar and Avishek Choudhury. User Intentions to Use ChatGPT for Self-Diagnosis and Health- Related Purposes: Cross-sectional Survey Study.JMIR Human Factors, 10(1):e47564, May 2023

work page 2023
[2]

Makowski, Luca Saba, Martin Hadamitzky, Jakob Nikolas Kather, Daniel Truhn, Renato Cuocolo, 8 Evaluating Responses to Patient Questions Lisa C

Felix Busch, Lena Hoffmann, Christopher Rueger, Elon HC van Dijk, Rawen Kader, Esteban Ortiz-Prado, Marcus R. Makowski, Luca Saba, Martin Hadamitzky, Jakob Nikolas Kather, Daniel Truhn, Renato Cuocolo, 8 Evaluating Responses to Patient Questions Lisa C. Adams, and Keno K. Bressem. Current applications and challenges in large language models for patient ca...

work page 2025
[3]

Overview of the ArchEHR-QA 2025 Shared Task on Grounded Question Answering from Electronic Health Records

Sarvesh Soni, Soumya Gayen, and Dina Demner-Fushman. Overview of the ArchEHR-QA 2025 Shared Task on Grounded Question Answering from Electronic Health Records. In Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, and Junichi Tsujii, editors,Proceedings of the 24th Workshop on Biomedical Language Processing, pages 396–405, Viena, Austria, August 2025. As...

work page 2025
[4]

Bashkin Osnat. Patient perspectives on artificial intelligence in healthcare: A global scoping review of benefits, ethical concerns, and implementation strategies.International Journal of Medical Informatics, 203:106007, November 2025

work page 2025
[5]

Fries, Michael Wornow, Akshay Swaminathan, Lisa Soleymani Lehmann, Hyo Jung Hong, Mehr Kashyap, Akash R

Suhana Bedi, Yutong Liu, Lucy Orr-Ewing, Dev Dash, Sanmi Koyejo, Alison Callahan, Jason A. Fries, Michael Wornow, Akshay Swaminathan, Lisa Soleymani Lehmann, Hyo Jung Hong, Mehr Kashyap, Akash R. Chaurasia, Nirav R. Shah, Karandeep Singh, Troy Tazbaz, Arnold Milstein, Michael A. Pfeffer, and Nigam H. Shah. Testing and Evaluation of Health Care Application...

work page 2025
[6]

Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, and Victor Tseng

Tiffany H. Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, and Victor Tseng. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.PLOS Digital Health, 2(2):e0000198, February 2023

work page 2023
[7]

Nishant Balepur, Abhilasha Ravichander, and Rachel Rudinger. Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10308–10330, Bangkok, Thail...

work page 2024
[8]

Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine

Maxime Griot, Jean Vanderdonckt, Demet Yuksel, and Coralie Hemptinne. Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Lo...

work page
[10]

Wangyue Li, Liangzhi Li, Tong Xiang, Xiao Liu, Wei Deng, and Noa Garcia. Can Multiple-choice Questions Really Be Useful in Detecting the Abilities of LLMs? In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors,Proceedings of the 2024 Joint International Conference on Computational Linguistics, Lan...

work page 2024
[11]

Ayers, Adam Poliak, Mark Dredze, Eric C

John W. Ayers, Adam Poliak, Mark Dredze, Eric C. Leas, Zechariah Zhu, Jessica B. Kelley, Dennis J. Faix, Aaron M. Goodman, Christopher A. Longhurst, Michael Hogarth, and Davey M. Smith. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.JAMA Internal Medicine, 183(6):589–596, June 2023

work page 2023
[12]

Bernstein, Youchen (Victor) Zhang, Devendra Govil, Iyad Majid, Robert T

Isaac A. Bernstein, Youchen (Victor) Zhang, Devendra Govil, Iyad Majid, Robert T. Chang, Yang Sun, Ann Shue, Jonathan C. Chou, Emily Schehlein, Karen L. Christopher, Sylvia L. Groth, Cassie Ludwig, and Sophia Y . Wang. Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.JAMA Network Open, 6(8):e233...

work page 2023
[13]

A Comprehensive Overview of Large Language Models.ACM Trans

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A Comprehensive Overview of Large Language Models.ACM Trans. Intell. Syst. Technol., 16(5):106:1–106:72, August 2025

work page 2025
[14]

Chen, Yining Hua, Peilin Zhou, Junling Liu, Chengfeng Mao, Chenyu You, Xian Wu, Yefeng Zheng, Lei Clifton, Zheng Li, Jiebo Luo, and David A

Fenglin Liu, Hongjian Zhou, Boyang Gu, Xinyu Zou, Jinfa Huang, Jinge Wu, Yiru Li, Sam S. Chen, Yining Hua, Peilin Zhou, Junling Liu, Chengfeng Mao, Chenyu You, Xian Wu, Yefeng Zheng, Lei Clifton, Zheng Li, Jiebo Luo, and David A. Clifton. Application of large language models in medicine.Nature Reviews Bioengineering, 3(6):445–464, June 2025

work page 2025
[15]

Evaluation of Question Answering Systems: Complexity of Judging a Natural Language.ACM Comput

Amer Farea, Zhen Yang, Kien Duong, Nadeesha Perera, and Frank Emmert-Streib. Evaluation of Question Answering Systems: Complexity of Judging a Natural Language.ACM Comput. Surv., 58(1):1:1–1:43, August 2025

work page 2025
[16]

Sai, Akash Kumar Mohankumar, and Mitesh M

Ananya B. Sai, Akash Kumar Mohankumar, and Mitesh M. Khapra. A Survey of Evaluation Metrics Used for NLG Systems.ACM Comput. Surv., 55(2):26:1–26:39, January 2022

work page 2022
[17]

Human Evaluation and Correlation with Automatic Metrics in Consultation Note 9 Evaluating Responses to Patient Questions Generation

Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Damir Juric, Jack Flann, Ehud Reiter, Anya Belz, and Aleksandar Savkov. Human Evaluation and Correlation with Automatic Metrics in Consultation Note 9 Evaluating Responses to Patient Questions Generation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics...

work page 2022
[18]

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark. InProceedings of the 41st International Conference on Machine Learning, pages 6562–6595. PMLR, July 2024

work page 2024
[19]

Nuggeteer: Automatic Nugget-Based Evaluation using Descriptions and Judgements

Gregory Marton and Alexey Radul. Nuggeteer: Automatic Nugget-Based Evaluation using Descriptions and Judgements. In Robert C. Moore, Jeff Bilmes, Jennifer Chu-Carroll, and Mark Sanderson, editors,Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 375–382, New York City, USA, June 2006. Association for Computationa...

work page 2006
[20]

Davis Bartels, Deepak Gupta, and Dina Demner-Fushman. Can Large Language Models Accurately Generate Answer Keys for Health-related Questions? In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mo- hammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 354–3...

work page 2025
[21]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating Text Generation with BERT. InInternational Conference on Learning Representations, September 2019

work page 2019
[22]

A Fine-Grained Analysis of BERTScore

Michael Hanna and Ondˇrej Bojar. A Fine-Grained Analysis of BERTScore. In Loic Barrault, Ondrej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussa, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Tom Kocmi, Andre...

work page 2021
[23]

Reference-based Metrics Disprove Themselves in Question Generation

Bang Nguyen, Mengxia Yu, Yun Huang, and Meng Jiang. Reference-based Metrics Disprove Themselves in Question Generation. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13651–13666, Miami, Florida, USA, November

work page 2024
[24]

Association for Computational Linguistics

work page
[25]

Why We Need New Evaluation Metrics for NLG

Jekaterina Novikova, Ondˇrej Dušek, Amanda Cercas Curry, and Verena Rieser. Why We Need New Evaluation Metrics for NLG. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel, editors,Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2241–2252, Copenhagen, Denmark, September 2017. Association for Computational Linguistics

work page 2017
[26]

G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December 2023. Association for Computational ...

work page 2023
[27]

Churpek, Anoop Mayampurath, Frank Liao, Cherodeep Goswami, Karen K

Emma Croxford, Yanjun Gao, Elliot First, Nicholas Pellegrino, Miranda Schnier, John Caskey, Madeline Oguss, Graham Wills, Guanhua Chen, Dmitriy Dligach, Matthew M. Churpek, Anoop Mayampurath, Frank Liao, Cherodeep Goswami, Karen K. Wong, Brian W. Patterson, and Majid Afshar. Automating Evaluation of AI Text Generation in Healthcare with a Large Language M...

work page 2025
[28]

Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge

Md Tahmid Rahman Laskar, Israt Jahan, Elham Dolatabadi, Chun Peng, Enamul Hoque, and Jimmy Huang. Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Assoc...

work page 2025
[29]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics

work page 2016
[30]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Tal Linzen, Grzegorz Chrupała, and Afra Alishahi, editors,Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brus...

work page 2018
[31]

What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14):6421, January 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14):6421, January 2021. 10 Evaluating Responses to Patient Questions

work page 2021
[32]

A Dataset for Addressing Patient’s Information Needs related to Clinical Course of Hospitalization

Sarvesh Soni and Dina Demner-Fushman. A Dataset for Addressing Patient’s Information Needs related to Clinical Course of Hospitalization. (arXiv:2506.04156), June 2025

work page arXiv 2025
[33]

BLEU: A method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 311–318. Association for Computational Linguistics, 2002

work page 2002
[34]

ROUGE: A Package for Automatic Evaluation of Summaries

Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics

work page 2004
[35]

Optimizing Statistical Machine Translation for Text Simplification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016

Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. Optimizing Statistical Machine Translation for Text Simplification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016

work page 2016
[36]

AlignScore: Evaluating Factual Consistency with A Unified Alignment Function

Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. AlignScore: Evaluating Factual Consistency with A Unified Alignment Function. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada, July 2023. Ass...

work page 2023
[37]

Aci-bench: A Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation.Scientific Data, 10(1):586, September 2023

Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, and Meliha Yetisgen. Aci-bench: A Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation.Scientific Data, 10(1):586, September 2023

work page 2023
[38]

Evaluating Content Selection in Summarization: The Pyramid Method

Ani Nenkova and Rebecca Passonneau. Evaluating Content Selection in Summarization: The Pyramid Method. InProceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 145–152, Boston, Massachusetts, USA, May 2004. Association for Computational Linguistics

work page 2004
[39]

Learning Whom to Trust with MACE

Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, and Eduard Hovy. Learning Whom to Trust with MACE. In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff, editors,Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1120–1130, Atlanta, Georgia, June 2013...

work page 2013

[1] [1]

User Intentions to Use ChatGPT for Self-Diagnosis and Health- Related Purposes: Cross-sectional Survey Study.JMIR Human Factors, 10(1):e47564, May 2023

Yeganeh Shahsavar and Avishek Choudhury. User Intentions to Use ChatGPT for Self-Diagnosis and Health- Related Purposes: Cross-sectional Survey Study.JMIR Human Factors, 10(1):e47564, May 2023

work page 2023

[2] [2]

Makowski, Luca Saba, Martin Hadamitzky, Jakob Nikolas Kather, Daniel Truhn, Renato Cuocolo, 8 Evaluating Responses to Patient Questions Lisa C

Felix Busch, Lena Hoffmann, Christopher Rueger, Elon HC van Dijk, Rawen Kader, Esteban Ortiz-Prado, Marcus R. Makowski, Luca Saba, Martin Hadamitzky, Jakob Nikolas Kather, Daniel Truhn, Renato Cuocolo, 8 Evaluating Responses to Patient Questions Lisa C. Adams, and Keno K. Bressem. Current applications and challenges in large language models for patient ca...

work page 2025

[3] [3]

Overview of the ArchEHR-QA 2025 Shared Task on Grounded Question Answering from Electronic Health Records

Sarvesh Soni, Soumya Gayen, and Dina Demner-Fushman. Overview of the ArchEHR-QA 2025 Shared Task on Grounded Question Answering from Electronic Health Records. In Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, and Junichi Tsujii, editors,Proceedings of the 24th Workshop on Biomedical Language Processing, pages 396–405, Viena, Austria, August 2025. As...

work page 2025

[4] [4]

Bashkin Osnat. Patient perspectives on artificial intelligence in healthcare: A global scoping review of benefits, ethical concerns, and implementation strategies.International Journal of Medical Informatics, 203:106007, November 2025

work page 2025

[5] [5]

Fries, Michael Wornow, Akshay Swaminathan, Lisa Soleymani Lehmann, Hyo Jung Hong, Mehr Kashyap, Akash R

Suhana Bedi, Yutong Liu, Lucy Orr-Ewing, Dev Dash, Sanmi Koyejo, Alison Callahan, Jason A. Fries, Michael Wornow, Akshay Swaminathan, Lisa Soleymani Lehmann, Hyo Jung Hong, Mehr Kashyap, Akash R. Chaurasia, Nirav R. Shah, Karandeep Singh, Troy Tazbaz, Arnold Milstein, Michael A. Pfeffer, and Nigam H. Shah. Testing and Evaluation of Health Care Application...

work page 2025

[6] [6]

Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, and Victor Tseng

Tiffany H. Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, and Victor Tseng. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.PLOS Digital Health, 2(2):e0000198, February 2023

work page 2023

[7] [7]

Nishant Balepur, Abhilasha Ravichander, and Rachel Rudinger. Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question? In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10308–10330, Bangkok, Thail...

work page 2024

[8] [8]

Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine

Maxime Griot, Jean Vanderdonckt, Demet Yuksel, and Coralie Hemptinne. Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Lo...

work page

[9] [10]

Wangyue Li, Liangzhi Li, Tong Xiang, Xiao Liu, Wei Deng, and Noa Garcia. Can Multiple-choice Questions Really Be Useful in Detecting the Abilities of LLMs? In Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, and Nianwen Xue, editors,Proceedings of the 2024 Joint International Conference on Computational Linguistics, Lan...

work page 2024

[10] [11]

Ayers, Adam Poliak, Mark Dredze, Eric C

John W. Ayers, Adam Poliak, Mark Dredze, Eric C. Leas, Zechariah Zhu, Jessica B. Kelley, Dennis J. Faix, Aaron M. Goodman, Christopher A. Longhurst, Michael Hogarth, and Davey M. Smith. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum.JAMA Internal Medicine, 183(6):589–596, June 2023

work page 2023

[11] [12]

Bernstein, Youchen (Victor) Zhang, Devendra Govil, Iyad Majid, Robert T

Isaac A. Bernstein, Youchen (Victor) Zhang, Devendra Govil, Iyad Majid, Robert T. Chang, Yang Sun, Ann Shue, Jonathan C. Chou, Emily Schehlein, Karen L. Christopher, Sylvia L. Groth, Cassie Ludwig, and Sophia Y . Wang. Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions.JAMA Network Open, 6(8):e233...

work page 2023

[12] [13]

A Comprehensive Overview of Large Language Models.ACM Trans

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A Comprehensive Overview of Large Language Models.ACM Trans. Intell. Syst. Technol., 16(5):106:1–106:72, August 2025

work page 2025

[13] [14]

Chen, Yining Hua, Peilin Zhou, Junling Liu, Chengfeng Mao, Chenyu You, Xian Wu, Yefeng Zheng, Lei Clifton, Zheng Li, Jiebo Luo, and David A

Fenglin Liu, Hongjian Zhou, Boyang Gu, Xinyu Zou, Jinfa Huang, Jinge Wu, Yiru Li, Sam S. Chen, Yining Hua, Peilin Zhou, Junling Liu, Chengfeng Mao, Chenyu You, Xian Wu, Yefeng Zheng, Lei Clifton, Zheng Li, Jiebo Luo, and David A. Clifton. Application of large language models in medicine.Nature Reviews Bioengineering, 3(6):445–464, June 2025

work page 2025

[14] [15]

Evaluation of Question Answering Systems: Complexity of Judging a Natural Language.ACM Comput

Amer Farea, Zhen Yang, Kien Duong, Nadeesha Perera, and Frank Emmert-Streib. Evaluation of Question Answering Systems: Complexity of Judging a Natural Language.ACM Comput. Surv., 58(1):1:1–1:43, August 2025

work page 2025

[15] [16]

Sai, Akash Kumar Mohankumar, and Mitesh M

Ananya B. Sai, Akash Kumar Mohankumar, and Mitesh M. Khapra. A Survey of Evaluation Metrics Used for NLG Systems.ACM Comput. Surv., 55(2):26:1–26:39, January 2022

work page 2022

[16] [17]

Human Evaluation and Correlation with Automatic Metrics in Consultation Note 9 Evaluating Responses to Patient Questions Generation

Francesco Moramarco, Alex Papadopoulos Korfiatis, Mark Perera, Damir Juric, Jack Flann, Ehud Reiter, Anya Belz, and Aleksandar Savkov. Human Evaluation and Correlation with Automatic Metrics in Consultation Note 9 Evaluating Responses to Patient Questions Generation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics...

work page 2022

[17] [18]

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark. InProceedings of the 41st International Conference on Machine Learning, pages 6562–6595. PMLR, July 2024

work page 2024

[18] [19]

Nuggeteer: Automatic Nugget-Based Evaluation using Descriptions and Judgements

Gregory Marton and Alexey Radul. Nuggeteer: Automatic Nugget-Based Evaluation using Descriptions and Judgements. In Robert C. Moore, Jeff Bilmes, Jennifer Chu-Carroll, and Mark Sanderson, editors,Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 375–382, New York City, USA, June 2006. Association for Computationa...

work page 2006

[19] [20]

Davis Bartels, Deepak Gupta, and Dina Demner-Fushman. Can Large Language Models Accurately Generate Answer Keys for Health-related Questions? In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mo- hammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 354–3...

work page 2025

[20] [21]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating Text Generation with BERT. InInternational Conference on Learning Representations, September 2019

work page 2019

[21] [22]

A Fine-Grained Analysis of BERTScore

Michael Hanna and Ondˇrej Bojar. A Fine-Grained Analysis of BERTScore. In Loic Barrault, Ondrej Bojar, Fethi Bougares, Rajen Chatterjee, Marta R. Costa-jussa, Christian Federmann, Mark Fishel, Alexander Fraser, Markus Freitag, Yvette Graham, Roman Grundkiewicz, Paco Guzman, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Tom Kocmi, Andre...

work page 2021

[22] [23]

Reference-based Metrics Disprove Themselves in Question Generation

Bang Nguyen, Mengxia Yu, Yun Huang, and Meng Jiang. Reference-based Metrics Disprove Themselves in Question Generation. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13651–13666, Miami, Florida, USA, November

work page 2024

[23] [24]

Association for Computational Linguistics

work page

[24] [25]

Why We Need New Evaluation Metrics for NLG

Jekaterina Novikova, Ondˇrej Dušek, Amanda Cercas Curry, and Verena Rieser. Why We Need New Evaluation Metrics for NLG. In Martha Palmer, Rebecca Hwa, and Sebastian Riedel, editors,Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2241–2252, Copenhagen, Denmark, September 2017. Association for Computational Linguistics

work page 2017

[25] [26]

G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December 2023. Association for Computational ...

work page 2023

[26] [27]

Churpek, Anoop Mayampurath, Frank Liao, Cherodeep Goswami, Karen K

Emma Croxford, Yanjun Gao, Elliot First, Nicholas Pellegrino, Miranda Schnier, John Caskey, Madeline Oguss, Graham Wills, Guanhua Chen, Dmitriy Dligach, Matthew M. Churpek, Anoop Mayampurath, Frank Liao, Cherodeep Goswami, Karen K. Wong, Brian W. Patterson, and Majid Afshar. Automating Evaluation of AI Text Generation in Healthcare with a Large Language M...

work page 2025

[27] [28]

Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge

Md Tahmid Rahman Laskar, Israt Jahan, Elham Dolatabadi, Chun Peng, Enamul Hoque, and Jimmy Huang. Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Assoc...

work page 2025

[28] [29]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics

work page 2016

[29] [30]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Tal Linzen, Grzegorz Chrupała, and Afra Alishahi, editors,Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brus...

work page 2018

[30] [31]

What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14):6421, January 2021

Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams.Applied Sciences, 11(14):6421, January 2021. 10 Evaluating Responses to Patient Questions

work page 2021

[31] [32]

A Dataset for Addressing Patient’s Information Needs related to Clinical Course of Hospitalization

Sarvesh Soni and Dina Demner-Fushman. A Dataset for Addressing Patient’s Information Needs related to Clinical Course of Hospitalization. (arXiv:2506.04156), June 2025

work page arXiv 2025

[32] [33]

BLEU: A method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: A method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 311–318. Association for Computational Linguistics, 2002

work page 2002

[33] [34]

ROUGE: A Package for Automatic Evaluation of Summaries

Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics

work page 2004

[34] [35]

Optimizing Statistical Machine Translation for Text Simplification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016

Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. Optimizing Statistical Machine Translation for Text Simplification.Transactions of the Association for Computational Linguistics, 4:401–415, 2016

work page 2016

[35] [36]

AlignScore: Evaluating Factual Consistency with A Unified Alignment Function

Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. AlignScore: Evaluating Factual Consistency with A Unified Alignment Function. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada, July 2023. Ass...

work page 2023

[36] [37]

Aci-bench: A Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation.Scientific Data, 10(1):586, September 2023

Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Neal Snider, Thomas Lin, and Meliha Yetisgen. Aci-bench: A Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation.Scientific Data, 10(1):586, September 2023

work page 2023

[37] [38]

Evaluating Content Selection in Summarization: The Pyramid Method

Ani Nenkova and Rebecca Passonneau. Evaluating Content Selection in Summarization: The Pyramid Method. InProceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 145–152, Boston, Massachusetts, USA, May 2004. Association for Computational Linguistics

work page 2004

[38] [39]

Learning Whom to Trust with MACE

Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, and Eduard Hovy. Learning Whom to Trust with MACE. In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff, editors,Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1120–1130, Atlanta, Georgia, June 2013...

work page 2013