Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

Abu Noman Md Sakib; Md. Main Oddin Chisty; Zijie Zhang

arxiv: 2604.19281 · v1 · submitted 2026-04-21 · 💻 cs.HC · cs.AI· cs.CL· cs.LG

Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

Abu Noman Md Sakib , Md. Main Oddin Chisty , Zijie Zhang This is my paper

Pith reviewed 2026-05-10 02:12 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CLcs.LG

keywords medical question answeringlarge language modelsevaluation frameworkhealth equityalgorithmic discriminationentity recognitionfactual consistencysemantic similarity

0 comments

The pith

A new component-wise evaluation framework shows that large language models for medical questions have severe accuracy failures and equity disparities not visible through semantic similarity alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Verification-Based Score framework that judges medical question-answering systems by separately scoring entity recognition, semantic similarity, factual consistency, and structured information completeness instead of relying on overall word similarity. This separation matters because semantic-only checks can rate an answer as good even when it misses key medical facts or entities, which carries real risks for patients using the systems for health guidance. When the authors test three common large language models on 48 public health topics drawn from authoritative sources, they find consistent severe shortcomings across the criteria along with a 13.8 percent performance drop on topics about chronic conditions that affect older and minority populations. The results indicate condition-based algorithmic discrimination and show that prompt engineering cannot overcome basic limitations in how the models extract and verify medical information. The work therefore questions whether semantic evaluation by itself is enough to judge safety in medical artificial intelligence.

Core claim

The paper establishes that its VB-Score framework, which evaluates entity recognition, semantic similarity, factual consistency, and structured information completeness as independent components, uncovers a major discrepancy between semantic and entity accuracy. All three tested models display almost uniformly severe performance failures across 48 public health topics, with most showing 13.8 percent lower performance on chronic-condition topics linked to older and minority populations, which the authors interpret as condition-based algorithmic discrimination. The analyses further demonstrate that prompt engineering does not compensate for architectural limitations in medical entity handling.

What carries the argument

VB-Score, a verification-based evaluation framework that separately measures entity recognition, semantic similarity, factual consistency, and structured information completeness in answers to medical questions.

Load-bearing premise

The 48 public health topics form a representative sample and the observed performance drops on chronic-condition topics arise from the conditions and their demographic links rather than from differences in topic complexity, training data coverage, or how the component scores were defined.

What would settle it

A controlled re-test showing that performance on chronic-condition topics equals the overall average after adjusting for topic complexity, training data coverage, and scoring definitions would falsify the claim of condition-based algorithmic discrimination.

Figures

Figures reproduced from arXiv: 2604.19281 by Abu Noman Md Sakib, Md. Main Oddin Chisty, Zijie Zhang.

**Figure 1.** Figure 1: VB-Score Component Breakdown by Model (Zero-Shot Baseline) indicates an average of less than 10% for Entity F1 [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗

**Figure 2.** Figure 2: Performance Profiles of Individual Components: The radar chart provides clear distinctions between the performances [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Semantic-Entity Gap: High semantic similarity ( [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Component-Wise Performance: Baseline vs RAG. Arrows indicate percentage improvements from RAG across Entity [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt Sensitivity Analysis Across Configurations. (a) VB-Score comparison showing RAG achieves the highest [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Relative Improvements Heatmap: Model-Specific Prompt Sensitivity. Gemini benefits from strict instructions while [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Using correlations which are sufficiently small, seprate by dimension. Models have varied in their performance profiles [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Semantic-Factual Gap. GPT-4 and Claude prioritize fluency over factual accuracy (positive gap), while Gemini reverses [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Dangerous Failure Patterns: Fluent Misinformation. 37–65% of GPT-4/Claude responses combine high semantic [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: The Failure Rates of All Models: Universally, all entities recognized by all models failed to achieve an F1 score of 0.10 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Performance vs Cost Analysis. Under our evaluation protocol, Gemini (free) achieves 25–48% higher VB-Scores than [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Cost Scaling: Equitable Access at Scale. Logarithmic plot shows exponential cost divergence: at 1M queries, GPT-4 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Quality Dimension Comparison Under Our Protocol. Gemini shows 2 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: Disease Category Disparity Within Our Benchmark: Models scored 13.8% lower on chronic disease topics compared [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: Factual Consistency Drives Disparity. When comparing the component-wise gap analysis, those models had a +4.4pp [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: Compounding Disadvantage for Vulnerable Populations. (Left) Populations with the highest burden of chronic [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗

read the original abstract

The use of Large Language Models (LLMs) to support patients in addressing medical questions is becoming increasingly prevalent. However, most of the measures currently used to evaluate the performance of these models in this context only measure how closely a model's answers match semantically, and therefore do not provide a true indication of the model's medical accuracy or of the health equity risks associated with it. To address these shortcomings, we present a new evaluation framework for medical question answering called VB-Score (Verification-Based Score) that provides a separate evaluation of the four components of entity recognition, semantic similarity, factual consistency, and structured information completeness for medical question-answering models. We perform rigorous reviews of the performance of three well-known and widely used LLMs on 48 public health-related topics taken from high-quality, authoritative information sources. Based on our analyses, we discover a major discrepancy between the models' semantic and entity accuracy. Our assessments of the performance of all three models show that each of them has almost uniformly severe performance failures when evaluated against our criteria. Our findings indicate alarming performance disparities across various public health topics, with most of the models exhibiting 13.8% lower performance (compared to an overall average) for all the public health topics that relate to chronic conditions that occur in older and minority populations, which indicates the existence of what's known as condition-based algorithmic discrimination. Our findings also demonstrate that prompt engineering alone does not compensate for basic architectural limitations on how these models perform in extracting medical entities and raise the question of whether semantic evaluation alone is a sufficient measure of medical AI safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VB-Score gives a practical breakdown of medical QA evaluation, but the 13.8% chronic-condition gap cannot yet be read as condition-based discrimination without controls for topic difficulty and data coverage.

read the letter

The main thing to know is that this paper proposes VB-Score, a four-component framework that scores medical QA on entity recognition, semantic similarity, factual consistency, and structured completeness instead of relying on semantic similarity alone. They run it on three LLMs across 48 public-health topics drawn from authoritative sources and report widespread failures plus a 13.8% performance drop on chronic-condition topics that affect older and minority populations.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces the VB-Score framework for evaluating medical question-answering LLMs by decomposing performance into four components: entity recognition, semantic similarity, factual consistency, and structured information completeness. It applies this framework to three widely used LLMs across 48 public health topics drawn from authoritative sources. The authors report uniformly severe performance failures in all models, a 13.8% performance reduction (relative to overall average) on chronic-condition topics affecting older and minority populations, and interpret this as evidence of condition-based algorithmic discrimination. They conclude that semantic-only metrics are insufficient for medical AI safety assessment and that prompt engineering cannot overcome core architectural limitations.

Significance. If the component scores prove reliable and the reported disparity survives controls for topic characteristics, the work would be significant for health informatics and AI ethics. It offers a more granular alternative to semantic similarity metrics and draws attention to potential equity risks in LLM-generated public health information. The framework could inform safer evaluation practices for medical AI, though its impact depends on addressing the methodological gaps noted below.

major comments (3)

[Abstract and Methods] Abstract and Methods: The 48 public health topics are stated to come from high-quality sources, yet no details are given on selection criteria, stratification by chronic vs. non-chronic status, or balancing for confounders such as topic complexity, entity density, question phrasing, or prevalence in training data. This information is load-bearing for the claim that the 13.8% performance drop constitutes condition-based algorithmic discrimination rather than uncontrolled differences in topic structure or data coverage.
[Evaluation Framework (Section 3)] Evaluation Framework (Section 3): The operational definitions and scoring procedures for the four VB-Score components are not specified. In particular, the rubric or verification process used to assess factual consistency against source material, the criteria for entity recognition accuracy, and the method for scoring structured completeness are absent. These details are required to interpret the reported 'severe performance failures' and to allow replication or independent validation.
[Results (Section 4)] Results (Section 4): The 13.8% lower performance on chronic-condition topics is presented without accompanying statistical support, including confidence intervals, hypothesis tests, or regression models that control for topic-level variables. Absent such analysis, the attribution of the gap to algorithmic discrimination remains under-supported and could be explained by the unmeasured factors identified in the skeptic note.

minor comments (3)

[Abstract] The abstract refers to 'rigorous reviews' of model performance; a short description of the review protocol (e.g., number of reviewers, blinding) would improve transparency.
[Methods] Clarify whether the three LLMs were evaluated under identical prompting conditions and whether any systematic prompt-engineering ablation was performed beyond the general statement that it does not compensate for architectural limits.
[Results] Tables or figures reporting component scores should include per-topic sample sizes, standard errors, and clear labeling of which topics fall into the chronic-condition category.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive feedback on our work. We address each of the major comments in turn and describe the revisions planned for the manuscript.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods: The 48 public health topics are stated to come from high-quality sources, yet no details are given on selection criteria, stratification by chronic vs. non-chronic status, or balancing for confounders such as topic complexity, entity density, question phrasing, or prevalence in training data. This information is load-bearing for the claim that the 13.8% performance drop constitutes condition-based algorithmic discrimination rather than uncontrolled differences in topic structure or data coverage.

Authors: We agree that explicit details on the selection and characteristics of the 48 topics are essential. In the revised manuscript, we will insert a detailed Methods subsection describing the authoritative sources used, the criteria for topic selection, the stratification into chronic and non-chronic categories, and balancing for observable confounders including topic complexity, entity density, and question phrasing. We note, however, that controlling for prevalence in the LLMs' training data is not possible given the lack of transparency from model developers regarding their training corpora. revision: partial
Referee: [Evaluation Framework (Section 3)] Evaluation Framework (Section 3): The operational definitions and scoring procedures for the four VB-Score components are not specified. In particular, the rubric or verification process used to assess factual consistency against source material, the criteria for entity recognition accuracy, and the method for scoring structured completeness are absent. These details are required to interpret the reported 'severe performance failures' and to allow replication or independent validation.

Authors: We acknowledge that the operational definitions and scoring procedures were insufficiently detailed in the original submission. The revised manuscript will substantially expand Section 3 to include precise operational definitions, scoring rubrics, and verification processes for each VB-Score component. This will encompass the entity recognition criteria, the step-by-step factual consistency verification against source material (with examples), and the structured completeness scoring method, enabling full replication and validation. revision: yes
Referee: [Results (Section 4)] Results (Section 4): The 13.8% lower performance on chronic-condition topics is presented without accompanying statistical support, including confidence intervals, hypothesis tests, or regression models that control for topic-level variables. Absent such analysis, the attribution of the gap to algorithmic discrimination remains under-supported and could be explained by the unmeasured factors identified in the skeptic note.

Authors: We agree that statistical support is required to substantiate the performance disparity. In the revised Results section, we will report confidence intervals around the 13.8% figure, conduct hypothesis tests to compare performance on chronic versus non-chronic topics, and perform regression analyses that control for topic-level variables such as complexity and entity density. These additions will provide a more robust basis for interpreting the gap in terms of condition-based algorithmic discrimination. revision: yes

standing simulated objections not resolved

Controlling for the prevalence of chronic condition topics in the proprietary training data of the evaluated LLMs.

Circularity Check

0 steps flagged

No circularity: empirical evaluation relies on external authoritative references

full rationale

The paper defines VB-Score as a four-component framework (entity recognition, semantic similarity, factual consistency, structured completeness) and applies it to model outputs compared against 48 topics drawn from independent high-quality authoritative sources. Performance scores and the reported 13.8% disparity are computed directly from these external comparisons rather than from any fitted parameters, self-referential definitions, or prior self-citations that reduce the result to the inputs by construction. No equations, uniqueness theorems, or ansatzes appear in the derivation chain; the interpretive claim of condition-based discrimination follows from the observed empirical gaps without mathematical reduction to the framework itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on two domain assumptions: that the four components capture the medically relevant dimensions of QA quality, and that the chosen authoritative sources constitute unbiased ground truth. No free parameters or invented entities are introduced.

axioms (2)

domain assumption The four components (entity recognition, semantic similarity, factual consistency, structured information completeness) are the appropriate and sufficient dimensions for assessing medical accuracy and equity risks.
The entire framework and all reported disparities are defined in terms of these components.
domain assumption The 48 topics drawn from high-quality authoritative sources are representative of public health questions and free of selection bias relative to chronic conditions in older and minority populations.
The 13.8% disparity claim depends on this representativeness.

pith-pipeline@v0.9.0 · 5605 in / 1687 out tokens · 47483 ms · 2026-05-10T02:12:15.740551+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages

[1]

Griffin Adams, Emily Alsentzer, Mert Ketenci, Jason Zucker, and Noémie Elhadad. 2021. What’s in a summary? laying the groundwork for advances in hospital-course summarization. InProceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, Vol. 2021. 4794

work page 2021
[2]

Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, and Matthew McDermott. 2019. Publicly available clinical BERT embeddings. InProceedings of the 2nd clinical natural language processing workshop. 72–78

work page 2019
[3]

Syed Umar Amin, Mohsen Guizani, and M Shamim Hossain. 2025. Advances, Evaluation, and Explainability of Large Language Models in Healthcare: A Systematic Review.ACM Transactions on Multimedia Computing, Communications and Applications(2025)

work page 2025
[4]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72

work page 2005
[5]

David W Bates, David M Levine, Gianna Syber, Eyal Zimlichman, Kenneth Sands, Ben Derman, Ziv Korach, Hojjat Salmasian, Nancy Leung, and Yves Chretien. 2023. The Safety of Inpatient Health Care.New England Journal of Medicine388, 2 (2023), 142–153. doi:10.1056/NEJMsa2206117

work page doi:10.1056/nejmsa2206117 2023
[6]

Jacobs and Hanna Wallach

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?(FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. doi:10.1145/3442188. 3445922

work page doi:10.1145/3442188 2021
[7]

Kaustav Bera, Kurt A Schalper, David L Rimm, Vamsidhar Velcheti, and Anant Madabhushi. 2019. Artificial intelligence in digital pathology—new tools for diagnosis and precision oncology.Nature reviews Clinical oncology16, 11 (2019), 703–715

work page 2019
[8]

Paula Braveman and Laura Gottlieb. 2014. The social determinants of health: it’s time to consider the causes of the causes.Public health reports129, 1_suppl2 (2014), 19–31

work page 2014
[9]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

work page 2020
[10]

Danton S Char, Michael D Abràmoff, and Chris Feudtner. 2020. Identifying ethical considerations for machine learning healthcare applications.The American Journal of Bioethics20, 11 (2020), 7–17

work page 2020
[11]

Anthony Chen, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Evaluating question answering evaluation. InProceedings of the 2nd Workshop on Machine Reading for Question Answering. 119–124

work page 2019
[12]

Jean-Benoit Delbrouck, Pierre Chambon, Zhihong Chen, Maya Varma, Andrew Johnston, Louis Blankemeier, Dave Van Veen, Tan Bui, Steven Truong, and Curtis Langlotz. 2024. Radgraph-xl: A large-scale expert-annotated dataset for entity and relation extraction from radiology reports. InFindings of the Association for Computational Linguistics: ACL 2024. 12902–12915

work page 2024
[13]

Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. 2019. A guide to deep learning in healthcare.Nature medicine25, 1 (2019), 24–29

work page 2019
[14]

Joseph Futoma, Morgan Simons, Trishan Panch, Finale Doshi-Velez, and Leo Anthony Celi. 2020. The myth of generalisability in clinical research and machine learning in health care.The Lancet Digital Health2, 9 (2020), e489–e492

work page 2020
[15]

Milena A Gianfrancesco, Suzanne Tamang, Jinoos Yazdany, and Gabriela Schmajuk. 2018. Potential biases in machine learning algorithms using electronic health record data.JAMA internal medicine178, 11 (2018), 1544–1547

work page 2018
[16]

Liu, and Mohammad Saleh

Ben Goodrich, Vinay Rao, Peter J. Liu, and Mohammad Saleh. 2019. Assessing The Factual Accuracy of Generated Text. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(Anchorage, AK, USA)(KDD ’19). Association for Computing Machinery, New York, NY, USA, 166–175. doi:10.1145/3292500.3330955

work page doi:10.1145/3292500.3330955 2019
[17]

Tanya Goyal and Greg Durrett. 2020. Evaluating factuality in generation with dependency-level entailment.arXiv preprint arXiv:2010.05478(2020)

work page arXiv 2020
[18]

WHO Guidance. 2021. Ethics and governance of artificial intelligence for health.World Health Organization(2021), 1–165

work page 2021
[19]

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. InInternational conference on machine learning. PMLR, 3929–3938

work page 2020
[20]

Greene, Tamara Broderick, Michael M

Benjamin Haibe-Kains, George Alexandru Adam, Ahmed Hosny, Farnoosh Khodakarami, Levi Waldron, Bo Wang, Chris McIntosh, Anna Goldenberg, Anshul Kundaje, Casey S. Greene, Tamara Broderick, Michael M. Hoffman, Jeffrey T. Leek, Keegan Korthauer, Wolfgang Huber, Alvis Brazma, Joelle Pineau, Robert Tibshirani, Trevor Hastie, John P. A. Ioannidis, John Quackenbu...

work page 2026
[21]

2020), E14–E16

Transparency and reproducibility in artificial intelligence.Nature586 (Oct. 2020), E14–E16. doi:10.1038/s41586-020-2766-y

work page doi:10.1038/s41586-020-2766-y 2020
[22]

Nchebe-Jah Iloanusi and Soon Ae Chun. 2024. AI impact on health equity for marginalized, racial, and ethnic minorities. InProceedings of the 25th annual international conference on digital government research. 841–848

work page 2024
[23]

Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume. 874–880

work page 2021
[24]

Sophie Jentzsch, Patrick Schramowski, Constantin Rothkopf, and Kristian Kersting. 2019. Semantics Derived Automatically from Language Corpora Contain Human-like Moral Choices. InProceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (Honolulu, HI, USA)(AIES ’19). Association for Computing Machinery, New York, NY, USA, 37–44. doi:10.1145/33...

work page doi:10.1145/3306618.3314267 2019
[25]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 2567–2577

work page 2019
[26]

Qiao Jin, Yifan Yang, Qingyu Chen, and Zhiyong Lu. 2024. Genegpt: Augmenting large language models with domain tools for improved access to biomedical information.Bioinformatics40, 2 (2024), btae075

work page 2024
[27]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213

work page 2022
[28]

Wojciech Kryściński, Bryan McCann, Caiming Xiong, and Richard Socher. 2020. Evaluating the factual consistency of abstractive text summarization. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). 9332–9346

work page 2020
[29]

Philippe Laban, Tobias Schnabel, Paul N Bennett, and Marti A Hearst. 2022. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization.Transactions of the Association for Computational Linguistics10 (2022), 163–177

work page 2022
[30]

Peter Lee, Sebastien Bubeck, and Joseph Petro. 2023. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine.New England Journal of Medicine388, 13 (2023), 1233–1239

work page 2023
[31]

Yang and Fynn Bachmann

Haley Lepp and Daniel Scott Smith. 2025. “You Cannot Sound Like GPT": Signs of language discrimination and resistance in computer science publishing. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’25). Association for Computing Machinery, New York, NY, USA, 3162–3181. doi:10.1145/3715275.3732202

work page doi:10.1145/3715275.3732202 2025
[32]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

work page 2020
[33]

Valentin Liévin, Christoffer Egeberg Hother, Andreas Geert Motzfeldt, and Ole Winther. 2024. Can large language models reason about medical questions?Patterns5, 3 (2024)

work page 2024
[34]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

work page 2004
[35]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers). 3214–3252

work page 2022
[36]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.ACM Comput. Surv.55, 9 (2023). doi:10.1145/3560815

work page doi:10.1145/3560815 2023
[37]

Calvert, Alastair K

Xiaoxuan Liu, Samantha Cruz Rivera, David Moher, Melanie J. Calvert, Alastair K. Denniston, the SPIRIT-AI, and CONSORT-AI Working Group. 2020. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. The Lancet Digital Health2, 10 (2020), e537–e548. doi:10.1016/S2589-7500(20)30218-1

work page doi:10.1016/s2589-7500(20)30218-1 2020
[38]

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On Faithfulness and Factuality in Abstractive Summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1906–1919

work page 2020
[39]

Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning.ACM computing surveys (CSUR)54, 6 (2021), 1–35

work page 2021
[40]

Bertalan Meskó and Eric J Topol. 2023. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ digital medicine6, 1 (2023), 120

work page 2023
[41]

Shira Mitchell, Eric Potash, Solon Barocas, Alexander D’Amour, and Kristian Lum. 2021. Algorithmic fairness: Choices, assumptions, and definitions.Annual review of statistics and its application8, 1 (2021), 141–163

work page 2021
[42]

Aurélie Névéol, Hercules Dalianis, Sumithra Velupillai, Guergana Savova, and Pierre Zweigenbaum. 2018. Clinical natural language processing in languages other than English: opportunities and challenges.Journal of biomedical semantics9, 1 (2018), 12

work page 2018
[43]

Natalia Norori, Qiyang Hu, Florence Marcelle Aellen, Francesca Dalia Faraci, and Athina Tzovara. 2021. Addressing bias in big data and AI for health care: A call for open science.Patterns2, 10 (2021)

work page 2021
[44]

Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. Dissecting racial bias in an algorithm used to manage the health of populations.Science366, 6464 (2019), 447–453

work page 2019
[45]

Jesutofunmi A Omiye, Jenna C Lester, Simon Spichak, Veronica Rotemberg, and Roxana Daneshjou. 2023. Large language models propagate race-based medicine.NPJ Digital Medicine6, 1 (2023), 195

work page 2023
[46]

Ermelinda Oro, Francesco Maria Granata, and Massimo Ruffolo. 2025. A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian.Big Data and Cognitive Computing9, 5 (2025), 141. FAccT ’26, June 25–28, 2026, Montreal, QC, Canada A.N.M. Sakib et al

work page 2025
[47]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

work page 2022
[48]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318

work page 2002
[49]

Jessica K Paulus and David M Kent. 2020. Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities.NPJ digital medicine3, 1 (2020), 99

work page 2020
[50]

Dana Pessach and Erez Shmueli. 2022. A review on fairness in machine learning.ACM Computing Surveys (CSUR)55, 3 (2022), 1–44

work page 2022
[51]

Rimma Pivovarov and Noémie Elhadad. 2015. Automated methods for the summarization of electronic health records.Journal of the American Medical Informatics Association22, 5 (2015), 938–947

work page 2015
[52]

Alvin Rajkomar, Jeffrey Dean, and Isaac Kohane. 2019. Machine learning in medicine.New England Journal of Medicine380, 14 (2019), 1347–1358

work page 2019
[53]

Alvin Rajkomar, Michaela Hardt, Michael D Howell, Greg Corrado, and Marshall H Chin. 2018. Ensuring fairness in machine learning to advance health equity.Annals of internal medicine169, 12 (2018), 866–872

work page 2018
[54]

Pranav Rajpurkar, Emma Chen, Oishi Banerjee, and Eric J Topol. 2022. AI in health and medicine.Nature medicine28, 1 (2022), 31–38

work page 2022
[55]

SCPR Ratzan, Ruth Parker, C Selden, Marcia Zorn, et al. 2000. National library of medicine current bibliographies in medicine: health literacy.Bethesda, MD: National Institutes of Health, US Department of Health and Human Services(2000), 331–337

work page 2000
[56]

Maya Rotmensch, Yoni Halpern, Abdulhakim Tlimat, Steven Horng, and David Sontag. 2017. Learning a health knowledge graph from electronic medical records.Scientific reports7, 1 (2017), 5994

work page 2017
[57]

Nina Schwalbe and Brian Wahl. 2020. Artificial intelligence and the future of global health.The Lancet395, 10236 (2020), 1579–1586

work page 2020
[58]

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. 2023. Large language models encode clinical knowledge.Nature620, 7972 (2023), 172–180

work page 2023
[59]

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano

work page
[60]

Learning to summarize with human feedback.Advances in neural information processing systems33 (2020), 3008–3021

work page 2020
[61]

Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting

work page
[62]

Large language models in medicine.Nature medicine29, 8 (2023), 1930–1940

work page 2023
[63]

2022.Recommendation on the ethics of artificial intelligence

Unesco. 2022.Recommendation on the ethics of artificial intelligence. United Nations Educational, Scientific and Cultural Organization

work page 2022
[64]

Tiffany C Veinot, Hannah Mitchell, and Jessica S Ancker. 2018. Good intentions are not enough: how informatics interventions can worsen inequality.Journal of the American Medical Informatics Association25, 8 (2018), 1080–1088

work page 2018
[65]

Vyas, Leo G

Darshali A. Vyas, Leo G. Eisenstein, and David S. Jones. 2020. Hidden in Plain Sight — Reconsidering the Use of Race Correction in Clinical Algorithms.New England Journal of Medicine383, 9 (2020), 874–882. doi:10.1056/NEJMms2004740

work page doi:10.1056/nejmms2004740 2020
[66]

Robert M Wachter and Erik Brynjolfsson. 2024. Will generative artificial intelligence deliver on its promise in health care?Jama331, 1 (2024), 65–69

work page 2024
[67]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

work page 2022
[68]

David R Williams and Selina A Mohammed. 2013. Racism and health I: Pathways and scientific evidence.American behavioral scientist 57, 8 (2013), 1152–1173

work page 2013
[69]

Michael Wornow, Yizhe Xu, Rahul Thapa, Birju Patel, Ethan Steinberg, Scott Fleming, Michael A Pfeffer, Jason Fries, and Nigam H Shah

work page
[70]

The shaky foundations of large language models and foundation models for electronic health records.npj digital medicine6, 1 (2023), 135

work page 2023
[71]

Tianyi Zhang, Varsha Kishore, W Felix, and Kilian Q Weinberger. 2020. Evaluating text generation with BERT.ICLR, Bertscore(2020)

work page 2020
[72]

Yijia Zhang, Qingyu Chen, Zhihao Yang, Hongfei Lin, and Zhiyong Lu. 2019. BioWordVec, improving biomedical word embeddings with subword information and MeSH.Scientific data6, 1 (2019), 52

work page 2019
[73]

how similar?

Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V. Le, Ed H., Denny Zhou, Swaroop Mishra, and Huaixiu Steven Zheng. 2024. SELF-DISCOVER: Large Language Models Self-Compose Reasoning Structures. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), V...

work page doi:10.52202/079017-4004 2024

[1] [1]

Griffin Adams, Emily Alsentzer, Mert Ketenci, Jason Zucker, and Noémie Elhadad. 2021. What’s in a summary? laying the groundwork for advances in hospital-course summarization. InProceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, Vol. 2021. 4794

work page 2021

[2] [2]

Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, and Matthew McDermott. 2019. Publicly available clinical BERT embeddings. InProceedings of the 2nd clinical natural language processing workshop. 72–78

work page 2019

[3] [3]

Syed Umar Amin, Mohsen Guizani, and M Shamim Hossain. 2025. Advances, Evaluation, and Explainability of Large Language Models in Healthcare: A Systematic Review.ACM Transactions on Multimedia Computing, Communications and Applications(2025)

work page 2025

[4] [4]

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72

work page 2005

[5] [5]

David W Bates, David M Levine, Gianna Syber, Eyal Zimlichman, Kenneth Sands, Ben Derman, Ziv Korach, Hojjat Salmasian, Nancy Leung, and Yves Chretien. 2023. The Safety of Inpatient Health Care.New England Journal of Medicine388, 2 (2023), 142–153. doi:10.1056/NEJMsa2206117

work page doi:10.1056/nejmsa2206117 2023

[6] [6]

Jacobs and Hanna Wallach

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?(FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. doi:10.1145/3442188. 3445922

work page doi:10.1145/3442188 2021

[7] [7]

Kaustav Bera, Kurt A Schalper, David L Rimm, Vamsidhar Velcheti, and Anant Madabhushi. 2019. Artificial intelligence in digital pathology—new tools for diagnosis and precision oncology.Nature reviews Clinical oncology16, 11 (2019), 703–715

work page 2019

[8] [8]

Paula Braveman and Laura Gottlieb. 2014. The social determinants of health: it’s time to consider the causes of the causes.Public health reports129, 1_suppl2 (2014), 19–31

work page 2014

[9] [9]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

work page 2020

[10] [10]

Danton S Char, Michael D Abràmoff, and Chris Feudtner. 2020. Identifying ethical considerations for machine learning healthcare applications.The American Journal of Bioethics20, 11 (2020), 7–17

work page 2020

[11] [11]

Anthony Chen, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Evaluating question answering evaluation. InProceedings of the 2nd Workshop on Machine Reading for Question Answering. 119–124

work page 2019

[12] [12]

Jean-Benoit Delbrouck, Pierre Chambon, Zhihong Chen, Maya Varma, Andrew Johnston, Louis Blankemeier, Dave Van Veen, Tan Bui, Steven Truong, and Curtis Langlotz. 2024. Radgraph-xl: A large-scale expert-annotated dataset for entity and relation extraction from radiology reports. InFindings of the Association for Computational Linguistics: ACL 2024. 12902–12915

work page 2024

[13] [13]

Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. 2019. A guide to deep learning in healthcare.Nature medicine25, 1 (2019), 24–29

work page 2019

[14] [14]

Joseph Futoma, Morgan Simons, Trishan Panch, Finale Doshi-Velez, and Leo Anthony Celi. 2020. The myth of generalisability in clinical research and machine learning in health care.The Lancet Digital Health2, 9 (2020), e489–e492

work page 2020

[15] [15]

Milena A Gianfrancesco, Suzanne Tamang, Jinoos Yazdany, and Gabriela Schmajuk. 2018. Potential biases in machine learning algorithms using electronic health record data.JAMA internal medicine178, 11 (2018), 1544–1547

work page 2018

[16] [16]

Liu, and Mohammad Saleh

Ben Goodrich, Vinay Rao, Peter J. Liu, and Mohammad Saleh. 2019. Assessing The Factual Accuracy of Generated Text. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(Anchorage, AK, USA)(KDD ’19). Association for Computing Machinery, New York, NY, USA, 166–175. doi:10.1145/3292500.3330955

work page doi:10.1145/3292500.3330955 2019

[17] [17]

Tanya Goyal and Greg Durrett. 2020. Evaluating factuality in generation with dependency-level entailment.arXiv preprint arXiv:2010.05478(2020)

work page arXiv 2020

[18] [18]

WHO Guidance. 2021. Ethics and governance of artificial intelligence for health.World Health Organization(2021), 1–165

work page 2021

[19] [19]

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. InInternational conference on machine learning. PMLR, 3929–3938

work page 2020

[20] [20]

Greene, Tamara Broderick, Michael M

Benjamin Haibe-Kains, George Alexandru Adam, Ahmed Hosny, Farnoosh Khodakarami, Levi Waldron, Bo Wang, Chris McIntosh, Anna Goldenberg, Anshul Kundaje, Casey S. Greene, Tamara Broderick, Michael M. Hoffman, Jeffrey T. Leek, Keegan Korthauer, Wolfgang Huber, Alvis Brazma, Joelle Pineau, Robert Tibshirani, Trevor Hastie, John P. A. Ioannidis, John Quackenbu...

work page 2026

[21] [21]

2020), E14–E16

Transparency and reproducibility in artificial intelligence.Nature586 (Oct. 2020), E14–E16. doi:10.1038/s41586-020-2766-y

work page doi:10.1038/s41586-020-2766-y 2020

[22] [22]

Nchebe-Jah Iloanusi and Soon Ae Chun. 2024. AI impact on health equity for marginalized, racial, and ethnic minorities. InProceedings of the 25th annual international conference on digital government research. 841–848

work page 2024

[23] [23]

Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume. 874–880

work page 2021

[24] [24]

Sophie Jentzsch, Patrick Schramowski, Constantin Rothkopf, and Kristian Kersting. 2019. Semantics Derived Automatically from Language Corpora Contain Human-like Moral Choices. InProceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (Honolulu, HI, USA)(AIES ’19). Association for Computing Machinery, New York, NY, USA, 37–44. doi:10.1145/33...

work page doi:10.1145/3306618.3314267 2019

[25] [25]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 2567–2577

work page 2019

[26] [26]

Qiao Jin, Yifan Yang, Qingyu Chen, and Zhiyong Lu. 2024. Genegpt: Augmenting large language models with domain tools for improved access to biomedical information.Bioinformatics40, 2 (2024), btae075

work page 2024

[27] [27]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213

work page 2022

[28] [28]

Wojciech Kryściński, Bryan McCann, Caiming Xiong, and Richard Socher. 2020. Evaluating the factual consistency of abstractive text summarization. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). 9332–9346

work page 2020

[29] [29]

Philippe Laban, Tobias Schnabel, Paul N Bennett, and Marti A Hearst. 2022. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization.Transactions of the Association for Computational Linguistics10 (2022), 163–177

work page 2022

[30] [30]

Peter Lee, Sebastien Bubeck, and Joseph Petro. 2023. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine.New England Journal of Medicine388, 13 (2023), 1233–1239

work page 2023

[31] [31]

Yang and Fynn Bachmann

Haley Lepp and Daniel Scott Smith. 2025. “You Cannot Sound Like GPT": Signs of language discrimination and resistance in computer science publishing. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’25). Association for Computing Machinery, New York, NY, USA, 3162–3181. doi:10.1145/3715275.3732202

work page doi:10.1145/3715275.3732202 2025

[32] [32]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

work page 2020

[33] [33]

Valentin Liévin, Christoffer Egeberg Hother, Andreas Geert Motzfeldt, and Ole Winther. 2024. Can large language models reason about medical questions?Patterns5, 3 (2024)

work page 2024

[34] [34]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

work page 2004

[35] [35]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers). 3214–3252

work page 2022

[36] [36]

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.ACM Comput. Surv.55, 9 (2023). doi:10.1145/3560815

work page doi:10.1145/3560815 2023

[37] [37]

Calvert, Alastair K

Xiaoxuan Liu, Samantha Cruz Rivera, David Moher, Melanie J. Calvert, Alastair K. Denniston, the SPIRIT-AI, and CONSORT-AI Working Group. 2020. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. The Lancet Digital Health2, 10 (2020), e537–e548. doi:10.1016/S2589-7500(20)30218-1

work page doi:10.1016/s2589-7500(20)30218-1 2020

[38] [38]

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On Faithfulness and Factuality in Abstractive Summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1906–1919

work page 2020

[39] [39]

Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning.ACM computing surveys (CSUR)54, 6 (2021), 1–35

work page 2021

[40] [40]

Bertalan Meskó and Eric J Topol. 2023. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ digital medicine6, 1 (2023), 120

work page 2023

[41] [41]

Shira Mitchell, Eric Potash, Solon Barocas, Alexander D’Amour, and Kristian Lum. 2021. Algorithmic fairness: Choices, assumptions, and definitions.Annual review of statistics and its application8, 1 (2021), 141–163

work page 2021

[42] [42]

Aurélie Névéol, Hercules Dalianis, Sumithra Velupillai, Guergana Savova, and Pierre Zweigenbaum. 2018. Clinical natural language processing in languages other than English: opportunities and challenges.Journal of biomedical semantics9, 1 (2018), 12

work page 2018

[43] [43]

Natalia Norori, Qiyang Hu, Florence Marcelle Aellen, Francesca Dalia Faraci, and Athina Tzovara. 2021. Addressing bias in big data and AI for health care: A call for open science.Patterns2, 10 (2021)

work page 2021

[44] [44]

Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. Dissecting racial bias in an algorithm used to manage the health of populations.Science366, 6464 (2019), 447–453

work page 2019

[45] [45]

Jesutofunmi A Omiye, Jenna C Lester, Simon Spichak, Veronica Rotemberg, and Roxana Daneshjou. 2023. Large language models propagate race-based medicine.NPJ Digital Medicine6, 1 (2023), 195

work page 2023

[46] [46]

Ermelinda Oro, Francesco Maria Granata, and Massimo Ruffolo. 2025. A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian.Big Data and Cognitive Computing9, 5 (2025), 141. FAccT ’26, June 25–28, 2026, Montreal, QC, Canada A.N.M. Sakib et al

work page 2025

[47] [47]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

work page 2022

[48] [48]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318

work page 2002

[49] [49]

Jessica K Paulus and David M Kent. 2020. Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities.NPJ digital medicine3, 1 (2020), 99

work page 2020

[50] [50]

Dana Pessach and Erez Shmueli. 2022. A review on fairness in machine learning.ACM Computing Surveys (CSUR)55, 3 (2022), 1–44

work page 2022

[51] [51]

Rimma Pivovarov and Noémie Elhadad. 2015. Automated methods for the summarization of electronic health records.Journal of the American Medical Informatics Association22, 5 (2015), 938–947

work page 2015

[52] [52]

Alvin Rajkomar, Jeffrey Dean, and Isaac Kohane. 2019. Machine learning in medicine.New England Journal of Medicine380, 14 (2019), 1347–1358

work page 2019

[53] [53]

Alvin Rajkomar, Michaela Hardt, Michael D Howell, Greg Corrado, and Marshall H Chin. 2018. Ensuring fairness in machine learning to advance health equity.Annals of internal medicine169, 12 (2018), 866–872

work page 2018

[54] [54]

Pranav Rajpurkar, Emma Chen, Oishi Banerjee, and Eric J Topol. 2022. AI in health and medicine.Nature medicine28, 1 (2022), 31–38

work page 2022

[55] [55]

SCPR Ratzan, Ruth Parker, C Selden, Marcia Zorn, et al. 2000. National library of medicine current bibliographies in medicine: health literacy.Bethesda, MD: National Institutes of Health, US Department of Health and Human Services(2000), 331–337

work page 2000

[56] [56]

Maya Rotmensch, Yoni Halpern, Abdulhakim Tlimat, Steven Horng, and David Sontag. 2017. Learning a health knowledge graph from electronic medical records.Scientific reports7, 1 (2017), 5994

work page 2017

[57] [57]

Nina Schwalbe and Brian Wahl. 2020. Artificial intelligence and the future of global health.The Lancet395, 10236 (2020), 1579–1586

work page 2020

[58] [58]

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. 2023. Large language models encode clinical knowledge.Nature620, 7972 (2023), 172–180

work page 2023

[59] [59]

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano

work page

[60] [60]

Learning to summarize with human feedback.Advances in neural information processing systems33 (2020), 3008–3021

work page 2020

[61] [61]

Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting

work page

[62] [62]

Large language models in medicine.Nature medicine29, 8 (2023), 1930–1940

work page 2023

[63] [63]

2022.Recommendation on the ethics of artificial intelligence

Unesco. 2022.Recommendation on the ethics of artificial intelligence. United Nations Educational, Scientific and Cultural Organization

work page 2022

[64] [64]

Tiffany C Veinot, Hannah Mitchell, and Jessica S Ancker. 2018. Good intentions are not enough: how informatics interventions can worsen inequality.Journal of the American Medical Informatics Association25, 8 (2018), 1080–1088

work page 2018

[65] [65]

Vyas, Leo G

Darshali A. Vyas, Leo G. Eisenstein, and David S. Jones. 2020. Hidden in Plain Sight — Reconsidering the Use of Race Correction in Clinical Algorithms.New England Journal of Medicine383, 9 (2020), 874–882. doi:10.1056/NEJMms2004740

work page doi:10.1056/nejmms2004740 2020

[66] [66]

Robert M Wachter and Erik Brynjolfsson. 2024. Will generative artificial intelligence deliver on its promise in health care?Jama331, 1 (2024), 65–69

work page 2024

[67] [67]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

work page 2022

[68] [68]

David R Williams and Selina A Mohammed. 2013. Racism and health I: Pathways and scientific evidence.American behavioral scientist 57, 8 (2013), 1152–1173

work page 2013

[69] [69]

Michael Wornow, Yizhe Xu, Rahul Thapa, Birju Patel, Ethan Steinberg, Scott Fleming, Michael A Pfeffer, Jason Fries, and Nigam H Shah

work page

[70] [70]

The shaky foundations of large language models and foundation models for electronic health records.npj digital medicine6, 1 (2023), 135

work page 2023

[71] [71]

Tianyi Zhang, Varsha Kishore, W Felix, and Kilian Q Weinberger. 2020. Evaluating text generation with BERT.ICLR, Bertscore(2020)

work page 2020

[72] [72]

Yijia Zhang, Qingyu Chen, Zhihao Yang, Hongfei Lin, and Zhiyong Lu. 2019. BioWordVec, improving biomedical word embeddings with subword information and MeSH.Scientific data6, 1 (2019), 52

work page 2019

[73] [73]

how similar?

Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V. Le, Ed H., Denny Zhou, Swaroop Mishra, and Huaixiu Steven Zheng. 2024. SELF-DISCOVER: Large Language Models Self-Compose Reasoning Structures. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), V...

work page doi:10.52202/079017-4004 2024