pith. sign in

arxiv: 2604.19281 · v1 · submitted 2026-04-21 · 💻 cs.HC · cs.AI· cs.CL· cs.LG

Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

Pith reviewed 2026-05-10 02:12 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.CLcs.LG
keywords medical question answeringlarge language modelsevaluation frameworkhealth equityalgorithmic discriminationentity recognitionfactual consistencysemantic similarity
0
0 comments X

The pith

A new component-wise evaluation framework shows that large language models for medical questions have severe accuracy failures and equity disparities not visible through semantic similarity alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Verification-Based Score framework that judges medical question-answering systems by separately scoring entity recognition, semantic similarity, factual consistency, and structured information completeness instead of relying on overall word similarity. This separation matters because semantic-only checks can rate an answer as good even when it misses key medical facts or entities, which carries real risks for patients using the systems for health guidance. When the authors test three common large language models on 48 public health topics drawn from authoritative sources, they find consistent severe shortcomings across the criteria along with a 13.8 percent performance drop on topics about chronic conditions that affect older and minority populations. The results indicate condition-based algorithmic discrimination and show that prompt engineering cannot overcome basic limitations in how the models extract and verify medical information. The work therefore questions whether semantic evaluation by itself is enough to judge safety in medical artificial intelligence.

Core claim

The paper establishes that its VB-Score framework, which evaluates entity recognition, semantic similarity, factual consistency, and structured information completeness as independent components, uncovers a major discrepancy between semantic and entity accuracy. All three tested models display almost uniformly severe performance failures across 48 public health topics, with most showing 13.8 percent lower performance on chronic-condition topics linked to older and minority populations, which the authors interpret as condition-based algorithmic discrimination. The analyses further demonstrate that prompt engineering does not compensate for architectural limitations in medical entity handling.

What carries the argument

VB-Score, a verification-based evaluation framework that separately measures entity recognition, semantic similarity, factual consistency, and structured information completeness in answers to medical questions.

Load-bearing premise

The 48 public health topics form a representative sample and the observed performance drops on chronic-condition topics arise from the conditions and their demographic links rather than from differences in topic complexity, training data coverage, or how the component scores were defined.

What would settle it

A controlled re-test showing that performance on chronic-condition topics equals the overall average after adjusting for topic complexity, training data coverage, and scoring definitions would falsify the claim of condition-based algorithmic discrimination.

Figures

Figures reproduced from arXiv: 2604.19281 by Abu Noman Md Sakib, Md. Main Oddin Chisty, Zijie Zhang.

Figure 1
Figure 1. Figure 1: VB-Score Component Breakdown by Model (Zero-Shot Baseline) indicates an average of less than 10% for Entity F1 [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance Profiles of Individual Components: The radar chart provides clear distinctions between the performances [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Semantic-Entity Gap: High semantic similarity ( [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Component-Wise Performance: Baseline vs RAG. Arrows indicate percentage improvements from RAG across Entity [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompt Sensitivity Analysis Across Configurations. (a) VB-Score comparison showing RAG achieves the highest [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Relative Improvements Heatmap: Model-Specific Prompt Sensitivity. Gemini benefits from strict instructions while [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Using correlations which are sufficiently small, seprate by dimension. Models have varied in their performance profiles [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Semantic-Factual Gap. GPT-4 and Claude prioritize fluency over factual accuracy (positive gap), while Gemini reverses [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Dangerous Failure Patterns: Fluent Misinformation. 37–65% of GPT-4/Claude responses combine high semantic [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The Failure Rates of All Models: Universally, all entities recognized by all models failed to achieve an F1 score of 0.10 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance vs Cost Analysis. Under our evaluation protocol, Gemini (free) achieves 25–48% higher VB-Scores than [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Cost Scaling: Equitable Access at Scale. Logarithmic plot shows exponential cost divergence: at 1M queries, GPT-4 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Quality Dimension Comparison Under Our Protocol. Gemini shows 2 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Disease Category Disparity Within Our Benchmark: Models scored 13.8% lower on chronic disease topics compared [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Factual Consistency Drives Disparity. When comparing the component-wise gap analysis, those models had a +4.4pp [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Compounding Disadvantage for Vulnerable Populations. (Left) Populations with the highest burden of chronic [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
read the original abstract

The use of Large Language Models (LLMs) to support patients in addressing medical questions is becoming increasingly prevalent. However, most of the measures currently used to evaluate the performance of these models in this context only measure how closely a model's answers match semantically, and therefore do not provide a true indication of the model's medical accuracy or of the health equity risks associated with it. To address these shortcomings, we present a new evaluation framework for medical question answering called VB-Score (Verification-Based Score) that provides a separate evaluation of the four components of entity recognition, semantic similarity, factual consistency, and structured information completeness for medical question-answering models. We perform rigorous reviews of the performance of three well-known and widely used LLMs on 48 public health-related topics taken from high-quality, authoritative information sources. Based on our analyses, we discover a major discrepancy between the models' semantic and entity accuracy. Our assessments of the performance of all three models show that each of them has almost uniformly severe performance failures when evaluated against our criteria. Our findings indicate alarming performance disparities across various public health topics, with most of the models exhibiting 13.8% lower performance (compared to an overall average) for all the public health topics that relate to chronic conditions that occur in older and minority populations, which indicates the existence of what's known as condition-based algorithmic discrimination. Our findings also demonstrate that prompt engineering alone does not compensate for basic architectural limitations on how these models perform in extracting medical entities and raise the question of whether semantic evaluation alone is a sufficient measure of medical AI safety.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces the VB-Score framework for evaluating medical question-answering LLMs by decomposing performance into four components: entity recognition, semantic similarity, factual consistency, and structured information completeness. It applies this framework to three widely used LLMs across 48 public health topics drawn from authoritative sources. The authors report uniformly severe performance failures in all models, a 13.8% performance reduction (relative to overall average) on chronic-condition topics affecting older and minority populations, and interpret this as evidence of condition-based algorithmic discrimination. They conclude that semantic-only metrics are insufficient for medical AI safety assessment and that prompt engineering cannot overcome core architectural limitations.

Significance. If the component scores prove reliable and the reported disparity survives controls for topic characteristics, the work would be significant for health informatics and AI ethics. It offers a more granular alternative to semantic similarity metrics and draws attention to potential equity risks in LLM-generated public health information. The framework could inform safer evaluation practices for medical AI, though its impact depends on addressing the methodological gaps noted below.

major comments (3)
  1. [Abstract and Methods] Abstract and Methods: The 48 public health topics are stated to come from high-quality sources, yet no details are given on selection criteria, stratification by chronic vs. non-chronic status, or balancing for confounders such as topic complexity, entity density, question phrasing, or prevalence in training data. This information is load-bearing for the claim that the 13.8% performance drop constitutes condition-based algorithmic discrimination rather than uncontrolled differences in topic structure or data coverage.
  2. [Evaluation Framework (Section 3)] Evaluation Framework (Section 3): The operational definitions and scoring procedures for the four VB-Score components are not specified. In particular, the rubric or verification process used to assess factual consistency against source material, the criteria for entity recognition accuracy, and the method for scoring structured completeness are absent. These details are required to interpret the reported 'severe performance failures' and to allow replication or independent validation.
  3. [Results (Section 4)] Results (Section 4): The 13.8% lower performance on chronic-condition topics is presented without accompanying statistical support, including confidence intervals, hypothesis tests, or regression models that control for topic-level variables. Absent such analysis, the attribution of the gap to algorithmic discrimination remains under-supported and could be explained by the unmeasured factors identified in the skeptic note.
minor comments (3)
  1. [Abstract] The abstract refers to 'rigorous reviews' of model performance; a short description of the review protocol (e.g., number of reviewers, blinding) would improve transparency.
  2. [Methods] Clarify whether the three LLMs were evaluated under identical prompting conditions and whether any systematic prompt-engineering ablation was performed beyond the general statement that it does not compensate for architectural limits.
  3. [Results] Tables or figures reporting component scores should include per-topic sample sizes, standard errors, and clear labeling of which topics fall into the chronic-condition category.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive feedback on our work. We address each of the major comments in turn and describe the revisions planned for the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Methods] Abstract and Methods: The 48 public health topics are stated to come from high-quality sources, yet no details are given on selection criteria, stratification by chronic vs. non-chronic status, or balancing for confounders such as topic complexity, entity density, question phrasing, or prevalence in training data. This information is load-bearing for the claim that the 13.8% performance drop constitutes condition-based algorithmic discrimination rather than uncontrolled differences in topic structure or data coverage.

    Authors: We agree that explicit details on the selection and characteristics of the 48 topics are essential. In the revised manuscript, we will insert a detailed Methods subsection describing the authoritative sources used, the criteria for topic selection, the stratification into chronic and non-chronic categories, and balancing for observable confounders including topic complexity, entity density, and question phrasing. We note, however, that controlling for prevalence in the LLMs' training data is not possible given the lack of transparency from model developers regarding their training corpora. revision: partial

  2. Referee: [Evaluation Framework (Section 3)] Evaluation Framework (Section 3): The operational definitions and scoring procedures for the four VB-Score components are not specified. In particular, the rubric or verification process used to assess factual consistency against source material, the criteria for entity recognition accuracy, and the method for scoring structured completeness are absent. These details are required to interpret the reported 'severe performance failures' and to allow replication or independent validation.

    Authors: We acknowledge that the operational definitions and scoring procedures were insufficiently detailed in the original submission. The revised manuscript will substantially expand Section 3 to include precise operational definitions, scoring rubrics, and verification processes for each VB-Score component. This will encompass the entity recognition criteria, the step-by-step factual consistency verification against source material (with examples), and the structured completeness scoring method, enabling full replication and validation. revision: yes

  3. Referee: [Results (Section 4)] Results (Section 4): The 13.8% lower performance on chronic-condition topics is presented without accompanying statistical support, including confidence intervals, hypothesis tests, or regression models that control for topic-level variables. Absent such analysis, the attribution of the gap to algorithmic discrimination remains under-supported and could be explained by the unmeasured factors identified in the skeptic note.

    Authors: We agree that statistical support is required to substantiate the performance disparity. In the revised Results section, we will report confidence intervals around the 13.8% figure, conduct hypothesis tests to compare performance on chronic versus non-chronic topics, and perform regression analyses that control for topic-level variables such as complexity and entity density. These additions will provide a more robust basis for interpreting the gap in terms of condition-based algorithmic discrimination. revision: yes

standing simulated objections not resolved
  • Controlling for the prevalence of chronic condition topics in the proprietary training data of the evaluated LLMs.

Circularity Check

0 steps flagged

No circularity: empirical evaluation relies on external authoritative references

full rationale

The paper defines VB-Score as a four-component framework (entity recognition, semantic similarity, factual consistency, structured completeness) and applies it to model outputs compared against 48 topics drawn from independent high-quality authoritative sources. Performance scores and the reported 13.8% disparity are computed directly from these external comparisons rather than from any fitted parameters, self-referential definitions, or prior self-citations that reduce the result to the inputs by construction. No equations, uniqueness theorems, or ansatzes appear in the derivation chain; the interpretive claim of condition-based discrimination follows from the observed empirical gaps without mathematical reduction to the framework itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on two domain assumptions: that the four components capture the medically relevant dimensions of QA quality, and that the chosen authoritative sources constitute unbiased ground truth. No free parameters or invented entities are introduced.

axioms (2)
  • domain assumption The four components (entity recognition, semantic similarity, factual consistency, structured information completeness) are the appropriate and sufficient dimensions for assessing medical accuracy and equity risks.
    The entire framework and all reported disparities are defined in terms of these components.
  • domain assumption The 48 topics drawn from high-quality authoritative sources are representative of public health questions and free of selection bias relative to chronic conditions in older and minority populations.
    The 13.8% disparity claim depends on this representativeness.

pith-pipeline@v0.9.0 · 5605 in / 1687 out tokens · 47483 ms · 2026-05-10T02:12:15.740551+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages

  1. [1]

    Griffin Adams, Emily Alsentzer, Mert Ketenci, Jason Zucker, and Noémie Elhadad. 2021. What’s in a summary? laying the groundwork for advances in hospital-course summarization. InProceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, Vol. 2021. 4794

  2. [2]

    Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, and Matthew McDermott. 2019. Publicly available clinical BERT embeddings. InProceedings of the 2nd clinical natural language processing workshop. 72–78

  3. [3]

    Syed Umar Amin, Mohsen Guizani, and M Shamim Hossain. 2025. Advances, Evaluation, and Explainability of Large Language Models in Healthcare: A Systematic Review.ACM Transactions on Multimedia Computing, Communications and Applications(2025)

  4. [4]

    Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72

  5. [5]

    David W Bates, David M Levine, Gianna Syber, Eyal Zimlichman, Kenneth Sands, Ben Derman, Ziv Korach, Hojjat Salmasian, Nancy Leung, and Yves Chretien. 2023. The Safety of Inpatient Health Care.New England Journal of Medicine388, 2 (2023), 142–153. doi:10.1056/NEJMsa2206117

  6. [6]

    Jacobs and Hanna Wallach

    Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?(FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. doi:10.1145/3442188. 3445922

  7. [7]

    Kaustav Bera, Kurt A Schalper, David L Rimm, Vamsidhar Velcheti, and Anant Madabhushi. 2019. Artificial intelligence in digital pathology—new tools for diagnosis and precision oncology.Nature reviews Clinical oncology16, 11 (2019), 703–715

  8. [8]

    Paula Braveman and Laura Gottlieb. 2014. The social determinants of health: it’s time to consider the causes of the causes.Public health reports129, 1_suppl2 (2014), 19–31

  9. [9]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

  10. [10]

    Danton S Char, Michael D Abràmoff, and Chris Feudtner. 2020. Identifying ethical considerations for machine learning healthcare applications.The American Journal of Bioethics20, 11 (2020), 7–17

  11. [11]

    Anthony Chen, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Evaluating question answering evaluation. InProceedings of the 2nd Workshop on Machine Reading for Question Answering. 119–124

  12. [12]

    Jean-Benoit Delbrouck, Pierre Chambon, Zhihong Chen, Maya Varma, Andrew Johnston, Louis Blankemeier, Dave Van Veen, Tan Bui, Steven Truong, and Curtis Langlotz. 2024. Radgraph-xl: A large-scale expert-annotated dataset for entity and relation extraction from radiology reports. InFindings of the Association for Computational Linguistics: ACL 2024. 12902–12915

  13. [13]

    Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. 2019. A guide to deep learning in healthcare.Nature medicine25, 1 (2019), 24–29

  14. [14]

    Joseph Futoma, Morgan Simons, Trishan Panch, Finale Doshi-Velez, and Leo Anthony Celi. 2020. The myth of generalisability in clinical research and machine learning in health care.The Lancet Digital Health2, 9 (2020), e489–e492

  15. [15]

    Milena A Gianfrancesco, Suzanne Tamang, Jinoos Yazdany, and Gabriela Schmajuk. 2018. Potential biases in machine learning algorithms using electronic health record data.JAMA internal medicine178, 11 (2018), 1544–1547

  16. [16]

    Liu, and Mohammad Saleh

    Ben Goodrich, Vinay Rao, Peter J. Liu, and Mohammad Saleh. 2019. Assessing The Factual Accuracy of Generated Text. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(Anchorage, AK, USA)(KDD ’19). Association for Computing Machinery, New York, NY, USA, 166–175. doi:10.1145/3292500.3330955

  17. [17]

    Tanya Goyal and Greg Durrett. 2020. Evaluating factuality in generation with dependency-level entailment.arXiv preprint arXiv:2010.05478(2020)

  18. [18]

    WHO Guidance. 2021. Ethics and governance of artificial intelligence for health.World Health Organization(2021), 1–165

  19. [19]

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. InInternational conference on machine learning. PMLR, 3929–3938

  20. [20]

    Greene, Tamara Broderick, Michael M

    Benjamin Haibe-Kains, George Alexandru Adam, Ahmed Hosny, Farnoosh Khodakarami, Levi Waldron, Bo Wang, Chris McIntosh, Anna Goldenberg, Anshul Kundaje, Casey S. Greene, Tamara Broderick, Michael M. Hoffman, Jeffrey T. Leek, Keegan Korthauer, Wolfgang Huber, Alvis Brazma, Joelle Pineau, Robert Tibshirani, Trevor Hastie, John P. A. Ioannidis, John Quackenbu...

  21. [21]

    2020), E14–E16

    Transparency and reproducibility in artificial intelligence.Nature586 (Oct. 2020), E14–E16. doi:10.1038/s41586-020-2766-y

  22. [22]

    Nchebe-Jah Iloanusi and Soon Ae Chun. 2024. AI impact on health equity for marginalized, racial, and ethnic minorities. InProceedings of the 25th annual international conference on digital government research. 841–848

  23. [23]

    Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume. 874–880

  24. [24]

    Sophie Jentzsch, Patrick Schramowski, Constantin Rothkopf, and Kristian Kersting. 2019. Semantics Derived Automatically from Language Corpora Contain Human-like Moral Choices. InProceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (Honolulu, HI, USA)(AIES ’19). Association for Computing Machinery, New York, NY, USA, 37–44. doi:10.1145/33...

  25. [25]

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 2567–2577

  26. [26]

    Qiao Jin, Yifan Yang, Qingyu Chen, and Zhiyong Lu. 2024. Genegpt: Augmenting large language models with domain tools for improved access to biomedical information.Bioinformatics40, 2 (2024), btae075

  27. [27]

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213

  28. [28]

    Wojciech Kryściński, Bryan McCann, Caiming Xiong, and Richard Socher. 2020. Evaluating the factual consistency of abstractive text summarization. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). 9332–9346

  29. [29]

    Philippe Laban, Tobias Schnabel, Paul N Bennett, and Marti A Hearst. 2022. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization.Transactions of the Association for Computational Linguistics10 (2022), 163–177

  30. [30]

    Peter Lee, Sebastien Bubeck, and Joseph Petro. 2023. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine.New England Journal of Medicine388, 13 (2023), 1233–1239

  31. [31]

    Yang and Fynn Bachmann

    Haley Lepp and Daniel Scott Smith. 2025. “You Cannot Sound Like GPT": Signs of language discrimination and resistance in computer science publishing. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’25). Association for Computing Machinery, New York, NY, USA, 3162–3181. doi:10.1145/3715275.3732202

  32. [32]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474

  33. [33]

    Valentin Liévin, Christoffer Egeberg Hother, Andreas Geert Motzfeldt, and Ole Winther. 2024. Can large language models reason about medical questions?Patterns5, 3 (2024)

  34. [34]

    Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81

  35. [35]

    Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers). 3214–3252

  36. [36]

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.ACM Comput. Surv.55, 9 (2023). doi:10.1145/3560815

  37. [37]

    Calvert, Alastair K

    Xiaoxuan Liu, Samantha Cruz Rivera, David Moher, Melanie J. Calvert, Alastair K. Denniston, the SPIRIT-AI, and CONSORT-AI Working Group. 2020. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. The Lancet Digital Health2, 10 (2020), e537–e548. doi:10.1016/S2589-7500(20)30218-1

  38. [38]

    Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On Faithfulness and Factuality in Abstractive Summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1906–1919

  39. [39]

    Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning.ACM computing surveys (CSUR)54, 6 (2021), 1–35

  40. [40]

    Bertalan Meskó and Eric J Topol. 2023. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ digital medicine6, 1 (2023), 120

  41. [41]

    Shira Mitchell, Eric Potash, Solon Barocas, Alexander D’Amour, and Kristian Lum. 2021. Algorithmic fairness: Choices, assumptions, and definitions.Annual review of statistics and its application8, 1 (2021), 141–163

  42. [42]

    Aurélie Névéol, Hercules Dalianis, Sumithra Velupillai, Guergana Savova, and Pierre Zweigenbaum. 2018. Clinical natural language processing in languages other than English: opportunities and challenges.Journal of biomedical semantics9, 1 (2018), 12

  43. [43]

    Natalia Norori, Qiyang Hu, Florence Marcelle Aellen, Francesca Dalia Faraci, and Athina Tzovara. 2021. Addressing bias in big data and AI for health care: A call for open science.Patterns2, 10 (2021)

  44. [44]

    Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. Dissecting racial bias in an algorithm used to manage the health of populations.Science366, 6464 (2019), 447–453

  45. [45]

    Jesutofunmi A Omiye, Jenna C Lester, Simon Spichak, Veronica Rotemberg, and Roxana Daneshjou. 2023. Large language models propagate race-based medicine.NPJ Digital Medicine6, 1 (2023), 195

  46. [46]

    Ermelinda Oro, Francesco Maria Granata, and Massimo Ruffolo. 2025. A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian.Big Data and Cognitive Computing9, 5 (2025), 141. FAccT ’26, June 25–28, 2026, Montreal, QC, Canada A.N.M. Sakib et al

  47. [47]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

  48. [48]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318

  49. [49]

    Jessica K Paulus and David M Kent. 2020. Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities.NPJ digital medicine3, 1 (2020), 99

  50. [50]

    Dana Pessach and Erez Shmueli. 2022. A review on fairness in machine learning.ACM Computing Surveys (CSUR)55, 3 (2022), 1–44

  51. [51]

    Rimma Pivovarov and Noémie Elhadad. 2015. Automated methods for the summarization of electronic health records.Journal of the American Medical Informatics Association22, 5 (2015), 938–947

  52. [52]

    Alvin Rajkomar, Jeffrey Dean, and Isaac Kohane. 2019. Machine learning in medicine.New England Journal of Medicine380, 14 (2019), 1347–1358

  53. [53]

    Alvin Rajkomar, Michaela Hardt, Michael D Howell, Greg Corrado, and Marshall H Chin. 2018. Ensuring fairness in machine learning to advance health equity.Annals of internal medicine169, 12 (2018), 866–872

  54. [54]

    Pranav Rajpurkar, Emma Chen, Oishi Banerjee, and Eric J Topol. 2022. AI in health and medicine.Nature medicine28, 1 (2022), 31–38

  55. [55]

    SCPR Ratzan, Ruth Parker, C Selden, Marcia Zorn, et al. 2000. National library of medicine current bibliographies in medicine: health literacy.Bethesda, MD: National Institutes of Health, US Department of Health and Human Services(2000), 331–337

  56. [56]

    Maya Rotmensch, Yoni Halpern, Abdulhakim Tlimat, Steven Horng, and David Sontag. 2017. Learning a health knowledge graph from electronic medical records.Scientific reports7, 1 (2017), 5994

  57. [57]

    Nina Schwalbe and Brian Wahl. 2020. Artificial intelligence and the future of global health.The Lancet395, 10236 (2020), 1579–1586

  58. [58]

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. 2023. Large language models encode clinical knowledge.Nature620, 7972 (2023), 172–180

  59. [59]

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano

  60. [60]

    Learning to summarize with human feedback.Advances in neural information processing systems33 (2020), 3008–3021

  61. [61]

    Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting

  62. [62]

    Large language models in medicine.Nature medicine29, 8 (2023), 1930–1940

  63. [63]

    2022.Recommendation on the ethics of artificial intelligence

    Unesco. 2022.Recommendation on the ethics of artificial intelligence. United Nations Educational, Scientific and Cultural Organization

  64. [64]

    Tiffany C Veinot, Hannah Mitchell, and Jessica S Ancker. 2018. Good intentions are not enough: how informatics interventions can worsen inequality.Journal of the American Medical Informatics Association25, 8 (2018), 1080–1088

  65. [65]

    Vyas, Leo G

    Darshali A. Vyas, Leo G. Eisenstein, and David S. Jones. 2020. Hidden in Plain Sight — Reconsidering the Use of Race Correction in Clinical Algorithms.New England Journal of Medicine383, 9 (2020), 874–882. doi:10.1056/NEJMms2004740

  66. [66]

    Robert M Wachter and Erik Brynjolfsson. 2024. Will generative artificial intelligence deliver on its promise in health care?Jama331, 1 (2024), 65–69

  67. [67]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837

  68. [68]

    David R Williams and Selina A Mohammed. 2013. Racism and health I: Pathways and scientific evidence.American behavioral scientist 57, 8 (2013), 1152–1173

  69. [69]

    Michael Wornow, Yizhe Xu, Rahul Thapa, Birju Patel, Ethan Steinberg, Scott Fleming, Michael A Pfeffer, Jason Fries, and Nigam H Shah

  70. [70]

    The shaky foundations of large language models and foundation models for electronic health records.npj digital medicine6, 1 (2023), 135

  71. [71]

    Tianyi Zhang, Varsha Kishore, W Felix, and Kilian Q Weinberger. 2020. Evaluating text generation with BERT.ICLR, Bertscore(2020)

  72. [72]

    Yijia Zhang, Qingyu Chen, Zhihao Yang, Hongfei Lin, and Zhiyong Lu. 2019. BioWordVec, improving biomedical word embeddings with subword information and MeSH.Scientific data6, 1 (2019), 52

  73. [73]

    how similar?

    Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V. Le, Ed H., Denny Zhou, Swaroop Mishra, and Huaixiu Steven Zheng. 2024. SELF-DISCOVER: Large Language Models Self-Compose Reasoning Structures. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), V...