Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications
Pith reviewed 2026-05-10 02:12 UTC · model grok-4.3
The pith
A new component-wise evaluation framework shows that large language models for medical questions have severe accuracy failures and equity disparities not visible through semantic similarity alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that its VB-Score framework, which evaluates entity recognition, semantic similarity, factual consistency, and structured information completeness as independent components, uncovers a major discrepancy between semantic and entity accuracy. All three tested models display almost uniformly severe performance failures across 48 public health topics, with most showing 13.8 percent lower performance on chronic-condition topics linked to older and minority populations, which the authors interpret as condition-based algorithmic discrimination. The analyses further demonstrate that prompt engineering does not compensate for architectural limitations in medical entity handling.
What carries the argument
VB-Score, a verification-based evaluation framework that separately measures entity recognition, semantic similarity, factual consistency, and structured information completeness in answers to medical questions.
Load-bearing premise
The 48 public health topics form a representative sample and the observed performance drops on chronic-condition topics arise from the conditions and their demographic links rather than from differences in topic complexity, training data coverage, or how the component scores were defined.
What would settle it
A controlled re-test showing that performance on chronic-condition topics equals the overall average after adjusting for topic complexity, training data coverage, and scoring definitions would falsify the claim of condition-based algorithmic discrimination.
Figures
read the original abstract
The use of Large Language Models (LLMs) to support patients in addressing medical questions is becoming increasingly prevalent. However, most of the measures currently used to evaluate the performance of these models in this context only measure how closely a model's answers match semantically, and therefore do not provide a true indication of the model's medical accuracy or of the health equity risks associated with it. To address these shortcomings, we present a new evaluation framework for medical question answering called VB-Score (Verification-Based Score) that provides a separate evaluation of the four components of entity recognition, semantic similarity, factual consistency, and structured information completeness for medical question-answering models. We perform rigorous reviews of the performance of three well-known and widely used LLMs on 48 public health-related topics taken from high-quality, authoritative information sources. Based on our analyses, we discover a major discrepancy between the models' semantic and entity accuracy. Our assessments of the performance of all three models show that each of them has almost uniformly severe performance failures when evaluated against our criteria. Our findings indicate alarming performance disparities across various public health topics, with most of the models exhibiting 13.8% lower performance (compared to an overall average) for all the public health topics that relate to chronic conditions that occur in older and minority populations, which indicates the existence of what's known as condition-based algorithmic discrimination. Our findings also demonstrate that prompt engineering alone does not compensate for basic architectural limitations on how these models perform in extracting medical entities and raise the question of whether semantic evaluation alone is a sufficient measure of medical AI safety.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the VB-Score framework for evaluating medical question-answering LLMs by decomposing performance into four components: entity recognition, semantic similarity, factual consistency, and structured information completeness. It applies this framework to three widely used LLMs across 48 public health topics drawn from authoritative sources. The authors report uniformly severe performance failures in all models, a 13.8% performance reduction (relative to overall average) on chronic-condition topics affecting older and minority populations, and interpret this as evidence of condition-based algorithmic discrimination. They conclude that semantic-only metrics are insufficient for medical AI safety assessment and that prompt engineering cannot overcome core architectural limitations.
Significance. If the component scores prove reliable and the reported disparity survives controls for topic characteristics, the work would be significant for health informatics and AI ethics. It offers a more granular alternative to semantic similarity metrics and draws attention to potential equity risks in LLM-generated public health information. The framework could inform safer evaluation practices for medical AI, though its impact depends on addressing the methodological gaps noted below.
major comments (3)
- [Abstract and Methods] Abstract and Methods: The 48 public health topics are stated to come from high-quality sources, yet no details are given on selection criteria, stratification by chronic vs. non-chronic status, or balancing for confounders such as topic complexity, entity density, question phrasing, or prevalence in training data. This information is load-bearing for the claim that the 13.8% performance drop constitutes condition-based algorithmic discrimination rather than uncontrolled differences in topic structure or data coverage.
- [Evaluation Framework (Section 3)] Evaluation Framework (Section 3): The operational definitions and scoring procedures for the four VB-Score components are not specified. In particular, the rubric or verification process used to assess factual consistency against source material, the criteria for entity recognition accuracy, and the method for scoring structured completeness are absent. These details are required to interpret the reported 'severe performance failures' and to allow replication or independent validation.
- [Results (Section 4)] Results (Section 4): The 13.8% lower performance on chronic-condition topics is presented without accompanying statistical support, including confidence intervals, hypothesis tests, or regression models that control for topic-level variables. Absent such analysis, the attribution of the gap to algorithmic discrimination remains under-supported and could be explained by the unmeasured factors identified in the skeptic note.
minor comments (3)
- [Abstract] The abstract refers to 'rigorous reviews' of model performance; a short description of the review protocol (e.g., number of reviewers, blinding) would improve transparency.
- [Methods] Clarify whether the three LLMs were evaluated under identical prompting conditions and whether any systematic prompt-engineering ablation was performed beyond the general statement that it does not compensate for architectural limits.
- [Results] Tables or figures reporting component scores should include per-topic sample sizes, standard errors, and clear labeling of which topics fall into the chronic-condition category.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our work. We address each of the major comments in turn and describe the revisions planned for the manuscript.
read point-by-point responses
-
Referee: [Abstract and Methods] Abstract and Methods: The 48 public health topics are stated to come from high-quality sources, yet no details are given on selection criteria, stratification by chronic vs. non-chronic status, or balancing for confounders such as topic complexity, entity density, question phrasing, or prevalence in training data. This information is load-bearing for the claim that the 13.8% performance drop constitutes condition-based algorithmic discrimination rather than uncontrolled differences in topic structure or data coverage.
Authors: We agree that explicit details on the selection and characteristics of the 48 topics are essential. In the revised manuscript, we will insert a detailed Methods subsection describing the authoritative sources used, the criteria for topic selection, the stratification into chronic and non-chronic categories, and balancing for observable confounders including topic complexity, entity density, and question phrasing. We note, however, that controlling for prevalence in the LLMs' training data is not possible given the lack of transparency from model developers regarding their training corpora. revision: partial
-
Referee: [Evaluation Framework (Section 3)] Evaluation Framework (Section 3): The operational definitions and scoring procedures for the four VB-Score components are not specified. In particular, the rubric or verification process used to assess factual consistency against source material, the criteria for entity recognition accuracy, and the method for scoring structured completeness are absent. These details are required to interpret the reported 'severe performance failures' and to allow replication or independent validation.
Authors: We acknowledge that the operational definitions and scoring procedures were insufficiently detailed in the original submission. The revised manuscript will substantially expand Section 3 to include precise operational definitions, scoring rubrics, and verification processes for each VB-Score component. This will encompass the entity recognition criteria, the step-by-step factual consistency verification against source material (with examples), and the structured completeness scoring method, enabling full replication and validation. revision: yes
-
Referee: [Results (Section 4)] Results (Section 4): The 13.8% lower performance on chronic-condition topics is presented without accompanying statistical support, including confidence intervals, hypothesis tests, or regression models that control for topic-level variables. Absent such analysis, the attribution of the gap to algorithmic discrimination remains under-supported and could be explained by the unmeasured factors identified in the skeptic note.
Authors: We agree that statistical support is required to substantiate the performance disparity. In the revised Results section, we will report confidence intervals around the 13.8% figure, conduct hypothesis tests to compare performance on chronic versus non-chronic topics, and perform regression analyses that control for topic-level variables such as complexity and entity density. These additions will provide a more robust basis for interpreting the gap in terms of condition-based algorithmic discrimination. revision: yes
- Controlling for the prevalence of chronic condition topics in the proprietary training data of the evaluated LLMs.
Circularity Check
No circularity: empirical evaluation relies on external authoritative references
full rationale
The paper defines VB-Score as a four-component framework (entity recognition, semantic similarity, factual consistency, structured completeness) and applies it to model outputs compared against 48 topics drawn from independent high-quality authoritative sources. Performance scores and the reported 13.8% disparity are computed directly from these external comparisons rather than from any fitted parameters, self-referential definitions, or prior self-citations that reduce the result to the inputs by construction. No equations, uniqueness theorems, or ansatzes appear in the derivation chain; the interpretive claim of condition-based discrimination follows from the observed empirical gaps without mathematical reduction to the framework itself.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The four components (entity recognition, semantic similarity, factual consistency, structured information completeness) are the appropriate and sufficient dimensions for assessing medical accuracy and equity risks.
- domain assumption The 48 topics drawn from high-quality authoritative sources are representative of public health questions and free of selection bias relative to chronic conditions in older and minority populations.
Reference graph
Works this paper leans on
-
[1]
Griffin Adams, Emily Alsentzer, Mert Ketenci, Jason Zucker, and Noémie Elhadad. 2021. What’s in a summary? laying the groundwork for advances in hospital-course summarization. InProceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, Vol. 2021. 4794
work page 2021
-
[2]
Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, and Matthew McDermott. 2019. Publicly available clinical BERT embeddings. InProceedings of the 2nd clinical natural language processing workshop. 72–78
work page 2019
-
[3]
Syed Umar Amin, Mohsen Guizani, and M Shamim Hossain. 2025. Advances, Evaluation, and Explainability of Large Language Models in Healthcare: A Systematic Review.ACM Transactions on Multimedia Computing, Communications and Applications(2025)
work page 2025
-
[4]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72
work page 2005
-
[5]
David W Bates, David M Levine, Gianna Syber, Eyal Zimlichman, Kenneth Sands, Ben Derman, Ziv Korach, Hojjat Salmasian, Nancy Leung, and Yves Chretien. 2023. The Safety of Inpatient Health Care.New England Journal of Medicine388, 2 (2023), 142–153. doi:10.1056/NEJMsa2206117
-
[6]
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?(FAccT ’21). Association for Computing Machinery, New York, NY, USA, 610–623. doi:10.1145/3442188. 3445922
-
[7]
Kaustav Bera, Kurt A Schalper, David L Rimm, Vamsidhar Velcheti, and Anant Madabhushi. 2019. Artificial intelligence in digital pathology—new tools for diagnosis and precision oncology.Nature reviews Clinical oncology16, 11 (2019), 703–715
work page 2019
-
[8]
Paula Braveman and Laura Gottlieb. 2014. The social determinants of health: it’s time to consider the causes of the causes.Public health reports129, 1_suppl2 (2014), 19–31
work page 2014
-
[9]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901
work page 2020
-
[10]
Danton S Char, Michael D Abràmoff, and Chris Feudtner. 2020. Identifying ethical considerations for machine learning healthcare applications.The American Journal of Bioethics20, 11 (2020), 7–17
work page 2020
-
[11]
Anthony Chen, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Evaluating question answering evaluation. InProceedings of the 2nd Workshop on Machine Reading for Question Answering. 119–124
work page 2019
-
[12]
Jean-Benoit Delbrouck, Pierre Chambon, Zhihong Chen, Maya Varma, Andrew Johnston, Louis Blankemeier, Dave Van Veen, Tan Bui, Steven Truong, and Curtis Langlotz. 2024. Radgraph-xl: A large-scale expert-annotated dataset for entity and relation extraction from radiology reports. InFindings of the Association for Computational Linguistics: ACL 2024. 12902–12915
work page 2024
-
[13]
Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. 2019. A guide to deep learning in healthcare.Nature medicine25, 1 (2019), 24–29
work page 2019
-
[14]
Joseph Futoma, Morgan Simons, Trishan Panch, Finale Doshi-Velez, and Leo Anthony Celi. 2020. The myth of generalisability in clinical research and machine learning in health care.The Lancet Digital Health2, 9 (2020), e489–e492
work page 2020
-
[15]
Milena A Gianfrancesco, Suzanne Tamang, Jinoos Yazdany, and Gabriela Schmajuk. 2018. Potential biases in machine learning algorithms using electronic health record data.JAMA internal medicine178, 11 (2018), 1544–1547
work page 2018
-
[16]
Ben Goodrich, Vinay Rao, Peter J. Liu, and Mohammad Saleh. 2019. Assessing The Factual Accuracy of Generated Text. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(Anchorage, AK, USA)(KDD ’19). Association for Computing Machinery, New York, NY, USA, 166–175. doi:10.1145/3292500.3330955
- [17]
-
[18]
WHO Guidance. 2021. Ethics and governance of artificial intelligence for health.World Health Organization(2021), 1–165
work page 2021
-
[19]
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. InInternational conference on machine learning. PMLR, 3929–3938
work page 2020
-
[20]
Greene, Tamara Broderick, Michael M
Benjamin Haibe-Kains, George Alexandru Adam, Ahmed Hosny, Farnoosh Khodakarami, Levi Waldron, Bo Wang, Chris McIntosh, Anna Goldenberg, Anshul Kundaje, Casey S. Greene, Tamara Broderick, Michael M. Hoffman, Jeffrey T. Leek, Keegan Korthauer, Wolfgang Huber, Alvis Brazma, Joelle Pineau, Robert Tibshirani, Trevor Hastie, John P. A. Ioannidis, John Quackenbu...
work page 2026
-
[21]
Transparency and reproducibility in artificial intelligence.Nature586 (Oct. 2020), E14–E16. doi:10.1038/s41586-020-2766-y
-
[22]
Nchebe-Jah Iloanusi and Soon Ae Chun. 2024. AI impact on health equity for marginalized, racial, and ethnic minorities. InProceedings of the 25th annual international conference on digital government research. 841–848
work page 2024
-
[23]
Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume. 874–880
work page 2021
-
[24]
Sophie Jentzsch, Patrick Schramowski, Constantin Rothkopf, and Kristian Kersting. 2019. Semantics Derived Automatically from Language Corpora Contain Human-like Moral Choices. InProceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (Honolulu, HI, USA)(AIES ’19). Association for Computing Machinery, New York, NY, USA, 37–44. doi:10.1145/33...
-
[25]
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 2567–2577
work page 2019
-
[26]
Qiao Jin, Yifan Yang, Qingyu Chen, and Zhiyong Lu. 2024. Genegpt: Augmenting large language models with domain tools for improved access to biomedical information.Bioinformatics40, 2 (2024), btae075
work page 2024
-
[27]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners.Advances in neural information processing systems35 (2022), 22199–22213
work page 2022
-
[28]
Wojciech Kryściński, Bryan McCann, Caiming Xiong, and Richard Socher. 2020. Evaluating the factual consistency of abstractive text summarization. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). 9332–9346
work page 2020
-
[29]
Philippe Laban, Tobias Schnabel, Paul N Bennett, and Marti A Hearst. 2022. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization.Transactions of the Association for Computational Linguistics10 (2022), 163–177
work page 2022
-
[30]
Peter Lee, Sebastien Bubeck, and Joseph Petro. 2023. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine.New England Journal of Medicine388, 13 (2023), 1233–1239
work page 2023
-
[31]
Haley Lepp and Daniel Scott Smith. 2025. “You Cannot Sound Like GPT": Signs of language discrimination and resistance in computer science publishing. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’25). Association for Computing Machinery, New York, NY, USA, 3162–3181. doi:10.1145/3715275.3732202
-
[32]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems33 (2020), 9459–9474
work page 2020
-
[33]
Valentin Liévin, Christoffer Egeberg Hother, Andreas Geert Motzfeldt, and Ole Winther. 2024. Can large language models reason about medical questions?Patterns5, 3 (2024)
work page 2024
-
[34]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. InText summarization branches out. 74–81
work page 2004
-
[35]
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers). 3214–3252
work page 2022
-
[36]
Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing.ACM Comput. Surv.55, 9 (2023). doi:10.1145/3560815
-
[37]
Xiaoxuan Liu, Samantha Cruz Rivera, David Moher, Melanie J. Calvert, Alastair K. Denniston, the SPIRIT-AI, and CONSORT-AI Working Group. 2020. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. The Lancet Digital Health2, 10 (2020), e537–e548. doi:10.1016/S2589-7500(20)30218-1
-
[38]
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On Faithfulness and Factuality in Abstractive Summarization. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1906–1919
work page 2020
-
[39]
Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning.ACM computing surveys (CSUR)54, 6 (2021), 1–35
work page 2021
-
[40]
Bertalan Meskó and Eric J Topol. 2023. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ digital medicine6, 1 (2023), 120
work page 2023
-
[41]
Shira Mitchell, Eric Potash, Solon Barocas, Alexander D’Amour, and Kristian Lum. 2021. Algorithmic fairness: Choices, assumptions, and definitions.Annual review of statistics and its application8, 1 (2021), 141–163
work page 2021
-
[42]
Aurélie Névéol, Hercules Dalianis, Sumithra Velupillai, Guergana Savova, and Pierre Zweigenbaum. 2018. Clinical natural language processing in languages other than English: opportunities and challenges.Journal of biomedical semantics9, 1 (2018), 12
work page 2018
-
[43]
Natalia Norori, Qiyang Hu, Florence Marcelle Aellen, Francesca Dalia Faraci, and Athina Tzovara. 2021. Addressing bias in big data and AI for health care: A call for open science.Patterns2, 10 (2021)
work page 2021
-
[44]
Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. Dissecting racial bias in an algorithm used to manage the health of populations.Science366, 6464 (2019), 447–453
work page 2019
-
[45]
Jesutofunmi A Omiye, Jenna C Lester, Simon Spichak, Veronica Rotemberg, and Roxana Daneshjou. 2023. Large language models propagate race-based medicine.NPJ Digital Medicine6, 1 (2023), 195
work page 2023
-
[46]
Ermelinda Oro, Francesco Maria Granata, and Massimo Ruffolo. 2025. A Comprehensive Evaluation of Embedding Models and LLMs for IR and QA Across English and Italian.Big Data and Cognitive Computing9, 5 (2025), 141. FAccT ’26, June 25–28, 2026, Montreal, QC, Canada A.N.M. Sakib et al
work page 2025
-
[47]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744
work page 2022
-
[48]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318
work page 2002
-
[49]
Jessica K Paulus and David M Kent. 2020. Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities.NPJ digital medicine3, 1 (2020), 99
work page 2020
-
[50]
Dana Pessach and Erez Shmueli. 2022. A review on fairness in machine learning.ACM Computing Surveys (CSUR)55, 3 (2022), 1–44
work page 2022
-
[51]
Rimma Pivovarov and Noémie Elhadad. 2015. Automated methods for the summarization of electronic health records.Journal of the American Medical Informatics Association22, 5 (2015), 938–947
work page 2015
-
[52]
Alvin Rajkomar, Jeffrey Dean, and Isaac Kohane. 2019. Machine learning in medicine.New England Journal of Medicine380, 14 (2019), 1347–1358
work page 2019
-
[53]
Alvin Rajkomar, Michaela Hardt, Michael D Howell, Greg Corrado, and Marshall H Chin. 2018. Ensuring fairness in machine learning to advance health equity.Annals of internal medicine169, 12 (2018), 866–872
work page 2018
-
[54]
Pranav Rajpurkar, Emma Chen, Oishi Banerjee, and Eric J Topol. 2022. AI in health and medicine.Nature medicine28, 1 (2022), 31–38
work page 2022
-
[55]
SCPR Ratzan, Ruth Parker, C Selden, Marcia Zorn, et al. 2000. National library of medicine current bibliographies in medicine: health literacy.Bethesda, MD: National Institutes of Health, US Department of Health and Human Services(2000), 331–337
work page 2000
-
[56]
Maya Rotmensch, Yoni Halpern, Abdulhakim Tlimat, Steven Horng, and David Sontag. 2017. Learning a health knowledge graph from electronic medical records.Scientific reports7, 1 (2017), 5994
work page 2017
-
[57]
Nina Schwalbe and Brian Wahl. 2020. Artificial intelligence and the future of global health.The Lancet395, 10236 (2020), 1579–1586
work page 2020
-
[58]
Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. 2023. Large language models encode clinical knowledge.Nature620, 7972 (2023), 172–180
work page 2023
-
[59]
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano
-
[60]
Learning to summarize with human feedback.Advances in neural information processing systems33 (2020), 3008–3021
work page 2020
-
[61]
Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting
-
[62]
Large language models in medicine.Nature medicine29, 8 (2023), 1930–1940
work page 2023
-
[63]
2022.Recommendation on the ethics of artificial intelligence
Unesco. 2022.Recommendation on the ethics of artificial intelligence. United Nations Educational, Scientific and Cultural Organization
work page 2022
-
[64]
Tiffany C Veinot, Hannah Mitchell, and Jessica S Ancker. 2018. Good intentions are not enough: how informatics interventions can worsen inequality.Journal of the American Medical Informatics Association25, 8 (2018), 1080–1088
work page 2018
-
[65]
Darshali A. Vyas, Leo G. Eisenstein, and David S. Jones. 2020. Hidden in Plain Sight — Reconsidering the Use of Race Correction in Clinical Algorithms.New England Journal of Medicine383, 9 (2020), 874–882. doi:10.1056/NEJMms2004740
-
[66]
Robert M Wachter and Erik Brynjolfsson. 2024. Will generative artificial intelligence deliver on its promise in health care?Jama331, 1 (2024), 65–69
work page 2024
-
[67]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems35 (2022), 24824–24837
work page 2022
-
[68]
David R Williams and Selina A Mohammed. 2013. Racism and health I: Pathways and scientific evidence.American behavioral scientist 57, 8 (2013), 1152–1173
work page 2013
-
[69]
Michael Wornow, Yizhe Xu, Rahul Thapa, Birju Patel, Ethan Steinberg, Scott Fleming, Michael A Pfeffer, Jason Fries, and Nigam H Shah
-
[70]
The shaky foundations of large language models and foundation models for electronic health records.npj digital medicine6, 1 (2023), 135
work page 2023
-
[71]
Tianyi Zhang, Varsha Kishore, W Felix, and Kilian Q Weinberger. 2020. Evaluating text generation with BERT.ICLR, Bertscore(2020)
work page 2020
-
[72]
Yijia Zhang, Qingyu Chen, Zhihao Yang, Hongfei Lin, and Zhiyong Lu. 2019. BioWordVec, improving biomedical word embeddings with subword information and MeSH.Scientific data6, 1 (2019), 52
work page 2019
-
[73]
Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V. Le, Ed H., Denny Zhou, Swaroop Mishra, and Huaixiu Steven Zheng. 2024. SELF-DISCOVER: Large Language Models Self-Compose Reasoning Structures. InAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), V...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.