pith. sign in

arxiv: 2605.19173 · v1 · pith:A2Z34GERnew · submitted 2026-05-18 · 💻 cs.CL

Prompting language influences diagnostic reasoning and accuracy of large language models

Pith reviewed 2026-05-20 10:21 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelsdiagnostic reasoningprompting languageclinical vignettesEnglish French comparisonmedical AILLM evaluationclinical decision support
0
0 comments X

The pith

Large language models achieve higher diagnostic accuracy and better reasoning when prompted in English than in French.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether the language of the prompt changes how well large language models reason through medical cases and reach correct diagnoses. The same 180 clinical vignettes spanning 16 specialties were presented to five models once in English and once in French, then scored by physicians on an 18-point scale that covered both the final diagnosis and the quality of the reasoning steps. Four of the five models scored noticeably higher in English, with the advantage visible in differential diagnosis, logical structure, and internal consistency. The findings indicate that language choice can shape the reliability of LLMs used for clinical decision support, particularly in settings where English is not the primary language.

Core claim

The study compared English and French prompting on diagnostic reasoning and accuracy using 180 vignettes across 16 specialties scored on an 18-point scale by two physicians. Four of the five models performed better in English with mean differences of 0.37 to 0.91 (adjusted p < 0.05). The performance gap included multiple aspects of reasoning such as differential diagnosis, logical structure, and internal validity. Only the o3 model showed no overall language effect.

What carries the argument

Direct comparison of identical clinical vignettes presented in English versus French, evaluated with a structured 18-point physician scoring rubric that assesses both diagnosis accuracy and reasoning quality.

If this is right

  • Prompting language remains a critical determinant of LLM clinical performance.
  • The performance gap spans differential diagnosis, logical structure, and internal validity.
  • Equitable linguistico-cultural deployment of LLMs worldwide requires attention to language effects.
  • One model showed no overall language performance difference while the others did.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model developers may need to balance training data across languages to narrow these performance gaps.
  • Similar language effects could appear when using other non-English languages not tested here.
  • Clinical applications in French-speaking regions might require language-specific prompting strategies or model selection.

Load-bearing premise

The 180 clinical vignettes and the 18-point physician scoring scale provide an unbiased, language-neutral measure of diagnostic performance that generalizes beyond the specific cases chosen.

What would settle it

A replication using a fresh set of clinical vignettes or a different physician scoring method that finds no consistent English advantage across the same models would undermine the central claim.

Figures

Figures reproduced from arXiv: 2605.19173 by Adrien Bazoge, Josselin Corvellec, Pierre-Antoine Gourraud, Sofiane Djillali Sid-Ahmed.

Figure 1
Figure 1. Figure 1: Pairwise comparison of model performance between English and French prompt [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Effect of prompting language on overall model performance. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Detailed performance across evaluation criteria by language. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Summarized model performances. Histograms showing the performance of o3, DeepSeek-R1, Llama-3.1-405B-Instruct, GPT-4-Turbo and BioMistral-7B considering prompting of the vignette in French and English. Models are evaluated on a scale of 18 points by two physicians. The red line indicates the mean performance of each model. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly explored for clinical decision support, yet most evaluations are conducted in English, leaving their reliability in other languages uncertain. Here we evaluate the impact of prompting language on diagnostic reasoning and final diagnosis accuracy by comparing English and French performance across five LLMs (o3, DeepSeek-R1, GPT-4-Turbo, Llama-3.1-405B-Instruct, and BioMistral-7B). A total of 180 clinical vignettes covering 16 medical specialties were assessed by two physicians using an 18-point scale evaluating both diagnosis accuracy and reasoning quality. Four of the five models performed better in English (mean difference 0.37-0.91, adjusted p < 0.05), with the gap spanning multiple aspects of reasoning, including differential diagnosis, logical structure, and internal validity. o3 was the only model showing no overall language effect. These findings demonstrate that prompting language remains a critical determinant of LLM clinical performance, with implications for equitable linguistico-cultural deployment worldwide.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that prompting five LLMs (o3, DeepSeek-R1, GPT-4-Turbo, Llama-3.1-405B-Instruct, BioMistral-7B) in English versus French on 180 clinical vignettes across 16 specialties yields higher diagnostic reasoning and accuracy scores in English for four models (mean differences 0.37-0.91, adjusted p < 0.05), as evaluated by two physicians on an 18-point scale covering diagnosis accuracy, differential diagnosis, logical structure, and internal validity; only o3 showed no overall language effect.

Significance. If the result holds after methodological clarification, the work is significant for highlighting language as a determinant of LLM clinical performance and for underscoring risks to equitable deployment in non-English medical settings. The multi-model design and physician-based scoring on real-world vignettes add practical value, though the absence of reported controls limits immediate generalizability.

major comments (3)
  1. [Methods] Methods (vignette construction and translation): The manuscript provides no details on vignette selection criteria, whether vignettes originated in English and were translated to French, or any equivalence validation (back-translation, professional medical review, or cultural adaptation). This is load-bearing for the central claim because unvalidated translations could systematically alter perceived logical structure or internal validity, confounding the reported mean differences of 0.37-0.91.
  2. [Methods] Methods (scoring and reliability): No information is given on inter-rater reliability for the 18-point physician scale or on whether raters were blinded to output language. Without these, it is unclear whether the observed advantages in differential diagnosis and reasoning quality reflect model capability or scoring artifacts tied to language-specific phrasing.
  3. [Results] Results (statistical reporting): The abstract states adjusted p < 0.05 but the paper does not specify the exact tests, multiple-comparison corrections, or adjustments for vignette difficulty or model factors. This weakens confidence that the language effect is robust rather than an artifact of analysis choices.
minor comments (2)
  1. [Methods] The prompting templates for each language and model should be reproduced verbatim in the supplement to allow exact replication.
  2. [Results] Figure or table presenting per-model, per-language scores would improve clarity over summary statistics alone.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas where additional methodological transparency will strengthen the manuscript. We address each major comment below and will incorporate the necessary clarifications in the revised version.

read point-by-point responses
  1. Referee: [Methods] Methods (vignette construction and translation): The manuscript provides no details on vignette selection criteria, whether vignettes originated in English and were translated to French, or any equivalence validation (back-translation, professional medical review, or cultural adaptation). This is load-bearing for the central claim because unvalidated translations could systematically alter perceived logical structure or internal validity, confounding the reported mean differences of 0.37-0.91.

    Authors: We agree that the Methods section requires greater detail on vignette construction and translation to support the central claims. In the revised manuscript, we will add a new subsection specifying that the 180 vignettes were drawn from publicly available English-language clinical case repositories used in medical education, that all vignettes originated in English, and that French versions were produced via professional medical translation followed by independent back-translation and review by a second bilingual physician to confirm clinical equivalence and preserve logical structure. These additions will directly address concerns about potential translation-induced confounds. revision: yes

  2. Referee: [Methods] Methods (scoring and reliability): No information is given on inter-rater reliability for the 18-point physician scale or on whether raters were blinded to output language. Without these, it is unclear whether the observed advantages in differential diagnosis and reasoning quality reflect model capability or scoring artifacts tied to language-specific phrasing.

    Authors: We acknowledge the absence of these details in the submitted manuscript. The two physician raters were blinded to both model identity and prompting language throughout the scoring process. In the revision, we will report inter-rater reliability using the intraclass correlation coefficient for the total 18-point score and for each subscale, along with a description of the blinding protocol. Should reliability fall below conventional thresholds, we will note this as a limitation and discuss its implications for interpretation. revision: yes

  3. Referee: [Results] Results (statistical reporting): The abstract states adjusted p < 0.05 but the paper does not specify the exact tests, multiple-comparison corrections, or adjustments for vignette difficulty or model factors. This weakens confidence that the language effect is robust rather than an artifact of analysis choices.

    Authors: We appreciate this observation on statistical transparency. The analyses used paired Wilcoxon signed-rank tests for each model-language comparison, with Bonferroni correction applied across the five models. No vignette-level difficulty covariates were included because vignettes were randomly allocated across conditions; we will explicitly state the test procedures, report exact p-values and effect sizes, and add a sentence noting the lack of difficulty adjustment as a potential limitation in the revised Results and Discussion sections. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparison without derivations or self-referential predictions

full rationale

This is a straightforward empirical evaluation study that compares LLM diagnostic performance across languages using 180 clinical vignettes scored by physicians on an 18-point scale. No mathematical derivations, equations, fitted parameters, or model-based predictions are present; results are reported as observed mean differences with statistical tests. The central claim rests on direct measurement rather than any chain that reduces to its own inputs by construction, and the study is self-contained against external benchmarks with no load-bearing self-citations or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the representativeness of the vignette set and the reliability of the physician scoring process, which are treated as given without further justification in the abstract.

axioms (1)
  • domain assumption The selected clinical vignettes are representative of real-world cases across 16 medical specialties and free of language-specific cultural bias.
    Invoked to allow generalization from the 180 cases to broader clinical use.

pith-pipeline@v0.9.0 · 5724 in / 1181 out tokens · 44270 ms · 2026-05-20T10:21:00.871332+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 5 internal anchors

  1. [1]

    Eric J. Topol. High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 25(1):44–56, Jan 2019

  2. [2]

    Sitapati, Chad VanDenBerg, Karandeep Singh, Christopher A

    Aaron Boussina, Rishivardhan Krishnamoorthy, Kimberly Quintero, Shreyansh Joshi, Gabriel Wardi, Hayden Pour, Nicholas Hilbert, Atul Malhotra, Michael Hogarth, Amy M. Sitapati, Chad VanDenBerg, Karandeep Singh, Christopher A. Longhurst, and Shamim Nemati. Large language models for more efficient reporting of hospital quality measures.NEJM AI, 1(11):AIcs240...

  3. [3]

    Hoffmann, Amelia L

    Maria Clara Saad Menezes, Alexander F. Hoffmann, Amelia L. M. Tan, Mariné Nalbandyan, Gilbert S. Omenn, Diego R. Mazzotti, Alejandro Hernández-Arango, Shyam Visweswaran, Shruthi Venkatesh, Kenneth D. Mandl, Florence T. Bourgeois, James W. K. Lee, Andrew Makmur, David A. Hanauer, Michael G. Semanik, Lauren T. Kerivan, Terra Hill, Julian Forero, Carlos Rest...

  4. [4]

    Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero- Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775, 2025

  5. [5]

    Retrieving evidence from ehrs with llms: possibilities and challenges.Proceedings of machine learning research, 248:489, 2024

    Hiba Ahsan, Denis Jered McInerney, Jisoo Kim, Christopher Potter, Geoffrey Young, Silvio Amir, and Byron C Wallace. Retrieving evidence from ehrs with llms: possibilities and challenges.Proceedings of machine learning research, 248:489, 2024

  6. [6]

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Senevi- ratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Schärli, Aakanksha Chowdhery, Philip Mansfield, Dina Demner-Fushman, Blaise Agüera y Arcas, Dale Webster, Greg S. Corrad...

  7. [7]

    Capabilities of GPT-4 on Medical Challenge Problems

    Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capa- bilities of gpt-4 on medical challenge problems.arXiv preprint arXiv:2303.13375, 2023

  8. [8]

    Kolbinger, Hannah Sophie Muti, Zunamys I

    Jan Clusmann, Fiona R. Kolbinger, Hannah Sophie Muti, Zunamys I. Carrero, Jan-Niklas Eckardt, Narmin Ghaffari Laleh, Chiara Maria Lavinia Löffler, Sophie-Caroline Schwarzkopf, Michaela Unger, Gregory P. Veldhuizen, Sophia J. Wagner, and Jakob Nikolas Kather. The future landscape of large language models in medicine.Communications Medicine, 3(1):141, Oct 2023

  9. [9]

    Towards measuring the representation of subjective global opinions in language models

    Esin Durmus, Karina Nguyen, Thomas Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, Liane Lovitt, Sam McCandlish, Orowa Sikder, Alex Tamkin, Janel Thamkul, Jared Kaplan, Jack Clark, and Deep Ganguli. Towards measuring the representation of subjective global opinions in language models...

  10. [10]

    BLEnd: A benchmark for LLMs on everyday knowledge in diverse cultures and languages

    Junho Myung, Nayeon Lee, Yi Zhou, Jiho Jin, Rifki Afina Putri, Dimosthenis Antypas, Hsuvas Borkakoty, Eunsu Kim, Carla Perez-Almendros, Abinew Ali Ayele, Victor Gutierrez Ba- sulto, Yazmin Ibanez-Garcia, Hwaran Lee, Shamsuddeen Hassan Muhammad, Kiwoong Park, Anar Sabuhi Rzayev, Nina White, Seid Muhie Yimam, Mohammad Taher Pilehvar, Nedjma Ousidhoum, Jose ...

  11. [11]

    Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H

    Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Mohamed Amin, Le Hou, Kevin Clark, Stephen R. Pfohl, Heather Cole-Lewis, Darlene Neal, Qazi Mamunur Rashid, Mike Schaekermann, Amy Wang, Dev Dash, Jonathan H. Chen, Nigam H. Shah, Sami Lachgar, Philip Andrew Mansfield, Sushant Prakash, Bradley Green, Ewa Dominowska, Blaise Agüera y Arcas,...

  12. [12]

    Accuracy of a generative artificial intelligence model in a complex diagnostic challenge.JAMA, 330(1):78–80, 07 2023

    Zahir Kanjee, Byron Crowe, and Adam Rodman. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge.JAMA, 330(1):78–80, 07 2023

  13. [13]

    Thomas Savage, Ashwin Nayak, Robert Gallo, Ekanath Rangan, and Jonathan H. Chen. Di- agnostic reasoning prompts reveal the potential for large language model interpretability in medicine.npj Digital Medicine, 7(1):20, Jan 2024

  14. [14]

    Eric Strong, Alicia DiGiammarino, Yingjie Weng, Andre Kumar, Poonam Hosamani, Jason Hom, and Jonathan H. Chen. Chatbot vs medical student performance on free-response clinical reasoning examinations.JAMA Internal Medicine, 183(9):1028–1030, 09 2023

  15. [15]

    Sequential diagnosis with language models.arXiv preprint arXiv:2506.22405, 2025

    Harsha Nori, Mayank Daswani, Christopher Kelly, Scott Lundberg, Marco Tulio Ribeiro, Marc Wilson, Xiaoxuan Liu, Viknesh Sounderajah, Jonathan Carlson, Matthew P Lungren, Bay Gross, Peter Hames, Mustafa Suleyman, Dominic King, and Eric Horvitz. Sequential diagnosis with language models.arXiv preprint arXiv:2506.22405, 2025

  16. [16]

    Benchmark evaluation of deepseek large language models in clinical decision-making.Nature Medicine, 31(8):2546–2549, Aug 2025

    Sarah Sandmann, Stefan Hegselmann, Michael Fujarski, Lucas Bickmann, Benjamin Wild, Roland Eils, and Julian Varghese. Benchmark evaluation of deepseek large language models in clinical decision-making.Nature Medicine, 31(8):2546–2549, Aug 2025

  17. [17]

    Medexpqa: Multilingual benchmarking of large language models for medical question answering.Artificial Intelligence in Medicine, 155:102938, 2024

    Iñigo Alonso, Maite Oronoz, and Rodrigo Agerri. Medexpqa: Multilingual benchmarking of large language models for medical question answering.Artificial Intelligence in Medicine, 155:102938, 2024

  18. [18]

    Towards building multilingual language model for medicine

    Pengcheng Qiu, Chaoyi Wu, Xiaoman Zhang, Weixiong Lin, Haicheng Wang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards building multilingual language model for medicine. Nature Communications, 15(1):8384, 2024

  19. [19]

    Performance evaluation of large language models in multilingual medical multiple-choice questions: Mixed methods study.JMIR Medical Education, 12(1):e81399, 2026

    Livia Maria Strasser, Wilma Anschuetz, Fabio Dennstädt, and Janna Hastings. Performance evaluation of large language models in multilingual medical multiple-choice questions: Mixed methods study.JMIR Medical Education, 12(1):e81399, 2026

  20. [20]

    Toward global large language models in medicine

    Rui Yang, Huitao Li, Weihao Xuan, Heli Qi, Xin Li, Kunyu Yu, Yingjian Chen, Rongrong Wang, Jacques Behmoaras, Tianxi Cai, et al. Toward global large language models in medicine. arXiv preprint arXiv:2601.02186, 2026

  21. [21]

    Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries

    Yiqiao Jin, Mohit Chandra, Gaurav Verma, Yibo Hu, Munmun De Choudhury, and Srijan Kumar. Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries. InProceedings of the ACM Web Conference 2024, WWW ’24, page 2627–2638, New York, NY , USA, 2024. Association for Computing Machinery

  22. [22]

    Ibrahim and Peter J

    Said A. Ibrahim and Peter J. Pronovost. Diagnostic errors, health disparities, and artificial intelligence: A combination for health or harm?JAMA Health Forum, 2(9):e212430–e212430, 09 2021

  23. [23]

    wayfinding

    Julia Adler-Milstein, Jonathan H. Chen, and Gurpreet Dhaliwal. Next-generation artificial intel- ligence for diagnosis: From predicting diagnostic labels to “wayfinding”.JAMA, 326(24):2467– 2468, 12 2021

  24. [24]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 12

  25. [25]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  26. [26]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2024

  27. [27]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  28. [28]

    BioMistral: A collection of open-source pretrained large language models for medical domains

    Yanis Labrak, Adrien Bazoge, Emmanuel Morin, Pierre-Antoine Gourraud, Mickael Rouvier, and Richard Dufour. BioMistral: A collection of open-source pretrained large language models for medical domains. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 5848–5864, Bangkok, Tha...

  29. [29]

    Julen Etxaniz, Gorka Azkune, Aitor Soroa, Oier Lopez de Lacalle, and Mikel Artetxe. Do multilingual language models think better in english? InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 550–564, 2024

  30. [30]

    Do llamas work in english? on the latent language of multilingual transformers

    Chris Wendler, Veniamin Veselovsky, Giovanni Monea, and Robert West. Do llamas work in english? on the latent language of multilingual transformers. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15366–15394, 2024

  31. [31]

    Crosslingual Reasoning through Test-Time Scaling , journal =

    Zheng-Xin Yong, M Farid Adilazuarda, Jonibek Mansurov, Ruochen Zhang, Niklas Muen- nighoff, Carsten Eickhoff, Genta Indra Winata, Julia Kreutzer, Stephen H Bach, and Alham Fikri Aji. Crosslingual reasoning through test-time scaling.arXiv preprint arXiv:2505.05408, 2025

  32. [32]

    Development of a human evaluation framework and correlation with automated metrics for natural language generation of medical diagnoses.AMIA Annual Symposium proceedings

    Emma Croxford, Yanjun Gao, Brian Patterson, Daniel To, Samuel Tesch, Dmitriy Dligach, Anoop Mayampurath, Matthew M Churpek, and Majid Afshar. Development of a human evaluation framework and correlation with automated metrics for natural language generation of medical diagnoses.AMIA Annual Symposium proceedings. AMIA Symposium, 2024:309–318, 2025. 13

  33. [33]

    An investigation of evaluation methods in automatic medical note generation

    Asma Ben Abacha, Wen-wai Yim, George Michalopoulos, and Thomas Lin. An investigation of evaluation methods in automatic medical note generation. InFindings of the Association for Computational Linguistics: ACL 2023, pages 2575–2588, 2023

  34. [34]

    Automating expert-level medical reasoning evaluation of large language models.npj Digital Medicine, 2025

    Shuang Zhou, Wenya Xie, Jiaxi Li, Zaifu Zhan, Meijia Song, Han Yang, Cheyenna Espinoza, Lindsay Welton, Xinnie Mai, Yanwei Jin, et al. Automating expert-level medical reasoning evaluation of large language models.npj Digital Medicine, 2025

  35. [35]

    Understanding expert disagreement in medical data analysis through structured adjudication

    Mike Schaekermann, Graeme Beaton, Minahz Habib, Andrew Lim, Kate Larson, and Edith Law. Understanding expert disagreement in medical data analysis through structured adjudication. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW):1–23, 2019

  36. [36]

    Expert disagreement in sequen- tial labeling: A case study on adjudication in medical time series analysis

    Mike Schaekermann, Edith Law, Kate Larson, and Andrew Lim. Expert disagreement in sequen- tial labeling: A case study on adjudication in medical time series analysis. InSAD/CrowdBias@ HCOMP, pages 55–66, 2018

  37. [37]

    State of what art? a call for multi-prompt llm evaluation.Transactions of the Association for Computational Linguistics, 12:933–949, 2024

    Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror, Dafna Shahaf, and Gabriel Stanovsky. State of what art? a call for multi-prompt llm evaluation.Transactions of the Association for Computational Linguistics, 12:933–949, 2024

  38. [38]

    purlish skin le- sion that does not fade under pressure

    Yifan Song, Guoyin Wang, Sujian Li, and Bill Yuchen Lin. The good, the bad, and the greedy: Evaluation of llms should not ignore non-determinism. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4195–4206, 2025. 14 A Supp...