Expert Evaluation of Clinical AI Tools on Real Point-of-Care Clinical Queries
Pith reviewed 2026-06-30 09:26 UTC · model grok-4.3
The pith
A specialized clinical AI tool outperforms three general-purpose models by 25 to 39 percentage points when physicians judge answers to real point-of-care questions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On the Real-POCQi set of 620 real-world point-of-care queries, blinded specialty-matched physicians scored the specialized clinical tool highest across all five dimensions of clinical decision support, with win differences ranging from 25 to 39 percentage points over the general-purpose models; these margins remained consistent in sensitivity analyses and on the HealthBench questions.
What carries the argument
Blinded head-to-head comparison of tool outputs on the Real-POCQi benchmark of real physician-submitted queries, scored by 149 specialty-matched practicing physicians.
If this is right
- Evaluations of clinical AI should draw from real query distributions rather than hypothetical or exam-style questions.
- Specialty-matched expert judges can detect larger performance gaps than general evaluators.
- Targeted engineering and customization can produce measurable gains on dimensions such as source quality and verifiability.
- LLM judges and expert judges reach similar top-model rankings even while differing systematically in their assessments.
- The advantage of the specialized tool holds across checks for citation display, answer length, and query source.
Where Pith is reading between the lines
- Benchmarks built only on exam questions may miss the distribution shifts that matter most for actual clinical use.
- Public release of Real-POCQi allows ongoing tracking of whether general models close the gap on verifiability and completeness over time.
- If the observed margins persist in live deployment, hospitals may need to weigh specialization when selecting decision-support tools.
Load-bearing premise
The 620 queries and 149 physician graders form a representative and unbiased sample of real clinical needs and judgments, with blinding sufficient to prevent favoritism.
What would settle it
A larger replication using queries from additional clinical platforms or unblinded graders showing no significant win difference would undermine the reported margins.
read the original abstract
Physicians now pose millions of clinical questions to AI tools each week, yet these tools are evaluated largely on hypothetical or exam-style questions, not those actually asked in practice. We report a blinded evaluation built on 620 Real-world Point-Of-Care Queries (Real-POCQi) submitted to the OpenEvidence (OE) platform by physicians spanning 30 specialties, as well as 187 questions from HealthBench. 149 practicing physicians across 36 states made head-to-head comparisons between answers from three frontier general-purpose models (Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5) and a specialized clinical tool (OE), with graders matched to each question's specialty. When comparing answers along five dimensions relevant to clinical decision support -- accuracy, clinical utility, source quality, verifiability, & completeness -- physicians scored the specialized tool highest on all axes; in the primary analysis on Real-POCQi, win differences (margins between win and loss rates) ranged from 25 to 39 percentage points (p<0.001). Results remained consistent in sensitivity analyses stratifying by citation display, answer length, OE-user status, and Real-POCQi versus HealthBench. In parallel, LLM judges were found to systematically differ from expert judges, though both generally agreed on the best model. These findings underscore two conclusions: (i) AI tool evaluations should reflect real-world query distributions and use expert judges that mirror the specialization defining modern medicine and (ii) the consistent advantage of the specialized tool over general-purpose models does not necessarily mean that the latter cannot serve similar purposes, but that targeted engineering and customization can yield meaningful gains in performance for its users. We release Real-POCQi as a public benchmark, as well as the prespecified statistical analysis for reproducing results of this study.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports a blinded head-to-head expert evaluation of the specialized clinical AI tool OpenEvidence (OE) against three general-purpose frontier models (Claude Opus 4.8, Gemini 3.1 Pro, GPT-5.5) on 620 real point-of-care queries (Real-POCQi) plus 187 HealthBench questions. 149 specialty-matched physicians rated answers on accuracy, clinical utility, source quality, verifiability, and completeness; OE won by 25–39 percentage-point margins (p<0.001) on Real-POCQi, with results stable in sensitivity analyses. The authors conclude that real-world query distributions and expert judges are essential for clinical AI evaluation and that targeted engineering yields measurable gains. They release Real-POCQi and the prespecified analysis plan.
Significance. If the superiority claim holds under rigorous blinding, the work provides direct evidence that specialized clinical tools can outperform general-purpose models on authentic physician queries, supporting the broader argument that evaluation protocols must use real query distributions and domain-matched expert raters rather than exam-style items. The public release of Real-POCQi and the analysis plan is a concrete contribution that enables future replication and benchmarking.
major comments (2)
- [Abstract/Methods] Abstract and Methods (blinding protocol): The headline 25–39 pp win margins rest on the assumption that the 149 graders could not identify tool origin. The manuscript states only that the evaluation was “blinded,” with no description of answer reformatting, removal of citation-style signatures, length normalization, or post-hoc de-blinding checks. Because source quality and verifiability are two of the five scored axes—precisely the dimensions on which OE is engineered to differ—this omission leaves open the possibility that graders de-blinded and favored OE, directly threatening the causal interpretation of the reported differences.
- [Methods] Methods (query sampling and rater reliability): No details are provided on how the 620 Real-POCQi queries were sampled from the OE platform, what exclusion criteria were applied, or how inter-rater reliability was quantified among the 149 specialty-matched physicians. These omissions are load-bearing for the claim that the sample is representative of real clinical decision-support needs.
minor comments (2)
- [Results] The sensitivity analyses stratifying by citation display and answer length are helpful but would be strengthened by reporting the exact distribution of answer lengths and citation counts per model.
- [Results] The statement that “LLM judges were found to systematically differ from expert judges” would benefit from a quantitative comparison (e.g., agreement rates or rank correlations) rather than a qualitative summary.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below and will revise the manuscript accordingly to provide the requested details.
read point-by-point responses
-
Referee: [Abstract/Methods] Abstract and Methods (blinding protocol): The headline 25–39 pp win margins rest on the assumption that the 149 graders could not identify tool origin. The manuscript states only that the evaluation was “blinded,” with no description of answer reformatting, removal of citation-style signatures, length normalization, or post-hoc de-blinding checks. Because source quality and verifiability are two of the five scored axes—precisely the dimensions on which OE is engineered to differ—this omission leaves open the possibility that graders de-blinded and favored OE, directly threatening the causal interpretation of the reported differences.
Authors: We agree with the referee that additional details on the blinding protocol are essential for interpreting the results, particularly given the importance of source quality and verifiability. In the revised manuscript, we will expand the Methods section to describe the specific procedures used to maintain blinding, including reformatting of answers, removal of citation-style signatures, length normalization, and any post-hoc assessments of de-blinding. These additions will address the concern and strengthen the causal claims. revision: yes
-
Referee: [Methods] Methods (query sampling and rater reliability): No details are provided on how the 620 Real-POCQi queries were sampled from the OE platform, what exclusion criteria were applied, or how inter-rater reliability was quantified among the 149 specialty-matched physicians. These omissions are load-bearing for the claim that the sample is representative of real clinical decision-support needs.
Authors: We acknowledge these omissions in the current Methods section. The revised manuscript will include a detailed description of the query sampling process from the OE platform, the exclusion criteria applied, and the quantification of inter-rater reliability (such as through statistical measures like Fleiss' kappa). This will provide transparency and support the representativeness of the Real-POCQi dataset. revision: yes
Circularity Check
No circularity: purely empirical comparison with independent expert ratings
full rationale
The paper reports a head-to-head blinded evaluation of AI tool answers on 620 Real-POCQi queries using 149 specialty-matched physician graders. Primary results are win differences (25-39 pp, p<0.001) across five axes computed via standard statistical tests on collected ratings. No equations, derivations, fitted parameters, or self-citations appear in the provided text. The central claims rest on external expert judgments and prespecified analysis, not on any reduction of outputs to inputs by construction. This is the most common honest finding for empirical comparison studies.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions underlying two-sample proportion tests and p-value calculations for win/loss rates
Reference graph
Works this paper leans on
-
[1]
2025 physicians AI report
Offcall. 2025 physicians AI report. https://2025-physicians-ai-report.offcall.com/. Accessed: 2026- 6-24
2025
-
[2]
What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Appl
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Appl. Sci. (Basel), 11(14):6421, July 2021. 19 1.00 0.22 0.350.35 1.00 0.220.23 0.32 1.00 0.380.38 0.51 1.00 0.25 0.40 0.46 1.00 0.26 0.24 0.31 0.0 0.3 0.6 0.9 Ac...
2021
-
[3]
HealthBench: Evaluating large language models towards improved human health.arXiv [cs.CL], May 2025
Rahul K Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Qui˜ nonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Hei- decke, and Karan Singhal. HealthBench: Evaluating large language models towards improved human health.arXiv [cs.CL], May 2025
2025
-
[4]
Holistic evaluation of large language models for medical tasks with MedHELM.Nat
Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, Hao Qiu, Shrey Jain, Leonardo Schettini, Mehr Kashyap, Jason Alan Fries, Akshay Swaminathan, Philip Chung, Fateme Nateghi Haredasht, Ivan Lopez, Asad Aali, Gabriel Tse, Ashwin Nayak, Shivam Vedak, Sneha S Jain, Birju Pate...
2026
-
[5]
General-purpose large language models outperform specialized clinical AI tools on medical benchmarks.Nat
Krithik Vishwanath, Anton Alyakin, Mrigayu Ghosh, Ali Hage, Sean N Neifert, Cordelia Orillac, Nataniel J Mandelberg, Hammad A Khan, Jin Vivian Lee, Jie J Yao, William Robert Small, Aakaash Varma, D Brock Hewitt, Yindalon Aphinyanaphongs, Daniel Alexander Alber, and Eric Karl Oer- mann. General-purpose large language models outperform specialized clinical ...
2026
-
[6]
Medical large language model benchmarks should prioritize construct validity.Proc
Ahmed Alaa, Thomas Hartvigsen, Niloufar Golchini, Shiladitya Dutta, Frances Dean, Inioluwa Deb- orah Raji, and Travis Zack. Medical large language model benchmarks should prioritize construct validity.Proc. Int. Conf. Mach. Learn., March 2025
2025
-
[7]
Large language models encode clinical knowledge.Nature, 620(7972):172–180, August 2023
Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, Perry Payne, Martin Seneviratne, Paul Gamble, Chris Kelly, Abubakr Babiker, Nathanael Sch¨ arli, Aakanksha Chowdhery, Philip Mans- field, Dina Demner-Fushman, Blaise Ag¨ uera Y Arcas, Dale Webster, Greg S Corr...
2023
-
[8]
JudgmentBench: Comparing rubric and preference evaluation for quality assessment.arXiv [cs.CL], May 2026
Russell Yang, Ruishi Chen, Pierce Kelaita, Riya Ranjan, Sibo Ma, Charles Dickens, Matthew Guil- lod, Megan Ma, and Julian Nyarko. JudgmentBench: Comparing rubric and preference evaluation for quality assessment.arXiv [cs.CL], May 2026
2026
-
[9]
Neither valid nor reliable? investigating the use of LLMs as judges.Adv
Khaoula Chehbouni, Mohammed Haddou, Jackie Chi Kit Cheung, and Golnoosh Farnadi. Neither valid nor reliable? investigating the use of LLMs as judges.Adv. Neural Inf. Process. Syst., August 2025
2025
-
[10]
LLMs judging LLMs: A simplex perspective.International Conference on Artificial Intelligence and Statistics, 2026
Patrick Vossler, Fan Xia, Yifan Mai, Adarsh Subbaswamy, and Jean Feng. LLMs judging LLMs: A simplex perspective.International Conference on Artificial Intelligence and Statistics, 2026
2026
-
[11]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, and Ion Stoica. Chatbot arena: An open platform for evaluating LLMs by human preference.ICML, abs/2403.04132:8359–8388, March 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
BERTopic: Neural topic modeling with a class-based TF-IDF procedure
Maarten Grootendorst. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv [cs.CL], March 2022
2022
-
[13]
Human-AI co-design for clinical prediction models.NPJ Digit
Jean Feng, Avni Kothari, Patrick Vossler, Andrew Bishara, Lucas Zier, Newton Addo, Aaron Korn- blith, Yan Shuo Tan, and Chandan Singh. Human-AI co-design for clinical prediction models.NPJ Digit. Med., pages 1–11, June 2026
2026
-
[14]
The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities.Eur
Stuart J Pocock, Cono A Ariti, Timothy J Collier, and Duolao Wang. The win ratio: a new approach to the analysis of composite endpoints in clinical trials based on clinical priorities.Eur. Heart J., 33(2):176–182, January 2012
2012
-
[15]
Statistical inference with win statistics in cluster-randomized trials with composite outcomes.arXiv [stat.ME], April 2026
Xi Fang, Guangyu Tong, Yuan Huang, F Perry Wilson, Patrick J Heagerty, and Fan Li. Statistical inference with win statistics in cluster-randomized trials with composite outcomes.arXiv [stat.ME], April 2026
2026
-
[16]
AgentClinic: a multimodal benchmark for tool-using clinical AI agents.NPJ Digit
Samuel Schmidgall, Rojin Ziaei, Carl Harris, Ji Woong Kim, Eduardo Pontes Reis, Jeffrey Jopling, and Michael Moor. AgentClinic: a multimodal benchmark for tool-using clinical AI agents.NPJ Digit. Med., April 2026
2026
-
[17]
Autonomous medical evaluation for guideline adherence of large language models.NPJ Digit
Dennis Fast, Lisa C Adams, Felix Busch, Conor Fallon, Marc Huppertz, Robert Siepmann, Philipp Prucker, Nadine Bayerl, Daniel Truhn, Marcus Makowski, Alexander L¨ oser, and Keno K Bressem. Autonomous medical evaluation for guideline adherence of large language models.NPJ Digit. Med., 7(1):358, December 2024
2024
-
[18]
Benchmarking cognitive biases in large language models as evaluators
Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. Benchmarking cognitive biases in large language models as evaluators. InFindings of the Association for Computational Linguistics ACL 2024, pages 517–545, Stroudsburg, PA, USA, 2024. Association for Computational Linguistics
2024
-
[19]
Judg- ing LLM-as-a-judge with MT-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judg- ing LLM-as-a-judge with MT-bench and chatbot arena. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, November 2023
2023
-
[20]
Replacing judges with juries: Evaluating LLM generations with a panel of diverse models.arXiv [cs.CL], April 2024
Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. Replacing judges with juries: Evaluating LLM generations with a panel of diverse models.arXiv [cs.CL], April 2024
2024
-
[21]
ChatEval: Towards better LLM-based evaluators through multi-agent debate
Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. ChatEval: Towards better LLM-based evaluators through multi-agent debate. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[22]
VERDICT: A library for compound LLM judge systems
Nimit Kalra and Leonard Tang. VERDICT: A library for compound LLM judge systems
-
[23]
BRIDGE: benchmarking large language models for understanding real-world clinical practice texts
Jiageng Wu, Bowen Gu, Ren Zhou, Kevin Xie, Doug Snyder, Yixing Jiang, Valentina Carducci, Richard Wyss, Rishi J Desai, Emily Alsentzer, Leo Anthony Celi, Adam Rodman, Sebastian Schneeweiss, Jonathan H Chen, Santiago Romero-Brufau, Kueiyu Joshua Lin, and Jie Yang. BRIDGE: benchmarking large language models for understanding real-world clinical practice tex...
2026
-
[24]
Prediction-powered ranking of large language models
Ivi Chatzi, Eleni Straitouri, Suhas Thejaswi, and Manuel Gomez Rodriguez. Prediction-powered ranking of large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, November 2024
2024
-
[25]
The clinician and dataset shift in artificial 21 intelligence.N
Samuel G Finlayson, Adarsh Subbaswamy, Karandeep Singh, John Bowers, Annabel Kupke, Jonathan Zittrain, Isaac S Kohane, and Suchi Saria. The clinician and dataset shift in artificial 21 intelligence.N. Engl. J. Med., 385(3):283–286, July 2021
2021
-
[26]
Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare.npj Digital Medicine, 5(1):1–9, May 2022
Jean Feng, Rachael V Phillips, Ivana Malenica, Andrew Bishara, Alan E Hubbard, Leo A Celi, and Romain Pirracchio. Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare.npj Digital Medicine, 5(1):1–9, May 2022
2022
-
[27]
Clinical trials for continuously monitored and updated AI systems.Nat
Wouter A C van Amsterdam, Michael Oberst, Jean Feng, Jenna Wiens, Shengpu Tang, Shalmali Joshi, Rajesh Ranganath, Mark Sendak, Uri Shalit, Julia E Vogt, Brett Beaulieu-Jones, Muhammad Mamdani, David Kent, Patrick J Heagerty, Thomas R Fleming, and Anna Goldenberg. Clinical trials for continuously monitored and updated AI systems.Nat. Med., pages 1–3, April 2026
2026
-
[28]
WildBench: Benchmarking LLMs with challenging tasks from real users in the wild.International Conference on Learning Representations, 2025:47852–47870, May 2025
Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. WildBench: Benchmarking LLMs with challenging tasks from real users in the wild.International Conference on Learning Representations, 2025:47852–47870, May 2025
2025
-
[29]
In Paul Lavrakas, editor,Encyclopedia of survey research methods, pages 272–
Favorability ratings. In Paul Lavrakas, editor,Encyclopedia of survey research methods, pages 272–
-
[30]
Sage Publications, Inc., 2455 Teller Road, Thousand Oaks California 91320 United States of America, September 2008
2008
-
[31]
Adding error bars to evals: A statistical approach to language model evaluations.arXiv [stat.AP], November 2024
Evan Miller. Adding error bars to evals: A statistical approach to language model evaluations.arXiv [stat.AP], November 2024
2024
-
[32]
On extending the bradley-terry model to accommodate ties in paired comparison experiments.J
Roger R Davidson. On extending the bradley-terry model to accommodate ties in paired comparison experiments.J. Am. Stat. Assoc., 65(329):317, March 1970
1970
-
[33]
Rank analysis of incomplete block designs: I
Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324, December 1952. 22
1952
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.