PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.
AI-based Clinical Decision Support for Primary Care: A Real- World Study.arXiv preprint
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
AI co-clinician is a multimodal conversational AI that uses live audio-visual data for real-time medical reasoning in simulated telemedicine, approaching primary care physicians in management plans and differentials but lagging in physical exam and disease-specific tasks.
Case-specific clinician rubrics for clinical AI notes achieve strong discrimination between outputs, high stability, and clinician-LLM agreement matching clinician-clinician levels at far lower cost.
Physician oversight reveals high error rates in LLM-generated labels for a clinical benchmark and demonstrates that corrected labels improve both evaluation accuracy and downstream model training.
Fine-tuned on-device LLMs achieve up to 87.9% diagnostic accuracy on clinical tasks, approaching GPT-5.1 at 89.4% while remaining smaller and local.
citing papers explorer
-
PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.
-
Towards Conversational Medical AI with Eyes, Ears and a Voice
AI co-clinician is a multimodal conversational AI that uses live audio-visual data for real-time medical reasoning in simulated telemedicine, approaching primary care physicians in management plans and differentials but lagging in physical exam and disease-specific tasks.
-
Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters
Case-specific clinician rubrics for clinical AI notes achieve strong discrimination between outputs, high stability, and clinician-LLM agreement matching clinician-clinician levels at far lower cost.
-
Scalable Stewardship of an LLM-Assisted Clinical Benchmark with Physician Oversight
Physician oversight reveals high error rates in LLM-generated labels for a clinical benchmark and demonstrates that corrected labels improve both evaluation accuracy and downstream model training.
-
Benchmarking and Adapting On-Device LLMs for Clinical Decision Support
Fine-tuned on-device LLMs achieve up to 87.9% diagnostic accuracy on clinical tasks, approaching GPT-5.1 at 89.4% while remaining smaller and local.