PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.
Agen- tehr: Advancing autonomous clinical decision-making via retrospective summarization
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
ClinSeekAgent automates active multimodal evidence seeking for clinical reasoning, improving LLM performance on raw EHR and CXR tasks while enabling distillation into smaller models.
CodeClinic benchmark demonstrates that LLM-generated Python skill libraries from clinical guidelines enhance consistency and reduce token consumption by up to 40% compared to zero-shot approaches on MIMIC-IV based tasks.
citing papers explorer
-
PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.
-
ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning
ClinSeekAgent automates active multimodal evidence seeking for clinical reasoning, improving LLM performance on raw EHR and CXR tasks while enabling distillation into smaller models.
-
CodeClinic: Evaluating Automation of Coding Skills for Clinical Reasoning Agents
CodeClinic benchmark demonstrates that LLM-generated Python skill libraries from clinical guidelines enhance consistency and reduce token consumption by up to 40% compared to zero-shot approaches on MIMIC-IV based tasks.