RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.
hub
What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 13roles
dataset 1polarities
use dataset 1representative citing papers
PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.
DDX-TRACE is a physician-adjudicated benchmark for evaluating VLMs on evidence-supported diagnostic trajectories rather than final answers alone in multimodal neuroradiology.
MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
ClinSeekAgent automates active multimodal evidence seeking for clinical reasoning, improving LLM performance on raw EHR and CXR tasks while enabling distillation into smaller models.
Federated PEFT on LLMs across healthcare and finance datasets performs close to centralized training and beats isolated local training under non-IID conditions.
XpertBench provides 1,346 rubric-scored expert tasks showing leading LLMs achieve a maximum ~66% success rate and ~55% mean score across domains.
Claim-selective certification decomposes medical RAG responses into verifiable claims scored against retrieved evidence and mapped via an intent-aware selector to actions, reporting zero UCCR and action accuracy of 0.92 on dev and 0.90 on test.
MindLoom synthesizes frontier-level reasoning data by decomposing solutions into thought mode chains, training a retrieval model for mode selection, composing new problems with distribution-aligned sampling, and applying rollout-based difficulty labeling for fine-tuning.
Code-Guided Reasoning protocol reports a 28 percentage-point macro accuracy gain for small language models on MCQA when using generated executable Python scaffolds versus direct answering on 20k+ items.
Domain fine-tuning of a 4B LLM yields a statistically significant 6.8 pp accuracy gain on MedQA-USMLE over a general baseline, while RAG over medical explanations produces no significant improvement.
citing papers explorer
-
RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.
-
PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.
-
DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs
DDX-TRACE is a physician-adjudicated benchmark for evaluating VLMs on evidence-supported diagnostic trajectories rather than final answers alone in multimodal neuroradiology.
-
MAGE: Multi-Agent Self-Evolution with Co-Evolutionary Knowledge Graphs
MAGE uses a four-subgraph co-evolutionary knowledge graph plus dual bandits to externalize and retrieve experience for stable self-evolution of frozen language-model agents, showing gains on nine diverse benchmarks.
-
Green Shielding: A User-Centric Approach Towards Trustworthy AI
Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.
-
ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning
ClinSeekAgent automates active multimodal evidence seeking for clinical reasoning, improving LLM performance on raw EHR and CXR tasks while enabling distillation into smaller models.
-
Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning
Federated PEFT on LLMs across healthcare and finance datasets performs close to centralized training and beats isolated local training under non-IID conditions.
-
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
XpertBench provides 1,346 rubric-scored expert tasks showing leading LLMs achieve a maximum ~66% success rate and ~55% mean score across domains.
-
Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation
Claim-selective certification decomposes medical RAG responses into verifiable claims scored against retrieved evidence and mapped via an intent-aware selector to actions, reporting zero UCCR and action accuracy of 0.92 on dev and 0.90 on test.
-
MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis
MindLoom synthesizes frontier-level reasoning data by decomposing solutions into thought mode chains, training a retrieval model for mode selection, composing new problems with distribution-aligned sampling, and applying rollout-based difficulty labeling for fine-tuning.
-
Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds
Code-Guided Reasoning protocol reports a 28 percentage-point macro accuracy gain for small language models on MCQA when using generated executable Python scaffolds versus direct answering on 20k+ items.
-
Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale
Domain fine-tuning of a 4B LLM yields a statistically significant 6.8 pp accuracy gain on MedQA-USMLE over a general baseline, while RAG over medical explanations produces no significant improvement.
- Fully Open Meditron: An Auditable Pipeline for Clinical LLMs