MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
Medcasereasoning: Evaluating and learning diagnostic reasoning from clinical case reports
8 Pith papers cite this work. Polarity classification is still indexing.
years
2026 8representative citing papers
MedFlowBench evaluates VLM agents on full radiology and pathology studies by requiring both task answers and verifiable evidence like key slices and regions of interest, revealing that answer-only scores overestimate performance.
DDX-TRACE is a physician-adjudicated benchmark for evaluating VLMs on evidence-supported diagnostic trajectories rather than final answers alone in multimodal neuroradiology.
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
A text-to-FHIR pipeline produces a synthetic dataset with 82.5% valid bundles and demonstrates reduced LLM diagnostic accuracy on structured EHR data versus plain text.
A new counterfactual multi-agent framework improves LLM diagnostic accuracy by quantifying confidence shifts from edited clinical findings and guiding specialist discussions.
Med-HEAL builds a hallucination dataset from BioMistral answers on EHRNoteQA via GPT-4o and human review, then shows self-critique improves accuracy in three of five tested LLMs without retraining.
Multi-turn evidence seeking reduces LLM diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% versus full-context evaluation in a new OSCE-inspired benchmark across 468 cases and 15 models.
citing papers explorer
-
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
-
MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows
MedFlowBench evaluates VLM agents on full radiology and pathology studies by requiring both task answers and verifiable evidence like key slices and regions of interest, revealing that answer-only scores overestimate performance.
-
DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs
DDX-TRACE is a physician-adjudicated benchmark for evaluating VLMs on evidence-supported diagnostic trajectories rather than final answers alone in multimodal neuroradiology.
-
Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
-
MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings
A text-to-FHIR pipeline produces a synthetic dataset with 82.5% valid bundles and demonstrates reduced LLM diagnostic accuracy on structured EHR data versus plain text.
-
Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning
A new counterfactual multi-agent framework improves LLM diagnostic accuracy by quantifying confidence shifts from edited clinical findings and guiding specialist discussions.
-
Med-HEAL: Analyzing and Mitigating Hallucinations in Medical LLMs with Hallucination-Aware In-Context Learning
Med-HEAL builds a hallucination dataset from BioMistral answers on EHRNoteQA via GPT-4o and human review, then shows self-critique improves accuracy in three of five tested LLMs without retraining.
-
Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support
Multi-turn evidence seeking reduces LLM diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% versus full-context evaluation in a new OSCE-inspired benchmark across 468 cases and 15 models.