Medcasereasoning: Evaluating and learning diagnostic reasoning from clinical case reports

URL http://arxiv · 2025 · arXiv 2505.11733

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

representative citing papers

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark

cs.CV · 2026-04-12 · unverdicted · novelty 8.0 · 2 refs

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.

MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows

cs.CV · 2026-03-25 · conditional · novelty 8.0

MedFlowBench evaluates VLM agents on full radiology and pathology studies by requiring both task answers and verifiable evidence like key slices and regions of interest, revealing that answer-only scores overestimate performance.

DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

DDX-TRACE is a physician-adjudicated benchmark for evaluating VLMs on evidence-supported diagnostic trajectories rather than final answers alone in multimodal neuroradiology.

Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs

cs.AI · 2026-04-09 · accept · novelty 7.0

The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

cs.CL · 2026-05-28 · unverdicted · novelty 6.0 · 2 refs

A text-to-FHIR pipeline produces a synthetic dataset with 82.5% valid bundles and demonstrates reduced LLM diagnostic accuracy on structured EHR data versus plain text.

Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning

cs.CL · 2026-03-29 · unverdicted · novelty 6.0

A new counterfactual multi-agent framework improves LLM diagnostic accuracy by quantifying confidence shifts from edited clinical findings and guiding specialist discussions.

Med-HEAL: Analyzing and Mitigating Hallucinations in Medical LLMs with Hallucination-Aware In-Context Learning

cs.CL · 2026-05-31 · unverdicted · novelty 5.0

Med-HEAL builds a hallucination dataset from BioMistral answers on EHRNoteQA via GPT-4o and human review, then shows self-critique improves accuracy in three of five tested LLMs without retraining.

Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support

cs.AI · 2026-05-21 · unverdicted · novelty 5.0

Multi-turn evidence seeking reduces LLM diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% versus full-context evaluation in a new OSCE-inspired benchmark across 468 cases and 15 models.

citing papers explorer

Showing 8 of 8 citing papers after filters.

MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark cs.CV · 2026-04-12 · unverdicted · none · ref 27 · 2 links
MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due to capacity dilution.
MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows cs.CV · 2026-03-25 · conditional · none · ref 28
MedFlowBench evaluates VLM agents on full radiology and pathology studies by requiring both task answers and verifiable evidence like key slices and regions of interest, revealing that answer-only scores overestimate performance.
DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs cs.CV · 2026-05-22 · unverdicted · none · ref 34
DDX-TRACE is a physician-adjudicated benchmark for evaluating VLMs on evidence-supported diagnostic trajectories rather than final answers alone in multimodal neuroradiology.
Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs cs.AI · 2026-04-09 · accept · none · ref 102
The paper delivers the first survey of abductive reasoning in LLMs, a unified two-stage taxonomy, a compact benchmark, and an analysis of gaps relative to deductive and inductive reasoning.
MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings cs.CL · 2026-05-28 · unverdicted · none · ref 12 · 2 links
A text-to-FHIR pipeline produces a synthetic dataset with 82.5% valid bundles and demonstrates reduced LLM diagnostic accuracy on structured EHR data versus plain text.
Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning cs.CL · 2026-03-29 · unverdicted · none · ref 8
A new counterfactual multi-agent framework improves LLM diagnostic accuracy by quantifying confidence shifts from edited clinical findings and guiding specialist discussions.
Med-HEAL: Analyzing and Mitigating Hallucinations in Medical LLMs with Hallucination-Aware In-Context Learning cs.CL · 2026-05-31 · unverdicted · none · ref 53
Med-HEAL builds a hallucination dataset from BioMistral answers on EHRNoteQA via GPT-4o and human review, then shows self-critique improves accuracy in three of five tested LLMs without retraining.
Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support cs.AI · 2026-05-21 · unverdicted · none · ref 36
Multi-turn evidence seeking reduces LLM diagnostic accuracy by 12.75% and supporting-evidence quality by 24.36% versus full-context evaluation in a new OSCE-inspired benchmark across 468 cases and 15 models.

Medcasereasoning: Evaluating and learning diagnostic reasoning from clinical case reports

fields

years

verdicts

representative citing papers

citing papers explorer