Diagnosisarena: Benchmarking diagnostic reasoning for large language models

Yakun Zhu, Zhongzhen Huang, Linjie Mu, Yutong Huang, Wei Nie, Jiaji Liu, Shaoting Zhang, Pengfei Liu, Xiaofan Zhang · 2025 · arXiv 2505.14107

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

representative citing papers

EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild

cs.AI · 2026-05-10 · conditional · novelty 7.0 · 2 refs

EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.

Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

cs.CV · 2025-09-26 · unverdicted · novelty 7.0

Neural-MedBench reveals sharp performance drops in state-of-the-art VLMs on reasoning-intensive neurology tasks compared to conventional classification benchmarks, with reasoning failures dominating errors.

From Exposure to Internalization: Dual-Stream Calibration for In-context Clinical Reasoning

q-bio.QM · 2026-04-07 · unverdicted · novelty 5.0

Dual-Stream Calibration uses entropy minimization and iterative meta-learning at test time to internalize clinical evidence and outperform standard in-context learning baselines on medical tasks.

MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

cs.CL · 2026-02-13 · unverdicted · novelty 4.0

MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.

citing papers explorer

Showing 4 of 4 citing papers.

EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild cs.AI · 2026-05-10 · conditional · none · ref 25 · 2 links
EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.
Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks cs.CV · 2025-09-26 · unverdicted · none · ref 38
Neural-MedBench reveals sharp performance drops in state-of-the-art VLMs on reasoning-intensive neurology tasks compared to conventional classification benchmarks, with reasoning failures dominating errors.
From Exposure to Internalization: Dual-Stream Calibration for In-context Clinical Reasoning q-bio.QM · 2026-04-07 · unverdicted · none · ref 3
Dual-Stream Calibration uses entropy minimization and iterative meta-learning at test time to internalize clinical evidence and outperform standard in-context learning baselines on medical tasks.
MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs cs.CL · 2026-02-13 · unverdicted · none · ref 76
MedXIAOHE is a medical MLLM that claims state-of-the-art benchmark performance through specialized pretraining to cover long-tail diseases and RL-based reasoning training.

Diagnosisarena: Benchmarking diagnostic reasoning for large language models

fields

years

verdicts

representative citing papers

citing papers explorer