Medical Reasoning with Large Language Models: A Survey and MR-Bench

· 2026 · cs.CL · arXiv 2604.08559

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Large language models (LLMs) have achieved strong performance on medical exam-style tasks, motivating growing interest in their deployment in real-world clinical settings. However, clinical decision-making is inherently safety-critical, context-dependent, and conducted under evolving evidence. In such situations, reliable LLM performance depends not on factual recall alone, but on robust medical reasoning. In this work, we present a comprehensive review of medical reasoning with LLMs. Grounded in cognitive theories of clinical reasoning, we conceptualize medical reasoning as an iterative process of abduction, deduction, and induction, and organize existing methods into seven major technical routes spanning training-based and training-free approaches. We further conduct a unified cross-benchmark evaluation of representative medical reasoning models under a consistent experimental setting, enabling a more systematic and comparable assessment of the empirical impact of existing methods. To better assess clinically grounded reasoning, we introduce MR-Bench, a benchmark derived from real-world hospital data. Evaluations on MR-Bench expose a pronounced gap between exam-level performance and accuracy on authentic clinical decision tasks. Overall, this survey provides a unified view of existing medical reasoning methods, benchmarks, and evaluation practices, and highlights key gaps between current model performance and the requirements of real-world clinical reasoning.

representative citing papers

Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

cs.CL · 2026-05-28 · unverdicted · novelty 5.0

Fine-tuning a Spanish biomedical encoder on Gemini-generated synthetic data for multiple languages yields a bi-encoder that matches or exceeds BioBERT-ST on clinical code retrieval metrics, with further gains from cross-encoder reranking on most languages.

MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional

cs.AI · 2026-05-23 · unverdicted · novelty 4.0

MDIA, a specialty-routed 7-node multi-agent system, reports 0.6272 accuracy on 525 HealthBench Professional cases using GPT-5.4, outperforming the ChatGPT for Clinicians baseline by 3.72 points and attributing the lift to architectural components.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages cs.CL · 2026-05-28 · unverdicted · none · ref 60 · internal anchor
Fine-tuning a Spanish biomedical encoder on Gemini-generated synthetic data for multiple languages yields a bi-encoder that matches or exceeds BioBERT-ST on clinical code retrieval metrics, with further gains from cross-encoder reranking on most languages.

Medical Reasoning with Large Language Models: A Survey and MR-Bench

fields

years

verdicts

representative citing papers

citing papers explorer