arxiv: 2604.08559 · v1 · submitted 2026-03-17 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Medical Reasoning with Large Language Models: A Survey and MR-Bench

Xiaohan Ren , Chenxiao Fan , Wenyin Ma , Hongliang He , Chongming Gao , Xiaoyan Zhao , Fuli Feng

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsmedical reasoningclinical decision makingbenchmarksurveyabduction deduction induction

0 comments

The pith

A benchmark from real hospital data reveals that large language models perform markedly worse on genuine clinical decisions than on medical exams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models have shown strong results on medical exam questions, yet clinical decision-making demands handling safety-critical uncertainty, patient-specific context, and constantly updating evidence. The paper frames medical reasoning as an iterative cycle of abduction to generate hypotheses, deduction to test them, and induction to refine knowledge. It reviews methods across seven technical routes that combine training-based and training-free strategies, then runs a unified evaluation of representative models. The authors introduce MR-Bench, built directly from hospital records, and find a clear drop in accuracy on these realistic tasks compared with exam-style tests, showing that factual recall alone does not suffice for reliable clinical use.

Core claim

Medical reasoning is defined as an iterative process of abduction, deduction, and induction; existing LLM approaches are grouped into seven major technical routes; and consistent cross-benchmark testing on MR-Bench, constructed from real-world hospital data, demonstrates a pronounced performance gap between exam-level accuracy and results on authentic clinical decision tasks.

What carries the argument

MR-Bench, a dataset drawn from real hospital records that tests iterative medical reasoning under realistic clinical conditions rather than exam-style questions.

If this is right

Future LLM development must target robust handling of evolving evidence and patient context rather than isolated factual recall.
Unified evaluation settings across benchmarks allow clearer comparison of training-based versus training-free reasoning methods.
Deployment in clinical environments requires new techniques that close the observed accuracy drop on authentic decision tasks.
Existing methods grouped in the seven routes need targeted adaptation for safety-critical settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The iterative abduction-deduction-induction framing could guide development of hybrid systems that combine LLMs with external knowledge bases updated in real time.
MR-Bench-style construction from local records might be replicated in other high-stakes domains such as legal case analysis or scientific hypothesis testing.
Persistent gaps would imply that full clinical autonomy for LLMs remains distant and that human oversight remains essential for safety.

Load-bearing premise

That a benchmark assembled from existing hospital records adequately represents the full safety-critical, context-dependent, and evidence-evolving character of live clinical decisions.

What would settle it

A controlled experiment in which models reach accuracy on MR-Bench tasks that matches or exceeds their scores on standard medical exams would directly undermine the reported performance gap.

read the original abstract

Large language models (LLMs) have achieved strong performance on medical exam-style tasks, motivating growing interest in their deployment in real-world clinical settings. However, clinical decision-making is inherently safety-critical, context-dependent, and conducted under evolving evidence. In such situations, reliable LLM performance depends not on factual recall alone, but on robust medical reasoning. In this work, we present a comprehensive review of medical reasoning with LLMs. Grounded in cognitive theories of clinical reasoning, we conceptualize medical reasoning as an iterative process of abduction, deduction, and induction, and organize existing methods into seven major technical routes spanning training-based and training-free approaches. We further conduct a unified cross-benchmark evaluation of representative medical reasoning models under a consistent experimental setting, enabling a more systematic and comparable assessment of the empirical impact of existing methods. To better assess clinically grounded reasoning, we introduce MR-Bench, a benchmark derived from real-world hospital data. Evaluations on MR-Bench expose a pronounced gap between exam-level performance and accuracy on authentic clinical decision tasks. Overall, this survey provides a unified view of existing medical reasoning methods, benchmarks, and evaluation practices, and highlights key gaps between current model performance and the requirements of real-world clinical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes LLM medical reasoning methods into seven cognitive routes and introduces MR-Bench to highlight gaps in real clinical performance.

read the letter

Hi, The one thing to know about this paper is that it frames medical reasoning in LLMs as an iterative process of abduction, deduction, and induction, organizes existing work into seven routes, and uses a new hospital-derived benchmark to show that models perform worse on authentic clinical decisions than on exam-style questions. What the paper does well is the consistent experimental setup for comparing methods across benchmarks. This makes the results more comparable than many previous studies. The taxonomy provides a clear way to categorize training-based approaches like fine-tuning and training-free ones like prompting strategies. Grounding the framework in cognitive theories of clinical reasoning adds some conceptual depth. MR-Bench is a genuine addition since it's built from real-world hospital data rather than synthetic or exam questions. Where it could be stronger is in the validation of MR-Bench. The abstract says the cases come from hospital data and expose a performance gap, but it doesn't detail how the cases were chosen, whether they include multi-step reasoning with evolving evidence, or how the ground truth was determined from clinical records. Without that, the gap might stem from the benchmark being more open-ended or noisy rather than specifically testing reasoning failures. The survey part is thorough but largely synthesizes prior work, so the novelty rests mostly on the organization and the benchmark. This is useful for anyone in the medical AI community who wants an overview of reasoning techniques and a tougher evaluation standard. Readers working on clinical applications would benefit from the emphasis on real-world gaps. It should go through peer review. The ideas are coherent and the new benchmark has potential, though the authors will likely need to add more on the data construction to strengthen the claims.

Referee Report

2 major / 2 minor

Summary. The paper surveys medical reasoning with LLMs, grounding the discussion in cognitive theories of abduction, deduction, and induction to organize existing methods into seven technical routes (training-based and training-free). It performs a unified cross-benchmark evaluation of representative models under a consistent setting and introduces MR-Bench, a new benchmark derived from real-world hospital data. Evaluations on MR-Bench are used to claim a pronounced performance gap between exam-level tasks and authentic clinical decision-making.

Significance. If the central gap claim holds after validation, the work supplies a needed unified taxonomy and evaluation framework for the field while highlighting deployment risks for LLMs in safety-critical settings that require context-dependent, evolving-evidence reasoning. The MR-Bench contribution could serve as a more realistic testbed than existing exam-style benchmarks.

major comments (2)

[MR-Bench construction] MR-Bench section: the claim that the benchmark is 'derived from real-world hospital data' and captures 'authentic clinical decision tasks' is load-bearing for the headline gap result, yet the manuscript supplies no protocol for case selection, temporal context encoding, or anchoring ground-truth labels to actual clinical endpoints or longitudinal outcomes. Without these details it remains possible that MR-Bench reduces to single-encounter snapshots whose performance drop reflects format shift rather than reasoning failure.
[Unified cross-benchmark evaluation] Unified evaluation section: the abstract and evaluation describe a 'consistent experimental setting' and 'clear performance gap,' but omit full details on data selection criteria and statistical controls (e.g., confidence intervals, multiple-run variance). This leaves the gap claim plausible yet not fully verified, directly affecting the strength of the cross-benchmark comparison.

minor comments (2)

[Survey organization] The seven technical routes are introduced in the abstract but would benefit from an explicit enumeration or table early in the survey section to improve readability.
[Evaluation tables] Ensure all cited benchmarks in the unified evaluation are accompanied by brief descriptions of their task formats so readers can interpret the reported gaps without external lookup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and plan to incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [MR-Bench construction] MR-Bench section: the claim that the benchmark is 'derived from real-world hospital data' and captures 'authentic clinical decision tasks' is load-bearing for the headline gap result, yet the manuscript supplies no protocol for case selection, temporal context encoding, or anchoring ground-truth labels to actual clinical endpoints or longitudinal outcomes. Without these details it remains possible that MR-Bench reduces to single-encounter snapshots whose performance drop reflects format shift rather than reasoning failure.

Authors: We agree that additional details on the MR-Bench construction are necessary to support the claims. In the revised manuscript, we will expand the MR-Bench section to include a full protocol: case selection criteria from the hospital dataset (e.g., inclusion of multi-turn interactions and evolving evidence), how temporal context is encoded in the prompts, and the process for anchoring ground-truth labels to verified clinical outcomes and longitudinal patient records. This will clarify that the benchmark goes beyond single-encounter snapshots and better isolate reasoning failures. revision: yes
Referee: [Unified cross-benchmark evaluation] Unified evaluation section: the abstract and evaluation describe a 'consistent experimental setting' and 'clear performance gap,' but omit full details on data selection criteria and statistical controls (e.g., confidence intervals, multiple-run variance). This leaves the gap claim plausible yet not fully verified, directly affecting the strength of the cross-benchmark comparison.

Authors: We appreciate this observation. The unified evaluation was conducted under a fixed setting with the same prompts and decoding parameters across benchmarks. In the revision, we will add explicit data selection criteria (e.g., how subsets were chosen for comparability) and report statistical controls including 95% confidence intervals and variance across multiple runs (e.g., 3-5 seeds). This will provide stronger verification for the observed performance gap. revision: yes

Circularity Check

0 steps flagged

No significant circularity in survey organization or MR-Bench construction

full rationale

The paper is a survey that reviews existing LLM medical reasoning methods, grounds its conceptualization of reasoning (abduction/deduction/induction) in external cognitive theories, organizes methods into seven routes drawn from literature, performs cross-benchmark evaluations on independent datasets, and introduces MR-Bench as a new construction from real-world hospital data. No equations, fitted parameters, predictions, or self-citations reduce any central claim to its own inputs by construction. The performance gap claim rests on empirical results from the newly introduced benchmark rather than self-referential definitions or load-bearing self-citations. This is a standard non-circular survey structure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey rests on standard cognitive theories of clinical reasoning and existing LLM techniques; MR-Bench is constructed from hospital data without introducing new fitted parameters or postulated entities.

axioms (1)

domain assumption Clinical reasoning can be usefully modeled as an iterative process of abduction, deduction, and induction
Invoked to organize existing methods into seven technical routes.

pith-pipeline@v0.9.0 · 5531 in / 1219 out tokens · 59132 ms · 2026-05-15T10:17:25.982649+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

148 extracted references · 148 canonical work pages · 17 internal anchors

[1]

Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.PLOS Digital Health.2023;2(2):1-12. doi: 10.1371/journal.pdig.0000198

work page doi:10.1371/journal.pdig.0000198 2023
[2]

Capabilities of GPT-4 on Medical Challenge Problems

Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on Medical Challenge Problems. arXiv preprint arXiv:2303.13375; 2023

work page internal anchor Pith review arXiv 2023
[3]

Foundation models for generalist medical artificial intelligence.Nature.2023;616(7956):259–265

Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical artificial intelligence.Nature.2023;616(7956):259–265

work page 2023
[4]

Large language models encode clinical knowledge.Nature.2023;620(7972):172–180

Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge.Nature.2023;620(7972):172–180

work page 2023
[5]

Toward expert-level medical question answering with large language models.Nature Medicine.2025;31(3):943– 950

Singhal K, Tu T, Gottweis J, et al. Toward expert-level medical question answering with large language models.Nature Medicine.2025;31(3):943– 950

work page 2025
[6]

Evaluating clinical AI summaries with large language models as judges.npj Digital Medicine.2025;8(1):640

Croxford E, Gao Y , First E, et al. Evaluating clinical AI summaries with large language models as judges.npj Digital Medicine.2025;8(1):640

work page 2025
[7]

Evaluating an LLM-Assisted Workflow for Clinical Documentation: A Pilot Randomized Controlled Trial on Time and Quality.medRxiv.2025:2025–10

Takayama T, Sado K, Suda K, et al. Evaluating an LLM-Assisted Workflow for Clinical Documentation: A Pilot Randomized Controlled Trial on Time and Quality.medRxiv.2025:2025–10

work page 2025
[8]

Current applications and challenges in large language models for patient care: a systematic review

Busch F, Hoffmann L, Rueger C, et al. Current applications and challenges in large language models for patient care: a systematic review. Communications Medicine.2025;5(1):26

work page 2025
[9]

Enhancing large language models for clinical decision support by incorporating clinical practice guidelines

Oniani D, Wu X, Visweswaran S, et al. Enhancing large language models for clinical decision support by incorporating clinical practice guidelines. In: IEEE. 2024:694–702

work page 2024
[10]

Large language models in real-world clinical workflows: A systematic review of applications and implementation.Frontiers in Digital Health.2025;7:1659134

Artsi Y , Sorin V , Glicksberg BS, Korfiatis P, Nadkarni GN, Klang E. Large language models in real-world clinical workflows: A systematic review of applications and implementation.Frontiers in Digital Health.2025;7:1659134

work page 2025
[11]

From best evidence to best practice: effective implementation of change in patients’ care.The lancet.2003;362(9391):1225– 1230

Grol R, Grimshaw J. From best evidence to best practice: effective implementation of change in patients’ care.The lancet.2003;362(9391):1225– 1230

work page 2003
[12]

Large Language Models in Healthcare and Medical Applications: A Review.Bioengineering.2025;12(6):631

Maity S, Saikia MJ. Large Language Models in Healthcare and Medical Applications: A Review.Bioengineering.2025;12(6):631

work page 2025
[13]

Explicit vs implicit memory: Exploring multi-hop complex reasoning over personalized information

Zhang Z, Zhang Y , Tan H, Li R, Chen X. Explicit vs implicit memory: Exploring multi-hop complex reasoning over personalized information. arXiv preprint arXiv:2508.13250.2025

work page arXiv 2025
[14]

Nextquill: Causal preference modeling for enhancing llm personalization.ICLR.2026

Zhao X, You J, Zhang Y , et al. Nextquill: Causal preference modeling for enhancing llm personalization.ICLR.2026

work page 2026
[15]

Igd: Token decisiveness modeling via information gain in llms for personalized recommendation

Lin Z, Zhang Y , Zhao X, Zhu F, Feng F, Chua TS. Igd: Token decisiveness modeling via information gain in llms for personalized recommendation. NeurIPS.2025

work page 2025
[16]

Survey of hallucination in natural language generation.ACM computing surveys.2023;55(12):1–38

Ji Z, Lee N, Frieske R, et al. Survey of hallucination in natural language generation.ACM computing surveys.2023;55(12):1–38

work page 2023
[17]

Language Models (Mostly) Know What They Know

Kadavath S, Conerly T, Askell A, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221.2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

The clinical reasoning process.Medical education.1987;21(2):86–91

Barrows HS, Feltovich PJ. The clinical reasoning process.Medical education.1987;21(2):86–91

work page 1987
[19]

Clinical reasoning strategies in physical therapy.Physical therapy.2004;84(4):312– 330

Edwards I, Jones M, Carr J, Braunack-Mayer A, Jensen GM. Clinical reasoning strategies in physical therapy.Physical therapy.2004;84(4):312– 330

work page 2004
[20]

Abductive reasoning and clinical assessment.Australian Psychologist.1997;32(2):93–100

Ward T, Haig B. Abductive reasoning and clinical assessment.Australian Psychologist.1997;32(2):93–100

work page 1997
[21]

Clinical cognition and diagnostic error: applications of a dual process model of reasoning.Advances in health sciences education

Croskerry P. Clinical cognition and diagnostic error: applications of a dual process model of reasoning.Advances in health sciences education. 2009;14(Suppl 1):27–35

work page 2009
[22]

The logic of medical diagnosis.Perspectives in Biology and Medicine.2013;56(2):300–315

Stanley DE, Campos DG. The logic of medical diagnosis.Perspectives in Biology and Medicine.2013;56(2):300–315

work page 2013
[23]

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Chen J, Cai Z, Ji K, et al. Huatuogpt-o1, towards medical complex reasoning with llms, 2024.URL https://arxiv. org/abs/2412.18925

work page internal anchor Pith review arXiv 2024
[24]

Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993

Wu J, Deng W, Li X, et al. Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993. 2025

work page arXiv 2025
[25]

m1: Unleash the potential of test-time scaling for medical reasoning with large language models.arXiv preprint arXiv:2504.00869.2025

Huang X, Wu J, Liu H, Tang X, Zhou Y . m1: Unleash the potential of test-time scaling for medical reasoning with large language models.arXiv preprint arXiv:2504.00869.2025. Medical Reasoning with Large Language Models: A Survey and MR-Bench 17

work page arXiv 2025
[26]

Disentangling reasoning and knowledge in medical large language models.arXiv preprint arXiv:2505.11462.2025

Thapa R, Wu Q, Wu K, et al. Disentangling reasoning and knowledge in medical large language models.arXiv preprint arXiv:2505.11462.2025

work page arXiv 2025
[27]

Testing and evaluation of health care applications of large language models: a systematic review.Jama.2025

Bedi S, Liu Y , Orr-Ewing L, et al. Testing and evaluation of health care applications of large language models: a systematic review.Jama.2025

work page 2025
[28]

Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

Gururangan S, Marasovi´c A, Swayamdipta S, et al. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In: Jurafsky D, Chai J, Schluter N, Tetreault J., eds.Proceedings of the 58th Annual Meeting of the Association for Computational LinguisticsAssociation for Computational Linguistics. Association for Computational Linguistics 2020; Online:...

work page 2020
[29]

Finetuned Language Models are Zero-Shot Learners

Wei J, Bosma M, Zhao V , et al. Finetuned Language Models are Zero-Shot Learners. In: International Conference on Learning Representations. 2022

work page 2022
[30]

Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation.arXiv preprint arXiv:2512.06690.2025

Wang C, Zhang Y , Wang W, et al. Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation.arXiv preprint arXiv:2512.06690.2025

work page arXiv 2025
[31]

Training language models to follow instructions with human feedback.Advances in neural information processing systems.2022;35:27730–27744

Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems.2022;35:27730–27744

work page 2022
[32]

Reinforced latent reasoning for llm-based recommendation.arXiv preprint arXiv:2505.19092.2025

Zhang Y , Xu W, Zhao X, et al. Reinforced latent reasoning for llm-based recommendation.arXiv preprint arXiv:2505.19092.2025

work page arXiv 2025
[33]

Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys.2023;55(9):1–35

Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys.2023;55(9):1–35

work page 2023
[34]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Wang X, Wei J, Schuurmans D, et al. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171. 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

SteerX: Disentangled Steering for LLM Personalization.arXiv preprint arXiv:2510.22256.2025

Zhao X, Yan M, Qiu Y , et al. SteerX: Disentangled Steering for LLM Personalization.arXiv preprint arXiv:2510.22256.2025

work page arXiv 2025
[36]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems.2020;33:9459–9474

Lewis P, Perez E, Piktus A, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems.2020;33:9459–9474

work page 2020
[37]

The rise and potential of large language model based agents: A survey.Science China Information Sciences

Xi Z, Chen W, Guo X, et al. The rise and potential of large language model based agents: A survey.Science China Information Sciences. 2025;68(2):121101

work page 2025
[38]

What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences.2021;11(14):6421

Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences.2021;11(14):6421

work page 2021
[39]

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Arora RK, Wei J, Hicks RS, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775.2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

MIMIC-III, a freely accessible critical care database.Scientific data.2016;3(1):1–9

Johnson AE, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database.Scientific data.2016;3(1):1–9

work page 2016
[41]

MIMIC-IV , a freely accessible electronic health record dataset.Scientific data.2023;10(1):1

Johnson AE, Bulgarelli L, Shen L, et al. MIMIC-IV , a freely accessible electronic health record dataset.Scientific data.2023;10(1):1

work page 2023
[42]

Baichuan-m1: Pushing the medical capability of large language models.arXiv preprint arXiv:2502.12671.2025

Wang B, Zhao H, Zhou H, et al. Baichuan-m1: Pushing the medical capability of large language models.arXiv preprint arXiv:2502.12671.2025

work page arXiv 2025
[43]

Baichuan-m2: Scaling medical capability with large verifier system.arXiv preprint arXiv:2509.02208.2025

Dou C, Liu C, Yang F, et al. Baichuan-m2: Scaling medical capability with large verifier system.arXiv preprint arXiv:2509.02208.2025

work page arXiv 2025
[44]

A generalist medical language model for disease diagnosis assistance.Nature medicine.2025;31(3):932–942

Liu X, Liu H, Yang G, et al. A generalist medical language model for disease diagnosis assistance.Nature medicine.2025;31(3):932–942

work page 2025
[45]

Citrus: Leveraging expert cognitive pathways in a medical language model for advanced medical decision support

Wang G, Gao M, Yang S, et al. Citrus: Leveraging expert cognitive pathways in a medical language model for advanced medical decision support. arXiv preprint arXiv:2502.18274.2025

work page arXiv 2025
[46]

EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis.arXiv preprint arXiv:2510.25628.2025

Liao Y , Wu C, Liu J, et al. EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis.arXiv preprint arXiv:2510.25628.2025

work page arXiv 2025
[47]

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

Xu W, Chan HP, Li L, et al. Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning.arXiv preprint arXiv:2506.07044.2025

work page internal anchor Pith review arXiv 2025
[48]

Ultramedical: Building specialized generalists in biomedicine.Advances in Neural Information Processing Systems.2024;37:26045–26081

Zhang K, Zeng S, Hua E, et al. Ultramedical: Building specialized generalists in biomedicine.Advances in Neural Information Processing Systems.2024;37:26045–26081

work page 2024
[49]

MedS3: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision.arXiv preprint arXiv:2501.12051.2025

Jiang S, Liao Y , Chen Z, Zhang Y , Wang Y , Wang Y . MedS3: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision.arXiv preprint arXiv:2501.12051.2025

work page arXiv 2025
[50]

MedGemma Technical Report

Sellergren A, Kazemzadeh S, Jaroensri T, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201.2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

A collaborative large language model for drug analysis.Nature Biomedical Engineering.2025:1–12

Zhou H, Liu F, Wu J, et al. A collaborative large language model for drug analysis.Nature Biomedical Engineering.2025:1–12

work page 2025
[52]

Capabilities of Gemini Models in Medicine

Saab K, Tu T, Weng WH, et al. Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416.2024

work page internal anchor Pith review arXiv 2024
[53]

O1 Replication Journey–Part 3: Inference-time Scaling for Medical Reasoning.arXiv preprint arXiv:2501.06458

Huang Z, Geng G, Hua S, et al. O1 Replication Journey–Part 3: Inference-time Scaling for Medical Reasoning.arXiv preprint arXiv:2501.06458. 2025

work page arXiv 2025
[54]

Medresearcher-r1: Expert-level medical deep researcher via a knowledge-informed trajectory synthesis framework

Yu A, Yao L, Liu J, et al. Medresearcher-r1: Expert-level medical deep researcher via a knowledge-informed trajectory synthesis framework. arXiv preprint arXiv:2508.14880.2025

work page arXiv 2025
[55]

Clinicalgpt-r1: Pushing reasoning capability of generalist disease diagnosis with large language model.arXiv preprint arXiv:2504.09421.2025

Lan W, Wang W, Ji C, et al. Clinicalgpt-r1: Pushing reasoning capability of generalist disease diagnosis with large language model.arXiv preprint arXiv:2504.09421.2025

work page arXiv 2025
[56]

Towards accurate differential diagnosis with large language models.Nature.2025:1–7

McDuff D, Schaekermann M, Tu T, et al. Towards accurate differential diagnosis with large language models.Nature.2025:1–7

work page 2025
[57]

Towards conversational diagnostic artificial intelligence.Nature.2025:1–9

Tu T, Schaekermann M, Palepu A, et al. Towards conversational diagnostic artificial intelligence.Nature.2025:1–9

work page 2025
[58]

MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale.arXiv preprint arXiv:2506.04405.2025

Xu R, Zhuang Y , Zhong Y , et al. MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale.arXiv preprint arXiv:2506.04405.2025

work page arXiv 2025
[59]

Cod, towards an interpretable medical agent using chain of diagnosis

Chen J, Gui C, Gao A, et al. Cod, towards an interpretable medical agent using chain of diagnosis. In: Association for Computational Linguistics. 2025:14345–14368

work page 2025
[60]

DDO: Dual-Decision Optimization for LLM-Based Medical Consultation via Multi-Agent Collaboration

Jia Z, Jia M, Duan J, Wang J. DDO: Dual-Decision Optimization for LLM-Based Medical Consultation via Multi-Agent Collaboration. In: Association for Computational Linguistics. 2025:26380–26397

work page 2025
[61]

Small language models learn enhanced reasoning skills from medical textbooks.NPJ digital medicine

Kim H, Hwang H, Lee J, et al. Small language models learn enhanced reasoning skills from medical textbooks.NPJ digital medicine. 2025;8(1):240

work page 2025
[62]

Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models.Bioinformatics.2024;40(Supplement_1):i119–i129

Jeong M, Sohn J, Sung M, Kang J. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models.Bioinformatics.2024;40(Supplement_1):i119–i129

work page 2024
[63]

Med42-v2: A suite of clinical llms.arXiv preprint arXiv:2408.06142.2024

Christophe C, Kanithi PK, Raha T, Khan S, Pimentel MA. Med42-v2: A suite of clinical llms.arXiv preprint arXiv:2408.06142.2024

work page arXiv 2024
[64]

Medadapter: Efficient test-time adaptation of large language models towards medical reasoning

Shi W, Xu R, Zhuang Y , et al. Medadapter: Efficient test-time adaptation of large language models towards medical reasoning. In: Association for Computational Linguistics. 2024:22294–22314

work page 2024
[65]

QuarkMed Medical Foundation Model Technical Report.arXiv preprint arXiv:2508.11894.2025

Li A, Yan B, Cai B, et al. QuarkMed Medical Foundation Model Technical Report.arXiv preprint arXiv:2508.11894.2025

work page arXiv 2025
[66]

Beyond distillation: Pushing the limits of medical llm reasoning with minimalist rule-based rl.arXiv preprint arXiv:2505.17952.2025

Liu C, Wang H, Pan J, et al. Beyond distillation: Pushing the limits of medical llm reasoning with minimalist rule-based rl.arXiv preprint arXiv:2505.17952.2025. 18 RENET AL

work page arXiv 2025
[67]

End-to-end agentic rag system training for traceable diagnostic reasoning.arXiv preprint arXiv:2508.15746.2025

Zheng Q, Sun Y , Wu C, et al. End-to-end agentic rag system training for traceable diagnostic reasoning.arXiv preprint arXiv:2508.15746.2025

work page arXiv 2025
[68]

AdaThink-Med: Medical Adaptive Thinking with Uncertainty-Guided Length Calibration.arXiv preprint arXiv:2509.24560.2025

Rui S, Chen K, Ma W, Wang X. AdaThink-Med: Medical Adaptive Thinking with Uncertainty-Guided Length Calibration.arXiv preprint arXiv:2509.24560.2025

work page arXiv 2025
[69]

Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering.arXiv e-prints.2024:arXiv–2403

Sandeep Nachane S, Gramopadhye O, Chanda P, et al. Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering.arXiv e-prints.2024:arXiv–2403

work page 2024
[70]

Reasoning with large language models for medical question answering.Journal of the American Medical Informatics Association.2024;31(9):1964–1975

Lucas MM, Yang J, Pomeroy JK, Yang CC. Reasoning with large language models for medical question answering.Journal of the American Medical Informatics Association.2024;31(9):1964–1975

work page 2024
[71]

arXiv preprint arXiv:2311.16452 , year=

Nori H, Lee YT, Zhang S, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452.2023

work page arXiv 2023
[72]

Large language models are clinical reasoners: Reasoning-aware diagnosis framework with prompt-generated rationales

Kwon T, Ong KTi, Kang D, et al. Large language models are clinical reasoners: Reasoning-aware diagnosis framework with prompt-generated rationales. In: . 38. AAAI. 2024:18417–18425

work page 2024
[73]

Large language models perform diagnostic reasoning.arXiv preprint arXiv:2307.08922.2023

Wu CK, Chen WL, Chen HH. Large language models perform diagnostic reasoning.arXiv preprint arXiv:2307.08922.2023

work page arXiv 2023
[74]

Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine.NPJ Digital Medicine.2024;7(1):20

Savage T, Nayak A, Gallo R, Rangan E, Chen JH. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine.NPJ Digital Medicine.2024;7(1):20

work page 2024
[75]

Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning.Advances in Neural Information Processing Systems.2024;37:28858–28888

Li S, Balachandran V , Feng S, et al. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning.Advances in Neural Information Processing Systems.2024;37:28858–28888

work page 2024
[76]

Medrag: Enhancing retrieval-augmented generation with knowledge graph-elicited reasoning for healthcare copilot

Zhao X, Liu S, Yang SY , Miao C. Medrag: Enhancing retrieval-augmented generation with knowledge graph-elicited reasoning for healthcare copilot. In: ACM. 2025:4442–4457

work page 2025
[77]

Medagents: Large language models as collaborators for zero-shot medical reasoning

Tang X, Zou A, Zhang Z, et al. Medagents: Large language models as collaborators for zero-shot medical reasoning. In: Association for Computational Linguistics. 2024:599–621

work page 2024
[78]

From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations

Wang B, Xia I, Zhang Y , et al. From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations. In: Association for Computational Linguistics. 2025:10820–10844

work page 2025
[79]

MedChain: Bridging the Gap Between LLM Agents and Clinical Practice with Interactive Sequence

Liu J, Wang W, Ma Z, et al. MedChain: Bridging the Gap Between LLM Agents and Clinical Practice with Interactive Sequence. In: NeurIPS. 2025

work page 2025
[80]

The anatomy of a personal health agent.arXiv preprint arXiv:2508.20148.2025

Heydari AA, Gu K, Srinivas V , et al. The anatomy of a personal health agent.arXiv preprint arXiv:2508.20148.2025

work page arXiv 2025

Showing first 80 references.