pith. machine review for the scientific record. sign in

arxiv: 2604.08559 · v1 · submitted 2026-03-17 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Medical Reasoning with Large Language Models: A Survey and MR-Bench

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelsmedical reasoningclinical decision makingbenchmarksurveyabduction deduction induction
0
0 comments X

The pith

A benchmark from real hospital data reveals that large language models perform markedly worse on genuine clinical decisions than on medical exams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models have shown strong results on medical exam questions, yet clinical decision-making demands handling safety-critical uncertainty, patient-specific context, and constantly updating evidence. The paper frames medical reasoning as an iterative cycle of abduction to generate hypotheses, deduction to test them, and induction to refine knowledge. It reviews methods across seven technical routes that combine training-based and training-free strategies, then runs a unified evaluation of representative models. The authors introduce MR-Bench, built directly from hospital records, and find a clear drop in accuracy on these realistic tasks compared with exam-style tests, showing that factual recall alone does not suffice for reliable clinical use.

Core claim

Medical reasoning is defined as an iterative process of abduction, deduction, and induction; existing LLM approaches are grouped into seven major technical routes; and consistent cross-benchmark testing on MR-Bench, constructed from real-world hospital data, demonstrates a pronounced performance gap between exam-level accuracy and results on authentic clinical decision tasks.

What carries the argument

MR-Bench, a dataset drawn from real hospital records that tests iterative medical reasoning under realistic clinical conditions rather than exam-style questions.

If this is right

  • Future LLM development must target robust handling of evolving evidence and patient context rather than isolated factual recall.
  • Unified evaluation settings across benchmarks allow clearer comparison of training-based versus training-free reasoning methods.
  • Deployment in clinical environments requires new techniques that close the observed accuracy drop on authentic decision tasks.
  • Existing methods grouped in the seven routes need targeted adaptation for safety-critical settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The iterative abduction-deduction-induction framing could guide development of hybrid systems that combine LLMs with external knowledge bases updated in real time.
  • MR-Bench-style construction from local records might be replicated in other high-stakes domains such as legal case analysis or scientific hypothesis testing.
  • Persistent gaps would imply that full clinical autonomy for LLMs remains distant and that human oversight remains essential for safety.

Load-bearing premise

That a benchmark assembled from existing hospital records adequately represents the full safety-critical, context-dependent, and evidence-evolving character of live clinical decisions.

What would settle it

A controlled experiment in which models reach accuracy on MR-Bench tasks that matches or exceeds their scores on standard medical exams would directly undermine the reported performance gap.

read the original abstract

Large language models (LLMs) have achieved strong performance on medical exam-style tasks, motivating growing interest in their deployment in real-world clinical settings. However, clinical decision-making is inherently safety-critical, context-dependent, and conducted under evolving evidence. In such situations, reliable LLM performance depends not on factual recall alone, but on robust medical reasoning. In this work, we present a comprehensive review of medical reasoning with LLMs. Grounded in cognitive theories of clinical reasoning, we conceptualize medical reasoning as an iterative process of abduction, deduction, and induction, and organize existing methods into seven major technical routes spanning training-based and training-free approaches. We further conduct a unified cross-benchmark evaluation of representative medical reasoning models under a consistent experimental setting, enabling a more systematic and comparable assessment of the empirical impact of existing methods. To better assess clinically grounded reasoning, we introduce MR-Bench, a benchmark derived from real-world hospital data. Evaluations on MR-Bench expose a pronounced gap between exam-level performance and accuracy on authentic clinical decision tasks. Overall, this survey provides a unified view of existing medical reasoning methods, benchmarks, and evaluation practices, and highlights key gaps between current model performance and the requirements of real-world clinical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper surveys medical reasoning with LLMs, grounding the discussion in cognitive theories of abduction, deduction, and induction to organize existing methods into seven technical routes (training-based and training-free). It performs a unified cross-benchmark evaluation of representative models under a consistent setting and introduces MR-Bench, a new benchmark derived from real-world hospital data. Evaluations on MR-Bench are used to claim a pronounced performance gap between exam-level tasks and authentic clinical decision-making.

Significance. If the central gap claim holds after validation, the work supplies a needed unified taxonomy and evaluation framework for the field while highlighting deployment risks for LLMs in safety-critical settings that require context-dependent, evolving-evidence reasoning. The MR-Bench contribution could serve as a more realistic testbed than existing exam-style benchmarks.

major comments (2)
  1. [MR-Bench construction] MR-Bench section: the claim that the benchmark is 'derived from real-world hospital data' and captures 'authentic clinical decision tasks' is load-bearing for the headline gap result, yet the manuscript supplies no protocol for case selection, temporal context encoding, or anchoring ground-truth labels to actual clinical endpoints or longitudinal outcomes. Without these details it remains possible that MR-Bench reduces to single-encounter snapshots whose performance drop reflects format shift rather than reasoning failure.
  2. [Unified cross-benchmark evaluation] Unified evaluation section: the abstract and evaluation describe a 'consistent experimental setting' and 'clear performance gap,' but omit full details on data selection criteria and statistical controls (e.g., confidence intervals, multiple-run variance). This leaves the gap claim plausible yet not fully verified, directly affecting the strength of the cross-benchmark comparison.
minor comments (2)
  1. [Survey organization] The seven technical routes are introduced in the abstract but would benefit from an explicit enumeration or table early in the survey section to improve readability.
  2. [Evaluation tables] Ensure all cited benchmarks in the unified evaluation are accompanied by brief descriptions of their task formats so readers can interpret the reported gaps without external lookup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and plan to incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [MR-Bench construction] MR-Bench section: the claim that the benchmark is 'derived from real-world hospital data' and captures 'authentic clinical decision tasks' is load-bearing for the headline gap result, yet the manuscript supplies no protocol for case selection, temporal context encoding, or anchoring ground-truth labels to actual clinical endpoints or longitudinal outcomes. Without these details it remains possible that MR-Bench reduces to single-encounter snapshots whose performance drop reflects format shift rather than reasoning failure.

    Authors: We agree that additional details on the MR-Bench construction are necessary to support the claims. In the revised manuscript, we will expand the MR-Bench section to include a full protocol: case selection criteria from the hospital dataset (e.g., inclusion of multi-turn interactions and evolving evidence), how temporal context is encoded in the prompts, and the process for anchoring ground-truth labels to verified clinical outcomes and longitudinal patient records. This will clarify that the benchmark goes beyond single-encounter snapshots and better isolate reasoning failures. revision: yes

  2. Referee: [Unified cross-benchmark evaluation] Unified evaluation section: the abstract and evaluation describe a 'consistent experimental setting' and 'clear performance gap,' but omit full details on data selection criteria and statistical controls (e.g., confidence intervals, multiple-run variance). This leaves the gap claim plausible yet not fully verified, directly affecting the strength of the cross-benchmark comparison.

    Authors: We appreciate this observation. The unified evaluation was conducted under a fixed setting with the same prompts and decoding parameters across benchmarks. In the revision, we will add explicit data selection criteria (e.g., how subsets were chosen for comparability) and report statistical controls including 95% confidence intervals and variance across multiple runs (e.g., 3-5 seeds). This will provide stronger verification for the observed performance gap. revision: yes

Circularity Check

0 steps flagged

No significant circularity in survey organization or MR-Bench construction

full rationale

The paper is a survey that reviews existing LLM medical reasoning methods, grounds its conceptualization of reasoning (abduction/deduction/induction) in external cognitive theories, organizes methods into seven routes drawn from literature, performs cross-benchmark evaluations on independent datasets, and introduces MR-Bench as a new construction from real-world hospital data. No equations, fitted parameters, predictions, or self-citations reduce any central claim to its own inputs by construction. The performance gap claim rests on empirical results from the newly introduced benchmark rather than self-referential definitions or load-bearing self-citations. This is a standard non-circular survey structure.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The survey rests on standard cognitive theories of clinical reasoning and existing LLM techniques; MR-Bench is constructed from hospital data without introducing new fitted parameters or postulated entities.

axioms (1)
  • domain assumption Clinical reasoning can be usefully modeled as an iterative process of abduction, deduction, and induction
    Invoked to organize existing methods into seven technical routes.

pith-pipeline@v0.9.0 · 5531 in / 1219 out tokens · 59132 ms · 2026-05-15T10:17:25.982649+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

148 extracted references · 148 canonical work pages · 17 internal anchors

  1. [1]

    Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models

    Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models.PLOS Digital Health.2023;2(2):1-12. doi: 10.1371/journal.pdig.0000198

  2. [2]

    Capabilities of GPT-4 on Medical Challenge Problems

    Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on Medical Challenge Problems. arXiv preprint arXiv:2303.13375; 2023

  3. [3]

    Foundation models for generalist medical artificial intelligence.Nature.2023;616(7956):259–265

    Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical artificial intelligence.Nature.2023;616(7956):259–265

  4. [4]

    Large language models encode clinical knowledge.Nature.2023;620(7972):172–180

    Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge.Nature.2023;620(7972):172–180

  5. [5]

    Toward expert-level medical question answering with large language models.Nature Medicine.2025;31(3):943– 950

    Singhal K, Tu T, Gottweis J, et al. Toward expert-level medical question answering with large language models.Nature Medicine.2025;31(3):943– 950

  6. [6]

    Evaluating clinical AI summaries with large language models as judges.npj Digital Medicine.2025;8(1):640

    Croxford E, Gao Y , First E, et al. Evaluating clinical AI summaries with large language models as judges.npj Digital Medicine.2025;8(1):640

  7. [7]

    Evaluating an LLM-Assisted Workflow for Clinical Documentation: A Pilot Randomized Controlled Trial on Time and Quality.medRxiv.2025:2025–10

    Takayama T, Sado K, Suda K, et al. Evaluating an LLM-Assisted Workflow for Clinical Documentation: A Pilot Randomized Controlled Trial on Time and Quality.medRxiv.2025:2025–10

  8. [8]

    Current applications and challenges in large language models for patient care: a systematic review

    Busch F, Hoffmann L, Rueger C, et al. Current applications and challenges in large language models for patient care: a systematic review. Communications Medicine.2025;5(1):26

  9. [9]

    Enhancing large language models for clinical decision support by incorporating clinical practice guidelines

    Oniani D, Wu X, Visweswaran S, et al. Enhancing large language models for clinical decision support by incorporating clinical practice guidelines. In: IEEE. 2024:694–702

  10. [10]

    Large language models in real-world clinical workflows: A systematic review of applications and implementation.Frontiers in Digital Health.2025;7:1659134

    Artsi Y , Sorin V , Glicksberg BS, Korfiatis P, Nadkarni GN, Klang E. Large language models in real-world clinical workflows: A systematic review of applications and implementation.Frontiers in Digital Health.2025;7:1659134

  11. [11]

    From best evidence to best practice: effective implementation of change in patients’ care.The lancet.2003;362(9391):1225– 1230

    Grol R, Grimshaw J. From best evidence to best practice: effective implementation of change in patients’ care.The lancet.2003;362(9391):1225– 1230

  12. [12]

    Large Language Models in Healthcare and Medical Applications: A Review.Bioengineering.2025;12(6):631

    Maity S, Saikia MJ. Large Language Models in Healthcare and Medical Applications: A Review.Bioengineering.2025;12(6):631

  13. [13]

    Explicit vs implicit memory: Exploring multi-hop complex reasoning over personalized information

    Zhang Z, Zhang Y , Tan H, Li R, Chen X. Explicit vs implicit memory: Exploring multi-hop complex reasoning over personalized information. arXiv preprint arXiv:2508.13250.2025

  14. [14]

    Nextquill: Causal preference modeling for enhancing llm personalization.ICLR.2026

    Zhao X, You J, Zhang Y , et al. Nextquill: Causal preference modeling for enhancing llm personalization.ICLR.2026

  15. [15]

    Igd: Token decisiveness modeling via information gain in llms for personalized recommendation

    Lin Z, Zhang Y , Zhao X, Zhu F, Feng F, Chua TS. Igd: Token decisiveness modeling via information gain in llms for personalized recommendation. NeurIPS.2025

  16. [16]

    Survey of hallucination in natural language generation.ACM computing surveys.2023;55(12):1–38

    Ji Z, Lee N, Frieske R, et al. Survey of hallucination in natural language generation.ACM computing surveys.2023;55(12):1–38

  17. [17]

    Language Models (Mostly) Know What They Know

    Kadavath S, Conerly T, Askell A, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221.2022

  18. [18]

    The clinical reasoning process.Medical education.1987;21(2):86–91

    Barrows HS, Feltovich PJ. The clinical reasoning process.Medical education.1987;21(2):86–91

  19. [19]

    Clinical reasoning strategies in physical therapy.Physical therapy.2004;84(4):312– 330

    Edwards I, Jones M, Carr J, Braunack-Mayer A, Jensen GM. Clinical reasoning strategies in physical therapy.Physical therapy.2004;84(4):312– 330

  20. [20]

    Abductive reasoning and clinical assessment.Australian Psychologist.1997;32(2):93–100

    Ward T, Haig B. Abductive reasoning and clinical assessment.Australian Psychologist.1997;32(2):93–100

  21. [21]

    Clinical cognition and diagnostic error: applications of a dual process model of reasoning.Advances in health sciences education

    Croskerry P. Clinical cognition and diagnostic error: applications of a dual process model of reasoning.Advances in health sciences education. 2009;14(Suppl 1):27–35

  22. [22]

    The logic of medical diagnosis.Perspectives in Biology and Medicine.2013;56(2):300–315

    Stanley DE, Campos DG. The logic of medical diagnosis.Perspectives in Biology and Medicine.2013;56(2):300–315

  23. [23]

    HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

    Chen J, Cai Z, Ji K, et al. Huatuogpt-o1, towards medical complex reasoning with llms, 2024.URL https://arxiv. org/abs/2412.18925

  24. [24]

    Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993

    Wu J, Deng W, Li X, et al. Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs.arXiv preprint arXiv:2504.00993. 2025

  25. [25]

    m1: Unleash the potential of test-time scaling for medical reasoning with large language models.arXiv preprint arXiv:2504.00869.2025

    Huang X, Wu J, Liu H, Tang X, Zhou Y . m1: Unleash the potential of test-time scaling for medical reasoning with large language models.arXiv preprint arXiv:2504.00869.2025. Medical Reasoning with Large Language Models: A Survey and MR-Bench 17

  26. [26]

    Disentangling reasoning and knowledge in medical large language models.arXiv preprint arXiv:2505.11462.2025

    Thapa R, Wu Q, Wu K, et al. Disentangling reasoning and knowledge in medical large language models.arXiv preprint arXiv:2505.11462.2025

  27. [27]

    Testing and evaluation of health care applications of large language models: a systematic review.Jama.2025

    Bedi S, Liu Y , Orr-Ewing L, et al. Testing and evaluation of health care applications of large language models: a systematic review.Jama.2025

  28. [28]

    Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

    Gururangan S, Marasovi´c A, Swayamdipta S, et al. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks. In: Jurafsky D, Chai J, Schluter N, Tetreault J., eds.Proceedings of the 58th Annual Meeting of the Association for Computational LinguisticsAssociation for Computational Linguistics. Association for Computational Linguistics 2020; Online:...

  29. [29]

    Finetuned Language Models are Zero-Shot Learners

    Wei J, Bosma M, Zhao V , et al. Finetuned Language Models are Zero-Shot Learners. In: International Conference on Learning Representations. 2022

  30. [30]

    Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation.arXiv preprint arXiv:2512.06690.2025

    Wang C, Zhang Y , Wang W, et al. Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation.arXiv preprint arXiv:2512.06690.2025

  31. [31]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems.2022;35:27730–27744

    Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems.2022;35:27730–27744

  32. [32]

    Reinforced latent reasoning for llm-based recommendation.arXiv preprint arXiv:2505.19092.2025

    Zhang Y , Xu W, Zhao X, et al. Reinforced latent reasoning for llm-based recommendation.arXiv preprint arXiv:2505.19092.2025

  33. [33]

    Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys.2023;55(9):1–35

    Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing.ACM computing surveys.2023;55(9):1–35

  34. [34]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Wang X, Wei J, Schuurmans D, et al. Self-consistency improves chain of thought reasoning in language models.arXiv preprint arXiv:2203.11171. 2022

  35. [35]

    SteerX: Disentangled Steering for LLM Personalization.arXiv preprint arXiv:2510.22256.2025

    Zhao X, Yan M, Qiu Y , et al. SteerX: Disentangled Steering for LLM Personalization.arXiv preprint arXiv:2510.22256.2025

  36. [36]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems.2020;33:9459–9474

    Lewis P, Perez E, Piktus A, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems.2020;33:9459–9474

  37. [37]

    The rise and potential of large language model based agents: A survey.Science China Information Sciences

    Xi Z, Chen W, Guo X, et al. The rise and potential of large language model based agents: A survey.Science China Information Sciences. 2025;68(2):121101

  38. [38]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences.2021;11(14):6421

    Jin D, Pan E, Oufattole N, Weng WH, Fang H, Szolovits P. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences.2021;11(14):6421

  39. [39]

    HealthBench: Evaluating Large Language Models Towards Improved Human Health

    Arora RK, Wei J, Hicks RS, et al. Healthbench: Evaluating large language models towards improved human health.arXiv preprint arXiv:2505.08775.2025

  40. [40]

    MIMIC-III, a freely accessible critical care database.Scientific data.2016;3(1):1–9

    Johnson AE, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database.Scientific data.2016;3(1):1–9

  41. [41]

    MIMIC-IV , a freely accessible electronic health record dataset.Scientific data.2023;10(1):1

    Johnson AE, Bulgarelli L, Shen L, et al. MIMIC-IV , a freely accessible electronic health record dataset.Scientific data.2023;10(1):1

  42. [42]

    Baichuan-m1: Pushing the medical capability of large language models.arXiv preprint arXiv:2502.12671.2025

    Wang B, Zhao H, Zhou H, et al. Baichuan-m1: Pushing the medical capability of large language models.arXiv preprint arXiv:2502.12671.2025

  43. [43]

    Baichuan-m2: Scaling medical capability with large verifier system.arXiv preprint arXiv:2509.02208.2025

    Dou C, Liu C, Yang F, et al. Baichuan-m2: Scaling medical capability with large verifier system.arXiv preprint arXiv:2509.02208.2025

  44. [44]

    A generalist medical language model for disease diagnosis assistance.Nature medicine.2025;31(3):932–942

    Liu X, Liu H, Yang G, et al. A generalist medical language model for disease diagnosis assistance.Nature medicine.2025;31(3):932–942

  45. [45]

    Citrus: Leveraging expert cognitive pathways in a medical language model for advanced medical decision support

    Wang G, Gao M, Yang S, et al. Citrus: Leveraging expert cognitive pathways in a medical language model for advanced medical decision support. arXiv preprint arXiv:2502.18274.2025

  46. [46]

    EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis.arXiv preprint arXiv:2510.25628.2025

    Liao Y , Wu C, Liu J, et al. EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis.arXiv preprint arXiv:2510.25628.2025

  47. [47]

    Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning

    Xu W, Chan HP, Li L, et al. Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning.arXiv preprint arXiv:2506.07044.2025

  48. [48]

    Ultramedical: Building specialized generalists in biomedicine.Advances in Neural Information Processing Systems.2024;37:26045–26081

    Zhang K, Zeng S, Hua E, et al. Ultramedical: Building specialized generalists in biomedicine.Advances in Neural Information Processing Systems.2024;37:26045–26081

  49. [49]

    MedS3: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision.arXiv preprint arXiv:2501.12051.2025

    Jiang S, Liao Y , Chen Z, Zhang Y , Wang Y , Wang Y . MedS3: Towards Medical Slow Thinking with Self-Evolved Soft Dual-sided Process Supervision.arXiv preprint arXiv:2501.12051.2025

  50. [50]

    MedGemma Technical Report

    Sellergren A, Kazemzadeh S, Jaroensri T, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201.2025

  51. [51]

    A collaborative large language model for drug analysis.Nature Biomedical Engineering.2025:1–12

    Zhou H, Liu F, Wu J, et al. A collaborative large language model for drug analysis.Nature Biomedical Engineering.2025:1–12

  52. [52]

    Capabilities of Gemini Models in Medicine

    Saab K, Tu T, Weng WH, et al. Capabilities of gemini models in medicine.arXiv preprint arXiv:2404.18416.2024

  53. [53]

    O1 Replication Journey–Part 3: Inference-time Scaling for Medical Reasoning.arXiv preprint arXiv:2501.06458

    Huang Z, Geng G, Hua S, et al. O1 Replication Journey–Part 3: Inference-time Scaling for Medical Reasoning.arXiv preprint arXiv:2501.06458. 2025

  54. [54]

    Medresearcher-r1: Expert-level medical deep researcher via a knowledge-informed trajectory synthesis framework

    Yu A, Yao L, Liu J, et al. Medresearcher-r1: Expert-level medical deep researcher via a knowledge-informed trajectory synthesis framework. arXiv preprint arXiv:2508.14880.2025

  55. [55]

    Clinicalgpt-r1: Pushing reasoning capability of generalist disease diagnosis with large language model.arXiv preprint arXiv:2504.09421.2025

    Lan W, Wang W, Ji C, et al. Clinicalgpt-r1: Pushing reasoning capability of generalist disease diagnosis with large language model.arXiv preprint arXiv:2504.09421.2025

  56. [56]

    Towards accurate differential diagnosis with large language models.Nature.2025:1–7

    McDuff D, Schaekermann M, Tu T, et al. Towards accurate differential diagnosis with large language models.Nature.2025:1–7

  57. [57]

    Towards conversational diagnostic artificial intelligence.Nature.2025:1–9

    Tu T, Schaekermann M, Palepu A, et al. Towards conversational diagnostic artificial intelligence.Nature.2025:1–9

  58. [58]

    MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale.arXiv preprint arXiv:2506.04405.2025

    Xu R, Zhuang Y , Zhong Y , et al. MedAgentGym: Training LLM Agents for Code-Based Medical Reasoning at Scale.arXiv preprint arXiv:2506.04405.2025

  59. [59]

    Cod, towards an interpretable medical agent using chain of diagnosis

    Chen J, Gui C, Gao A, et al. Cod, towards an interpretable medical agent using chain of diagnosis. In: Association for Computational Linguistics. 2025:14345–14368

  60. [60]

    DDO: Dual-Decision Optimization for LLM-Based Medical Consultation via Multi-Agent Collaboration

    Jia Z, Jia M, Duan J, Wang J. DDO: Dual-Decision Optimization for LLM-Based Medical Consultation via Multi-Agent Collaboration. In: Association for Computational Linguistics. 2025:26380–26397

  61. [61]

    Small language models learn enhanced reasoning skills from medical textbooks.NPJ digital medicine

    Kim H, Hwang H, Lee J, et al. Small language models learn enhanced reasoning skills from medical textbooks.NPJ digital medicine. 2025;8(1):240

  62. [62]

    Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models.Bioinformatics.2024;40(Supplement_1):i119–i129

    Jeong M, Sohn J, Sung M, Kang J. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models.Bioinformatics.2024;40(Supplement_1):i119–i129

  63. [63]

    Med42-v2: A suite of clinical llms.arXiv preprint arXiv:2408.06142.2024

    Christophe C, Kanithi PK, Raha T, Khan S, Pimentel MA. Med42-v2: A suite of clinical llms.arXiv preprint arXiv:2408.06142.2024

  64. [64]

    Medadapter: Efficient test-time adaptation of large language models towards medical reasoning

    Shi W, Xu R, Zhuang Y , et al. Medadapter: Efficient test-time adaptation of large language models towards medical reasoning. In: Association for Computational Linguistics. 2024:22294–22314

  65. [65]

    QuarkMed Medical Foundation Model Technical Report.arXiv preprint arXiv:2508.11894.2025

    Li A, Yan B, Cai B, et al. QuarkMed Medical Foundation Model Technical Report.arXiv preprint arXiv:2508.11894.2025

  66. [66]

    Beyond distillation: Pushing the limits of medical llm reasoning with minimalist rule-based rl.arXiv preprint arXiv:2505.17952.2025

    Liu C, Wang H, Pan J, et al. Beyond distillation: Pushing the limits of medical llm reasoning with minimalist rule-based rl.arXiv preprint arXiv:2505.17952.2025. 18 RENET AL

  67. [67]

    End-to-end agentic rag system training for traceable diagnostic reasoning.arXiv preprint arXiv:2508.15746.2025

    Zheng Q, Sun Y , Wu C, et al. End-to-end agentic rag system training for traceable diagnostic reasoning.arXiv preprint arXiv:2508.15746.2025

  68. [68]

    AdaThink-Med: Medical Adaptive Thinking with Uncertainty-Guided Length Calibration.arXiv preprint arXiv:2509.24560.2025

    Rui S, Chen K, Ma W, Wang X. AdaThink-Med: Medical Adaptive Thinking with Uncertainty-Guided Length Calibration.arXiv preprint arXiv:2509.24560.2025

  69. [69]

    Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering.arXiv e-prints.2024:arXiv–2403

    Sandeep Nachane S, Gramopadhye O, Chanda P, et al. Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering.arXiv e-prints.2024:arXiv–2403

  70. [70]

    Reasoning with large language models for medical question answering.Journal of the American Medical Informatics Association.2024;31(9):1964–1975

    Lucas MM, Yang J, Pomeroy JK, Yang CC. Reasoning with large language models for medical question answering.Journal of the American Medical Informatics Association.2024;31(9):1964–1975

  71. [71]

    arXiv preprint arXiv:2311.16452 , year=

    Nori H, Lee YT, Zhang S, et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452.2023

  72. [72]

    Large language models are clinical reasoners: Reasoning-aware diagnosis framework with prompt-generated rationales

    Kwon T, Ong KTi, Kang D, et al. Large language models are clinical reasoners: Reasoning-aware diagnosis framework with prompt-generated rationales. In: . 38. AAAI. 2024:18417–18425

  73. [73]

    Large language models perform diagnostic reasoning.arXiv preprint arXiv:2307.08922.2023

    Wu CK, Chen WL, Chen HH. Large language models perform diagnostic reasoning.arXiv preprint arXiv:2307.08922.2023

  74. [74]

    Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine.NPJ Digital Medicine.2024;7(1):20

    Savage T, Nayak A, Gallo R, Rangan E, Chen JH. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine.NPJ Digital Medicine.2024;7(1):20

  75. [75]

    Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning.Advances in Neural Information Processing Systems.2024;37:28858–28888

    Li S, Balachandran V , Feng S, et al. Mediq: Question-asking llms and a benchmark for reliable interactive clinical reasoning.Advances in Neural Information Processing Systems.2024;37:28858–28888

  76. [76]

    Medrag: Enhancing retrieval-augmented generation with knowledge graph-elicited reasoning for healthcare copilot

    Zhao X, Liu S, Yang SY , Miao C. Medrag: Enhancing retrieval-augmented generation with knowledge graph-elicited reasoning for healthcare copilot. In: ACM. 2025:4442–4457

  77. [77]

    Medagents: Large language models as collaborators for zero-shot medical reasoning

    Tang X, Zou A, Zhang Z, et al. Medagents: Large language models as collaborators for zero-shot medical reasoning. In: Association for Computational Linguistics. 2024:599–621

  78. [78]

    From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations

    Wang B, Xia I, Zhang Y , et al. From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations. In: Association for Computational Linguistics. 2025:10820–10844

  79. [79]

    MedChain: Bridging the Gap Between LLM Agents and Clinical Practice with Interactive Sequence

    Liu J, Wang W, Ma Z, et al. MedChain: Bridging the Gap Between LLM Agents and Clinical Practice with Interactive Sequence. In: NeurIPS. 2025

  80. [80]

    The anatomy of a personal health agent.arXiv preprint arXiv:2508.20148.2025

    Heydari AA, Gu K, Srinivas V , et al. The anatomy of a personal health agent.arXiv preprint arXiv:2508.20148.2025

Showing first 80 references.