pith. sign in

arxiv: 2605.07305 · v1 · submitted 2026-05-08 · 💻 cs.CL · cs.AI

MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs

Pith reviewed 2026-05-11 02:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multi-turn diagnosisactive clinical reasoningsynthetic trajectory generationLLM fine-tuningmedical decision supporttrajectory filteringpartial evidence reasoning
0
0 comments X

The pith

Training LLMs on synthesized multi-turn diagnostic trajectories lets them order tests and update hypotheses under partial evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing medical LLMs fail at active diagnosis because they are trained on complete patient information given all at once rather than on the sequential process of gathering evidence through tests and revising a differential diagnosis. The paper addresses this by creating a tree-structured pipeline that generates synthetic multi-turn trajectories from LLM interactions with a simulated environment and filters them using two metrics grounded in disease knowledge graphs. Fine-tuning an 8B model on the resulting 32,681 trajectories produces the strongest open-source results on both an existing benchmark and a new hard set of 300 cases. This matters because real clinical work consists of repeated cycles of observation, action, and belief update rather than one-shot reasoning from full facts.

Core claim

The paper presents MedAction, a distillation pipeline that synthesizes diverse multi-turn diagnostic trajectories via LLM-environment interaction and applies Disease Trajectory Consistency (DTC) and Reasoning-Action Consistency (RAC) filters to retain only high-quality paths. From 2,896 PMC cases the pipeline produces the MedAction-32K dataset; fine-tuning an 8B model on this data yields state-of-the-art performance among open-source models on MedR-Bench and the curated MedAction-300-Hard benchmark.

What carries the argument

The tree-structured distillation pipeline that generates trajectories and filters them with the DTC and RAC consistency metrics derived from knowledge graphs.

If this is right

  • Open-source medical LLMs can be trained to manage the sequential gathering and interpretation of evidence instead of assuming complete information is available upfront.
  • The same synthesis method scales to produce tens of thousands of training examples without requiring human annotation of each turn.
  • New benchmarks focused on hard multi-turn cases can serve as standard tests for active diagnostic capability.
  • Performance improvements arise specifically from teaching models to maintain coherence and ground test orders in evolving partial evidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The filtering metrics could be reused to evaluate or improve other agentic systems that must act under uncertainty.
  • Deployment in live hospital workflows would require additional checks for safety and alignment with institutional protocols.
  • The approach might transfer to other domains that involve iterative hypothesis refinement, such as technical troubleshooting or scientific hypothesis testing.

Load-bearing premise

The trajectories produced by the pipeline and retained by the DTC and RAC filters are representative of the reasoning patterns that occur in actual multi-turn clinical practice.

What would settle it

Testing the fine-tuned model on a set of trajectories drawn from real recorded doctor-patient encounters or from prospective clinical observations would show whether the reported gains persist outside the synthetic environment.

Figures

Figures reproduced from arXiv: 2605.07305 by Chenwei Wu, Chia-Hsuan Hsu, Donghua Zhang, Fang-Ming Hung, Feng Liu, Guoan Wang, Hsin-Ling Hsu, Jerry Wang, Jun-En Ding, Liyue Shen, Nai-Chia Chen, Zizheng Wang.

Figure 1
Figure 1. Figure 1: Static vs. Active Diagnosis. Left: Existing benchmarks provide a complete patient record and ask for a one-shot diagnosis. Right: In active diagnosis, the model begins with only the chief complaint (CC) and physical examination, then iteratively orders new tests to resolve uncertainty and progressively narrows the differential diagnosis (DDx) from six candidate conditions down to a definitive diagnosis thr… view at source ↗
Figure 2
Figure 2. Figure 2: Three representative failure modes (MedGemma) in active diagnosis. Each panel illustrates one failure mode with a concrete example: (A) ungrounded test ordering, (B) unreliable diagnostic update, and (C) degraded multi-turn coherence. The Venn diagram shows failure mode co-occurrence across 155 erroneous cases out of 222 examined cases in our training set. Starting from raw case reports, we first extract 2… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the MedAction pipeline. Stage 1 transforms retrospective case reports into interactive clinical environments. Stage 2 generates a trajectory tree through multi-turn LLM–environment interaction. Stage 3 applies knowledge-graph-grounded metrics to retain high-quality trajectories: complete trajectories reaching the ground-truth diagnosis (green, ✓), truncated prefixes from partially correct paths… view at source ↗
Figure 4
Figure 4. Figure 4: Ablation studies on trajectory data curation. Lighter bars: test recommendation [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of trajectory filtering on diagnostic performance. (a) Average DTC for differently finetuned models at each turn. Lower values indicate closer alignment. (b) Turn level pass rate under the RAC filtering. Higher is better. (c) Scaling performance for each filtering strategy. and our environment-interactive trajectory distillation — grounded by two new metrics (DTC and RAC) — directly addresses them. … view at source ↗
Figure 6
Figure 6. Figure 6: Dataset composition across primary body systems and disease types. The outer [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of diagnostic tests in the dataset. The inner ring shows major test [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
read the original abstract

Most existing LLM diagnoses are evaluated on static, single-turn settings where complete patient information is provided upfront, an oversimplification of real clinical practice. We study active diagnosis: the real-life clinical process of starting from initial observation, ordering tests, interpreting results, and updating a differential diagnosis across multiple turns. Through systematic analysis, we identify three recurring failure modes in current LLMs: ungrounded test ordering, unreliable diagnostic update, and degraded multi-turn coherence. Together, these failures reveal a core deficit: existing medical training data teaches models to reason from complete information but not to act under evolving, partial evidence. To address this gap, we introduce MedAction, a tree-structured distillation pipeline that synthesizes diverse and high-quality multi-turn diagnostic trajectories via LLM-environment interaction. We propose two knowledge-graph-grounded metrics to filter trajectory quality: Disease Trajectory Consistency (DTC), which tracks whether the model's hypothesis converges toward the correct diagnosis, and Reasoning-Action Consistency (RAC), which verifies that belief updates are driven by gathered evidence. Using this pipeline, we construct MedAction-32K, a dataset of 32,681 trajectories from 2,896 PMC cases. Fine-tuning an 8B model on MedAction-32K achieves state-of-the-art performance among open-source models on both MedR-Bench and our curated MedAction-300-Hard benchmark, pushing the edge for open-source medical LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper identifies three failure modes in current LLMs for active multi-turn clinical diagnosis (ungrounded test ordering, unreliable diagnostic updates, and degraded coherence) and introduces the MedAction tree-structured distillation pipeline. This pipeline uses LLM-environment interactions to synthesize trajectories from 2,896 PMC cases, filtered by two knowledge-graph-grounded metrics—Disease Trajectory Consistency (DTC) and Reasoning-Action Consistency (RAC)—to produce the MedAction-32K dataset. Fine-tuning an 8B model on this dataset is reported to achieve state-of-the-art results among open-source models on MedR-Bench and the curated MedAction-300-Hard benchmark.

Significance. If the filtered trajectories prove to be faithful proxies for real clinical multi-turn reasoning, the work would meaningfully advance open-source medical LLMs by shifting focus from static single-turn settings to active diagnosis under partial evidence. The release of a 32K-trajectory dataset and concrete filtering metrics constitutes a concrete, reusable contribution that could support further research on evidence-driven belief updating.

major comments (2)
  1. [Abstract / pipeline description] Abstract and pipeline description: DTC and RAC are defined from the same LLM-environment loop and knowledge-graph construction used to generate the trajectories, yet no external validation (e.g., correlation with human clinician ratings of trajectory quality or comparison to recorded multi-turn clinical encounters) is reported. This circularity is load-bearing for the central SOTA claim, as benchmark gains could reflect distillation of the pipeline's own biases rather than improved real-world active reasoning.
  2. [Results] Results section (implied by abstract claims): The abstract states SOTA performance on MedR-Bench and MedAction-300-Hard but supplies no quantitative scores, baseline comparisons, error bars, or statistical tests. Without these details it is impossible to assess whether the reported gains are robust or merely incremental.
minor comments (1)
  1. [Abstract] The curation process for the MedAction-300-Hard benchmark and the exact selection criteria for the 2,896 PMC cases are not described in the provided abstract; adding these details would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment point by point below, providing clarifications and indicating revisions where the manuscript will be updated to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract / pipeline description] Abstract and pipeline description: DTC and RAC are defined from the same LLM-environment loop and knowledge-graph construction used to generate the trajectories, yet no external validation (e.g., correlation with human clinician ratings of trajectory quality or comparison to recorded multi-turn clinical encounters) is reported. This circularity is load-bearing for the central SOTA claim, as benchmark gains could reflect distillation of the pipeline's own biases rather than improved real-world active reasoning.

    Authors: We appreciate the referee's concern regarding potential circularity. DTC and RAC are computed using knowledge graphs extracted directly from the source PMC case reports, which contain real clinical narratives and documented diagnoses independent of the LLM generator. DTC specifically quantifies convergence of the trajectory's evolving hypothesis to the case's ground-truth diagnosis, while RAC verifies that each action is supported by evidence present in the case text. This design grounds the filters in external literature rather than purely model-internal signals. That said, we acknowledge that direct correlation with human clinician ratings of trajectory quality or comparisons to recorded real-world multi-turn encounters was not included in the original submission. We have revised the manuscript to add an explicit limitations subsection discussing this point and outlining it as important future work, while retaining the objective KG-based rationale for the current metrics. revision: partial

  2. Referee: [Results] Results section (implied by abstract claims): The abstract states SOTA performance on MedR-Bench and MedAction-300-Hard but supplies no quantitative scores, baseline comparisons, error bars, or statistical tests. Without these details it is impossible to assess whether the reported gains are robust or merely incremental.

    Authors: We agree that the abstract should provide key quantitative details to support the SOTA claim. The full results section already includes tables with exact scores, baseline comparisons (including both open-source and closed models), standard deviations across runs, and statistical significance tests. In the revised manuscript we have updated the abstract to explicitly state the primary performance numbers (e.g., the 8B model's scores on MedR-Bench and MedAction-300-Hard relative to prior open-source baselines) and direct readers to the detailed tables and error analysis in the results section. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses independent synthesis and held-out evaluation

full rationale

The paper describes synthesizing trajectories via LLM-environment interaction, proposing DTC and RAC as knowledge-graph-grounded filters for quality, constructing MedAction-32K from PMC cases, and reporting SOTA results on separate held-out sets (MedR-Bench and curated MedAction-300-Hard). No equations, definitions, or self-citations are presented that reduce the performance claim to the generation pipeline by construction. The filtering metrics operate on the synthesized trajectories to select for convergence and evidence-driven updates, but this does not make the benchmark gains equivalent to the inputs; the evaluation benchmarks are distinct and the chain remains self-contained without load-bearing self-references or fitted predictions renamed as results.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that LLM-simulated patient responses and knowledge-graph-derived consistency checks produce trajectories that transfer to real clinical settings. No new physical entities are introduced. The main free parameters are implicit in the choice of which LLM acts as the environment and the exact thresholds for DTC and RAC filtering.

free parameters (1)
  • DTC and RAC filtering thresholds
    The paper does not state explicit numerical thresholds; these are chosen to retain 32k trajectories and therefore function as fitted selection parameters.
axioms (2)
  • domain assumption LLM-generated patient responses in the environment simulation are sufficiently realistic to train diagnostic behavior
    Invoked when the pipeline uses LLM-environment interaction to synthesize trajectories.
  • domain assumption Knowledge-graph grounding provides an objective proxy for clinical correctness
    Used to define DTC and RAC without external clinician validation mentioned in the abstract.

pith-pipeline@v0.9.0 · 5593 in / 1624 out tokens · 46499 ms · 2026-05-11T02:13:34.608367+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

  1. [1]

    m1: Unleash the potential of test-time scaling for medical reasoning with large language models.arXiv preprint arXiv:2504.00869, 2025

    URLhttps://openreview.net/forum?id=7xjoTuaNmN. Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, and Yuyin Zhou. m1: Unleash the potential of test-time scaling for medical reasoning with large language models.arXiv preprint arXiv:2504.00869, 2025. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihin...

  2. [2]

    13 Preprint

    URLhttps://openreview.net/forum?id=roNSXZpUDN. 13 Preprint. Under review. A Prompt Templates System prompt for the diagnostic agent. You are an experienced clinical physician performing a structured diagnostic workup. Initial-turn prompt for structured diagnostic reasoning. You are seeing this patient for the first time. Analyze the case and respond using...

  3. [4]

    Diagnosis name - one sentence reasoning

  4. [5]

    Specific test names only.]

    Diagnosis name - one sentence reasoning ### Pivot: [One paragraph: the single most important diagnostic question right now and why.] ### Primary Actions: [Most targeted, high-yield tests to resolve top differentials THIS turn. Specific test names only.]

  5. [6]

    Not required

    test name - purpose ### Additional Information Required: [Any broader workup beyond Primary Actions: routine labs, imaging, history. If a test comes back normal or shows no significant findings, do NOT order any further tests in the same category - move to a completely different diagnostic category or test a different aspect of the presentation. Write "No...

  6. [7]

    Write DONE only if you have sufficient information to make a definitive final diagnosis right now

    category: specific test or data needed ### Diagnostic Status: [Write EXACTLY one word: either DONE or CONTINUE. Write DONE only if you have sufficient information to make a definitive final diagnosis right now. Write CONTINUE if you need more test results before concluding.] ### Conclusion: [Your single best current diagnosis. Be specific. One sentence on...

  7. [8]

    Tests Confirmed UNAVAILABLE

    If a test is listed under "Tests Confirmed UNAVAILABLE" above, you are FORBIDDEN from re-ordering it

  8. [9]

    All Tests Ordered So Far

    If a test is listed under "All Tests Ordered So Far", do NOT re-order it

  9. [10]

    Conclusion

    You must either: a) Order a different, alternative diagnostic test that IS available. b) Move to a final "Conclusion" if no other available tests can help

  10. [11]

    Await",

    DO NOT use words like "Await", "Wait for", or "Review previous tests".] ### Additional Information Required: [Any broader workup. Write "Not required." if you are shifting to a final conclusion .] ### Diagnostic Status: [Write EXACTLY one word: either DONE or CONTINUE. Write DONE only if you have sufficient information right now to make a definitive final...

  11. [12]

    If the requested test is present in the Ancillary Test Results below, provide the specific findings

  12. [13]

    CBC, CMP, LFTs

    If the test is NOT in the data, respond with: "[test name]: This test is currently UNAVAILABLE due to equipment maintenance or lack of specialized personnel. You must proceed with clinical diagnosis or alternative available testing." Patient Case {full_case_summary} Ancillary Test Results {anc} Prompt for extracting ordered tests into a JSON array. You ar...

  13. [14]

    CBC" ==

    Name equivalence: "CBC" == "Complete Blood Count", "CXR" == "Chest X-ray"

  14. [15]

    MRI Brain

    Parent covers children: "MRI Brain" covers "MRI Brain T1", "MRI Brain FLAIR"

  15. [16]

    Compound GT items (comma/slash separated): treat as ONE unit

  16. [17]

    One GT item counted once only

    One predicted test can cover multiple GT items. One GT item counted once only

  17. [18]

    gt_covered

    Do NOT mark covered if overlap is minor. Return ONLY valid JSON: { "gt_covered": ["..."], "gt_uncovered": ["..."], "pred_used": ["..."], "pred_unused": ["..."] } Prompt for evaluating diagnostic equivalence. You are a medical evaluation assistant. Determine if Predicted Diagnosis and Ground Truth Diagnosis are clinically equivalent. Return ONLY valid JSON...