pith. sign in

arxiv: 2606.29876 · v1 · pith:PR4XOU2Snew · submitted 2026-06-29 · 💻 cs.CL · cs.AI· q-bio.QM

Clinical Reasoning Graphs: Structured Evaluation of LLM Diagnostic Reasoning Reveals Competence Without Consistency

Pith reviewed 2026-06-30 05:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AIq-bio.QM
keywords clinical reasoning graphsLLM diagnostic reasoningreasoning consistencygraph similaritymedical AI evaluationdiagnostic schemasprocess-level evaluationontology extraction
0
0 comments X

The pith

LLMs reach diagnostic accuracy on clinical cases but show equivalent reasoning graph similarity within and between case clusters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces clinical reasoning graphs to test whether LLMs rely on stable diagnostic schemas for clinically similar cases rather than pattern matching. It extracts these graphs from 750 diagnostic traces produced by five models across 50 NEJM cases under three prompt conditions, using a fixed 5-node 7-edge ontology. The central test compares composite graph similarity for cases grouped by clinical similarity versus dissimilarity. Results show nearly identical within-cluster and between-cluster similarities across 15 model-condition pairs, with no differences surviving multiple-testing correction, and equivalent similarities for correct-correct and incorrect-incorrect model pairs. This leads to the conclusion that final-answer accuracy must be supplemented by process-level measures of reasoning consistency.

Core claim

Across 15 model-condition comparisons, within-cluster and between-cluster composite similarity are nearly equal, and no comparison survives multiple-testing correction; graph similarity is also nearly identical for pairs of models that are both correct (0.488) and both incorrect (0.484). Structured reflection prompting increases explicit discriminating-feature analysis within traces (+33%) but does not increase cross-case consistency.

What carries the argument

Clinical reasoning graphs, structured representations with 5 node types and 7 edge types extracted from free-text LLM diagnostic traces via a domain ontology, whose composite similarity serves as the measure of stable reasoning patterns.

If this is right

  • Diagnostic accuracy alone cannot distinguish stable clinically-grounded reasoning from pattern matching.
  • Graph structure captures a dimension of reasoning independent of whether the final diagnosis is correct.
  • Structured reflection prompting boosts explicit feature analysis but leaves cross-case consistency unchanged.
  • Process-level evaluation with graph similarity should complement final-answer accuracy in LLM clinical assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmarks that score only final diagnoses may overestimate reliability when models are deployed on new cases.
  • The same graph extraction approach could be adapted to test reasoning consistency in non-medical domains such as legal analysis or scientific hypothesis generation.
  • Low consistency may contribute to brittle performance when case distributions shift even if average accuracy remains high.

Load-bearing premise

The composite graph similarity from the 5-node 7-edge ontology extraction pipeline validly operationalizes stable structured reasoning patterns or diagnostic schemas.

What would settle it

A replication on the same 50 NEJM cases that finds statistically significant higher within-cluster than between-cluster composite similarity after multiple-testing correction would falsify the central result.

Figures

Figures reproduced from arXiv: 2606.29876 by Nisarg A. Patel (University of California, San Francisco).

Figure 1
Figure 1. Figure 1: A clinical reasoning graph extracted from a single LLM trace, read top-down through phases. Features (boxes) support or argue against diagnoses (circles; numbers are model confidence). Idiopathic pulmonary fibrosis (IPF) leads in Phase 1 (60%); in Phase 2, discriminating features (nodules, lymphadenopathy, lack of basal predominance) argue against IPF, support sarcoidosis, and trigger a reflection that pro… view at source ↗
Figure 2
Figure 2. Figure 2: Empirical distributions of pairwise graph similarity. Within-cluster and between-cluster distributions overlap almost completely (no detectable diagnostic-schema-like clustering); their shared bimodality reflects the qualifier component (Appendix E), not clinical structure. Inter-extractor (same trace, different extrac￾tion model) and test-retest (same trace, same model) distributions are clearly separated… view at source ↗
read the original abstract

Modern large language models (LLMs) reach 60-70% diagnostic accuracy on complex clinical case benchmarks, but accuracy alone cannot distinguish stable clinically-grounded reasoning from pattern matching. We introduce clinical reasoning graphs, structured graph representations extracted from free-text LLM diagnostic traces using a domain-grounded ontology with 5 node types and 7 edge types. We apply this pipeline to 750 traces from five LLMs across 50 New England Journal of Medicine Clinicopathological Conference cases and three prompt conditions, and test whether diagnostic traces show stable structured reasoning patterns, or diagnostic schemas, for clinically similar cases. We operationalize this as higher graph similarity among clinically similar cases than among clinically dissimilar ones. Across 15 model-condition comparisons, within-cluster and between-cluster composite similarity are nearly equal, and no comparison survives multiple-testing correction; a component-level analysis finds any residual content signal far below schema scale. Graph similarity is also nearly identical for pairs of models that are both correct (0.488) and both incorrect (0.484), suggesting that graph structure captures a dimension not reflected in diagnostic accuracy. Structured reflection prompting increases explicit discriminating-feature analysis within traces (+33%) but does not increase cross-case consistency. These results show diagnostic competence without schema-scale reasoning consistency, and indicate that final-answer accuracy should be complemented by process-level evaluation. We release the ontology, extraction pipeline, validation protocol, and the extracted reasoning graphs and similarity artifacts as resources for structured evaluation of LLM clinical reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces clinical reasoning graphs extracted from free-text LLM diagnostic traces via a fixed domain ontology (5 node types, 7 edge types). It applies the pipeline to 750 traces from five LLMs on 50 NEJM CPC cases under three prompt conditions and tests for stable diagnostic schemas by comparing composite graph similarity within clinically similar case clusters versus between dissimilar ones. Across 15 model-condition comparisons, within- and between-cluster similarities are statistically indistinguishable (none survive multiple-testing correction); component-level signals are weak, and similarity is nearly identical for correct-correct (0.488) versus incorrect-incorrect (0.484) pairs. Structured reflection prompting boosts explicit feature analysis (+33%) but not cross-case consistency. The central claim is diagnostic competence without schema-scale reasoning consistency, advocating process-level evaluation beyond accuracy. The ontology, pipeline, and artifacts are released.

Significance. If the null result on schema consistency is robust, the work supplies a concrete, reproducible method for structured process evaluation of LLM clinical reasoning and demonstrates that accuracy metrics alone are insufficient. The large scale (750 traces, multiple models/conditions, correction for multiple tests) and public release of the extraction pipeline and graphs are concrete strengths that enable follow-on work. The finding that graph structure is orthogonal to correctness also suggests a useful separation of concerns for future benchmarks.

major comments (2)
  1. [Methods (ontology extraction and similarity computation)] The central claim (competence without schema-scale consistency) rests on the composite similarity metric detecting schemas if present. The Methods section on the ontology extraction pipeline defines a fixed 5-node/7-edge representation; however, no positive-control experiment or sensitivity analysis is reported showing that this granularity distinguishes known distinct reasoning structures (e.g., alternative causal chains or feature-weighting patterns) on the same cases. Without such evidence, the observed null (within ≈ between) could arise from representational collapse rather than absence of schemas, directly weakening the interpretation of the 15 comparisons and the correct/incorrect pair result (0.488 vs 0.484).
  2. [Results (component-level analysis)] Results section on component-level analysis: the claim that residual content signal is “far below schema scale” inherits the same 5/7 ontology; if the representation lacks sensitivity, this analysis cannot rule out that finer-grained or alternative graph encodings would reveal consistency. A direct test (e.g., comparison against a higher-resolution extraction or human-annotated schemas) is needed to support the load-bearing conclusion.
minor comments (2)
  1. [Methods] Clarify in the Methods how the 50 cases were clustered into “clinically similar” groups and whether cluster definitions were pre-registered or derived post-hoc from the same traces.
  2. [Results] The abstract states “no comparison survives multiple-testing correction”; report the exact correction method and the raw p-values in a supplementary table for transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on validating the sensitivity of our ontology. We respond point-by-point to the major comments below.

read point-by-point responses
  1. Referee: The central claim (competence without schema-scale consistency) rests on the composite similarity metric detecting schemas if present. The Methods section on the ontology extraction pipeline defines a fixed 5-node/7-edge representation; however, no positive-control experiment or sensitivity analysis is reported showing that this granularity distinguishes known distinct reasoning structures (e.g., alternative causal chains or feature-weighting patterns) on the same cases. Without such evidence, the observed null (within ≈ between) could arise from representational collapse rather than absence of schemas, directly weakening the interpretation of the 15 comparisons and the correct/incorrect pair result (0.488 vs 0.484).

    Authors: We agree that an explicit positive-control or sensitivity analysis would strengthen claims about the ontology's ability to detect schemas if present. The 5/7 ontology is derived from standard clinical reasoning models, and our results show detectable sensitivity (e.g., +33% explicit feature nodes under structured reflection). To address the concern directly, the revised manuscript will add a sensitivity analysis comparing the current ontology against a coarsened version (merged node types) and report effects on within- vs. between-cluster similarities. revision: yes

  2. Referee: Results section on component-level analysis: the claim that residual content signal is “far below schema scale” inherits the same 5/7 ontology; if the representation lacks sensitivity, this analysis cannot rule out that finer-grained or alternative graph encodings would reveal consistency. A direct test (e.g., comparison against a higher-resolution extraction or human-annotated schemas) is needed to support the load-bearing conclusion.

    Authors: We acknowledge that the component-level claims depend on the chosen representation and that finer-grained encodings or human-annotated schemas could reveal additional consistency. The component results do show condition-specific signals (e.g., prompting effects on features), indicating the graphs are not fully collapsed. A full human-annotation comparison would require new expert effort on the 50 cases and is outside current scope. In revision we will expand the limitations section to discuss this explicitly and identify it as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of extracted graph similarities

full rationale

The paper defines an extraction pipeline with a fixed 5-node/7-edge ontology, applies it to 750 LLM traces, computes composite graph similarities, and performs direct statistical comparisons (within- vs between-cluster, correct vs incorrect pairs). No equations, fitted parameters, predictions, or derivations are present. The operationalization of 'diagnostic schemas' as higher within-cluster similarity is a testable hypothesis, not a self-definition. No self-citations are load-bearing for any result. The analysis is self-contained against external benchmarks (NEJM CPC cases) with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that the chosen ontology captures clinically meaningful reasoning elements and that graph similarity is a meaningful proxy for schema consistency; no free parameters or invented entities with independent evidence are described.

axioms (1)
  • domain assumption The domain-grounded ontology with 5 node types and 7 edge types adequately represents the clinically relevant elements in free-text LLM diagnostic traces.
    This ontology is the foundation for all graph extraction and subsequent similarity measurements.
invented entities (1)
  • clinical reasoning graphs no independent evidence
    purpose: Structured graph representations of LLM diagnostic reasoning traces for measuring cross-case consistency
    New construct introduced to enable the similarity analysis; no independent falsifiable evidence outside the paper is provided.

pith-pipeline@v0.9.1-grok · 5802 in / 1331 out tokens · 30898 ms · 2026-06-30T05:54:17.395371+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    Teaching and Learning in Medicine , volume=

    The script concordance test: A tool to assess the reflective clinician , author=. Teaching and Learning in Medicine , volume=

  2. [2]

    New England Journal of Medicine , volume=

    Educational strategies to promote clinical diagnostic reasoning , author=. New England Journal of Medicine , volume=

  3. [3]

    The assessment of reasoning tool (

    Dhaliwal, Gurpreet , journal=. The assessment of reasoning tool (

  4. [4]

    Eriksen, Alexander V and M. Use of. NEJM AI , volume=

  5. [5]

    Nature , volume=

    Large language models encode clinical knowledge , author=. Nature , volume=

  6. [6]

    What disease does this patient have?

    Jin, Di and Pan, Eileen and Oufattole, Nassim and Weng, Wei-Hung and Fang, Hanyi and Szolovits, Peter , journal=. What disease does this patient have?

  7. [9]

    Advances in Neural Information Processing Systems , volume=

    Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting , author=. Advances in Neural Information Processing Systems , volume=

  8. [10]

    Academic Medicine , volume=

    A cognitive perspective on medical expertise: Theory and implications , author=. Academic Medicine , volume=

  9. [12]

    NEJM AI , volume=

    Assessment of large language models in clinical reasoning: A novel benchmarking study , author=. NEJM AI , volume=

  10. [13]

    2026 , note =

    Patel, Nisarg , title =. 2026 , note =

  11. [14]

    HealthBench: Evaluating Large Language Models Towards Improved Human Health

    Arora, R. K., Wei, J., Hicks, R. S., Bowman, P., Qui \ n onero-Candela, J., Tsimpourlas, F., Sharman, M., Shah, M., Vallone, A., Beutel, A., Heidecke, J., and Singhal, K. HealthBench : Evaluating large language models towards improved human health. arXiv preprint arXiv:2505.08775, 2025

  12. [15]

    Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness

    Basu, A. and Chakraborty, P. The illusion of reasoning: Step-level evaluation reveals decorative chain-of-thought in frontier language models. arXiv preprint arXiv:2603.22816, 2026

  13. [16]

    Bowen, J. L. Educational strategies to promote clinical diagnostic reasoning. New England Journal of Medicine, 355 0 (21): 0 2217--2225, 2006

  14. [17]

    The script concordance test: A tool to assess the reflective clinician

    Charlin, B., Roy, L., Brailovsky, C., Goulet, F., and van der Vleuten, C. The script concordance test: A tool to assess the reflective clinician. Teaching and Learning in Medicine, 12 0 (4): 0 189--195, 2000

  15. [18]

    The assessment of reasoning tool ( ART ): Structuring the conversation between teachers and learners

    Dhaliwal, G. The assessment of reasoning tool ( ART ): Structuring the conversation between teachers and learners. Diagnosis, 4 0 (4): 0 197--203, 2017

  16. [19]

    V., M ller, S., and Ryg, J

    Eriksen, A. V., M ller, S., and Ryg, J. Use of GPT-4 to diagnose complex clinical cases. NEJM AI, 1 0 (1), 2023

  17. [20]

    What disease does this patient have? A large-scale open domain question answering dataset from medical exams

    Jin, D., Pan, E., Oufattole, N., Weng, W.-H., Fang, H., and Szolovits, P. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Applied Sciences, 11 0 (14): 0 6421, 2021

  18. [21]

    Measuring Faithfulness in Chain-of-Thought Reasoning

    Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Denison, C., Hernandez, D., Li, D., Durmus, E., Hubinger, E., Kernion, J., et al. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023

  19. [22]

    G., Swamy, R., Sagar, N., Wang, M., Bacchi, S., Fong, J

    McCoy, L. G., Swamy, R., Sagar, N., Wang, M., Bacchi, S., Fong, J. M. N., Tan, N. C., Tan, K., Buckley, T. A., Brodeur, P., Celi, L. A., Manrai, A. K., Humbert, A., and Rodman, A. Assessment of large language models in clinical reasoning: A novel benchmarking study. NEJM AI, 2 0 (10), 2025

  20. [23]

    Problem representation, metacognition, and the limits of diagnostic self-critique In frontier language models , 2026

    Patel, N. Problem representation, metacognition, and the limits of diagnostic self-critique In frontier language models , 2026. https://doi.org/10.17605/OSF.IO/B8NHR

  21. [24]

    S., Wei, J., Chung, H

    Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., et al. Large language models encode clinical knowledge. Nature, 620 0 (7972): 0 172--180, 2023

  22. [25]

    Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting

    Turpin, M., Michael, J., Perez, E., and Bowman, S. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 2024