pith. sign in

arxiv: 2606.10279 · v1 · pith:LKEMJU6Qnew · submitted 2026-06-09 · 💻 cs.AI · cs.CL· cs.LG

Supervised Fine-tuning with Synthetic Rationale Data Hurts Real-World Disease Prediction

Pith reviewed 2026-06-27 13:47 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords supervised fine-tuningsynthetic rationalesdisease predictionAlzheimer's diseaseclinical predictionlanguage modelsexplanation supervisionprediction accuracy
0
0 comments X

The pith

Training language models on synthetic rationales for disease prediction reduces accuracy compared to label-only fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether adding synthetic explanations during supervised fine-tuning improves clinical prediction by teaching models both the outcome and the supporting logic. On five-year prediction of Alzheimer's disease and related dementias from longitudinal patient records, rationale-based training lowered performance relative to fine-tuning on labels alone. The drop held across model families, data volumes, and even when starting from reasoning-oriented base models. Expert review confirmed the generated rationales were medically accurate and tied to patient evidence, and the same rationales raised accuracy when used only as few-shot examples at inference time rather than as training targets. The authors locate the problem in a mismatch between the goals of producing coherent narratives and optimizing for correct discrimination on the prediction task.

Core claim

Across a controlled experiment spanning 504 configurations on ADRD prediction from health histories, rationale-based supervised fine-tuning consistently and substantially reduced prediction performance relative to label-only fine-tuning. The degradation persisted across model families and data scales and was not fixed by selecting a reasoning-oriented base model. Human experts verified that the generated rationales were medically accurate and faithfully grounded in patient-specific evidence. The same rationales improved results when supplied as inference-time demonstrations but not when used as training targets. The root cause is identified as a structural conflict between narrative plausibi

What carries the argument

The structural conflict between narrative plausibility and discriminative optimization

If this is right

  • Label-only fine-tuning should be preferred over rationale-augmented training when the goal is maximum accuracy on clinical prediction tasks.
  • Few-shot presentation of rationales at inference time remains a viable way to leverage explanations without incurring the training penalty.
  • The negative effect on performance is not mitigated by switching to reasoning-oriented base models.
  • The pattern holds across multiple model families and across different scales of training data.
  • Rationale supervision introduces a training dynamic that trades narrative coherence for reduced discriminative power.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tension may appear in other prediction settings that require precise pattern detection rather than story-like justification.
  • Training pipelines could separate rationale generation into a distinct stage that does not back-propagate into the main prediction parameters.
  • The finding suggests testing whether the conflict reverses on tasks where narrative structure itself carries predictive value.
  • Results on this specific health dataset motivate direct replication on non-health longitudinal prediction problems to isolate domain effects.

Load-bearing premise

That the observed performance drop arises from an inherent tension between narrative and discriminative objectives rather than from particulars of this dataset or training procedure.

What would settle it

An experiment on a different longitudinal clinical prediction task, such as onset of a non-neurological condition from electronic health records, in which rationale-based fine-tuning matches or exceeds label-only performance would falsify the claimed generality of the conflict.

Figures

Figures reproduced from arXiv: 2606.10279 by Bingxin Zhao, Bingxuan Li, Buxin Su, Cheng Qian, Jin Jin, Yiwei Wang.

Figure 1
Figure 1. Figure 1: Example training records constructed from one participant record. The three columns share the same [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SFT ROC-AUC performance by rationale format and base model. no-rationale is clearly strongest (Figure 2A). Mean ROC-AUC is 0.734 for no-rationale, compared with 0.604 for free-rationale and 0.592 for stepwise￾rationale. Both rationale conditions are substan￾tially worse than no-rationale (paired t-test, P = 7.26 × 10−52 for free-rationale and P = 6.51 × 10−57 for stepwise-rationale). The same pattern appea… view at source ↗
Figure 3
Figure 3. Figure 3: Parameter-level SFT diagnostics for the additional summary insights. All panels use ROC-AUC as [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Feature-level error analysis for the best no-rationale and free-rationale SFT configurations, using the same [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Few-shot ablation in the training-free setting. Bars show metric means over matched decoding settings [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Feature and generated-rationale analysis for deterministic Qwen3-8B few-shot and Zero-shot with CoT [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Divergent validation examples from the best [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: SFT performance by rationale format and by rationale format crossed with training sample size. Panel A [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Additional SFT PR-AUC, F1 score, and Recall by base model and by base model crossed with training [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Zero-shot baseline performance by prompt format and by base model. Panels A–B provide the ROC-AUC [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Zero-shot base-model-by-prompt-format interaction panels for ROC-AUC, PR-AUC, F1 score, and [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗
read the original abstract

Supervised fine-tuning with synthetic rationale data is widely assumed to improve language model performance on clinical prediction tasks by teaching models not just what to predict but why. We test this assumption on five-year Alzheimer's disease and related dementias (ADRD) prediction from longitudinal health histories. Across a large-scale controlled experiment of 504 configurations, we find that rationale-based SFT consistently and substantially hurts prediction performance relative to label-only fine-tuning. The degradation persists across model families and data scales, and is not resolved by using a reasoning-oriented base model. Crucially, the failure is not explained by poor rationale quality: human expert annotation confirms that the generated rationales are medically accurate and faithfully grounded in patient-specific evidence, and few-shot experiments show that the same rationales improve performance when used as inference-time demonstrations rather than training targets. We identify the root cause as a structural conflict between narrative plausibility and discriminative optimization. We hope our work paves the path toward a more precise understanding of when and how rationale-based supervision helps and when it does not, guiding the responsible development of language models for high-stakes clinical prediction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper reports a large-scale empirical study (504 configurations) on five-year ADRD prediction from longitudinal health records. It finds that supervised fine-tuning on synthetic rationales consistently and substantially degrades predictive performance relative to label-only fine-tuning, across model families and data scales. Human expert review confirms the rationales are medically accurate and evidence-grounded; the same rationales improve performance in few-shot inference. The authors attribute the degradation to an inherent structural conflict between narrative plausibility and discriminative optimization, and conclude that rationale-based SFT should be used with caution in high-stakes clinical prediction.

Significance. If the core empirical result holds, the work provides a clear counter-example to the common assumption that rationale supervision improves clinical prediction models. The scale of the controlled experiment, the human validation of rationale quality, and the explicit contrast with few-shot inference are notable strengths that make the negative finding credible within the tested setting. The result would usefully constrain expectations for chain-of-thought-style supervision in discriminative medical tasks and motivate more targeted investigation of when rationale data helps versus harms.

major comments (3)
  1. [§5 and §4.2] §5 (Discussion) and §4.2 (Ablation results): The root-cause claim of a 'structural conflict between narrative plausibility and discriminative optimization' is presented as the explanation, yet the manuscript does not report ablations that isolate this mechanism from plausible alternatives such as increased sequence length altering gradient weighting, shifted next-token loss distribution, or format-induced output bias. Without such controls, the interpretive attribution remains under-supported relative to the strength of the empirical degradation result.
  2. [§3 and Table 1] §3 (Experimental setup) and Table 1: The study is restricted to a single longitudinal ADRD dataset. While the 504-configuration sweep is extensive within this domain, the manuscript does not discuss or test whether the observed degradation pattern transfers to other clinical prediction tasks (e.g., shorter-horizon outcomes or tasks where narrative and discriminative objectives may align more closely). This limits the scope of the central claim.
  3. [§4.1] §4.1 (Main results): The performance degradation is reported as 'consistent and substantial,' but the text does not include per-configuration variance, confidence intervals, or statistical significance tests across the 504 runs. Adding these would strengthen the claim that the effect is robust rather than driven by a subset of configurations.
minor comments (2)
  1. [Figures 2 and 3] Figure 2 and 3: Axis labels and legend entries use inconsistent abbreviations for model families; standardize notation to match the text in §4.
  2. [§2] §2 (Related work): The citation to prior rationale-SFT studies in medicine could be expanded to include recent negative or null results on rationale supervision outside the clinical domain for balance.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important areas for strengthening the interpretation and scope of our findings. We address each major comment below and will incorporate revisions as noted.

read point-by-point responses
  1. Referee: [§5 and §4.2] §5 (Discussion) and §4.2 (Ablation results): The root-cause claim of a 'structural conflict between narrative plausibility and discriminative optimization' is presented as the explanation, yet the manuscript does not report ablations that isolate this mechanism from plausible alternatives such as increased sequence length altering gradient weighting, shifted next-token loss distribution, or format-induced output bias. Without such controls, the interpretive attribution remains under-supported relative to the strength of the empirical degradation result.

    Authors: We agree that isolating the proposed mechanism from alternatives would strengthen the interpretation. The existing few-shot experiments already provide evidence against format-induced bias, as the identical rationales improve performance when used as inference-time demonstrations rather than SFT targets. To further address sequence length and loss distribution effects, we will add controlled ablations in a revised §4.2 (e.g., padding label-only sequences to match rationale lengths and comparing loss distributions), along with expanded discussion in §5. These additions will better support the attribution while acknowledging remaining alternatives. revision: yes

  2. Referee: [§3 and Table 1] §3 (Experimental setup) and Table 1: The study is restricted to a single longitudinal ADRD dataset. While the 504-configuration sweep is extensive within this domain, the manuscript does not discuss or test whether the observed degradation pattern transfers to other clinical prediction tasks (e.g., shorter-horizon outcomes or tasks where narrative and discriminative objectives may align more closely). This limits the scope of the central claim.

    Authors: We acknowledge this limitation on generalizability. In the revision, we will expand the discussion in §3 to justify the choice of five-year ADRD prediction as a canonical high-stakes task where narrative plausibility can conflict with discriminative needs, and add an explicit limitations paragraph noting that transfer to other tasks (such as shorter-horizon outcomes) is an important direction for future work. This will appropriately scope the central claim without overgeneralization. revision: yes

  3. Referee: [§4.1] §4.1 (Main results): The performance degradation is reported as 'consistent and substantial,' but the text does not include per-configuration variance, confidence intervals, or statistical significance tests across the 504 runs. Adding these would strengthen the claim that the effect is robust rather than driven by a subset of configurations.

    Authors: We agree that statistical characterization would strengthen the robustness claim. In the revised manuscript, we will augment §4.1 (and associated tables/figures) with per-configuration standard deviations, bootstrap confidence intervals on the performance deltas, and paired statistical tests (e.g., Wilcoxon signed-rank) across the 504 configurations to demonstrate that the degradation is consistent and statistically significant rather than driven by outliers. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical experimental study

full rationale

The paper is a controlled empirical study reporting performance comparisons across 504 SFT configurations on longitudinal health data. No derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the text. Claims rest directly on measured degradation, human rationale validation, and few-shot controls rather than any chain that reduces outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical comparison relying on experimental measurement rather than mathematical axioms or new postulated entities. No free parameters are introduced; the only background assumptions are standard ones about the representativeness of the health dataset and the validity of human expert judgment on rationale quality.

pith-pipeline@v0.9.1-grok · 5741 in / 1114 out tokens · 18915 ms · 2026-06-27T13:47:11.161587+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 3 canonical work pages

  1. [1]

    arXiv preprint arXiv:2212.07919 , year=

    Roscoe: A suite of metrics for scoring step-by-step reasoning , author=. arXiv preprint arXiv:2212.07919 , year=

  2. [2]

    arXiv preprint arXiv:2304.10703 , year=

    Receval: Evaluating reasoning chains via correctness and informativeness , author=. arXiv preprint arXiv:2304.10703 , year=

  3. [3]

    Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

    Socreval: Large language models with the socratic method for reference-free reasoning evaluation , author=. Findings of the Association for Computational Linguistics: NAACL 2024 , pages=

  4. [4]

    Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

    Assessing LLM Reasoning Steps via Principal Knowledge Grounding , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

  5. [5]

    Journal of Mathematical Psychology , volume=

    The area above the ordinal dominance graph and the area below the receiver operating characteristic graph , author=. Journal of Mathematical Psychology , volume=. 1975 , doi=

  6. [6]

    The meaning and use of the area under a receiver operating characteristic (

    Hanley, James A and McNeil, Barbara J , journal=. The meaning and use of the area under a receiver operating characteristic (. 1982 , doi=

  7. [7]

    An introduction to

    Fawcett, Tom , journal=. An introduction to. 2006 , doi=

  8. [8]

    1979 , isbn=

    Information Retrieval , author=. 1979 , isbn=

  9. [9]

    The relationship between precision-recall and

    Davis, Jesse and Goadrich, Mark , booktitle=. The relationship between precision-recall and. 2006 , doi=

  10. [10]

    The precision-recall plot is more informative than the

    Saito, Takaya and Rehmsmeier, Marc , journal=. The precision-recall plot is more informative than the. 2015 , doi=

  11. [11]

    Biometrics , volume=

    Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach , author=. Biometrics , volume=. 1988 , doi=

  12. [12]

    1993 , isbn=

    An Introduction to the Bootstrap , author=. 1993 , isbn=

  13. [13]

    Scandinavian Journal of Statistics , volume=

    A simple sequentially rejective multiple test procedure , author=. Scandinavian Journal of Statistics , volume=. 1979 , url=

  14. [14]

    Journal of the Royal Statistical Society: Series B (Methodological) , volume=

    Controlling the false discovery rate: A practical and powerful approach to multiple testing , author=. Journal of the Royal Statistical Society: Series B (Methodological) , volume=. 1995 , doi=

  15. [15]

    arXiv preprint arXiv:2404.05221 , year=

    Llm reasoners: New evaluation, library, and analysis of step-by-step reasoning with large language models , author=. arXiv preprint arXiv:2404.05221 , year=

  16. [16]

    Findings of the Association for Computational Linguistics: ACL 2023 , pages=

    Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes , author=. Findings of the Association for Computational Linguistics: ACL 2023 , pages=

  17. [17]

    On the impact of fine-tuning on chain-of-thought reasoning , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  18. [18]

    Nature Reviews Neurology , year=

    UK Biobank at 20—a growing, global resource for dementia research , author=. Nature Reviews Neurology , year=

  19. [19]

    Nature Reviews Neurology , volume=

    A global view of the genetic basis of Alzheimer disease , author=. Nature Reviews Neurology , volume=. 2023 , doi=

  20. [20]

    European Heart Journal , volume=

    Modifiable cardiovascular risk factors and genetics for targeted prevention of dementia , author=. European Heart Journal , volume=. 2023 , doi=

  21. [21]

    Nature , volume=

    Learning the natural history of human disease with generative transformers , author=. Nature , volume=. 2025 , doi=

  22. [22]

    PLOS Medicine , volume=

    UK Biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age , author=. PLOS Medicine , volume=. 2015 , doi=

  23. [23]

    and Tejada-Vera, Betzaida , title =

    Kramarow, Ellen A. and Tejada-Vera, Betzaida , title =. 2024 , howpublished =. doi:10.15620/cdc/165795 , url =

  24. [24]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  25. [25]

    arXiv preprint arXiv:2110.14168 , year=

    Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

  26. [26]

    International Conference on Learning Representations , volume=

    Let's verify step by step , author=. International Conference on Learning Representations , volume=

  27. [27]

    arXiv preprint arXiv:2211.12588 , year=

    Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks , author=. arXiv preprint arXiv:2211.12588 , year=

  28. [28]

    Advances in neural information processing systems , volume=

    Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=

  29. [29]

    arXiv preprint arXiv:2203.11171 , year=

    Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

  30. [30]

    Nature , volume=

    The UK Biobank resource with deep phenotyping and genomic data , author=. Nature , volume=. 2018 , publisher=

  31. [31]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Large language models are clinical reasoners: Reasoning-aware diagnosis framework with prompt-generated rationales , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  32. [32]

    Knowledge-Augmented Multimodal Clinical Rationale Generation for Disease Diagnosis with Small Language Models

    Niu, Shuai and Ma, Jing and Lin, Hongzhan and Bai, Liang and Wang, Zhihua and Xu, Yida and Song, Yunya and Yang, Xian. Knowledge-Augmented Multimodal Clinical Rationale Generation for Disease Diagnosis with Small Language Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.1...

  33. [33]

    International Conference on Learning Representations , volume=

    Reasoning-enhanced healthcare predictions with knowledge graph community retrieval , author=. International Conference on Learning Representations , volume=

  34. [34]

    Does Rationale Quality Matter? Enhancing Mental Disorder Detection via Selective Reasoning Distillation

    Song, Hoyun and Lee, Huije and Shin, Jisu and Cho, Sukmin and Ko, Changgeon and Park, Jong C. Does Rationale Quality Matter? Enhancing Mental Disorder Detection via Selective Reasoning Distillation. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.1119

  35. [35]

    arXiv preprint arXiv:2512.20074 , year=

    Reason2Decide: Rationale-Driven Multi-Task Learning , author=. arXiv preprint arXiv:2512.20074 , year=

  36. [36]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Reasonmed: A 370k multi-agent generated dataset for advancing medical reasoning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  37. [37]

    arXiv preprint arXiv:2605.01474 , year=

    ReMedi: Reasoner for Medical Clinical Prediction , author=. arXiv preprint arXiv:2605.01474 , year=

  38. [38]

    medRxiv , pages=

    Large-language-model-based 10-year risk prediction of cardiovascular disease: insight from the UK biobank data , author=. medRxiv , pages=. 2023 , publisher=

  39. [39]

    NPJ digital medicine , volume=

    Small language models learn enhanced reasoning skills from medical textbooks , author=. NPJ digital medicine , volume=. 2025 , publisher=

  40. [40]

    arXiv preprint arXiv:2501.09213 , year=

    FinemedLM-o1: Enhancing medical knowledge reasoning ability of LLM from supervised fine-tuning to test-time training , author=. arXiv preprint arXiv:2501.09213 , year=

  41. [41]

    arXiv preprint arXiv:2412.18925 , year=

    Huatuogpt-o1, towards medical complex reasoning with llms , author=. arXiv preprint arXiv:2412.18925 , year=