pith. sign in

arxiv: 2512.20074 · v2 · submitted 2025-12-23 · 💻 cs.AI · cs.CL

Reason2Decide: Rationale-Driven Multi-Task Learning

Pith reviewed 2026-05-16 20:48 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords multi-task learningself-rationalizationclinical decision supportscheduled samplingexposure biaslanguage modelsexplainable AImedical reasoning
0
0 comments X

The pith

Reason2Decide's two-stage training lets smaller models generate clinical predictions together with aligned rationales by pretraining on rationale generation then applying scheduled sampling for joint label and explanation learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Reason2Decide as a two-stage framework to fix exposure bias in self-rationalizing models for clinical decisions. In the first stage the model learns only to generate rationales; in the second stage it trains jointly on label prediction and rationale generation while scheduled sampling gradually shifts from gold labels to the model's own predictions. Experiments on three medical datasets show gains over standard fine-tuning and some zero-shot large models in both predictive F1 and rationale quality metrics. The gains hold across model sizes, remain stable whether rationales come from LLMs or nurses, and work when only LLM-generated rationales are used in stage one.

Core claim

Reason2Decide trains a model first on rationale generation alone and then on the combined task of label prediction plus rationale generation, using scheduled sampling to transition from conditioning on gold labels to conditioning on model predictions; this produces higher F1 scores and higher-fidelity rationales than ordinary fine-tuning while remaining effective with LLM-generated rationales and with models far smaller than contemporary foundation models.

What carries the argument

The Reason2Decide two-stage framework that separates rationale pretraining from joint label-and-rationale training and inserts scheduled sampling in the second stage to reduce exposure bias.

If this is right

  • Models achieve higher F1 on medical prediction tasks than standard fine-tuning baselines.
  • Generated rationales show improved fidelity according to BERTScore, BLEU, and LLM-as-a-Judge evaluations.
  • Performance advantages persist when models are 40 times smaller than typical foundation models.
  • Training remains effective when only LLM-generated rationales are supplied in the first stage.
  • Results are robust across LLM-generated, nurse-authored, and nurse-post-processed rationales on triage data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could lower the cost of building explainable clinical systems by substituting LLM rationales for human ones during pretraining.
  • Smaller models trained this way might fit into resource-limited hospital or mobile settings that cannot host large foundation models.
  • The same separation of rationale learning followed by scheduled joint training could be tested on non-medical reasoning tasks.
  • If the method generalizes, it would reduce dependence on scarce human-annotated rationales for training explainable decision models.

Load-bearing premise

Scheduled sampling in the joint-training stage reliably reduces exposure bias without destabilizing learning, and LLM-generated rationales alone supply enough quality to replace human annotations during the initial rationale pretraining stage.

What would settle it

On the same medical datasets a standard single-stage fine-tuned model would need to match or exceed Reason2Decide in both F1 score and rationale fidelity metrics for the performance advantage to be refuted.

Figures

Figures reproduced from arXiv: 2512.20074 by H M Quamran Hasan, Housam Khalifa Bashier, Jiayi Dai, Mi-Young Kim, Randy Goebel.

Figure 1
Figure 1. Figure 1: Overview of Reason2Decide. Stage-1 trains rationale generation. Stage-2 jointly predicts labels [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Despite the wide adoption of Large Language Models (LLM)s, clinical decision support systems face a critical challenge: achieving high predictive accuracy while generating explanations aligned with the predictions. Current approaches suffer from exposure bias leading to misaligned explanations. We propose Reason2Decide, a two-stage training framework that addresses key challenges in self-rationalization, including exposure bias and task separation. In Stage-1, our model is trained on rationale generation, while in Stage-2, we jointly train on label prediction and rationale generation, applying scheduled sampling to gradually transition from conditioning on gold labels to model predictions. We evaluate Reason2Decide on three medical datasets, including a proprietary triage dataset and public biomedical QA datasets. Across model sizes, Reason2Decide outperforms other fine-tuning baselines and some zero-shot LLMs in prediction (F1) and rationale fidelity (BERTScore, BLEU, LLM-as-a-Judge). In triage, Reason2Decide is rationale source-robust across LLM-generated, nurse-authored, and nurse-post-processed rationales. In our experiments, while using only LLM-generated rationales in Stage-1, Reason2Decide outperforms other fine-tuning variants. This indicates that LLM-generated rationales are suitable for pretraining models, reducing reliance on human annotations. Remarkably, Reason2Decide achieves these gains with models 40x smaller than contemporary foundation models, making clinical reasoning more accessible for resource-constrained deployments while still providing explainable decision support.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Reason2Decide, a two-stage training framework for self-rationalization in clinical decision support. Stage-1 pretrains the model on rationale generation using LLM-generated rationales; Stage-2 jointly optimizes label prediction and rationale generation while applying scheduled sampling to transition from gold labels to model predictions, aiming to mitigate exposure bias. Experiments on three medical datasets (including a proprietary triage set and public biomedical QA sets) report that Reason2Decide outperforms fine-tuning baselines and some zero-shot LLMs on prediction F1 and rationale metrics (BERTScore, BLEU, LLM-as-Judge), achieves source-robustness across rationale types, and delivers these gains with models 40x smaller than contemporary foundation models while using only LLM rationales in Stage-1.

Significance. If the performance and robustness claims are substantiated by isolating controls and statistical tests, the work would demonstrate a practical route to explainable clinical AI on modest hardware, reducing reliance on human rationales and large foundation models. The source-robustness result on triage data would be especially valuable for deployment settings where rationale provenance varies.

major comments (3)
  1. [§4 (Experiments)] §4 (Experiments): No ablation isolates the scheduled-sampling schedule in Stage-2. The manuscript reports gains over fine-tuning baselines but does not compare against (i) fixed teacher-forcing, (ii) joint multi-task training without any sampling schedule, or (iii) standard fine-tuning that already uses a combined objective. Without these controls it is impossible to attribute improvements to exposure-bias mitigation rather than multi-task regularization or LLM-rationale quality.
  2. [§4.3–4.4 (Results)] §4.3–4.4 (Results): The reported F1, BERTScore, and BLEU improvements lack statistical significance tests, confidence intervals, or details on baseline hyperparameter matching and implementation. The central claim that Reason2Decide “outperforms other fine-tuning baselines … across model sizes” therefore rests on point estimates whose reliability cannot be assessed.
  3. [§3.2 (Stage-2 training)] §3.2 (Stage-2 training): The scheduled-sampling transition schedule is described only at a high level; no analysis is given of its stability, sensitivity to the transition rate, or interaction with rationale source (LLM-generated vs. nurse-authored). This is load-bearing for the source-robustness claim on the triage dataset.
minor comments (2)
  1. [Abstract and §1] Abstract and §1: The phrase “rationale source-robust” is used without a concise operational definition or pointer to the exact metric (e.g., delta in F1 or BERTScore across sources) that establishes robustness.
  2. [§4.1 (Datasets)] §4.1 (Datasets): The proprietary triage dataset is described only qualitatively; adding basic statistics (class balance, rationale length distribution) would aid reproducibility assessment.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current manuscript would benefit from additional ablations, statistical analysis, and expanded discussion of the scheduled sampling procedure. We will incorporate these elements in the revised version to strengthen the attribution of gains and the robustness claims.

read point-by-point responses
  1. Referee: [§4 (Experiments)] §4 (Experiments): No ablation isolates the scheduled-sampling schedule in Stage-2. The manuscript reports gains over fine-tuning baselines but does not compare against (i) fixed teacher-forcing, (ii) joint multi-task training without any sampling schedule, or (iii) standard fine-tuning that already uses a combined objective. Without these controls it is impossible to attribute improvements to exposure-bias mitigation rather than multi-task regularization or LLM-rationale quality.

    Authors: We agree that isolating the contribution of scheduled sampling is essential. In the revised manuscript we will add the requested controls to §4: (i) fixed teacher-forcing throughout Stage-2, (ii) joint multi-task training without any sampling schedule, and (iii) standard fine-tuning using the combined objective. These ablations will allow clearer attribution of performance gains to exposure-bias mitigation. revision: yes

  2. Referee: [§4.3–4.4 (Results)] §4.3–4.4 (Results): The reported F1, BERTScore, and BLEU improvements lack statistical significance tests, confidence intervals, or details on baseline hyperparameter matching and implementation. The central claim that Reason2Decide “outperforms other fine-tuning baselines … across model sizes” therefore rests on point estimates whose reliability cannot be assessed.

    Authors: We acknowledge the absence of statistical tests and implementation details in the current version. The revision will include bootstrap confidence intervals and appropriate significance tests (e.g., paired t-tests or McNemar’s test) for all reported metrics, together with full hyperparameter search ranges and implementation specifications for every baseline to support fair comparison. revision: yes

  3. Referee: [§3.2 (Stage-2 training)] §3.2 (Stage-2 training): The scheduled-sampling transition schedule is described only at a high level; no analysis is given of its stability, sensitivity to the transition rate, or interaction with rationale source (LLM-generated vs. nurse-authored). This is load-bearing for the source-robustness claim on the triage dataset.

    Authors: We will expand §3.2 and add supporting analysis (new table or appendix) that examines schedule stability across random seeds, sensitivity to different transition rates (linear and exponential decay), and performance stratified by rationale source on the triage dataset. This will directly address the interaction with rationale provenance and bolster the source-robustness result. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external baselines

full rationale

The paper proposes a two-stage training procedure (Stage-1 rationale generation followed by Stage-2 joint training with scheduled sampling) and supports its claims solely through empirical comparisons on three medical datasets against independent fine-tuning baselines and zero-shot LLMs. No equations, derivations, or self-referential definitions appear in the provided text. Performance metrics (F1, BERTScore, BLEU, LLM-as-a-Judge) are standard external measures, not quantities defined by the method's own fitted parameters. LLM-generated rationales are used as input data rather than being tautologically redefined as output. No self-citations are invoked as load-bearing for any uniqueness theorem or ansatz. The method is therefore self-contained against external benchmarks, yielding a normal non-circular finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework depends on the standard assumption that multi-task learning with scheduled sampling improves alignment, plus the domain-specific assumption that LLM-generated rationales suffice for Stage-1 pretraining.

free parameters (1)
  • scheduled sampling transition schedule
    The rate and timing of switching from gold labels to model predictions is a tunable hyperparameter whose specific values are not detailed in the abstract.
axioms (1)
  • domain assumption LLM-generated rationales are suitable substitutes for human annotations in Stage-1 pretraining
    The abstract states that using only LLM-generated rationales still yields outperformance, relying on this unverified quality assumption.

pith-pipeline@v0.9.0 · 5578 in / 1271 out tokens · 75108 ms · 2026-05-16T20:48:58.587051+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 1 internal anchor

  1. [1]

    Reason2Decide: Rationale-Driven Multi-Task Learning

    Introduction The integration of reasoning capabilities with pre- diction tasks has been a critical research prob- lem in natural language processing (NLP). Existing state-of-the-art language models struggle to bal- ancehighpredictiveaccuracywhilealsogenerating human-interpretable explanations (Niu et al., 2025). Although LLMs have demonstrated strong perf...

  2. [2]

    Related Work Our work builds upon research in rationale gener- ation, multi-task learning, and existing methods to mitigate exposure bias. 2.1. Rationale Generation and Explainable AI A primary goal of Explainable AI (XAI) is making modeldecisionstransparentandinterpretable. The developmentofearlyXAImethodsfocusedonpost- hoc explanation generation by anal...

  3. [3]

    Similarly, (Lampinen et al., 2022) demon- strate that incorporating natural language explana- tions during training can improve reasoning and generalization

    fine-tuned T5 (Raffel et al., 2023) models to generate explanations by treating them as a sepa- rate task. Similarly, (Lampinen et al., 2022) demon- strate that incorporating natural language explana- tions during training can improve reasoning and generalization. However, these approaches often treat the explanation as a secondary output, which creates a...

  4. [4]

    To do so, we employ a single encoder-decoder architecture fθ (T5variants)thathandlesbothtasks

    Methodology Our method addresses the problem of combined prediction and rationale generation, where given an input x (clinical note or biomedical question), a model must predict a discrete labely∈ Y and generate a free-text rationaler which justifies the prediction. To do so, we employ a single encoder-decoder architecture fθ (T5variants)thathandlesbothta...

  5. [5]

    Experiments In this section we introduce the datasets, provide implementation details, followed by the experimen- tal results. 4.1. Tasks and Datasets We evaluate Reason2Decide on one proprietary clinicaldecision-makingtask(triagenotes),andtwo public biomedical QA benchmarks (PubMedQA, BioASQ). As models we use T5-Small/Base/Large and zero-shot LLMs as no...

  6. [6]

    Persistent severe ear pain without fever or infection signs requires timely medical evaluation to prevent complications and ensure appropriate treatment

    Earache AND [2] MODERATE pain OR SEVERE pain inadequately treated per guideline advice - yes The patient has an earache with moderate or severe pain inadequately treated according to guideline advice. Persistent severe ear pain without fever or infection signs requires timely medical evaluation to prevent complications and ensure appropriate treatment

  7. [7]

    MILD- MODERATE pain AND [2] constant AND

  8. [8]

    If the predicted disposition is Go to L&D now , does the generated ra- tionale justify the decision - nothomecare?

    present > 2 hours The patient has mild to moderate pain that is constant and has been present for more than two hours. Urgent evaluation needed due to persistent abdominal pain, bowel changes, and recent confusion, to rule out serious conditions and ensure appropriate treatment. Table 1: Sample rationale variants. Forourexperiments, wefocuson yesnoquestio...

  9. [9]

    Through this framework, the model learns to generate rationales that align with its predictions

    Conclusion and Future Work We have introduced Reason2Decide, a two-stage training framework for LLMs designed to enhance decision quality and interpretability in clinical NLP tasks. Through this framework, the model learns to generate rationales that align with its predictions. Experiments on nurse triage and biomedical QA datasets show that Reason2Decide...

  10. [10]

    Firstly, the datasets used in this work fall under nurse triage and biomedical QA

    Limitations While Reason2Decide demonstrates strong per- formance over other fine-tuning variants, certain limitations remain. Firstly, the datasets used in this work fall under nurse triage and biomedical QA. To effectively assess Reason2Decide, it should be extended to other subdomains in clinical NLP. Sec- ondly, the LLM-generated rationales were not r...

  11. [11]

    The proprietary dataset used was de-identified

    Ethics Statement During this research, we ensured to follow ethical guidelines for clinical NLP. The proprietary dataset used was de-identified. No personally identifiable information was accessible to the models or re- searchers during training/evaluation. Only open- source models were used, and were run on local machines. All clinical predictions and ra...

  12. [12]

    Bibliographical References Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural net- works. InProceedings of the 29th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 1171–1179, Cambridge, MA, USA. MIT Press. Yoshua Bengio, Jérôme Lou...

  13. [13]

    Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang

    Tellmewhy! explanationssupportlearning relational and causal structure. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. Biobert: a pre- trained biomedical language representation model for biomedical text mining.Bioinformatics, 36(4):1234–1240. Patrick Lewis, Myle Ott, Jingfei Du, and Veselin Stoyanov....

  14. [14]

    why should i trust you?

    Wt5?! training text-to-text models to ex- plain their predictions. Anastasios Nentidis, Georgios Katsimpras, Anastasia Krithara, Martin Krallinger, Miguel Rodríguez-Ortega, Eduard Rodriguez-López, Natalia Loukachevitch, Andrey Sakhovskiy, Elena Tutubalina, Dimitris Dimitriadis, Grigorios Tsoumakas, George Giannakoulas, Alexandra Bekiaridou, Athanasios Sam...

  15. [15]

    Appendix 9.1. Prompts Used Prompt for Post-Processing Nurse-Authored Rationales: You are a helpful assistant who expands brief medical notes into full, grammatically correct sentences using fewer than 20 words. Do not add new information. Convertthistoasentencewithoutaddingnew information: [RATIONALE] This created the nurse post-processed rationale varian...