Reason2Decide: Rationale-Driven Multi-Task Learning
Pith reviewed 2026-05-16 20:48 UTC · model grok-4.3
The pith
Reason2Decide's two-stage training lets smaller models generate clinical predictions together with aligned rationales by pretraining on rationale generation then applying scheduled sampling for joint label and explanation learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Reason2Decide trains a model first on rationale generation alone and then on the combined task of label prediction plus rationale generation, using scheduled sampling to transition from conditioning on gold labels to conditioning on model predictions; this produces higher F1 scores and higher-fidelity rationales than ordinary fine-tuning while remaining effective with LLM-generated rationales and with models far smaller than contemporary foundation models.
What carries the argument
The Reason2Decide two-stage framework that separates rationale pretraining from joint label-and-rationale training and inserts scheduled sampling in the second stage to reduce exposure bias.
If this is right
- Models achieve higher F1 on medical prediction tasks than standard fine-tuning baselines.
- Generated rationales show improved fidelity according to BERTScore, BLEU, and LLM-as-a-Judge evaluations.
- Performance advantages persist when models are 40 times smaller than typical foundation models.
- Training remains effective when only LLM-generated rationales are supplied in the first stage.
- Results are robust across LLM-generated, nurse-authored, and nurse-post-processed rationales on triage data.
Where Pith is reading between the lines
- The approach could lower the cost of building explainable clinical systems by substituting LLM rationales for human ones during pretraining.
- Smaller models trained this way might fit into resource-limited hospital or mobile settings that cannot host large foundation models.
- The same separation of rationale learning followed by scheduled joint training could be tested on non-medical reasoning tasks.
- If the method generalizes, it would reduce dependence on scarce human-annotated rationales for training explainable decision models.
Load-bearing premise
Scheduled sampling in the joint-training stage reliably reduces exposure bias without destabilizing learning, and LLM-generated rationales alone supply enough quality to replace human annotations during the initial rationale pretraining stage.
What would settle it
On the same medical datasets a standard single-stage fine-tuned model would need to match or exceed Reason2Decide in both F1 score and rationale fidelity metrics for the performance advantage to be refuted.
Figures
read the original abstract
Despite the wide adoption of Large Language Models (LLM)s, clinical decision support systems face a critical challenge: achieving high predictive accuracy while generating explanations aligned with the predictions. Current approaches suffer from exposure bias leading to misaligned explanations. We propose Reason2Decide, a two-stage training framework that addresses key challenges in self-rationalization, including exposure bias and task separation. In Stage-1, our model is trained on rationale generation, while in Stage-2, we jointly train on label prediction and rationale generation, applying scheduled sampling to gradually transition from conditioning on gold labels to model predictions. We evaluate Reason2Decide on three medical datasets, including a proprietary triage dataset and public biomedical QA datasets. Across model sizes, Reason2Decide outperforms other fine-tuning baselines and some zero-shot LLMs in prediction (F1) and rationale fidelity (BERTScore, BLEU, LLM-as-a-Judge). In triage, Reason2Decide is rationale source-robust across LLM-generated, nurse-authored, and nurse-post-processed rationales. In our experiments, while using only LLM-generated rationales in Stage-1, Reason2Decide outperforms other fine-tuning variants. This indicates that LLM-generated rationales are suitable for pretraining models, reducing reliance on human annotations. Remarkably, Reason2Decide achieves these gains with models 40x smaller than contemporary foundation models, making clinical reasoning more accessible for resource-constrained deployments while still providing explainable decision support.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Reason2Decide, a two-stage training framework for self-rationalization in clinical decision support. Stage-1 pretrains the model on rationale generation using LLM-generated rationales; Stage-2 jointly optimizes label prediction and rationale generation while applying scheduled sampling to transition from gold labels to model predictions, aiming to mitigate exposure bias. Experiments on three medical datasets (including a proprietary triage set and public biomedical QA sets) report that Reason2Decide outperforms fine-tuning baselines and some zero-shot LLMs on prediction F1 and rationale metrics (BERTScore, BLEU, LLM-as-Judge), achieves source-robustness across rationale types, and delivers these gains with models 40x smaller than contemporary foundation models while using only LLM rationales in Stage-1.
Significance. If the performance and robustness claims are substantiated by isolating controls and statistical tests, the work would demonstrate a practical route to explainable clinical AI on modest hardware, reducing reliance on human rationales and large foundation models. The source-robustness result on triage data would be especially valuable for deployment settings where rationale provenance varies.
major comments (3)
- [§4 (Experiments)] §4 (Experiments): No ablation isolates the scheduled-sampling schedule in Stage-2. The manuscript reports gains over fine-tuning baselines but does not compare against (i) fixed teacher-forcing, (ii) joint multi-task training without any sampling schedule, or (iii) standard fine-tuning that already uses a combined objective. Without these controls it is impossible to attribute improvements to exposure-bias mitigation rather than multi-task regularization or LLM-rationale quality.
- [§4.3–4.4 (Results)] §4.3–4.4 (Results): The reported F1, BERTScore, and BLEU improvements lack statistical significance tests, confidence intervals, or details on baseline hyperparameter matching and implementation. The central claim that Reason2Decide “outperforms other fine-tuning baselines … across model sizes” therefore rests on point estimates whose reliability cannot be assessed.
- [§3.2 (Stage-2 training)] §3.2 (Stage-2 training): The scheduled-sampling transition schedule is described only at a high level; no analysis is given of its stability, sensitivity to the transition rate, or interaction with rationale source (LLM-generated vs. nurse-authored). This is load-bearing for the source-robustness claim on the triage dataset.
minor comments (2)
- [Abstract and §1] Abstract and §1: The phrase “rationale source-robust” is used without a concise operational definition or pointer to the exact metric (e.g., delta in F1 or BERTScore across sources) that establishes robustness.
- [§4.1 (Datasets)] §4.1 (Datasets): The proprietary triage dataset is described only qualitatively; adding basic statistics (class balance, rationale length distribution) would aid reproducibility assessment.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We agree that the current manuscript would benefit from additional ablations, statistical analysis, and expanded discussion of the scheduled sampling procedure. We will incorporate these elements in the revised version to strengthen the attribution of gains and the robustness claims.
read point-by-point responses
-
Referee: [§4 (Experiments)] §4 (Experiments): No ablation isolates the scheduled-sampling schedule in Stage-2. The manuscript reports gains over fine-tuning baselines but does not compare against (i) fixed teacher-forcing, (ii) joint multi-task training without any sampling schedule, or (iii) standard fine-tuning that already uses a combined objective. Without these controls it is impossible to attribute improvements to exposure-bias mitigation rather than multi-task regularization or LLM-rationale quality.
Authors: We agree that isolating the contribution of scheduled sampling is essential. In the revised manuscript we will add the requested controls to §4: (i) fixed teacher-forcing throughout Stage-2, (ii) joint multi-task training without any sampling schedule, and (iii) standard fine-tuning using the combined objective. These ablations will allow clearer attribution of performance gains to exposure-bias mitigation. revision: yes
-
Referee: [§4.3–4.4 (Results)] §4.3–4.4 (Results): The reported F1, BERTScore, and BLEU improvements lack statistical significance tests, confidence intervals, or details on baseline hyperparameter matching and implementation. The central claim that Reason2Decide “outperforms other fine-tuning baselines … across model sizes” therefore rests on point estimates whose reliability cannot be assessed.
Authors: We acknowledge the absence of statistical tests and implementation details in the current version. The revision will include bootstrap confidence intervals and appropriate significance tests (e.g., paired t-tests or McNemar’s test) for all reported metrics, together with full hyperparameter search ranges and implementation specifications for every baseline to support fair comparison. revision: yes
-
Referee: [§3.2 (Stage-2 training)] §3.2 (Stage-2 training): The scheduled-sampling transition schedule is described only at a high level; no analysis is given of its stability, sensitivity to the transition rate, or interaction with rationale source (LLM-generated vs. nurse-authored). This is load-bearing for the source-robustness claim on the triage dataset.
Authors: We will expand §3.2 and add supporting analysis (new table or appendix) that examines schedule stability across random seeds, sensitivity to different transition rates (linear and exponential decay), and performance stratified by rationale source on the triage dataset. This will directly address the interaction with rationale provenance and bolster the source-robustness result. revision: yes
Circularity Check
No significant circularity; empirical claims rest on external baselines
full rationale
The paper proposes a two-stage training procedure (Stage-1 rationale generation followed by Stage-2 joint training with scheduled sampling) and supports its claims solely through empirical comparisons on three medical datasets against independent fine-tuning baselines and zero-shot LLMs. No equations, derivations, or self-referential definitions appear in the provided text. Performance metrics (F1, BERTScore, BLEU, LLM-as-a-Judge) are standard external measures, not quantities defined by the method's own fitted parameters. LLM-generated rationales are used as input data rather than being tautologically redefined as output. No self-citations are invoked as load-bearing for any uniqueness theorem or ansatz. The method is therefore self-contained against external benchmarks, yielding a normal non-circular finding.
Axiom & Free-Parameter Ledger
free parameters (1)
- scheduled sampling transition schedule
axioms (1)
- domain assumption LLM-generated rationales are suitable substitutes for human annotations in Stage-1 pretraining
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Reason2Decide, a two-stage training framework... applying scheduled sampling to gradually transition from conditioning on gold labels to model predictions.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
scheduled sampling mechanism that gradually transitions from gold labels to predicted label conditioning, mitigating exposure bias
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Reason2Decide: Rationale-Driven Multi-Task Learning
Introduction The integration of reasoning capabilities with pre- diction tasks has been a critical research prob- lem in natural language processing (NLP). Existing state-of-the-art language models struggle to bal- ancehighpredictiveaccuracywhilealsogenerating human-interpretable explanations (Niu et al., 2025). Although LLMs have demonstrated strong perf...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Related Work Our work builds upon research in rationale gener- ation, multi-task learning, and existing methods to mitigate exposure bias. 2.1. Rationale Generation and Explainable AI A primary goal of Explainable AI (XAI) is making modeldecisionstransparentandinterpretable. The developmentofearlyXAImethodsfocusedonpost- hoc explanation generation by anal...
work page 2016
-
[3]
fine-tuned T5 (Raffel et al., 2023) models to generate explanations by treating them as a sepa- rate task. Similarly, (Lampinen et al., 2022) demon- strate that incorporating natural language explana- tions during training can improve reasoning and generalization. However, these approaches often treat the explanation as a secondary output, which creates a...
work page 2023
-
[4]
To do so, we employ a single encoder-decoder architecture fθ (T5variants)thathandlesbothtasks
Methodology Our method addresses the problem of combined prediction and rationale generation, where given an input x (clinical note or biomedical question), a model must predict a discrete labely∈ Y and generate a free-text rationaler which justifies the prediction. To do so, we employ a single encoder-decoder architecture fθ (T5variants)thathandlesbothta...
work page 2020
-
[5]
Experiments In this section we introduce the datasets, provide implementation details, followed by the experimen- tal results. 4.1. Tasks and Datasets We evaluate Reason2Decide on one proprietary clinicaldecision-makingtask(triagenotes),andtwo public biomedical QA benchmarks (PubMedQA, BioASQ). As models we use T5-Small/Base/Large and zero-shot LLMs as no...
work page 2025
-
[6]
Earache AND [2] MODERATE pain OR SEVERE pain inadequately treated per guideline advice - yes The patient has an earache with moderate or severe pain inadequately treated according to guideline advice. Persistent severe ear pain without fever or infection signs requires timely medical evaluation to prevent complications and ensure appropriate treatment
-
[7]
MILD- MODERATE pain AND [2] constant AND
-
[8]
present > 2 hours The patient has mild to moderate pain that is constant and has been present for more than two hours. Urgent evaluation needed due to persistent abdominal pain, bowel changes, and recent confusion, to rule out serious conditions and ensure appropriate treatment. Table 1: Sample rationale variants. Forourexperiments, wefocuson yesnoquestio...
-
[9]
Through this framework, the model learns to generate rationales that align with its predictions
Conclusion and Future Work We have introduced Reason2Decide, a two-stage training framework for LLMs designed to enhance decision quality and interpretability in clinical NLP tasks. Through this framework, the model learns to generate rationales that align with its predictions. Experiments on nurse triage and biomedical QA datasets show that Reason2Decide...
-
[10]
Firstly, the datasets used in this work fall under nurse triage and biomedical QA
Limitations While Reason2Decide demonstrates strong per- formance over other fine-tuning variants, certain limitations remain. Firstly, the datasets used in this work fall under nurse triage and biomedical QA. To effectively assess Reason2Decide, it should be extended to other subdomains in clinical NLP. Sec- ondly, the LLM-generated rationales were not r...
-
[11]
The proprietary dataset used was de-identified
Ethics Statement During this research, we ensured to follow ethical guidelines for clinical NLP. The proprietary dataset used was de-identified. No personally identifiable information was accessible to the models or re- searchers during training/evaluation. Only open- source models were used, and were run on local machines. All clinical predictions and ra...
-
[12]
Bibliographical References Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural net- works. InProceedings of the 29th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 1171–1179, Cambridge, MA, USA. MIT Press. Yoshua Bengio, Jérôme Lou...
work page 2015
-
[13]
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang
Tellmewhy! explanationssupportlearning relational and causal structure. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. Biobert: a pre- trained biomedical language representation model for biomedical text mining.Bioinformatics, 36(4):1234–1240. Patrick Lewis, Myle Ott, Jingfei Du, and Veselin Stoyanov....
work page 2019
-
[14]
Wt5?! training text-to-text models to ex- plain their predictions. Anastasios Nentidis, Georgios Katsimpras, Anastasia Krithara, Martin Krallinger, Miguel Rodríguez-Ortega, Eduard Rodriguez-López, Natalia Loukachevitch, Andrey Sakhovskiy, Elena Tutubalina, Dimitris Dimitriadis, Grigorios Tsoumakas, George Giannakoulas, Alexandra Bekiaridou, Athanasios Sam...
work page 2025
-
[15]
Appendix 9.1. Prompts Used Prompt for Post-Processing Nurse-Authored Rationales: You are a helpful assistant who expands brief medical notes into full, grammatically correct sentences using fewer than 20 words. Do not add new information. Convertthistoasentencewithoutaddingnew information: [RATIONALE] This created the nurse post-processed rationale varian...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.