Resource-Conscious Modeling for Next- Day Discharge Prediction Using Clinical Notes

Alexander Lopez; Ha Na Cho; Hansen Bow; Kai Zheng; Sairam Sutari

arxiv: 2604.03498 · v1 · submitted 2026-04-03 · 💻 cs.AI

Resource-Conscious Modeling for Next- Day Discharge Prediction Using Clinical Notes

Ha Na Cho , Sairam Sutari , Alexander Lopez , Hansen Bow , Kai Zheng This is my paper

Pith reviewed 2026-05-13 19:25 UTC · model grok-4.3

classification 💻 cs.AI

keywords next-day discharge predictionclinical notesTF-IDFLGBMLoRAcompact LLMsimbalanced classificationspine surgery

0 comments

The pith

TF-IDF paired with LGBM outperforms fine-tuned compact LLMs at predicting next-day discharge from postoperative notes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the feasibility of using postoperative clinical notes to predict next-day discharge in elective spine surgery units. It compares traditional TF-IDF vectorization with gradient boosting models against compact LLMs fine-tuned via LoRA. The TF-IDF and LGBM combination delivers the strongest results, with the highest AUC-ROC and a usable F1-score on the discharge class despite class imbalance. This matters because accurate early discharge forecasts can improve bed turnover and resource use in hospitals without requiring heavy computational resources.

Core claim

TF-IDF with LGBM achieved the best balance with an F1-score of 0.47 for the discharge class, recall of 0.51, and the highest AUC-ROC of 0.80, while LoRA improved recall in DistilGPT-2 but overall transformer-based and generative models underperformed the simpler text-based approach.

What carries the argument

TF-IDF vectorization of clinical notes fed into light gradient boosting (LGBM) for binary classification of next-day discharge.

Load-bearing premise

Postoperative clinical notes contain sufficient and representative signals for next-day discharge without major documentation inconsistencies or selection bias.

What would settle it

Applying the same models to notes from a different hospital or surgical specialty and observing a clear drop in AUC-ROC below 0.70 would show the signals are not general enough.

Figures

Figures reproduced from arXiv: 2604.03498 by Alexander Lopez, Ha Na Cho, Hansen Bow, Kai Zheng, Sairam Sutari.

**Figure 1.** Figure 1: Pipeline for lightweight clinical NLP model evaluation. 3. Results We evaluated modeling performance across three modeling strategies: TF-IDF with ML classifiers, embedding-based models using MiniLM and BCB, and fine-tuned generative LLMs [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

Timely discharge prediction is essential for optimizing bed turnover and resource allocation in elective spine surgery units. This study evaluates the feasibility of lightweight, fine-tuned large language models (LLMs) and traditional text-based models for predicting next-day discharge using postoperative clinical notes. We compared 13 models, including TF-IDF with XGBoost and LGBM, and compact LLMs (DistilGPT-2, Bio_ClinicalBERT) fine-tuned via LoRA. TF-IDF with LGBM achieved the best balance, with an F1-score of 0.47 for the discharge class, a recall of 0.51, and the highest AUC-ROC (0.80). While LoRA improved recall in DistilGPT2, overall transformer-based and generative models underperformed. These findings suggest interpretable, resource-efficient models may outperform compact LLMs in real-world, imbalanced clinical prediction tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TF-IDF with LGBM beats the LoRA-tuned small LLMs on next-day discharge from spine surgery notes, but the edge may come from direct label leakage in templated phrases rather than clinical signals.

read the letter

The main thing here is that TF-IDF with LGBM beats the fine-tuned small LLMs for next-day discharge prediction using postoperative notes from spine surgery patients, reaching 0.80 AUC and 0.47 F1 on the positive class. It's a new comparison on this specific dataset and task. The paper does well by testing a range of models with an eye toward resource use, including XGBoost, LGBM, and LoRA on DistilGPT2 and Bio_ClinicalBERT. The finding that transformers underperform despite fine-tuning is worth noting for anyone working with limited data or compute in clinical settings. They also show some recall gains from LoRA, keeping the comparison balanced. The soft spots are the potential for label leakage and missing methodological details. Clinical notes in this context often contain direct references to discharge plans, which a TF-IDF model could exploit without learning deeper patterns. Since there's no feature importance or ablation to remove those signals, the superiority might not reflect better modeling of clinical factors. Being single-center adds to the generalization concern, and the abstract doesn't specify dataset size, cross-validation, or imbalance handling, so the metrics are hard to fully trust. The F1 score remains low, limiting immediate practical impact. This paper suits readers in health informatics or operations research focused on practical discharge tools. Someone evaluating lightweight NLP options for hospitals would get useful insights from the model rankings. It has enough going for it to go to peer review, as the core idea and results are clear. I'd recommend sending it for review rather than rejecting it outright.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates 13 models, including TF-IDF paired with XGBoost and LGBM as well as compact LLMs (DistilGPT-2, Bio_ClinicalBERT) fine-tuned via LoRA, for next-day discharge prediction from postoperative clinical notes in elective spine surgery. TF-IDF with LGBM is reported as the strongest performer, achieving an F1-score of 0.47, recall of 0.51, and AUC-ROC of 0.80 on the discharge class, while transformer-based models underperformed overall. The authors conclude that resource-efficient traditional models may be preferable to compact LLMs for imbalanced clinical text tasks.

Significance. If the reported metrics prove robust, the work illustrates that interpretable bag-of-words models can deliver competitive performance with lower computational cost than fine-tuned LLMs in real-world clinical note prediction. This supports the value of resource-conscious baselines in healthcare AI and highlights trade-offs between model complexity and practical utility on imbalanced data.

major comments (2)

[Results] Results section: The headline metrics (F1 0.47, recall 0.51, AUC 0.80 for TF-IDF+LGBM) are presented without dataset size, class imbalance ratio, cross-validation procedure, or error bars. These omissions make it impossible to assess whether the model ranking is statistically reliable or reproducible.
[Methods] Methods/Results: No ablation, feature-importance table, or top-token list is provided to test for label leakage from explicit discharge-planning phrases routinely present in postoperative notes (e.g., templated statements such as 'ready for discharge' or 'discharge to home tomorrow'). Without such controls, the reported superiority of TF-IDF+LGBM cannot be confidently attributed to clinical reasoning rather than direct encoding of the label.

minor comments (2)

[Abstract] Abstract: The claim that '13 models' were compared is not accompanied by an explicit list or reference to a supplementary table; adding this would aid reproducibility.
[Discussion] Discussion: The single-center, single-unit data source should be stated more explicitly as a limitation on generalizability, together with any steps taken to mitigate documentation variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below and have revised the manuscript to incorporate the suggested improvements for greater transparency and robustness.

read point-by-point responses

Referee: [Results] Results section: The headline metrics (F1 0.47, recall 0.51, AUC 0.80 for TF-IDF+LGBM) are presented without dataset size, class imbalance ratio, cross-validation procedure, or error bars. These omissions make it impossible to assess whether the model ranking is statistically reliable or reproducible.

Authors: We agree that these details are essential for assessing statistical reliability and reproducibility. The revised manuscript now includes the total number of postoperative clinical notes in the dataset, the class imbalance ratio, a full description of the stratified cross-validation procedure, and error bars (standard deviations across folds) for all reported metrics. revision: yes
Referee: [Methods] Methods/Results: No ablation, feature-importance table, or top-token list is provided to test for label leakage from explicit discharge-planning phrases routinely present in postoperative notes (e.g., templated statements such as 'ready for discharge' or 'discharge to home tomorrow'). Without such controls, the reported superiority of TF-IDF+LGBM cannot be confidently attributed to clinical reasoning rather than direct encoding of the label.

Authors: This is a valid concern regarding potential label leakage. In the revised manuscript we have added an ablation study that removes common explicit discharge-planning phrases from the notes before re-training and evaluating the models. We also include a feature-importance table and top-token list for the TF-IDF+LGBM model to show the broader clinical features contributing to predictions beyond direct label encoding. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical model comparison on held-out data

full rationale

The paper reports an empirical benchmark of 13 models (TF-IDF+LGBM, XGBoost, LoRA-tuned DistilGPT-2, Bio_ClinicalBERT, etc.) for next-day discharge prediction from postoperative notes. All reported metrics (F1 0.47, recall 0.51, AUC-ROC 0.80) are obtained by training on one split and evaluating on a held-out test set; no equations, derivations, or first-principles claims appear. No parameter is fitted to a subset and then re-labeled as a prediction, no self-citation chain is load-bearing for any result, and no ansatz or uniqueness theorem is invoked. The derivation chain is therefore self-contained: performance numbers are direct outputs of standard train/test evaluation rather than reductions to the inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The work rests on standard supervised learning assumptions plus domain-specific premises about clinical text; no new entities or heavy free parameters beyond routine hyperparameter tuning.

free parameters (2)

LoRA rank and alpha
Low-rank adaptation parameters chosen for fine-tuning DistilGPT-2 and Bio_ClinicalBERT
LGBM/XGBoost hyperparameters
Tree depth, learning rate, and regularization parameters tuned on the clinical notes task

axioms (2)

domain assumption Clinical notes contain predictive textual signals for next-day discharge
Invoked by framing the task as text classification from postoperative notes
domain assumption Standard train/test split yields unbiased performance estimates
Assumed in reporting F1, recall, and AUC-ROC without further validation details

pith-pipeline@v0.9.0 · 5459 in / 1381 out tokens · 47336 ms · 2026-05-13T19:25:01.866013+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

[1]

Development and validation of a machine learning model integrated with the clinical workflow for inpatient discharge date prediction

Mahyoub MA, Dougherty K, Yadav RR, Berio-Dorta R and Shukla A. Development and validation of a machine learning model integrated with the clinical workflow for inpatient discharge date prediction. Front. Digit. Health. 2024 Sept;6(1455446), doi: 10.3389/fdgth.2024.1455446

work page doi:10.3389/fdgth.2024.1455446 2024
[2]

Perioperative protocol for elective spine surgery is associated with reduced length of stay and complications

Sivaganesan A, Wick JB, Chotai S, Cherkesky C, Stephens BF, Devin CJ. Perioperative protocol for elective spine surgery is associated with reduced length of stay and complications. J Am Acad Orthop Surg. 2019 Mar 1;27(5):183-189, doi: 10.5435/JAAOS-D-17-00274

work page doi:10.5435/jaaos-d-17-00274 2019
[3]

Predicting individual patient and hospital-level discharge using machine learning

Wei J, Zhou J, Zhang Z, et al. Predicting individual patient and hospital-level discharge using machine learning. Commun Med. 2024;4(1):236. 2024 Nov 18, doi:10.1038/s43856-024-00673-x

work page doi:10.1038/s43856-024-00673-x 2024
[4]

Jung H, Kim Y, Seo J. et al. Clinical assessment of fine-tuned open-source LLMs in cardiology: from progress notes to discharge summary. J Healthc Inform Res. 2025, doi: https://doi.org/10.1007/s41666- 025-00203-x

work page doi:10.1007/s41666- 2025
[5]

Publicly Available Clinical

Alsentzer E, Murphy J, Boag W, et al. Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop; 2019; Minneapolis, MN: Association for Computational Linguistics. p.72-78, doi: 10.18653/v1/W19-1909

work page doi:10.18653/v1/w19-1909 2019
[6]

The foundational capabilities of large language models in predicting postoperative risks using clinical notes

Alba C, Xue B, Abraham J, Kannampallil T, Lu C. The foundational capabilities of large language models in predicting postoperative risks using clinical notes. NPJ Digit Med. 2025 Feb;8(1):95, doi:10.1038/s41746-025-01489-2

work page doi:10.1038/s41746-025-01489-2 2025
[7]

Lightweight transformers for clinical natural language processing

Rohanian O, Nouriborji M, Jauncey H, et al. Lightweight transformers for clinical natural language processing. Natural Language Engineering. 2024;30(5):887-914, doi:10.1017/S1351324923000542

work page doi:10.1017/s1351324923000542 2024
[8]

arXiv preprint arXiv:2504.17119 (2025)

Garg M, Raza S, Rayana S, Liu X, Sohn S. The rise of small language models in healthcare: a comprehensive survey. arXiv. 2024 Apr. Accessed August 28, 2025. https://arxiv.org/abs/2504.17119

work page arXiv 2024
[9]

Examining imbalance effects on performance and demographic fairness of clinical language models

Jones P, Liu W, Huang I-C, et al. Examining imbalance effects on performance and demographic fairness of clinical language models. arXiv. Published December 2024. arXiv:2412.17803

work page arXiv 2024
[10]

Building large-scale registries from unstructured clinical notes using a low-resource natural language processing pipeline

Tavabi N, Pruneski J, Golchin S, et al. Building large-scale registries from unstructured clinical notes using a low-resource natural language processing pipeline. Artif Intell Med. 2024;151:102847, doi:10.1016/j.artmed.2024.102847

work page doi:10.1016/j.artmed.2024.102847 2024

[1] [1]

Development and validation of a machine learning model integrated with the clinical workflow for inpatient discharge date prediction

Mahyoub MA, Dougherty K, Yadav RR, Berio-Dorta R and Shukla A. Development and validation of a machine learning model integrated with the clinical workflow for inpatient discharge date prediction. Front. Digit. Health. 2024 Sept;6(1455446), doi: 10.3389/fdgth.2024.1455446

work page doi:10.3389/fdgth.2024.1455446 2024

[2] [2]

Perioperative protocol for elective spine surgery is associated with reduced length of stay and complications

Sivaganesan A, Wick JB, Chotai S, Cherkesky C, Stephens BF, Devin CJ. Perioperative protocol for elective spine surgery is associated with reduced length of stay and complications. J Am Acad Orthop Surg. 2019 Mar 1;27(5):183-189, doi: 10.5435/JAAOS-D-17-00274

work page doi:10.5435/jaaos-d-17-00274 2019

[3] [3]

Predicting individual patient and hospital-level discharge using machine learning

Wei J, Zhou J, Zhang Z, et al. Predicting individual patient and hospital-level discharge using machine learning. Commun Med. 2024;4(1):236. 2024 Nov 18, doi:10.1038/s43856-024-00673-x

work page doi:10.1038/s43856-024-00673-x 2024

[4] [4]

Jung H, Kim Y, Seo J. et al. Clinical assessment of fine-tuned open-source LLMs in cardiology: from progress notes to discharge summary. J Healthc Inform Res. 2025, doi: https://doi.org/10.1007/s41666- 025-00203-x

work page doi:10.1007/s41666- 2025

[5] [5]

Publicly Available Clinical

Alsentzer E, Murphy J, Boag W, et al. Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop; 2019; Minneapolis, MN: Association for Computational Linguistics. p.72-78, doi: 10.18653/v1/W19-1909

work page doi:10.18653/v1/w19-1909 2019

[6] [6]

The foundational capabilities of large language models in predicting postoperative risks using clinical notes

Alba C, Xue B, Abraham J, Kannampallil T, Lu C. The foundational capabilities of large language models in predicting postoperative risks using clinical notes. NPJ Digit Med. 2025 Feb;8(1):95, doi:10.1038/s41746-025-01489-2

work page doi:10.1038/s41746-025-01489-2 2025

[7] [7]

Lightweight transformers for clinical natural language processing

Rohanian O, Nouriborji M, Jauncey H, et al. Lightweight transformers for clinical natural language processing. Natural Language Engineering. 2024;30(5):887-914, doi:10.1017/S1351324923000542

work page doi:10.1017/s1351324923000542 2024

[8] [8]

arXiv preprint arXiv:2504.17119 (2025)

Garg M, Raza S, Rayana S, Liu X, Sohn S. The rise of small language models in healthcare: a comprehensive survey. arXiv. 2024 Apr. Accessed August 28, 2025. https://arxiv.org/abs/2504.17119

work page arXiv 2024

[9] [9]

Examining imbalance effects on performance and demographic fairness of clinical language models

Jones P, Liu W, Huang I-C, et al. Examining imbalance effects on performance and demographic fairness of clinical language models. arXiv. Published December 2024. arXiv:2412.17803

work page arXiv 2024

[10] [10]

Building large-scale registries from unstructured clinical notes using a low-resource natural language processing pipeline

Tavabi N, Pruneski J, Golchin S, et al. Building large-scale registries from unstructured clinical notes using a low-resource natural language processing pipeline. Artif Intell Med. 2024;151:102847, doi:10.1016/j.artmed.2024.102847

work page doi:10.1016/j.artmed.2024.102847 2024