arxiv: 2409.02136 · v2 · submitted 2024-09-02 · 💻 cs.LG · cs.AI· cs.CL

Large Language Models versus Classical Machine Learning: Performance in COVID-19 Mortality Prediction Using High-Dimensional Tabular Data

Mohammadreza Ghaffarzadeh-Esfahani , Mahdi Ghaffarzadeh-Esfahani , Arian Salahi-Niri , Hossein Toreyhi , Zahra Atf , Amirali Mohsenzadeh-Kermani , Mahshad Sarikhani , Zohreh Tajabadi

show 34 more authors

Fatemeh Shojaeian Mohammad Hassan Bagheri Aydin Feyzi Mohammadamin Tarighatpayma Narges Gazmeh Fateme Heydari Hossein Afshar Amirreza Allahgholipour Farid Alimardani Ameneh Salehi Naghmeh Asadimanesh Mohammad Amin Khalafi Hadis Shabanipour Ali Moradi Sajjad Hossein Zadeh Omid Yazdani Romina Esbati Moozhan Maleki Danial Samiei Nasr Amirali Soheili Hossein Majlesi Saba Shahsavan Alireza Soheilipour Nooshin Goudarzi Erfan Taherifard Hamidreza Hatamabadi Jamil S Samaan Thomas Savage Ankit Sakhuja Ali Soroush Girish Nadkarni Ilad Alavi Darazam Mohamad Amin Pourhoseingholi Seyed Amir Ahmad Safavi-Naini

This is my paper

classification 💻 cs.LG cs.AIcs.CL

keywords cmlsdatallmsmodelsperformancehigh-dimensionaltabularclassical

0 comments

read the original abstract

This study compared the performance of classical feature-based machine learning models (CMLs) and large language models (LLMs) in predicting COVID-19 mortality using high-dimensional tabular data from 9,134 patients across four hospitals. Seven CML models, including XGBoost and random forest (RF), were evaluated alongside eight LLMs, such as GPT-4 and Mistral-7b, which performed zero-shot classification on text-converted structured data. Additionally, Mistral- 7b was fine-tuned using the QLoRA approach. XGBoost and RF demonstrated superior performance among CMLs, achieving F1 scores of 0.87 and 0.83 for internal and external validation, respectively. GPT-4 led the LLM category with an F1 score of 0.43, while fine-tuning Mistral-7b significantly improved its recall from 1% to 79%, yielding a stable F1 score of 0.74 during external validation. Although LLMs showed moderate performance in zero-shot classification, fine-tuning substantially enhanced their effectiveness, potentially bridging the gap with CML models. However, CMLs still outperformed LLMs in handling high-dimensional tabular data tasks. This study highlights the potential of both CMLs and fine-tuned LLMs in medical predictive modeling, while emphasizing the current superiority of CMLs for structured data analysis.

This paper has not been read by Pith yet.

Large Language Models versus Classical Machine Learning: Performance in COVID-19 Mortality Prediction Using High-Dimensional Tabular Data

discussion (0)