pith. sign in

arxiv: 2604.22534 · v1 · submitted 2026-04-24 · 💻 cs.LG · cs.AI

FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records

Pith reviewed 2026-05-08 12:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords electronic health recordsfeature engineeringlarge language modelsclinical predictionirregular time seriesICU dataautomated feature generationprivacy-preserving machine learning
0
0 comments X

The pith

LLMs generate executable feature code from EHR schemas alone to handle irregular clinical data and boost prediction accuracy

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FeatEHR-LLM, a framework that directs large language models to create tabular features from electronic health records by supplying only dataset schemas and task descriptions rather than raw patient records. The model receives specialized tool routines that let it write code explicitly suited to uneven observation times and missing values common in clinical time series. This matters because traditional automated feature methods often fail on real EHR data while manual engineering demands scarce clinical expertise and risks privacy breaches. The approach runs an iterative loop that validates the generated code before use. Across eight prediction tasks on four ICU datasets the generated features yield the best average performance on seven tasks.

Core claim

FeatEHR-LLM leverages large language models to generate clinically meaningful tabular features from irregularly sampled EHR time series. The LLM operates exclusively on dataset schemas and task descriptions, equipped with tool routines for querying irregular temporal data, enabling it to produce executable feature-extraction code that explicitly handles uneven observation patterns and informative sparsity. The framework supports both univariate and multivariate feature generation through an iterative, validation-in-the-loop pipeline.

What carries the argument

Tool-augmented LLM generation pipeline that supplies specialized routines for irregular temporal queries so the model can output executable code for feature extraction from sparse, unevenly timed clinical records.

Load-bearing premise

An LLM given only schemas and task descriptions plus tool routines will reliably output correct, clinically useful code that properly manages irregular sampling and sparsity without hallucinations or invalid syntax.

What would settle it

Apply the framework to a fresh collection of ICU datasets and observe either no AUROC gain over baselines or frequent generation of non-executable or semantically wrong feature code.

Figures

Figures reproduced from arXiv: 2604.22534 by Anisoara Ionescu, David Atienza, Hojjat Karami, Jean-Philippe Thiran.

Figure 1
Figure 1. Figure 1: Overview of FeatEHR-LLM. observation times mi and the set of observed variables Oik can vary across patients and across timestamps. Let X denote the space of patient records xi = (ci , Ti). For any subset of variables S, let T (S) i denote the restriction of Ti to measurements from variables in S. Our goal is to learn a feature map ϕ : X → R d that converts each patient record into a fixed-length represent… view at source ↗
Figure 2
Figure 2. Figure 2: Performance gain over baselines across different dataset sizes. The x-axis represents the view at source ↗
Figure 3
Figure 3. Figure 3: Performance gain over baselines across different dataset sizes. The x-axis represents the view at source ↗
Figure 4
Figure 4. Figure 4: Univariate feature engineering step. Top: prompt used to generate candidate univariate feature view at source ↗
Figure 5
Figure 5. Figure 5: Multivariate feature engineering step. Top: prompt used to generate clinically relevant questions. view at source ↗
Figure 6
Figure 6. Figure 6: Multivariate feature engineering example. Top: generated question and required variables. Bottom: view at source ↗
Figure 7
Figure 7. Figure 7: Tool functions available to the LLM. Univariate feature engineering uses only view at source ↗
read the original abstract

Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either lack clinical domain awareness or assume clean, regularly sampled inputs, limiting their applicability to real-world EHR data. We present \textbf{FeatEHR-LLM}, a framework that leverages Large Language Models (LLMs) to generate clinically meaningful tabular features from irregularly sampled EHR time series. To limit patient privacy exposure, the LLM operates exclusively on dataset schemas and task descriptions rather than raw patient records. A tool-augmented generation mechanism equips the LLM with specialized routines for querying irregular temporal data, enabling it to produce executable feature-extraction code that explicitly handles uneven observation patterns and informative sparsity. FeatEHR-LLM supports both univariate and multivariate feature generation through an iterative, validation-in-the-loop pipeline. Evaluated on eight clinical prediction tasks across four ICU datasets, our framework achieves the highest mean AUROC on 7 out of 8 tasks, with improvements of up to 6 percentage points over strong baselines. Code is available at github.com/hojjatkarami/FeatEHR-LLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces FeatEHR-LLM, a framework that uses LLMs operating solely on dataset schemas and task descriptions (no raw patient data) to generate executable Python code for tabular features from irregularly sampled EHR time series. It equips the LLM with specialized temporal-query tools and an iterative validation-in-the-loop pipeline to handle uneven observation patterns and sparsity. The central empirical claim is that this yields the highest mean AUROC on 7 out of 8 clinical prediction tasks across four ICU datasets, with gains of up to 6 percentage points over strong baselines.

Significance. If the performance claims hold under rigorous verification, the work would demonstrate a practical way to inject clinical domain knowledge into automated feature engineering for real-world EHR without privacy leakage, addressing limitations of prior methods that assume regular sampling. The schema-only + tool-augmented design and public code release are notable strengths that could enable reproducibility and extension to other clinical tasks.

major comments (3)
  1. [§3.2] §3.2 (Validation-in-the-Loop Pipeline): The description of the iterative pipeline does not report any quantitative success rate for generated code (e.g., fraction of LLM outputs that pass both syntax/runtime checks and clinical-logic validation), nor a taxonomy of caught errors. This is load-bearing for the central claim because the reported AUROC gains presuppose that the LLM reliably produces correct temporal aggregations and missingness handling rather than artifacts from flawed feature logic.
  2. [§4] §4 (Experimental Evaluation): The headline result (highest mean AUROC on 7/8 tasks, up to 6pp improvement) is presented without details on baseline implementations, number of random seeds or runs, statistical significance testing (p-values or confidence intervals), or exact train/validation/test splits. Without these, it is impossible to determine whether the observed edges are robust or could arise from implementation differences or lucky generations.
  3. [§3.1] §3.1 (Tool-Augmented Generation): The specialized routines for querying irregular temporal data are described at a high level, but no concrete examples or pseudocode are given showing how they enforce time-aware weighting or avoid assuming regular sampling. This matters because any residual assumption of regularity in the generated features would undermine the claim of handling real EHR sparsity.
minor comments (2)
  1. [Abstract and §4] The abstract and §4 refer to 'strong baselines' without naming them or citing their original papers in the main text; adding an explicit comparison table with references would improve clarity.
  2. [§4] Notation for feature types (univariate vs. multivariate) is introduced but not consistently used when reporting per-task results; a small table mapping generated feature categories to AUROC deltas would help readers trace which LLM-generated features drive the gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the paper's rigor and reproducibility without altering its core contributions.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Validation-in-the-Loop Pipeline): The description of the iterative pipeline does not report any quantitative success rate for generated code (e.g., fraction of LLM outputs that pass both syntax/runtime checks and clinical-logic validation), nor a taxonomy of caught errors. This is load-bearing for the central claim because the reported AUROC gains presuppose that the LLM reliably produces correct temporal aggregations and missingness handling rather than artifacts from flawed feature logic.

    Authors: We agree that quantitative success metrics and an error taxonomy would better substantiate the pipeline's reliability. In the revised manuscript, we will expand §3.2 to include these details drawn from our experimental logs: the overall fraction of LLM outputs that ultimately pass all validation stages, the distribution of iterations required, and a categorized breakdown of errors (syntax errors, runtime errors from sparsity handling, and logical inconsistencies in temporal queries). This addition will directly address the concern that the AUROC gains might stem from flawed code rather than correct feature logic. revision: yes

  2. Referee: [§4] §4 (Experimental Evaluation): The headline result (highest mean AUROC on 7/8 tasks, up to 6pp improvement) is presented without details on baseline implementations, number of random seeds or runs, statistical significance testing (p-values or confidence intervals), or exact train/validation/test splits. Without these, it is impossible to determine whether the observed edges are robust or could arise from implementation differences or lucky generations.

    Authors: We acknowledge that these experimental details are necessary for assessing robustness. We will revise §4 (and add supporting material in an appendix) to specify: the exact implementations and adaptations of all baselines for irregular sampling; the number of random seeds and runs performed; results of statistical significance tests (including p-values and confidence intervals); and the precise train/validation/test splits employed for each of the four ICU datasets. These additions will allow readers to verify that the reported gains are not artifacts of implementation variance or single-run luck. revision: yes

  3. Referee: [§3.1] §3.1 (Tool-Augmented Generation): The specialized routines for querying irregular temporal data are described at a high level, but no concrete examples or pseudocode are given showing how they enforce time-aware weighting or avoid assuming regular sampling. This matters because any residual assumption of regularity in the generated features would undermine the claim of handling real EHR sparsity.

    Authors: We agree that explicit examples would clarify how the tools avoid regularity assumptions. In the revised §3.1, we will add pseudocode for the core temporal-query routines (e.g., the time-aware aggregation and missingness-handling functions) together with a concrete worked example on sparse, irregularly sampled vital-sign data. The example will demonstrate explicit use of actual time deltas for weighting, without any fixed-interval assumptions, thereby reinforcing the framework's suitability for real EHR sparsity patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external benchmarks

full rationale

The paper advances an LLM-augmented feature engineering pipeline for irregular EHR data and supports its claims solely through empirical AUROC comparisons on eight clinical prediction tasks across four public ICU datasets. No mathematical derivation, uniqueness theorem, or first-principles result is presented that reduces to fitted parameters, self-definitions, or prior self-citations; the performance numbers are obtained by running the generated code on held-out data and comparing against independent baselines. The framework's internal validation loop operates on syntax/runtime checks rather than re-using the target AUROC metric, so the reported gains are not forced by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach depends on the unverified assumption that LLMs can produce reliable code for irregular time series when restricted to schemas; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption Large language models can generate correct and clinically useful executable feature-extraction code when supplied with dataset schemas, task descriptions, and specialized query tools.
    This assumption underpins the entire tool-augmented generation pipeline described in the abstract.
  • ad hoc to paper The generated features will be clinically meaningful and will improve downstream prediction performance on real EHR tasks.
    This is the core empirical claim but is not derived from first principles.

pith-pipeline@v0.9.0 · 5524 in / 1405 out tokens · 35404 ms · 2026-05-08T12:21:47.977770+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

  1. [1]

    Beam, Irene Y

    Marzyeh Ghassemi, Tristan Naumann, Peter Schulam, Andrew L. Beam, Irene Y. Chen, and Rajesh Ranganath. A Review of Challenges and Opportunities in Machine Learning for Health. AMIA Summits on Translational Science Proceedings, 2020:191–200, May 2020. ISSN 2153-4063

  2. [2]

    Yizhao Zhou, Jiasheng Shi, Ronen Stein, Xiaokang Liu, Robert N Baldassano, Christopher B Forrest, Yong Chen, and Jing Huang. Missing data matter: An empirical evaluation of the impacts of missing EHR data in comparative effectiveness research.Journal of the American Medical Informatics Association, 30(7):1246–1256, July 2023. ISSN 1527-974X. doi: 10.1093/...

  3. [3]

    Mining for equitable health: Assessing the impact of missing data in electronic health records.Journal of Biomedical Informatics, 139:104269, March 2023

    Emily Getzen, Lyle Ungar, Danielle Mowery, Xiaoqian Jiang, and Qi Long. Mining for equitable health: Assessing the impact of missing data in electronic health records.Journal of Biomedical Informatics, 139:104269, March 2023. ISSN 1532-0464. doi: 10.1016/j.jbi.2022.104269

  4. [4]

    Harnessing the power of clinical decision support systems: Challenges and opportunities.Open Heart, 10(2), November 2023

    Zhao Chen, Ning Liang, Haili Zhang, Huizhen Li, Yijiu Yang, Xingyu Zong, Yaxin Chen, Yanping Wang, and Nannan Shi. Harnessing the power of clinical decision support systems: Challenges and opportunities.Open Heart, 10(2), November 2023. ISSN 2053-3624. doi: 10.1136/openhrt-2023-002432. 12

  5. [5]

    Jiancheng Ye, Donna Woods, Neil Jordan, and Justin Starren. The role of artificial intelligence for the application of integrating electronic health records and patient-generated data in clinical decision support.AMIA Summits on Translational Science Proceedings, 2024:459–467, May

  6. [6]

    Juliette T

    Helen Coupland, Neil Scheidwasser, Alexandros Katsiferis, Megan Davies, Seth Flaxman, Naja Hulvej Rod, Swapnil Mishra, Samir Bhatt, and H. Juliette T. Unwin. Exploring the potential and limitations of deep learning and explainable AI for longitudinal life course analysis.BMC Public Health, 25(1):1520, April 2025. ISSN 1471-2458. doi: 10.1186/s12889-025-22705-4

  7. [7]

    Mehak Arora, Hassan Mortagy, Nathan Dwarshuis, Jeffrey Wang, Philip Yang, Andre L Holder, Swati Gupta, and Rishikesan Kamaleswaran. Improving clinical decision support through interpretable machine learning and error handling in electronic health records.Journal of the American Medical Informatics Association, 33(1):123–132, January 2026. ISSN 1527-974X. ...

  8. [8]

    Eyre, and Jingjing Fu

    Zizheng Zhang, Yiming Li, Justin Xu, Jinyu Wang, Rui Wang, Lei Song, Jiang Bian, David W. Eyre, and Jingjing Fu. MedFeat: Model-Aware and Explainability-Driven Feature Engineering with LLMs for Clinical Tabular Prediction, February 2026

  9. [9]

    Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , author =

    BryanLimandStefanZohren. Time-seriesforecastingwithdeeplearning: Asurvey.Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences, 379(2194):20200209, April 2021. ISSN 1471-2962. doi: 10.1098/rsta.2020.0209

  10. [10]

    A Survey of Deep Learning for Time Series Forecasting: Theories, Datasets, and State-of-the-Art Techniques.Computers, Materials & Continua, 85(2):2403–2441, 2025

    Gaoyong Lu, Yang Ou, Zhihong Wang, Yingnan Qu, Yingsheng Xia, Dibin Tang, Igor Kotenko, and Wei Li. A Survey of Deep Learning for Time Series Forecasting: Theories, Datasets, and State-of-the-Art Techniques.Computers, Materials & Continua, 85(2):2403–2441, 2025. ISSN 1546-2218, 1546-2226. doi: 10.32604/cmc.2025.068024

  11. [11]

    OpenFE: Automated Feature Generation with Expert-level Performance, June 2023

    Tianping Zhang, Zheyu Zhang, Zhiyuan Fan, Haoyan Luo, Fengyuan Liu, Qian Liu, Wei Cao, and Jian Li. OpenFE: Automated Feature Generation with Expert-level Performance, June 2023

  12. [12]

    The autofeat Python Library for Automated Feature Engineering and Selection, February 2020

    Franziska Horn, Robert Pack, and Michael Rieger. The autofeat Python Library for Automated Feature Engineering and Selection, February 2020

  13. [13]

    Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering, September 2023

    Noah Hollmann, Samuel Müller, and Frank Hutter. Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering, September 2023

  14. [14]

    Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning, November 2024

    Jaehyun Nam, Kyuyoung Kim, Seunghyuk Oh, Jihoon Tack, Jaehyung Kim, and Jinwoo Shin. Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning, November 2024

  15. [15]

    Nikhil Abhyankar, Parshin Shojaee, and Chandan K. Reddy. LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers, March 2025

  16. [16]

    Learning a Data-Driven Policy Network for Pre-Training Automated Feature Engineering

    Liyao Li, Haobo Wang, Liangyu Zha, Qingyi Huang, Sai Wu, Gang Chen, and Junbo Zhao. Learning a Data-Driven Policy Network for Pre-Training Automated Feature Engineering. In The Eleventh International Conference on Learning Representations, September 2022. 13

  17. [17]

    Time series feature extraction on basis of scalable hypothesis tests (tsfresh – a python package),

    Maximilian Christ, Nils Braun, Julius Neuffer, and Andreas W. Kempa-Liehr. Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package).Neuro- computing, 307:72–77, September 2018. ISSN 0925-2312. doi: 10.1016/j.neucom.2018.03.067

  18. [18]

    Tsflex: Flexible time series processing & feature extraction, December 2021

    Jonas Van Der Donckt, Jeroen Van Der Donckt, Emiel Deprost, and Sofie Van Hoecke. Tsflex: Flexible time series processing & feature extraction, December 2021

  19. [19]

    Kats, March 2022

    Xiaodong Jiang, Sudeep Srivastava, Sourav Chatterjee, Yang Yu, Jeffrey Handler, Peiyi Zhang, Rohan Bopardikar, Dawei Li, Yanjun Lin, Uttam Thakore, Michael Brundage, Ginger Holt, Caner Komurlu, Rakshita Nagalla, Zhichao Wang, Hechao Sun, Peng Gao, Wei Cheung, Jun Gao, Qi Wang, Marius Guerard, Morteza Kazemi, Yulin Chen, Chong Zhou, Sean Lee, Nikolay Lapte...

  20. [20]

    Lubba, Sarab S

    Carl H. Lubba, Sarab S. Sethi, Philip Knaute, Simon R. Schultz, Ben D. Fulcher, and Nick S. Jones. Catch22: CAnonical Time-series CHaracteristics, January 2019

  21. [21]

    Arik, and Tomas Pfister

    Sungwon Han, Jinsung Yoon, Sercan O. Arik, and Tomas Pfister. Large Language Models Can Automatically Engineer Features for Few-Shot Tabular Learning, May 2024

  22. [22]

    FeRG-LLM : Feature Engineering by Reason Generation Large Language Models, March 2025

    Jeonghyun Ko, Gyeongyun Park, Donghoon Lee, and Kyunam Lee. FeRG-LLM : Feature Engineering by Reason Generation Large Language Models, March 2025

  23. [23]

    FAMOSE: A ReAct Approach to Automated Feature Discovery, February 2026

    Keith Burghardt, Jienan Liu, Sadman Sakib, Yuning Hao, and Bo Li. FAMOSE: A ReAct Approach to Automated Feature Discovery, February 2026

  24. [24]

    ReAct: Synergizing Reasoning and Acting in Language Models, March 2023

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models, March 2023

  25. [25]

    Predicting In- Hospital Mortality of ICU Patients: The PhysioNet/Computing in Cardiology Challenge 2012

    Ikaro Silva, George Moody, Daniel J Scott, Leo A Celi, and Roger G Mark. Predicting In- Hospital Mortality of ICU Patients: The PhysioNet/Computing in Cardiology Challenge 2012. page 4

  26. [26]

    Reyna, Christopher S

    Matthew A. Reyna, Christopher S. Josef, Russell Jeter, Supreeth P. Shashikumar, M. Brandon Westover, Shamim Nemati, Gari D. Clifford, and Ashish Sharma. Early Prediction of Sepsis From Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019.Critical Care Medicine, 48(2):210–217, February 2020. ISSN 0090-3493. doi: 10.1097/CCM.0000000000004145

  27. [27]

    Shirly Wang, Matthew B. A. McDermott, Geeticka Chauhan, Marzyeh Ghassemi, Michael C. Hughes, and Tristan Naumann. MIMIC-Extract: A data extraction, preprocessing, and representation pipeline for MIMIC-III. InProceedings of the ACM Conference on Health, Inference, and Learning, pages 222–235, Toronto Ontario Canada, April 2020. ACM. ISBN 978-1-4503-7046-2....

  28. [28]

    Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. MIMIC-III, a freely accessible critical care database.Scientific Data, 3(1):160035, May 2016. ISSN 2052-4463. doi: 10.1038/sdata.2016.35

  29. [29]

    and Johnson, Alistair E

    Tom J. Pollard, Alistair E. W. Johnson, Jesse D. Raffa, Leo A. Celi, Roger G. Mark, and Omar Badawi. The eICU Collaborative Research Database, a freely available multi-center database 14 for critical care research.Scientific Data, 5(1):180178, September 2018. ISSN 2052-4463. doi: 10.1038/sdata.2018.178

  30. [30]

    Shengpu Tang, Parmida Davarmanesh, Yanmeng Song, Danai Koutra, Michael W Sjoding, and Jenna Wiens. Democratizing EHR analyses with FIDDLE: A flexible data-driven preprocessing pipeline for structured clinical data.Journal of the American Medical Informatics Association, 27(12):1921–1934, December 2020. ISSN 1527-974X. doi: 10.1093/jamia/ocaa139

  31. [31]

    LightGBM: A Highly Efficient Gradient Boosting Decision Tree

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  32. [32]

    https://deepmind.google/models/gemini/flash/

    Gemini 2.5 Flash. https://deepmind.google/models/gemini/flash/

  33. [33]

    Lipton, David C

    Zachary C. Lipton, David C. Kale, Charles Elkan, and Randall Wetzel. Learning to Diagnose with LSTM Recurrent Neural Networks, March 2017

  34. [34]

    Satya Narayan Shukla and Benjamin M. Marlin. Multi-Time Attention Networks for Irregularly Sampled Time Series, June 2021

  35. [35]

    Roderick J. A. Little and Donald B. Rubin.Statistical Analysis with Missing Data. John Wiley & Sons, April 2019. ISBN 978-0-470-52679-8

  36. [36]

    BMJ361, 1479 (2018) https: //doi.org/10.1136/bmj.k1479

    Denis Agniel, Isaac S. Kohane, and Griffin M. Weber. Biases in electronic health record data due to processes within the healthcare system: Retrospective observational study.BMJ, 361: k1479, April 2018. ISSN 1756-1833. doi: 10.1136/bmj.k1479

  37. [37]

    Lundberg, Gabriel G

    Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. Consistent Individualized Feature Attribution for Tree Ensembles, March 2019

  38. [38]

    Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets.Pattern Recognition, 36(6): 1291–1302, June 2003

    Robert Bryll, Ricardo Gutierrez-Osuna, and Francis Quek. Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets.Pattern Recognition, 36(6): 1291–1302, June 2003. ISSN 0031-3203. doi: 10.1016/S0031-3203(02)00121-8

  39. [39]

    ControlBurn: Feature Selection by Sparse Forests

    Brian Liu, Miaolan Xie, and Madeleine Udell. ControlBurn: Feature Selection by Sparse Forests. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1045–1054, August 2021. doi: 10.1145/3447548.3467387

  40. [40]

    Explainable and interpretable artificial intelligence in medicine: A systematic bibliometric review.Discover Artificial Intelligence, 4(1):15, February 2024

    Maria Frasca, Davide La Torre, Gabriella Pravettoni, and Ilaria Cutica. Explainable and interpretable artificial intelligence in medicine: A systematic bibliometric review.Discover Artificial Intelligence, 4(1):15, February 2024. ISSN 2731-0809. doi: 10.1007/s44163-024-00114-7

  41. [41]

    Al-Mallah, and Sherif Sakr

    Radwa Elshawi, Mouaz H. Al-Mallah, and Sherif Sakr. On the interpretability of machine learning-based model for predicting hypertension.BMC Medical Informatics and Decision Making, 19(1):146, July 2019. ISSN 1472-6947. doi: 10.1186/s12911-019-0874-0

  42. [42]

    Large language models in biomedicine and healthcare.npj Artificial Intelligence, 1(1):44, December

    Juexiao Zhou, Haoyang Li, Siyuan Chen, Zhangtianyi Chen, Zhongyi Han, and Xin Gao. Large language models in biomedicine and healthcare.npj Artificial Intelligence, 1(1):44, December

  43. [43]

    doi: 10.1038/s44387-025-00047-1

    ISSN 3005-1460. doi: 10.1038/s44387-025-00047-1

  44. [44]

    Medical Hallucinations in Foundation Models and Their Impact on Healthcare, November 2025

    Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Chanwoo Park, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, Rodrigo Gameiro, Lizhou Fan, Eugene Park, Tristan Lin, Joonsik Yoon, Wonjin Yoon, Maarten Sap, Yulia Tsvetkov, Paul Liang, Xuhai 15 Xu, Xin Liu, Chunjong Park, Hyeonhoon Lee, Hae Won Park, Daniel McDuff, Samir Tulebaev, a...

  45. [45]

    The Clinicians’ Guide to Large Language Models: A General Perspective With a Focus on Hallucinations.Interactive Journal of Medical Research, 14(1):e59823, January 2025

    Dimitri Roustan and François Bastardot. The Clinicians’ Guide to Large Language Models: A General Perspective With a Focus on Hallucinations.Interactive Journal of Medical Research, 14(1):e59823, January 2025. doi: 10.2196/59823

  46. [46]

    What is the risk of ...?

    Lisa Pilgram, Samer El Kababji, Dan Liu, and Khaled El Emam. Magnitude and Impact of Hallucinations in Tabular Synthetic Health Data on Prognostic Machine Learning Models: Validation Study.Journal of Medical Internet Research, 27(1):e77893, August 2025. doi: 10.2196/77893. 16 ### Your task is to {TASK} . You are given access to the patient's * {var_name} ...