FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records

Anisoara Ionescu; David Atienza; Hojjat Karami; Jean-Philippe Thiran

arxiv: 2604.22534 · v1 · submitted 2026-04-24 · 💻 cs.LG · cs.AI

FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records

Hojjat Karami , David Atienza , Jean-Philippe Thiran , Anisoara Ionescu This is my paper

Pith reviewed 2026-05-08 12:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords electronic health recordsfeature engineeringlarge language modelsclinical predictionirregular time seriesICU dataautomated feature generationprivacy-preserving machine learning

0 comments

The pith

LLMs generate executable feature code from EHR schemas alone to handle irregular clinical data and boost prediction accuracy

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FeatEHR-LLM, a framework that directs large language models to create tabular features from electronic health records by supplying only dataset schemas and task descriptions rather than raw patient records. The model receives specialized tool routines that let it write code explicitly suited to uneven observation times and missing values common in clinical time series. This matters because traditional automated feature methods often fail on real EHR data while manual engineering demands scarce clinical expertise and risks privacy breaches. The approach runs an iterative loop that validates the generated code before use. Across eight prediction tasks on four ICU datasets the generated features yield the best average performance on seven tasks.

Core claim

FeatEHR-LLM leverages large language models to generate clinically meaningful tabular features from irregularly sampled EHR time series. The LLM operates exclusively on dataset schemas and task descriptions, equipped with tool routines for querying irregular temporal data, enabling it to produce executable feature-extraction code that explicitly handles uneven observation patterns and informative sparsity. The framework supports both univariate and multivariate feature generation through an iterative, validation-in-the-loop pipeline.

What carries the argument

Tool-augmented LLM generation pipeline that supplies specialized routines for irregular temporal queries so the model can output executable code for feature extraction from sparse, unevenly timed clinical records.

Load-bearing premise

An LLM given only schemas and task descriptions plus tool routines will reliably output correct, clinically useful code that properly manages irregular sampling and sparsity without hallucinations or invalid syntax.

What would settle it

Apply the framework to a fresh collection of ICU datasets and observe either no AUROC gain over baselines or frequent generation of non-executable or semantically wrong feature code.

Figures

Figures reproduced from arXiv: 2604.22534 by Anisoara Ionescu, David Atienza, Hojjat Karami, Jean-Philippe Thiran.

**Figure 1.** Figure 1: Overview of FeatEHR-LLM. observation times mi and the set of observed variables Oik can vary across patients and across timestamps. Let X denote the space of patient records xi = (ci , Ti). For any subset of variables S, let T (S) i denote the restriction of Ti to measurements from variables in S. Our goal is to learn a feature map ϕ : X → R d that converts each patient record into a fixed-length represent… view at source ↗

**Figure 2.** Figure 2: Performance gain over baselines across different dataset sizes. The x-axis represents the view at source ↗

**Figure 3.** Figure 3: Performance gain over baselines across different dataset sizes. The x-axis represents the view at source ↗

**Figure 4.** Figure 4: Univariate feature engineering step. Top: prompt used to generate candidate univariate feature view at source ↗

**Figure 5.** Figure 5: Multivariate feature engineering step. Top: prompt used to generate clinically relevant questions. view at source ↗

**Figure 6.** Figure 6: Multivariate feature engineering example. Top: generated question and required variables. Bottom: view at source ↗

**Figure 7.** Figure 7: Tool functions available to the LLM. Univariate feature engineering uses only view at source ↗

read the original abstract

Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either lack clinical domain awareness or assume clean, regularly sampled inputs, limiting their applicability to real-world EHR data. We present \textbf{FeatEHR-LLM}, a framework that leverages Large Language Models (LLMs) to generate clinically meaningful tabular features from irregularly sampled EHR time series. To limit patient privacy exposure, the LLM operates exclusively on dataset schemas and task descriptions rather than raw patient records. A tool-augmented generation mechanism equips the LLM with specialized routines for querying irregular temporal data, enabling it to produce executable feature-extraction code that explicitly handles uneven observation patterns and informative sparsity. FeatEHR-LLM supports both univariate and multivariate feature generation through an iterative, validation-in-the-loop pipeline. Evaluated on eight clinical prediction tasks across four ICU datasets, our framework achieves the highest mean AUROC on 7 out of 8 tasks, with improvements of up to 6 percentage points over strong baselines. Code is available at github.com/hojjatkarami/FeatEHR-LLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces schema-only LLM code generation for irregular EHR features with a validation loop, reports gains on most tasks, but supplies almost no evidence on how reliably the generation step works.

read the letter

The core idea is using an LLM to write Python feature-extraction code for messy clinical time series while seeing only the dataset schema and task description. This keeps patient data private and tries to handle irregular sampling and sparsity through tool routines and an iterative check loop. They test it on eight prediction tasks from four ICU datasets and claim the best average AUROC on seven of them, with lifts up to six points over baselines. That setup is new relative to prior automated feature work that either needs raw records or assumes regular grids.

Referee Report

3 major / 2 minor

Summary. The paper introduces FeatEHR-LLM, a framework that uses LLMs operating solely on dataset schemas and task descriptions (no raw patient data) to generate executable Python code for tabular features from irregularly sampled EHR time series. It equips the LLM with specialized temporal-query tools and an iterative validation-in-the-loop pipeline to handle uneven observation patterns and sparsity. The central empirical claim is that this yields the highest mean AUROC on 7 out of 8 clinical prediction tasks across four ICU datasets, with gains of up to 6 percentage points over strong baselines.

Significance. If the performance claims hold under rigorous verification, the work would demonstrate a practical way to inject clinical domain knowledge into automated feature engineering for real-world EHR without privacy leakage, addressing limitations of prior methods that assume regular sampling. The schema-only + tool-augmented design and public code release are notable strengths that could enable reproducibility and extension to other clinical tasks.

major comments (3)

[§3.2] §3.2 (Validation-in-the-Loop Pipeline): The description of the iterative pipeline does not report any quantitative success rate for generated code (e.g., fraction of LLM outputs that pass both syntax/runtime checks and clinical-logic validation), nor a taxonomy of caught errors. This is load-bearing for the central claim because the reported AUROC gains presuppose that the LLM reliably produces correct temporal aggregations and missingness handling rather than artifacts from flawed feature logic.
[§4] §4 (Experimental Evaluation): The headline result (highest mean AUROC on 7/8 tasks, up to 6pp improvement) is presented without details on baseline implementations, number of random seeds or runs, statistical significance testing (p-values or confidence intervals), or exact train/validation/test splits. Without these, it is impossible to determine whether the observed edges are robust or could arise from implementation differences or lucky generations.
[§3.1] §3.1 (Tool-Augmented Generation): The specialized routines for querying irregular temporal data are described at a high level, but no concrete examples or pseudocode are given showing how they enforce time-aware weighting or avoid assuming regular sampling. This matters because any residual assumption of regularity in the generated features would undermine the claim of handling real EHR sparsity.

minor comments (2)

[Abstract and §4] The abstract and §4 refer to 'strong baselines' without naming them or citing their original papers in the main text; adding an explicit comparison table with references would improve clarity.
[§4] Notation for feature types (univariate vs. multivariate) is introduced but not consistently used when reporting per-task results; a small table mapping generated feature categories to AUROC deltas would help readers trace which LLM-generated features drive the gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the paper's rigor and reproducibility without altering its core contributions.

read point-by-point responses

Referee: [§3.2] §3.2 (Validation-in-the-Loop Pipeline): The description of the iterative pipeline does not report any quantitative success rate for generated code (e.g., fraction of LLM outputs that pass both syntax/runtime checks and clinical-logic validation), nor a taxonomy of caught errors. This is load-bearing for the central claim because the reported AUROC gains presuppose that the LLM reliably produces correct temporal aggregations and missingness handling rather than artifacts from flawed feature logic.

Authors: We agree that quantitative success metrics and an error taxonomy would better substantiate the pipeline's reliability. In the revised manuscript, we will expand §3.2 to include these details drawn from our experimental logs: the overall fraction of LLM outputs that ultimately pass all validation stages, the distribution of iterations required, and a categorized breakdown of errors (syntax errors, runtime errors from sparsity handling, and logical inconsistencies in temporal queries). This addition will directly address the concern that the AUROC gains might stem from flawed code rather than correct feature logic. revision: yes
Referee: [§4] §4 (Experimental Evaluation): The headline result (highest mean AUROC on 7/8 tasks, up to 6pp improvement) is presented without details on baseline implementations, number of random seeds or runs, statistical significance testing (p-values or confidence intervals), or exact train/validation/test splits. Without these, it is impossible to determine whether the observed edges are robust or could arise from implementation differences or lucky generations.

Authors: We acknowledge that these experimental details are necessary for assessing robustness. We will revise §4 (and add supporting material in an appendix) to specify: the exact implementations and adaptations of all baselines for irregular sampling; the number of random seeds and runs performed; results of statistical significance tests (including p-values and confidence intervals); and the precise train/validation/test splits employed for each of the four ICU datasets. These additions will allow readers to verify that the reported gains are not artifacts of implementation variance or single-run luck. revision: yes
Referee: [§3.1] §3.1 (Tool-Augmented Generation): The specialized routines for querying irregular temporal data are described at a high level, but no concrete examples or pseudocode are given showing how they enforce time-aware weighting or avoid assuming regular sampling. This matters because any residual assumption of regularity in the generated features would undermine the claim of handling real EHR sparsity.

Authors: We agree that explicit examples would clarify how the tools avoid regularity assumptions. In the revised §3.1, we will add pseudocode for the core temporal-query routines (e.g., the time-aware aggregation and missingness-handling functions) together with a concrete worked example on sparse, irregularly sampled vital-sign data. The example will demonstrate explicit use of actual time deltas for weighting, without any fixed-interval assumptions, thereby reinforcing the framework's suitability for real EHR sparsity patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on external benchmarks

full rationale

The paper advances an LLM-augmented feature engineering pipeline for irregular EHR data and supports its claims solely through empirical AUROC comparisons on eight clinical prediction tasks across four public ICU datasets. No mathematical derivation, uniqueness theorem, or first-principles result is presented that reduces to fitted parameters, self-definitions, or prior self-citations; the performance numbers are obtained by running the generated code on held-out data and comparing against independent baselines. The framework's internal validation loop operates on syntax/runtime checks rather than re-using the target AUROC metric, so the reported gains are not forced by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach depends on the unverified assumption that LLMs can produce reliable code for irregular time series when restricted to schemas; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption Large language models can generate correct and clinically useful executable feature-extraction code when supplied with dataset schemas, task descriptions, and specialized query tools.
This assumption underpins the entire tool-augmented generation pipeline described in the abstract.
ad hoc to paper The generated features will be clinically meaningful and will improve downstream prediction performance on real EHR tasks.
This is the core empirical claim but is not derived from first principles.

pith-pipeline@v0.9.0 · 5524 in / 1405 out tokens · 35404 ms · 2026-05-08T12:21:47.977770+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

[1]

Beam, Irene Y

Marzyeh Ghassemi, Tristan Naumann, Peter Schulam, Andrew L. Beam, Irene Y. Chen, and Rajesh Ranganath. A Review of Challenges and Opportunities in Machine Learning for Health. AMIA Summits on Translational Science Proceedings, 2020:191–200, May 2020. ISSN 2153-4063

work page 2020
[2]

Yizhao Zhou, Jiasheng Shi, Ronen Stein, Xiaokang Liu, Robert N Baldassano, Christopher B Forrest, Yong Chen, and Jing Huang. Missing data matter: An empirical evaluation of the impacts of missing EHR data in comparative effectiveness research.Journal of the American Medical Informatics Association, 30(7):1246–1256, July 2023. ISSN 1527-974X. doi: 10.1093/...

work page 2023
[3]

Mining for equitable health: Assessing the impact of missing data in electronic health records.Journal of Biomedical Informatics, 139:104269, March 2023

Emily Getzen, Lyle Ungar, Danielle Mowery, Xiaoqian Jiang, and Qi Long. Mining for equitable health: Assessing the impact of missing data in electronic health records.Journal of Biomedical Informatics, 139:104269, March 2023. ISSN 1532-0464. doi: 10.1016/j.jbi.2022.104269

work page doi:10.1016/j.jbi.2022.104269 2023
[4]

Harnessing the power of clinical decision support systems: Challenges and opportunities.Open Heart, 10(2), November 2023

Zhao Chen, Ning Liang, Haili Zhang, Huizhen Li, Yijiu Yang, Xingyu Zong, Yaxin Chen, Yanping Wang, and Nannan Shi. Harnessing the power of clinical decision support systems: Challenges and opportunities.Open Heart, 10(2), November 2023. ISSN 2053-3624. doi: 10.1136/openhrt-2023-002432. 12

work page doi:10.1136/openhrt-2023-002432 2023
[5]

Jiancheng Ye, Donna Woods, Neil Jordan, and Justin Starren. The role of artificial intelligence for the application of integrating electronic health records and patient-generated data in clinical decision support.AMIA Summits on Translational Science Proceedings, 2024:459–467, May

work page 2024
[6]

Juliette T

Helen Coupland, Neil Scheidwasser, Alexandros Katsiferis, Megan Davies, Seth Flaxman, Naja Hulvej Rod, Swapnil Mishra, Samir Bhatt, and H. Juliette T. Unwin. Exploring the potential and limitations of deep learning and explainable AI for longitudinal life course analysis.BMC Public Health, 25(1):1520, April 2025. ISSN 1471-2458. doi: 10.1186/s12889-025-22705-4

work page doi:10.1186/s12889-025-22705-4 2025
[7]

Mehak Arora, Hassan Mortagy, Nathan Dwarshuis, Jeffrey Wang, Philip Yang, Andre L Holder, Swati Gupta, and Rishikesan Kamaleswaran. Improving clinical decision support through interpretable machine learning and error handling in electronic health records.Journal of the American Medical Informatics Association, 33(1):123–132, January 2026. ISSN 1527-974X. ...

work page doi:10.1093/jamia/ocaf058 2026
[8]

Eyre, and Jingjing Fu

Zizheng Zhang, Yiming Li, Justin Xu, Jinyu Wang, Rui Wang, Lei Song, Jiang Bian, David W. Eyre, and Jingjing Fu. MedFeat: Model-Aware and Explainability-Driven Feature Engineering with LLMs for Clinical Tabular Prediction, February 2026

work page 2026
[9]

Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , author =

BryanLimandStefanZohren. Time-seriesforecastingwithdeeplearning: Asurvey.Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences, 379(2194):20200209, April 2021. ISSN 1471-2962. doi: 10.1098/rsta.2020.0209

work page doi:10.1098/rsta.2020.0209 2021
[10]

A Survey of Deep Learning for Time Series Forecasting: Theories, Datasets, and State-of-the-Art Techniques.Computers, Materials & Continua, 85(2):2403–2441, 2025

Gaoyong Lu, Yang Ou, Zhihong Wang, Yingnan Qu, Yingsheng Xia, Dibin Tang, Igor Kotenko, and Wei Li. A Survey of Deep Learning for Time Series Forecasting: Theories, Datasets, and State-of-the-Art Techniques.Computers, Materials & Continua, 85(2):2403–2441, 2025. ISSN 1546-2218, 1546-2226. doi: 10.32604/cmc.2025.068024

work page doi:10.32604/cmc.2025.068024 2025
[11]

OpenFE: Automated Feature Generation with Expert-level Performance, June 2023

Tianping Zhang, Zheyu Zhang, Zhiyuan Fan, Haoyan Luo, Fengyuan Liu, Qian Liu, Wei Cao, and Jian Li. OpenFE: Automated Feature Generation with Expert-level Performance, June 2023

work page 2023
[12]

The autofeat Python Library for Automated Feature Engineering and Selection, February 2020

Franziska Horn, Robert Pack, and Michael Rieger. The autofeat Python Library for Automated Feature Engineering and Selection, February 2020

work page 2020
[13]

Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering, September 2023

Noah Hollmann, Samuel Müller, and Frank Hutter. Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering, September 2023

work page 2023
[14]

Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning, November 2024

Jaehyun Nam, Kyuyoung Kim, Seunghyuk Oh, Jihoon Tack, Jaehyung Kim, and Jinwoo Shin. Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning, November 2024

work page 2024
[15]

Nikhil Abhyankar, Parshin Shojaee, and Chandan K. Reddy. LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers, March 2025

work page 2025
[16]

Learning a Data-Driven Policy Network for Pre-Training Automated Feature Engineering

Liyao Li, Haobo Wang, Liangyu Zha, Qingyi Huang, Sai Wu, Gang Chen, and Junbo Zhao. Learning a Data-Driven Policy Network for Pre-Training Automated Feature Engineering. In The Eleventh International Conference on Learning Representations, September 2022. 13

work page 2022
[17]

Time series feature extraction on basis of scalable hypothesis tests (tsfresh – a python package),

Maximilian Christ, Nils Braun, Julius Neuffer, and Andreas W. Kempa-Liehr. Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package).Neuro- computing, 307:72–77, September 2018. ISSN 0925-2312. doi: 10.1016/j.neucom.2018.03.067

work page doi:10.1016/j.neucom.2018.03.067 2018
[18]

Tsflex: Flexible time series processing & feature extraction, December 2021

Jonas Van Der Donckt, Jeroen Van Der Donckt, Emiel Deprost, and Sofie Van Hoecke. Tsflex: Flexible time series processing & feature extraction, December 2021

work page 2021
[19]

Kats, March 2022

Xiaodong Jiang, Sudeep Srivastava, Sourav Chatterjee, Yang Yu, Jeffrey Handler, Peiyi Zhang, Rohan Bopardikar, Dawei Li, Yanjun Lin, Uttam Thakore, Michael Brundage, Ginger Holt, Caner Komurlu, Rakshita Nagalla, Zhichao Wang, Hechao Sun, Peng Gao, Wei Cheung, Jun Gao, Qi Wang, Marius Guerard, Morteza Kazemi, Yulin Chen, Chong Zhou, Sean Lee, Nikolay Lapte...

work page 2022
[20]

Lubba, Sarab S

Carl H. Lubba, Sarab S. Sethi, Philip Knaute, Simon R. Schultz, Ben D. Fulcher, and Nick S. Jones. Catch22: CAnonical Time-series CHaracteristics, January 2019

work page 2019
[21]

Arik, and Tomas Pfister

Sungwon Han, Jinsung Yoon, Sercan O. Arik, and Tomas Pfister. Large Language Models Can Automatically Engineer Features for Few-Shot Tabular Learning, May 2024

work page 2024
[22]

FeRG-LLM : Feature Engineering by Reason Generation Large Language Models, March 2025

Jeonghyun Ko, Gyeongyun Park, Donghoon Lee, and Kyunam Lee. FeRG-LLM : Feature Engineering by Reason Generation Large Language Models, March 2025

work page 2025
[23]

FAMOSE: A ReAct Approach to Automated Feature Discovery, February 2026

Keith Burghardt, Jienan Liu, Sadman Sakib, Yuning Hao, and Bo Li. FAMOSE: A ReAct Approach to Automated Feature Discovery, February 2026

work page 2026
[24]

ReAct: Synergizing Reasoning and Acting in Language Models, March 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models, March 2023

work page 2023
[25]

Predicting In- Hospital Mortality of ICU Patients: The PhysioNet/Computing in Cardiology Challenge 2012

Ikaro Silva, George Moody, Daniel J Scott, Leo A Celi, and Roger G Mark. Predicting In- Hospital Mortality of ICU Patients: The PhysioNet/Computing in Cardiology Challenge 2012. page 4

work page 2012
[26]

Reyna, Christopher S

Matthew A. Reyna, Christopher S. Josef, Russell Jeter, Supreeth P. Shashikumar, M. Brandon Westover, Shamim Nemati, Gari D. Clifford, and Ashish Sharma. Early Prediction of Sepsis From Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019.Critical Care Medicine, 48(2):210–217, February 2020. ISSN 0090-3493. doi: 10.1097/CCM.0000000000004145

work page doi:10.1097/ccm.0000000000004145 2019
[27]

Shirly Wang, Matthew B. A. McDermott, Geeticka Chauhan, Marzyeh Ghassemi, Michael C. Hughes, and Tristan Naumann. MIMIC-Extract: A data extraction, preprocessing, and representation pipeline for MIMIC-III. InProceedings of the ACM Conference on Health, Inference, and Learning, pages 222–235, Toronto Ontario Canada, April 2020. ACM. ISBN 978-1-4503-7046-2....

work page doi:10.1145/3368555.3384469 2020
[28]

Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. MIMIC-III, a freely accessible critical care database.Scientific Data, 3(1):160035, May 2016. ISSN 2052-4463. doi: 10.1038/sdata.2016.35

work page doi:10.1038/sdata.2016.35 2016
[29]

and Johnson, Alistair E

Tom J. Pollard, Alistair E. W. Johnson, Jesse D. Raffa, Leo A. Celi, Roger G. Mark, and Omar Badawi. The eICU Collaborative Research Database, a freely available multi-center database 14 for critical care research.Scientific Data, 5(1):180178, September 2018. ISSN 2052-4463. doi: 10.1038/sdata.2018.178

work page doi:10.1038/sdata.2018.178 2018
[30]

Shengpu Tang, Parmida Davarmanesh, Yanmeng Song, Danai Koutra, Michael W Sjoding, and Jenna Wiens. Democratizing EHR analyses with FIDDLE: A flexible data-driven preprocessing pipeline for structured clinical data.Journal of the American Medical Informatics Association, 27(12):1921–1934, December 2020. ISSN 1527-974X. doi: 10.1093/jamia/ocaa139

work page doi:10.1093/jamia/ocaa139 1921
[31]

LightGBM: A Highly Efficient Gradient Boosting Decision Tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017
[32]

https://deepmind.google/models/gemini/flash/

Gemini 2.5 Flash. https://deepmind.google/models/gemini/flash/

work page
[33]

Lipton, David C

Zachary C. Lipton, David C. Kale, Charles Elkan, and Randall Wetzel. Learning to Diagnose with LSTM Recurrent Neural Networks, March 2017

work page 2017
[34]

Satya Narayan Shukla and Benjamin M. Marlin. Multi-Time Attention Networks for Irregularly Sampled Time Series, June 2021

work page 2021
[35]

Roderick J. A. Little and Donald B. Rubin.Statistical Analysis with Missing Data. John Wiley & Sons, April 2019. ISBN 978-0-470-52679-8

work page 2019
[36]

BMJ361, 1479 (2018) https: //doi.org/10.1136/bmj.k1479

Denis Agniel, Isaac S. Kohane, and Griffin M. Weber. Biases in electronic health record data due to processes within the healthcare system: Retrospective observational study.BMJ, 361: k1479, April 2018. ISSN 1756-1833. doi: 10.1136/bmj.k1479

work page doi:10.1136/bmj.k1479 2018
[37]

Lundberg, Gabriel G

Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. Consistent Individualized Feature Attribution for Tree Ensembles, March 2019

work page 2019
[38]

Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets.Pattern Recognition, 36(6): 1291–1302, June 2003

Robert Bryll, Ricardo Gutierrez-Osuna, and Francis Quek. Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets.Pattern Recognition, 36(6): 1291–1302, June 2003. ISSN 0031-3203. doi: 10.1016/S0031-3203(02)00121-8

work page doi:10.1016/s0031-3203(02)00121-8 2003
[39]

ControlBurn: Feature Selection by Sparse Forests

Brian Liu, Miaolan Xie, and Madeleine Udell. ControlBurn: Feature Selection by Sparse Forests. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1045–1054, August 2021. doi: 10.1145/3447548.3467387

work page doi:10.1145/3447548.3467387 2021
[40]

Explainable and interpretable artificial intelligence in medicine: A systematic bibliometric review.Discover Artificial Intelligence, 4(1):15, February 2024

Maria Frasca, Davide La Torre, Gabriella Pravettoni, and Ilaria Cutica. Explainable and interpretable artificial intelligence in medicine: A systematic bibliometric review.Discover Artificial Intelligence, 4(1):15, February 2024. ISSN 2731-0809. doi: 10.1007/s44163-024-00114-7

work page doi:10.1007/s44163-024-00114-7 2024
[41]

Al-Mallah, and Sherif Sakr

Radwa Elshawi, Mouaz H. Al-Mallah, and Sherif Sakr. On the interpretability of machine learning-based model for predicting hypertension.BMC Medical Informatics and Decision Making, 19(1):146, July 2019. ISSN 1472-6947. doi: 10.1186/s12911-019-0874-0

work page doi:10.1186/s12911-019-0874-0 2019
[42]

Large language models in biomedicine and healthcare.npj Artificial Intelligence, 1(1):44, December

Juexiao Zhou, Haoyang Li, Siyuan Chen, Zhangtianyi Chen, Zhongyi Han, and Xin Gao. Large language models in biomedicine and healthcare.npj Artificial Intelligence, 1(1):44, December

work page
[43]

doi: 10.1038/s44387-025-00047-1

ISSN 3005-1460. doi: 10.1038/s44387-025-00047-1

work page doi:10.1038/s44387-025-00047-1
[44]

Medical Hallucinations in Foundation Models and Their Impact on Healthcare, November 2025

Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Chanwoo Park, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, Rodrigo Gameiro, Lizhou Fan, Eugene Park, Tristan Lin, Joonsik Yoon, Wonjin Yoon, Maarten Sap, Yulia Tsvetkov, Paul Liang, Xuhai 15 Xu, Xin Liu, Chunjong Park, Hyeonhoon Lee, Hae Won Park, Daniel McDuff, Samir Tulebaev, a...

work page 2025
[45]

The Clinicians’ Guide to Large Language Models: A General Perspective With a Focus on Hallucinations.Interactive Journal of Medical Research, 14(1):e59823, January 2025

Dimitri Roustan and François Bastardot. The Clinicians’ Guide to Large Language Models: A General Perspective With a Focus on Hallucinations.Interactive Journal of Medical Research, 14(1):e59823, January 2025. doi: 10.2196/59823

work page doi:10.2196/59823 2025
[46]

What is the risk of ...?

Lisa Pilgram, Samer El Kababji, Dan Liu, and Khaled El Emam. Magnitude and Impact of Hallucinations in Tabular Synthetic Health Data on Prognostic Machine Learning Models: Validation Study.Journal of Medical Internet Research, 27(1):e77893, August 2025. doi: 10.2196/77893. 16 ### Your task is to {TASK} . You are given access to the patient's * {var_name} ...

work page doi:10.2196/77893 2025

[1] [1]

Beam, Irene Y

Marzyeh Ghassemi, Tristan Naumann, Peter Schulam, Andrew L. Beam, Irene Y. Chen, and Rajesh Ranganath. A Review of Challenges and Opportunities in Machine Learning for Health. AMIA Summits on Translational Science Proceedings, 2020:191–200, May 2020. ISSN 2153-4063

work page 2020

[2] [2]

Yizhao Zhou, Jiasheng Shi, Ronen Stein, Xiaokang Liu, Robert N Baldassano, Christopher B Forrest, Yong Chen, and Jing Huang. Missing data matter: An empirical evaluation of the impacts of missing EHR data in comparative effectiveness research.Journal of the American Medical Informatics Association, 30(7):1246–1256, July 2023. ISSN 1527-974X. doi: 10.1093/...

work page 2023

[3] [3]

Mining for equitable health: Assessing the impact of missing data in electronic health records.Journal of Biomedical Informatics, 139:104269, March 2023

Emily Getzen, Lyle Ungar, Danielle Mowery, Xiaoqian Jiang, and Qi Long. Mining for equitable health: Assessing the impact of missing data in electronic health records.Journal of Biomedical Informatics, 139:104269, March 2023. ISSN 1532-0464. doi: 10.1016/j.jbi.2022.104269

work page doi:10.1016/j.jbi.2022.104269 2023

[4] [4]

Harnessing the power of clinical decision support systems: Challenges and opportunities.Open Heart, 10(2), November 2023

Zhao Chen, Ning Liang, Haili Zhang, Huizhen Li, Yijiu Yang, Xingyu Zong, Yaxin Chen, Yanping Wang, and Nannan Shi. Harnessing the power of clinical decision support systems: Challenges and opportunities.Open Heart, 10(2), November 2023. ISSN 2053-3624. doi: 10.1136/openhrt-2023-002432. 12

work page doi:10.1136/openhrt-2023-002432 2023

[5] [5]

Jiancheng Ye, Donna Woods, Neil Jordan, and Justin Starren. The role of artificial intelligence for the application of integrating electronic health records and patient-generated data in clinical decision support.AMIA Summits on Translational Science Proceedings, 2024:459–467, May

work page 2024

[6] [6]

Juliette T

Helen Coupland, Neil Scheidwasser, Alexandros Katsiferis, Megan Davies, Seth Flaxman, Naja Hulvej Rod, Swapnil Mishra, Samir Bhatt, and H. Juliette T. Unwin. Exploring the potential and limitations of deep learning and explainable AI for longitudinal life course analysis.BMC Public Health, 25(1):1520, April 2025. ISSN 1471-2458. doi: 10.1186/s12889-025-22705-4

work page doi:10.1186/s12889-025-22705-4 2025

[7] [7]

Mehak Arora, Hassan Mortagy, Nathan Dwarshuis, Jeffrey Wang, Philip Yang, Andre L Holder, Swati Gupta, and Rishikesan Kamaleswaran. Improving clinical decision support through interpretable machine learning and error handling in electronic health records.Journal of the American Medical Informatics Association, 33(1):123–132, January 2026. ISSN 1527-974X. ...

work page doi:10.1093/jamia/ocaf058 2026

[8] [8]

Eyre, and Jingjing Fu

Zizheng Zhang, Yiming Li, Justin Xu, Jinyu Wang, Rui Wang, Lei Song, Jiang Bian, David W. Eyre, and Jingjing Fu. MedFeat: Model-Aware and Explainability-Driven Feature Engineering with LLMs for Clinical Tabular Prediction, February 2026

work page 2026

[9] [9]

Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences , author =

BryanLimandStefanZohren. Time-seriesforecastingwithdeeplearning: Asurvey.Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences, 379(2194):20200209, April 2021. ISSN 1471-2962. doi: 10.1098/rsta.2020.0209

work page doi:10.1098/rsta.2020.0209 2021

[10] [10]

A Survey of Deep Learning for Time Series Forecasting: Theories, Datasets, and State-of-the-Art Techniques.Computers, Materials & Continua, 85(2):2403–2441, 2025

Gaoyong Lu, Yang Ou, Zhihong Wang, Yingnan Qu, Yingsheng Xia, Dibin Tang, Igor Kotenko, and Wei Li. A Survey of Deep Learning for Time Series Forecasting: Theories, Datasets, and State-of-the-Art Techniques.Computers, Materials & Continua, 85(2):2403–2441, 2025. ISSN 1546-2218, 1546-2226. doi: 10.32604/cmc.2025.068024

work page doi:10.32604/cmc.2025.068024 2025

[11] [11]

OpenFE: Automated Feature Generation with Expert-level Performance, June 2023

Tianping Zhang, Zheyu Zhang, Zhiyuan Fan, Haoyan Luo, Fengyuan Liu, Qian Liu, Wei Cao, and Jian Li. OpenFE: Automated Feature Generation with Expert-level Performance, June 2023

work page 2023

[12] [12]

The autofeat Python Library for Automated Feature Engineering and Selection, February 2020

Franziska Horn, Robert Pack, and Michael Rieger. The autofeat Python Library for Automated Feature Engineering and Selection, February 2020

work page 2020

[13] [13]

Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering, September 2023

Noah Hollmann, Samuel Müller, and Frank Hutter. Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering, September 2023

work page 2023

[14] [14]

Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning, November 2024

Jaehyun Nam, Kyuyoung Kim, Seunghyuk Oh, Jihoon Tack, Jaehyung Kim, and Jinwoo Shin. Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning, November 2024

work page 2024

[15] [15]

Nikhil Abhyankar, Parshin Shojaee, and Chandan K. Reddy. LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers, March 2025

work page 2025

[16] [16]

Learning a Data-Driven Policy Network for Pre-Training Automated Feature Engineering

Liyao Li, Haobo Wang, Liangyu Zha, Qingyi Huang, Sai Wu, Gang Chen, and Junbo Zhao. Learning a Data-Driven Policy Network for Pre-Training Automated Feature Engineering. In The Eleventh International Conference on Learning Representations, September 2022. 13

work page 2022

[17] [17]

Time series feature extraction on basis of scalable hypothesis tests (tsfresh – a python package),

Maximilian Christ, Nils Braun, Julius Neuffer, and Andreas W. Kempa-Liehr. Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package).Neuro- computing, 307:72–77, September 2018. ISSN 0925-2312. doi: 10.1016/j.neucom.2018.03.067

work page doi:10.1016/j.neucom.2018.03.067 2018

[18] [18]

Tsflex: Flexible time series processing & feature extraction, December 2021

Jonas Van Der Donckt, Jeroen Van Der Donckt, Emiel Deprost, and Sofie Van Hoecke. Tsflex: Flexible time series processing & feature extraction, December 2021

work page 2021

[19] [19]

Kats, March 2022

Xiaodong Jiang, Sudeep Srivastava, Sourav Chatterjee, Yang Yu, Jeffrey Handler, Peiyi Zhang, Rohan Bopardikar, Dawei Li, Yanjun Lin, Uttam Thakore, Michael Brundage, Ginger Holt, Caner Komurlu, Rakshita Nagalla, Zhichao Wang, Hechao Sun, Peng Gao, Wei Cheung, Jun Gao, Qi Wang, Marius Guerard, Morteza Kazemi, Yulin Chen, Chong Zhou, Sean Lee, Nikolay Lapte...

work page 2022

[20] [20]

Lubba, Sarab S

Carl H. Lubba, Sarab S. Sethi, Philip Knaute, Simon R. Schultz, Ben D. Fulcher, and Nick S. Jones. Catch22: CAnonical Time-series CHaracteristics, January 2019

work page 2019

[21] [21]

Arik, and Tomas Pfister

Sungwon Han, Jinsung Yoon, Sercan O. Arik, and Tomas Pfister. Large Language Models Can Automatically Engineer Features for Few-Shot Tabular Learning, May 2024

work page 2024

[22] [22]

FeRG-LLM : Feature Engineering by Reason Generation Large Language Models, March 2025

Jeonghyun Ko, Gyeongyun Park, Donghoon Lee, and Kyunam Lee. FeRG-LLM : Feature Engineering by Reason Generation Large Language Models, March 2025

work page 2025

[23] [23]

FAMOSE: A ReAct Approach to Automated Feature Discovery, February 2026

Keith Burghardt, Jienan Liu, Sadman Sakib, Yuning Hao, and Bo Li. FAMOSE: A ReAct Approach to Automated Feature Discovery, February 2026

work page 2026

[24] [24]

ReAct: Synergizing Reasoning and Acting in Language Models, March 2023

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models, March 2023

work page 2023

[25] [25]

Predicting In- Hospital Mortality of ICU Patients: The PhysioNet/Computing in Cardiology Challenge 2012

Ikaro Silva, George Moody, Daniel J Scott, Leo A Celi, and Roger G Mark. Predicting In- Hospital Mortality of ICU Patients: The PhysioNet/Computing in Cardiology Challenge 2012. page 4

work page 2012

[26] [26]

Reyna, Christopher S

Matthew A. Reyna, Christopher S. Josef, Russell Jeter, Supreeth P. Shashikumar, M. Brandon Westover, Shamim Nemati, Gari D. Clifford, and Ashish Sharma. Early Prediction of Sepsis From Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019.Critical Care Medicine, 48(2):210–217, February 2020. ISSN 0090-3493. doi: 10.1097/CCM.0000000000004145

work page doi:10.1097/ccm.0000000000004145 2019

[27] [27]

Shirly Wang, Matthew B. A. McDermott, Geeticka Chauhan, Marzyeh Ghassemi, Michael C. Hughes, and Tristan Naumann. MIMIC-Extract: A data extraction, preprocessing, and representation pipeline for MIMIC-III. InProceedings of the ACM Conference on Health, Inference, and Learning, pages 222–235, Toronto Ontario Canada, April 2020. ACM. ISBN 978-1-4503-7046-2....

work page doi:10.1145/3368555.3384469 2020

[28] [28]

Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. MIMIC-III, a freely accessible critical care database.Scientific Data, 3(1):160035, May 2016. ISSN 2052-4463. doi: 10.1038/sdata.2016.35

work page doi:10.1038/sdata.2016.35 2016

[29] [29]

and Johnson, Alistair E

Tom J. Pollard, Alistair E. W. Johnson, Jesse D. Raffa, Leo A. Celi, Roger G. Mark, and Omar Badawi. The eICU Collaborative Research Database, a freely available multi-center database 14 for critical care research.Scientific Data, 5(1):180178, September 2018. ISSN 2052-4463. doi: 10.1038/sdata.2018.178

work page doi:10.1038/sdata.2018.178 2018

[30] [30]

Shengpu Tang, Parmida Davarmanesh, Yanmeng Song, Danai Koutra, Michael W Sjoding, and Jenna Wiens. Democratizing EHR analyses with FIDDLE: A flexible data-driven preprocessing pipeline for structured clinical data.Journal of the American Medical Informatics Association, 27(12):1921–1934, December 2020. ISSN 1527-974X. doi: 10.1093/jamia/ocaa139

work page doi:10.1093/jamia/ocaa139 1921

[31] [31]

LightGBM: A Highly Efficient Gradient Boosting Decision Tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

work page 2017

[32] [32]

https://deepmind.google/models/gemini/flash/

Gemini 2.5 Flash. https://deepmind.google/models/gemini/flash/

work page

[33] [33]

Lipton, David C

Zachary C. Lipton, David C. Kale, Charles Elkan, and Randall Wetzel. Learning to Diagnose with LSTM Recurrent Neural Networks, March 2017

work page 2017

[34] [34]

Satya Narayan Shukla and Benjamin M. Marlin. Multi-Time Attention Networks for Irregularly Sampled Time Series, June 2021

work page 2021

[35] [35]

Roderick J. A. Little and Donald B. Rubin.Statistical Analysis with Missing Data. John Wiley & Sons, April 2019. ISBN 978-0-470-52679-8

work page 2019

[36] [36]

BMJ361, 1479 (2018) https: //doi.org/10.1136/bmj.k1479

Denis Agniel, Isaac S. Kohane, and Griffin M. Weber. Biases in electronic health record data due to processes within the healthcare system: Retrospective observational study.BMJ, 361: k1479, April 2018. ISSN 1756-1833. doi: 10.1136/bmj.k1479

work page doi:10.1136/bmj.k1479 2018

[37] [37]

Lundberg, Gabriel G

Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. Consistent Individualized Feature Attribution for Tree Ensembles, March 2019

work page 2019

[38] [38]

Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets.Pattern Recognition, 36(6): 1291–1302, June 2003

Robert Bryll, Ricardo Gutierrez-Osuna, and Francis Quek. Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets.Pattern Recognition, 36(6): 1291–1302, June 2003. ISSN 0031-3203. doi: 10.1016/S0031-3203(02)00121-8

work page doi:10.1016/s0031-3203(02)00121-8 2003

[39] [39]

ControlBurn: Feature Selection by Sparse Forests

Brian Liu, Miaolan Xie, and Madeleine Udell. ControlBurn: Feature Selection by Sparse Forests. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1045–1054, August 2021. doi: 10.1145/3447548.3467387

work page doi:10.1145/3447548.3467387 2021

[40] [40]

Explainable and interpretable artificial intelligence in medicine: A systematic bibliometric review.Discover Artificial Intelligence, 4(1):15, February 2024

Maria Frasca, Davide La Torre, Gabriella Pravettoni, and Ilaria Cutica. Explainable and interpretable artificial intelligence in medicine: A systematic bibliometric review.Discover Artificial Intelligence, 4(1):15, February 2024. ISSN 2731-0809. doi: 10.1007/s44163-024-00114-7

work page doi:10.1007/s44163-024-00114-7 2024

[41] [41]

Al-Mallah, and Sherif Sakr

Radwa Elshawi, Mouaz H. Al-Mallah, and Sherif Sakr. On the interpretability of machine learning-based model for predicting hypertension.BMC Medical Informatics and Decision Making, 19(1):146, July 2019. ISSN 1472-6947. doi: 10.1186/s12911-019-0874-0

work page doi:10.1186/s12911-019-0874-0 2019

[42] [42]

Large language models in biomedicine and healthcare.npj Artificial Intelligence, 1(1):44, December

Juexiao Zhou, Haoyang Li, Siyuan Chen, Zhangtianyi Chen, Zhongyi Han, and Xin Gao. Large language models in biomedicine and healthcare.npj Artificial Intelligence, 1(1):44, December

work page

[43] [43]

doi: 10.1038/s44387-025-00047-1

ISSN 3005-1460. doi: 10.1038/s44387-025-00047-1

work page doi:10.1038/s44387-025-00047-1

[44] [44]

Medical Hallucinations in Foundation Models and Their Impact on Healthcare, November 2025

Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Chanwoo Park, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, Rodrigo Gameiro, Lizhou Fan, Eugene Park, Tristan Lin, Joonsik Yoon, Wonjin Yoon, Maarten Sap, Yulia Tsvetkov, Paul Liang, Xuhai 15 Xu, Xin Liu, Chunjong Park, Hyeonhoon Lee, Hae Won Park, Daniel McDuff, Samir Tulebaev, a...

work page 2025

[45] [45]

The Clinicians’ Guide to Large Language Models: A General Perspective With a Focus on Hallucinations.Interactive Journal of Medical Research, 14(1):e59823, January 2025

Dimitri Roustan and François Bastardot. The Clinicians’ Guide to Large Language Models: A General Perspective With a Focus on Hallucinations.Interactive Journal of Medical Research, 14(1):e59823, January 2025. doi: 10.2196/59823

work page doi:10.2196/59823 2025

[46] [46]

What is the risk of ...?

Lisa Pilgram, Samer El Kababji, Dan Liu, and Khaled El Emam. Magnitude and Impact of Hallucinations in Tabular Synthetic Health Data on Prognostic Machine Learning Models: Validation Study.Journal of Medical Internet Research, 27(1):e77893, August 2025. doi: 10.2196/77893. 16 ### Your task is to {TASK} . You are given access to the patient's * {var_name} ...

work page doi:10.2196/77893 2025