FeatEHR-LLM: Leveraging Large Language Models for Feature Engineering in Electronic Health Records
Pith reviewed 2026-05-08 12:21 UTC · model grok-4.3
The pith
LLMs generate executable feature code from EHR schemas alone to handle irregular clinical data and boost prediction accuracy
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FeatEHR-LLM leverages large language models to generate clinically meaningful tabular features from irregularly sampled EHR time series. The LLM operates exclusively on dataset schemas and task descriptions, equipped with tool routines for querying irregular temporal data, enabling it to produce executable feature-extraction code that explicitly handles uneven observation patterns and informative sparsity. The framework supports both univariate and multivariate feature generation through an iterative, validation-in-the-loop pipeline.
What carries the argument
Tool-augmented LLM generation pipeline that supplies specialized routines for irregular temporal queries so the model can output executable code for feature extraction from sparse, unevenly timed clinical records.
Load-bearing premise
An LLM given only schemas and task descriptions plus tool routines will reliably output correct, clinically useful code that properly manages irregular sampling and sparsity without hallucinations or invalid syntax.
What would settle it
Apply the framework to a fresh collection of ICU datasets and observe either no AUROC gain over baselines or frequent generation of non-executable or semantically wrong feature code.
Figures
read the original abstract
Feature engineering for Electronic Health Records (EHR) is complicated by irregular observation intervals, variable measurement frequencies, and structural sparsity inherent to clinical time series. Existing automated methods either lack clinical domain awareness or assume clean, regularly sampled inputs, limiting their applicability to real-world EHR data. We present \textbf{FeatEHR-LLM}, a framework that leverages Large Language Models (LLMs) to generate clinically meaningful tabular features from irregularly sampled EHR time series. To limit patient privacy exposure, the LLM operates exclusively on dataset schemas and task descriptions rather than raw patient records. A tool-augmented generation mechanism equips the LLM with specialized routines for querying irregular temporal data, enabling it to produce executable feature-extraction code that explicitly handles uneven observation patterns and informative sparsity. FeatEHR-LLM supports both univariate and multivariate feature generation through an iterative, validation-in-the-loop pipeline. Evaluated on eight clinical prediction tasks across four ICU datasets, our framework achieves the highest mean AUROC on 7 out of 8 tasks, with improvements of up to 6 percentage points over strong baselines. Code is available at github.com/hojjatkarami/FeatEHR-LLM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FeatEHR-LLM, a framework that uses LLMs operating solely on dataset schemas and task descriptions (no raw patient data) to generate executable Python code for tabular features from irregularly sampled EHR time series. It equips the LLM with specialized temporal-query tools and an iterative validation-in-the-loop pipeline to handle uneven observation patterns and sparsity. The central empirical claim is that this yields the highest mean AUROC on 7 out of 8 clinical prediction tasks across four ICU datasets, with gains of up to 6 percentage points over strong baselines.
Significance. If the performance claims hold under rigorous verification, the work would demonstrate a practical way to inject clinical domain knowledge into automated feature engineering for real-world EHR without privacy leakage, addressing limitations of prior methods that assume regular sampling. The schema-only + tool-augmented design and public code release are notable strengths that could enable reproducibility and extension to other clinical tasks.
major comments (3)
- [§3.2] §3.2 (Validation-in-the-Loop Pipeline): The description of the iterative pipeline does not report any quantitative success rate for generated code (e.g., fraction of LLM outputs that pass both syntax/runtime checks and clinical-logic validation), nor a taxonomy of caught errors. This is load-bearing for the central claim because the reported AUROC gains presuppose that the LLM reliably produces correct temporal aggregations and missingness handling rather than artifacts from flawed feature logic.
- [§4] §4 (Experimental Evaluation): The headline result (highest mean AUROC on 7/8 tasks, up to 6pp improvement) is presented without details on baseline implementations, number of random seeds or runs, statistical significance testing (p-values or confidence intervals), or exact train/validation/test splits. Without these, it is impossible to determine whether the observed edges are robust or could arise from implementation differences or lucky generations.
- [§3.1] §3.1 (Tool-Augmented Generation): The specialized routines for querying irregular temporal data are described at a high level, but no concrete examples or pseudocode are given showing how they enforce time-aware weighting or avoid assuming regular sampling. This matters because any residual assumption of regularity in the generated features would undermine the claim of handling real EHR sparsity.
minor comments (2)
- [Abstract and §4] The abstract and §4 refer to 'strong baselines' without naming them or citing their original papers in the main text; adding an explicit comparison table with references would improve clarity.
- [§4] Notation for feature types (univariate vs. multivariate) is introduced but not consistently used when reporting per-task results; a small table mapping generated feature categories to AUROC deltas would help readers trace which LLM-generated features drive the gains.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments on our manuscript. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the paper's rigor and reproducibility without altering its core contributions.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Validation-in-the-Loop Pipeline): The description of the iterative pipeline does not report any quantitative success rate for generated code (e.g., fraction of LLM outputs that pass both syntax/runtime checks and clinical-logic validation), nor a taxonomy of caught errors. This is load-bearing for the central claim because the reported AUROC gains presuppose that the LLM reliably produces correct temporal aggregations and missingness handling rather than artifacts from flawed feature logic.
Authors: We agree that quantitative success metrics and an error taxonomy would better substantiate the pipeline's reliability. In the revised manuscript, we will expand §3.2 to include these details drawn from our experimental logs: the overall fraction of LLM outputs that ultimately pass all validation stages, the distribution of iterations required, and a categorized breakdown of errors (syntax errors, runtime errors from sparsity handling, and logical inconsistencies in temporal queries). This addition will directly address the concern that the AUROC gains might stem from flawed code rather than correct feature logic. revision: yes
-
Referee: [§4] §4 (Experimental Evaluation): The headline result (highest mean AUROC on 7/8 tasks, up to 6pp improvement) is presented without details on baseline implementations, number of random seeds or runs, statistical significance testing (p-values or confidence intervals), or exact train/validation/test splits. Without these, it is impossible to determine whether the observed edges are robust or could arise from implementation differences or lucky generations.
Authors: We acknowledge that these experimental details are necessary for assessing robustness. We will revise §4 (and add supporting material in an appendix) to specify: the exact implementations and adaptations of all baselines for irregular sampling; the number of random seeds and runs performed; results of statistical significance tests (including p-values and confidence intervals); and the precise train/validation/test splits employed for each of the four ICU datasets. These additions will allow readers to verify that the reported gains are not artifacts of implementation variance or single-run luck. revision: yes
-
Referee: [§3.1] §3.1 (Tool-Augmented Generation): The specialized routines for querying irregular temporal data are described at a high level, but no concrete examples or pseudocode are given showing how they enforce time-aware weighting or avoid assuming regular sampling. This matters because any residual assumption of regularity in the generated features would undermine the claim of handling real EHR sparsity.
Authors: We agree that explicit examples would clarify how the tools avoid regularity assumptions. In the revised §3.1, we will add pseudocode for the core temporal-query routines (e.g., the time-aware aggregation and missingness-handling functions) together with a concrete worked example on sparse, irregularly sampled vital-sign data. The example will demonstrate explicit use of actual time deltas for weighting, without any fixed-interval assumptions, thereby reinforcing the framework's suitability for real EHR sparsity patterns. revision: yes
Circularity Check
No circularity: empirical framework evaluated on external benchmarks
full rationale
The paper advances an LLM-augmented feature engineering pipeline for irregular EHR data and supports its claims solely through empirical AUROC comparisons on eight clinical prediction tasks across four public ICU datasets. No mathematical derivation, uniqueness theorem, or first-principles result is presented that reduces to fitted parameters, self-definitions, or prior self-citations; the performance numbers are obtained by running the generated code on held-out data and comparing against independent baselines. The framework's internal validation loop operates on syntax/runtime checks rather than re-using the target AUROC metric, so the reported gains are not forced by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Large language models can generate correct and clinically useful executable feature-extraction code when supplied with dataset schemas, task descriptions, and specialized query tools.
- ad hoc to paper The generated features will be clinically meaningful and will improve downstream prediction performance on real EHR tasks.
Reference graph
Works this paper leans on
-
[1]
Marzyeh Ghassemi, Tristan Naumann, Peter Schulam, Andrew L. Beam, Irene Y. Chen, and Rajesh Ranganath. A Review of Challenges and Opportunities in Machine Learning for Health. AMIA Summits on Translational Science Proceedings, 2020:191–200, May 2020. ISSN 2153-4063
work page 2020
-
[2]
Yizhao Zhou, Jiasheng Shi, Ronen Stein, Xiaokang Liu, Robert N Baldassano, Christopher B Forrest, Yong Chen, and Jing Huang. Missing data matter: An empirical evaluation of the impacts of missing EHR data in comparative effectiveness research.Journal of the American Medical Informatics Association, 30(7):1246–1256, July 2023. ISSN 1527-974X. doi: 10.1093/...
work page 2023
-
[3]
Emily Getzen, Lyle Ungar, Danielle Mowery, Xiaoqian Jiang, and Qi Long. Mining for equitable health: Assessing the impact of missing data in electronic health records.Journal of Biomedical Informatics, 139:104269, March 2023. ISSN 1532-0464. doi: 10.1016/j.jbi.2022.104269
-
[4]
Zhao Chen, Ning Liang, Haili Zhang, Huizhen Li, Yijiu Yang, Xingyu Zong, Yaxin Chen, Yanping Wang, and Nannan Shi. Harnessing the power of clinical decision support systems: Challenges and opportunities.Open Heart, 10(2), November 2023. ISSN 2053-3624. doi: 10.1136/openhrt-2023-002432. 12
-
[5]
Jiancheng Ye, Donna Woods, Neil Jordan, and Justin Starren. The role of artificial intelligence for the application of integrating electronic health records and patient-generated data in clinical decision support.AMIA Summits on Translational Science Proceedings, 2024:459–467, May
work page 2024
-
[6]
Helen Coupland, Neil Scheidwasser, Alexandros Katsiferis, Megan Davies, Seth Flaxman, Naja Hulvej Rod, Swapnil Mishra, Samir Bhatt, and H. Juliette T. Unwin. Exploring the potential and limitations of deep learning and explainable AI for longitudinal life course analysis.BMC Public Health, 25(1):1520, April 2025. ISSN 1471-2458. doi: 10.1186/s12889-025-22705-4
-
[7]
Mehak Arora, Hassan Mortagy, Nathan Dwarshuis, Jeffrey Wang, Philip Yang, Andre L Holder, Swati Gupta, and Rishikesan Kamaleswaran. Improving clinical decision support through interpretable machine learning and error handling in electronic health records.Journal of the American Medical Informatics Association, 33(1):123–132, January 2026. ISSN 1527-974X. ...
-
[8]
Zizheng Zhang, Yiming Li, Justin Xu, Jinyu Wang, Rui Wang, Lei Song, Jiang Bian, David W. Eyre, and Jingjing Fu. MedFeat: Model-Aware and Explainability-Driven Feature Engineering with LLMs for Clinical Tabular Prediction, February 2026
work page 2026
-
[9]
BryanLimandStefanZohren. Time-seriesforecastingwithdeeplearning: Asurvey.Philosophical Transactions. Series A, Mathematical, Physical, and Engineering Sciences, 379(2194):20200209, April 2021. ISSN 1471-2962. doi: 10.1098/rsta.2020.0209
-
[10]
Gaoyong Lu, Yang Ou, Zhihong Wang, Yingnan Qu, Yingsheng Xia, Dibin Tang, Igor Kotenko, and Wei Li. A Survey of Deep Learning for Time Series Forecasting: Theories, Datasets, and State-of-the-Art Techniques.Computers, Materials & Continua, 85(2):2403–2441, 2025. ISSN 1546-2218, 1546-2226. doi: 10.32604/cmc.2025.068024
-
[11]
OpenFE: Automated Feature Generation with Expert-level Performance, June 2023
Tianping Zhang, Zheyu Zhang, Zhiyuan Fan, Haoyan Luo, Fengyuan Liu, Qian Liu, Wei Cao, and Jian Li. OpenFE: Automated Feature Generation with Expert-level Performance, June 2023
work page 2023
-
[12]
The autofeat Python Library for Automated Feature Engineering and Selection, February 2020
Franziska Horn, Robert Pack, and Michael Rieger. The autofeat Python Library for Automated Feature Engineering and Selection, February 2020
work page 2020
-
[13]
Noah Hollmann, Samuel Müller, and Frank Hutter. Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering, September 2023
work page 2023
-
[14]
Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning, November 2024
Jaehyun Nam, Kyuyoung Kim, Seunghyuk Oh, Jihoon Tack, Jaehyung Kim, and Jinwoo Shin. Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning, November 2024
work page 2024
-
[15]
Nikhil Abhyankar, Parshin Shojaee, and Chandan K. Reddy. LLM-FE: Automated Feature Engineering for Tabular Data with LLMs as Evolutionary Optimizers, March 2025
work page 2025
-
[16]
Learning a Data-Driven Policy Network for Pre-Training Automated Feature Engineering
Liyao Li, Haobo Wang, Liangyu Zha, Qingyi Huang, Sai Wu, Gang Chen, and Junbo Zhao. Learning a Data-Driven Policy Network for Pre-Training Automated Feature Engineering. In The Eleventh International Conference on Learning Representations, September 2022. 13
work page 2022
-
[17]
Time series feature extraction on basis of scalable hypothesis tests (tsfresh – a python package),
Maximilian Christ, Nils Braun, Julius Neuffer, and Andreas W. Kempa-Liehr. Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package).Neuro- computing, 307:72–77, September 2018. ISSN 0925-2312. doi: 10.1016/j.neucom.2018.03.067
-
[18]
Tsflex: Flexible time series processing & feature extraction, December 2021
Jonas Van Der Donckt, Jeroen Van Der Donckt, Emiel Deprost, and Sofie Van Hoecke. Tsflex: Flexible time series processing & feature extraction, December 2021
work page 2021
-
[19]
Xiaodong Jiang, Sudeep Srivastava, Sourav Chatterjee, Yang Yu, Jeffrey Handler, Peiyi Zhang, Rohan Bopardikar, Dawei Li, Yanjun Lin, Uttam Thakore, Michael Brundage, Ginger Holt, Caner Komurlu, Rakshita Nagalla, Zhichao Wang, Hechao Sun, Peng Gao, Wei Cheung, Jun Gao, Qi Wang, Marius Guerard, Morteza Kazemi, Yulin Chen, Chong Zhou, Sean Lee, Nikolay Lapte...
work page 2022
-
[20]
Carl H. Lubba, Sarab S. Sethi, Philip Knaute, Simon R. Schultz, Ben D. Fulcher, and Nick S. Jones. Catch22: CAnonical Time-series CHaracteristics, January 2019
work page 2019
-
[21]
Sungwon Han, Jinsung Yoon, Sercan O. Arik, and Tomas Pfister. Large Language Models Can Automatically Engineer Features for Few-Shot Tabular Learning, May 2024
work page 2024
-
[22]
FeRG-LLM : Feature Engineering by Reason Generation Large Language Models, March 2025
Jeonghyun Ko, Gyeongyun Park, Donghoon Lee, and Kyunam Lee. FeRG-LLM : Feature Engineering by Reason Generation Large Language Models, March 2025
work page 2025
-
[23]
FAMOSE: A ReAct Approach to Automated Feature Discovery, February 2026
Keith Burghardt, Jienan Liu, Sadman Sakib, Yuning Hao, and Bo Li. FAMOSE: A ReAct Approach to Automated Feature Discovery, February 2026
work page 2026
-
[24]
ReAct: Synergizing Reasoning and Acting in Language Models, March 2023
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models, March 2023
work page 2023
-
[25]
Ikaro Silva, George Moody, Daniel J Scott, Leo A Celi, and Roger G Mark. Predicting In- Hospital Mortality of ICU Patients: The PhysioNet/Computing in Cardiology Challenge 2012. page 4
work page 2012
-
[26]
Matthew A. Reyna, Christopher S. Josef, Russell Jeter, Supreeth P. Shashikumar, M. Brandon Westover, Shamim Nemati, Gari D. Clifford, and Ashish Sharma. Early Prediction of Sepsis From Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019.Critical Care Medicine, 48(2):210–217, February 2020. ISSN 0090-3493. doi: 10.1097/CCM.0000000000004145
-
[27]
Shirly Wang, Matthew B. A. McDermott, Geeticka Chauhan, Marzyeh Ghassemi, Michael C. Hughes, and Tristan Naumann. MIMIC-Extract: A data extraction, preprocessing, and representation pipeline for MIMIC-III. InProceedings of the ACM Conference on Health, Inference, and Learning, pages 222–235, Toronto Ontario Canada, April 2020. ACM. ISBN 978-1-4503-7046-2....
-
[28]
Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. MIMIC-III, a freely accessible critical care database.Scientific Data, 3(1):160035, May 2016. ISSN 2052-4463. doi: 10.1038/sdata.2016.35
-
[29]
Tom J. Pollard, Alistair E. W. Johnson, Jesse D. Raffa, Leo A. Celi, Roger G. Mark, and Omar Badawi. The eICU Collaborative Research Database, a freely available multi-center database 14 for critical care research.Scientific Data, 5(1):180178, September 2018. ISSN 2052-4463. doi: 10.1038/sdata.2018.178
-
[30]
Shengpu Tang, Parmida Davarmanesh, Yanmeng Song, Danai Koutra, Michael W Sjoding, and Jenna Wiens. Democratizing EHR analyses with FIDDLE: A flexible data-driven preprocessing pipeline for structured clinical data.Journal of the American Medical Informatics Association, 27(12):1921–1934, December 2020. ISSN 1527-974X. doi: 10.1093/jamia/ocaa139
-
[31]
LightGBM: A Highly Efficient Gradient Boosting Decision Tree
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017
work page 2017
-
[32]
https://deepmind.google/models/gemini/flash/
Gemini 2.5 Flash. https://deepmind.google/models/gemini/flash/
-
[33]
Zachary C. Lipton, David C. Kale, Charles Elkan, and Randall Wetzel. Learning to Diagnose with LSTM Recurrent Neural Networks, March 2017
work page 2017
-
[34]
Satya Narayan Shukla and Benjamin M. Marlin. Multi-Time Attention Networks for Irregularly Sampled Time Series, June 2021
work page 2021
-
[35]
Roderick J. A. Little and Donald B. Rubin.Statistical Analysis with Missing Data. John Wiley & Sons, April 2019. ISBN 978-0-470-52679-8
work page 2019
-
[36]
BMJ361, 1479 (2018) https: //doi.org/10.1136/bmj.k1479
Denis Agniel, Isaac S. Kohane, and Griffin M. Weber. Biases in electronic health record data due to processes within the healthcare system: Retrospective observational study.BMJ, 361: k1479, April 2018. ISSN 1756-1833. doi: 10.1136/bmj.k1479
-
[37]
Scott M. Lundberg, Gabriel G. Erion, and Su-In Lee. Consistent Individualized Feature Attribution for Tree Ensembles, March 2019
work page 2019
-
[38]
Robert Bryll, Ricardo Gutierrez-Osuna, and Francis Quek. Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets.Pattern Recognition, 36(6): 1291–1302, June 2003. ISSN 0031-3203. doi: 10.1016/S0031-3203(02)00121-8
-
[39]
ControlBurn: Feature Selection by Sparse Forests
Brian Liu, Miaolan Xie, and Madeleine Udell. ControlBurn: Feature Selection by Sparse Forests. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1045–1054, August 2021. doi: 10.1145/3447548.3467387
-
[40]
Maria Frasca, Davide La Torre, Gabriella Pravettoni, and Ilaria Cutica. Explainable and interpretable artificial intelligence in medicine: A systematic bibliometric review.Discover Artificial Intelligence, 4(1):15, February 2024. ISSN 2731-0809. doi: 10.1007/s44163-024-00114-7
-
[41]
Radwa Elshawi, Mouaz H. Al-Mallah, and Sherif Sakr. On the interpretability of machine learning-based model for predicting hypertension.BMC Medical Informatics and Decision Making, 19(1):146, July 2019. ISSN 1472-6947. doi: 10.1186/s12911-019-0874-0
-
[42]
Large language models in biomedicine and healthcare.npj Artificial Intelligence, 1(1):44, December
Juexiao Zhou, Haoyang Li, Siyuan Chen, Zhangtianyi Chen, Zhongyi Han, and Xin Gao. Large language models in biomedicine and healthcare.npj Artificial Intelligence, 1(1):44, December
-
[43]
doi: 10.1038/s44387-025-00047-1
ISSN 3005-1460. doi: 10.1038/s44387-025-00047-1
-
[44]
Medical Hallucinations in Foundation Models and Their Impact on Healthcare, November 2025
Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Chanwoo Park, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, Rodrigo Gameiro, Lizhou Fan, Eugene Park, Tristan Lin, Joonsik Yoon, Wonjin Yoon, Maarten Sap, Yulia Tsvetkov, Paul Liang, Xuhai 15 Xu, Xin Liu, Chunjong Park, Hyeonhoon Lee, Hae Won Park, Daniel McDuff, Samir Tulebaev, a...
work page 2025
-
[45]
Dimitri Roustan and François Bastardot. The Clinicians’ Guide to Large Language Models: A General Perspective With a Focus on Hallucinations.Interactive Journal of Medical Research, 14(1):e59823, January 2025. doi: 10.2196/59823
-
[46]
Lisa Pilgram, Samer El Kababji, Dan Liu, and Khaled El Emam. Magnitude and Impact of Hallucinations in Tabular Synthetic Health Data on Prognostic Machine Learning Models: Validation Study.Journal of Medical Internet Research, 27(1):e77893, August 2025. doi: 10.2196/77893. 16 ### Your task is to {TASK} . You are given access to the patient's * {var_name} ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.