arxiv: 2604.27195 · v1 · submitted 2026-04-29 · 💻 cs.AI

Recognition: unknown

Evaluating TabPFN for Mild Cognitive Impairment to Alzheimer's Disease Conversion in Data Limited Settings

Brad Ye , Bulent Soykan , Gulsah Hancerliogullari Koksalmis , Hsin-Hsiung Huang , Laura J. Brattain

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:32 UTC · model grok-4.3

classification 💻 cs.AI

keywords TabPFNMild cognitive impairmentAlzheimer conversionLow data learningTADPOLE datasetMachine learningAlzheimer's biomarkersFoundation models

0 comments

The pith

TabPFN outperforms traditional machine learning models when predicting conversion from mild cognitive impairment to Alzheimer's disease in settings with limited training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to determine if TabPFN, a foundation model designed for tabular data, can deliver reliable predictions of which individuals with mild cognitive impairment will develop Alzheimer's disease within three years. It pits TabPFN against several established classifiers on features drawn from the TADPOLE dataset, including patient demographics, APOE4 genotype, MRI brain volumes, cerebrospinal fluid markers, and PET scans. Experiments vary the number of training examples from 50 up to 1000 to highlight behavior in data-scarce conditions. TabPFN records an AUC of 0.892 and sustains high performance at the lowest sample count, while XGBoost, Random Forest, LightGBM, and logistic regression show marked declines. If correct, this indicates that pre-trained tabular models can support early Alzheimer's intervention even when large patient cohorts are unavailable.

Core claim

TabPFN achieves an AUC of 0.892 on the task of predicting 3-year conversion from mild cognitive impairment to Alzheimer's disease using multimodal biomarker data from the TADPOLE dataset and outperforms LightGBM (AUC 0.860) as well as other traditional models, with its advantage becoming pronounced when the training set is restricted to 50 samples.

What carries the argument

TabPFN, the tabular pre-trained foundation network, serving as a classifier evaluated on varying sizes of training data from the TADPOLE collection of Alzheimer's biomarkers.

If this is right

TabPFN demonstrates clear benefits for Alzheimer's conversion prediction when fewer than 200 training samples are available.
Multimodal biomarker integration with TabPFN supports more accurate early forecasts than single-modality approaches would allow.
Foundation models reduce the sample size needed for effective disease progression modeling in neurodegenerative conditions.
The results encourage application of similar pre-trained networks to other data-limited medical prediction problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If TabPFN's low-data performance generalizes to independent cohorts, it could shorten the time required to validate new Alzheimer's therapies by enabling smaller pilot studies.
The approach may transfer to predicting progression in related conditions like frontotemporal dementia or vascular cognitive impairment where longitudinal data is also sparse.
Future work could test whether combining TabPFN with additional modalities such as EEG or genetic sequencing yields further gains beyond the current biomarker set.

Load-bearing premise

Differences in model performance stem from the inherent properties of TabPFN rather than from unequal hyperparameter tuning, inconsistent feature engineering, or any overlap between training and test data in the TADPOLE dataset splits.

What would settle it

Retraining every model under identical hyperparameter tuning protocols on the exact same TADPOLE feature set and splits, then measuring whether the AUC gap at N=50 training samples remains or closes.

read the original abstract

Accurate prediction of conversion from Mild Cognitive Impairment (MCI) to Alzheimers Diseases (AD) is essential for early intervention, however, developing reliable conversion predictive models is difficult to develop due to limited longitudinal data availability We evaluate TabPFN (Tabular Pre-Trained Foundation Network) against traditional machine learning methods for predicting 3 year MCI to AD conversion using the TADPOLE dataset derived from ADNI. Using multimodal biomarker features extracted from demographics, APOE4, MRI volumes, CSF markers, and PET imaging, we conducted an experimental comparison across varying training set sizes (N=50 to 1000) and models including XGBoost, Random Forest, LightGBM, and Logistic Regression. TabPFN achieved one the highest performance (AUC=0.892), outperforming LightGBM (AUC=0.860) and demonstrating advantages in low data settings. At N=50 training samples, TabPFN maintained strong AUC while the traditional machine learning models struggles at small training samples. These findings demonstrate that foundation models are promising for disease prediction in data limited scenarios, such as Alzheimers diseases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TabPFN looks more stable than the baselines at N=50 on TADPOLE MCI-to-AD but the reported edge is hard to trust without tuning and split details.

read the letter

The main thing to know is that this paper runs TabPFN against XGBoost, LightGBM, random forest, and logistic regression on the TADPOLE dataset for 3-year MCI-to-AD conversion and finds the pre-trained model holds up better when training data is cut to 50 samples. The overall AUC numbers are 0.892 for TabPFN versus 0.860 for LightGBM, with the gap widening at small N. That low-data angle matters for clinical work where longitudinal samples are scarce, and using the public multimodal TADPOLE features (demographics, APOE4, MRI, CSF, PET) keeps the setup realistic rather than synthetic.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates TabPFN against XGBoost, Random Forest, LightGBM, and Logistic Regression for predicting 3-year conversion from MCI to AD on the TADPOLE dataset derived from ADNI. Using multimodal features (demographics, APOE4, MRI volumes, CSF, PET), it reports TabPFN achieving the highest AUC of 0.892 (vs. LightGBM 0.860) and superior robustness at small training sizes (N=50), claiming advantages for foundation models in data-limited clinical settings.

Significance. If the reported AUC advantage and low-N robustness are attributable to TabPFN rather than experimental artifacts, the work would provide evidence that pre-trained tabular foundation models can improve predictive performance in Alzheimer's research where longitudinal samples are scarce. This could support broader adoption of such models for other data-limited medical prediction tasks.

major comments (3)

[Methods] The experimental comparison (Methods section) does not specify whether equivalent hyperparameter optimization was applied to all baselines; TabPFN is described as used in its default pre-trained form while LightGBM, XGBoost, and Random Forest are known to be sensitive to learning rate, depth, and regularization. This asymmetry could produce the observed 0.032 AUC gap (0.892 vs 0.860) without supporting the claim of architectural superiority.
[Methods] No description is given of the procedure for constructing the N=50 training subsets or the cross-validation scheme used to compute AUC values. Given the longitudinal structure of TADPOLE/ADNI, this omission leaves open the possibility of selection bias or leakage, which directly undermines the central low-data robustness claim.
[Methods] The handling of missing values in the multimodal feature set (MRI, CSF, PET) and any imputation or scaling pipeline is not detailed. Since performance differences are reported across models, unequal preprocessing could explain results independently of model choice.

minor comments (2)

[Abstract] Abstract contains grammatical errors: 'one the highest' should be 'one of the highest' and 'struggles' should be 'struggle'.
[Results] The paper would benefit from reporting confidence intervals or statistical tests for the AUC differences to allow assessment of whether the 0.032 gap is significant.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which have identified important omissions in the Methods section of our manuscript. We have revised the paper to provide the requested details on hyperparameter optimization, data splitting procedures, and preprocessing pipelines. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Methods] The experimental comparison (Methods section) does not specify whether equivalent hyperparameter optimization was applied to all baselines; TabPFN is described as used in its default pre-trained form while LightGBM, XGBoost, and Random Forest are known to be sensitive to learning rate, depth, and regularization. This asymmetry could produce the observed 0.032 AUC gap (0.892 vs 0.860) without supporting the claim of architectural superiority.

Authors: We agree that a fair comparison requires equivalent hyperparameter optimization for the baseline models. TabPFN is specifically designed as a pre-trained foundation model intended for use without tuning, which is central to its value in data-limited clinical settings. To address the concern, the revised manuscript now includes a detailed hyperparameter optimization procedure for XGBoost, Random Forest, LightGBM, and logistic regression. We performed a grid search over key parameters (learning rate, max depth, number of estimators, regularization strength) using 5-fold cross-validation on the training data, selecting the best configuration for each model before final evaluation. The optimized hyperparameters and search ranges are reported in the revised Methods and supplementary material. This ensures the 0.032 AUC difference cannot be attributed to unequal tuning. revision: yes
Referee: [Methods] No description is given of the procedure for constructing the N=50 training subsets or the cross-validation scheme used to compute AUC values. Given the longitudinal structure of TADPOLE/ADNI, this omission leaves open the possibility of selection bias or leakage, which directly undermines the central low-data robustness claim.

Authors: We apologize for the omission of these procedural details. In the revised manuscript we have added a dedicated subsection 'Data Splitting, Subset Construction, and Cross-Validation' in Methods. The full dataset was first split at the patient level (80/20 train/test) using unique subject IDs to prevent any leakage across longitudinal visits. From the training portion, N=50 subsets were constructed by repeated random sampling (10 repetitions) stratified by conversion label to preserve class balance. For each subset size, 5-fold stratified cross-validation was performed within the training data for both hyperparameter tuning and AUC estimation, with folds also respecting patient-level separation. Final reported AUCs are averages over the held-out test set across repetitions. These steps are now fully documented to eliminate concerns about selection bias or leakage. revision: yes
Referee: [Methods] The handling of missing values in the multimodal feature set (MRI, CSF, PET) and any imputation or scaling pipeline is not detailed. Since performance differences are reported across models, unequal preprocessing could explain results independently of model choice.

Authors: We have expanded the Methods section with a complete 'Preprocessing Pipeline' subsection. Missing values in MRI volumes, CSF markers, and PET features were imputed using IterativeImputer (MICE) with 10 iterations, fitted exclusively on the training data to avoid leakage. All features were then standardized with StandardScaler, again fitted only on training data. This identical pipeline was applied uniformly to TabPFN and all baseline models. The revised text specifies the exact imputation strategy, number of iterations, and scaling method, ensuring that reported performance differences arise from model choice rather than preprocessing. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical benchmark with no derivations or self-referential predictions

full rationale

The paper reports an empirical comparison of TabPFN against XGBoost, Random Forest, LightGBM, and Logistic Regression on the fixed public TADPOLE dataset for MCI-to-AD conversion prediction. Performance is measured via AUC across training sizes N=50 to 1000, with no equations, fitted parameters renamed as predictions, ansatzes, or uniqueness theorems. The central claims rest on observed AUC values (e.g., 0.892 vs 0.860) from standard train-test splits; these do not reduce to the inputs by construction. No self-citation chains are load-bearing for the results, and the evaluation uses an external pre-trained model without internal derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the domain assumption that the TADPOLE/ADNI multimodal biomarkers are representative and that standard ML train-test splits avoid leakage; no free parameters or new entities are introduced.

axioms (1)

domain assumption TADPOLE dataset provides unbiased multimodal features suitable for MCI-to-AD conversion modeling
Invoked when extracting demographics, APOE4, MRI volumes, CSF markers, and PET imaging for model training.

pith-pipeline@v0.9.0 · 5521 in / 1273 out tokens · 72066 ms · 2026-05-07T09:32:21.971897+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Machine learning models show promise for predicting disease progression, yet their development faces a fundamental challenge: limited high-quality train- ing data

INTRODUCTION The number of Americans living with Alzheimer’s Disease (AD) is projected to reach 13.8 million by 2060[1], under- scoring the urgent need for improved early detection and in- tervention strategies. Machine learning models show promise for predicting disease progression, yet their development faces a fundamental challenge: limited high-qualit...

2060
[2]

MATERIALS AND EXPERIMENTS 2.1. Dataset and Preprocessing We utilized the TADPOLE dataset, derived from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), a comprehen- sive longitudinal study containing clinical, imaging, and biomarker data from 1,737 participants[4]. The dataset in- cludes multiple visit observations spanning up to 10 years, with mea...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

XGBoost achieved the highest AUC score of 0.901, followed closely by TabPFN at 0.892

RESULTS AND DISCUSSIONS Figure 1 presents the overall performance of all models on the holdout validation set. XGBoost achieved the highest AUC score of 0.901, followed closely by TabPFN at 0.892. Random Forest achieved 0.888, while LightGBM and Lo- gistic Regression performed comparably at 0.860 and 0.859 respectively. These results indicate that both tu...
[4]

CONCLUSIONS This study provides a systematic evaluation of TabPFN, a foundation model for tabular data, for predicting MCI-to-AD conversion using biomarker features from the TADPOLE dataset. Our results demonstrate that foundation models of- fer meaningful advantages in data-limited clinical scenarios while also revealing important practical consideration...
[5]

Metrics for multiclass classification: An overview,

“Metrics for multiclass classification: An overview,” 2020

2020
[6]

Xgboost, a novel explainable ai technique, in the prediction of myocardial infarction: A uk biobank cohort study,

A. Moore and M. Bell, “Xgboost, a novel explainable ai technique, in the prediction of myocardial infarction: A uk biobank cohort study,”Clinical Medicine Insights: Cardiology, vol. 16, pp. 117954682211336, Jan 2022

2022
[7]

Navigating data scarcity using foundation models: A benchmark of few- shot and zero-shot learning approaches in medical imag- ing,

S. Woerner and C. F. Baumgartner, “Navigating data scarcity using foundation models: A benchmark of few- shot and zero-shot learning approaches in medical imag- ing,”arXiv preprint arXiv:2408.08058, 2024

work page arXiv 2024
[8]

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

N. Hollmann, S. M ¨uller, K. Eggensperger, and F. Hut- ter, “Tabpfn: A transformer that solves small tabu- lar classification problems in a second,”arXiv preprint arXiv:2207.01848, 2023

work page internal anchor Pith review arXiv 2023
[9]

Research progress in predicting the conver- sion from mild cognitive impairment to alzheimer’s dis- ease via multimodal mri and artificial intelligence,

Min Ai, Yu Liu, Dan Liu, Chengxi Yan, Xia Wang, and Xun Chen, “Research progress in predicting the conver- sion from mild cognitive impairment to alzheimer’s dis- ease via multimodal mri and artificial intelligence,”Fron- tiers in Neurology, vol. 16, pp. 1596632, 2025

2025
[10]

A systematic review of the barriers to the implementation of artificial intelligence in healthcare,

M. I. Ahmed, B. Spooner, J. Isherwood, M. A. Lane, E. Orrock, and A. Dennison, “A systematic review of the barriers to the implementation of artificial intelligence in healthcare,”Cureus, vol. 15, no. 10, 2023

2023
[11]

Tadpole challenge: Accu- rate alzheimer’s disease prediction through crowdsourced forecasting of future data,

R. V . Marinescu et al., “Tadpole challenge: Accu- rate alzheimer’s disease prediction through crowdsourced forecasting of future data,” inLecture Notes in Computer Science. Springer, 2019, vol. 11843, pp. 1–10

2019
[12]

2024 alzheimer’s disease facts and figures,

Alzheimer’s Association, “2024 alzheimer’s disease facts and figures,”Alzheimer’s & Dementia, vol. 20, no. 5, pp. 3708–3821, Apr 2024

2024
[13]

Bridging the gap: Missing data imputation methods and their effect on dementia classification performance,

F. Aracri, M. G. Bianco, A. Quattrone, and A. Sarica, “Bridging the gap: Missing data imputation methods and their effect on dementia classification performance,” Brain Sciences, vol. 15, no. 6, pp. 639, Jun 2025

2025