Evaluating TabPFN for Mild Cognitive Impairment to Alzheimer's Disease Conversion in Data Limited Settings

Brad Ye; Bulent Soykan; Gulsah Hancerliogullari Koksalmis; Hsin-Hsiung Huang; Laura J. Brattain

arxiv: 2604.27195 · v2 · pith:M6ZINE3Ynew · submitted 2026-04-29 · 💻 cs.AI

Evaluating TabPFN for Mild Cognitive Impairment to Alzheimer's Disease Conversion in Data Limited Settings

Brad Ye , Bulent Soykan , Gulsah Hancerliogullari Koksalmis , Hsin-Hsiung Huang , Laura J. Brattain This is my paper

Pith reviewed 2026-05-21 09:17 UTC · model grok-4.3

classification 💻 cs.AI

keywords TabPFNMCI to AD conversionAlzheimer's predictionlow-data machine learningtabular foundation modelsTADPOLE datasetmultimodal biomarkersearly intervention

0 comments

The pith

TabPFN, a pre-trained foundation model for tabular data, predicts MCI to Alzheimer's conversion with AUC 0.892 and stays effective when training data drops to 50 samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests TabPFN against standard machine learning algorithms for forecasting which patients with mild cognitive impairment will develop Alzheimer's disease within three years. It draws on multimodal features including demographics, APOE4 genotype, MRI volumes, CSF markers, and PET scans from the TADPOLE dataset. TabPFN reaches an AUC of 0.892 overall and maintains performance at small training sizes where methods such as LightGBM, XGBoost, and random forests lose accuracy. A sympathetic reader would care because reliable predictions in data-scarce medical settings could support earlier interventions without requiring thousands of longitudinal cases.

Core claim

TabPFN achieves an AUC of 0.892 for three-year MCI to AD conversion on the TADPOLE dataset and outperforms LightGBM at 0.860. The model sustains strong results at N=50 training samples while traditional approaches decline, using features from demographics, APOE4, MRI, CSF, and PET across training sizes from 50 to 1000. These results indicate that pre-trained tabular foundation models can address data limitations common in Alzheimer's research.

What carries the argument

TabPFN, a tabular pre-trained foundation network that uses prior exposure to large synthetic tabular datasets to learn effectively from small real tabular inputs with minimal tuning.

If this is right

Clinicians could apply TabPFN-style models for early risk assessment in memory clinics that collect only modest numbers of patient records.
Alzheimer's studies would require fewer longitudinal cases to build usable predictors.
The same pre-trained approach may transfer to forecasting other neurodegenerative conditions with limited data.
Performance gains hold as training size scales from 50 up to 1000 samples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

External validation on diverse populations outside ADNI could reveal whether the low-data benefit persists across ethnic and geographic groups.
Combining TabPFN outputs with emerging blood-based biomarkers might further raise accuracy without adding imaging costs.
Real-world deployment would still need explicit strategies for handling incomplete scans or lab results that the current evaluation does not detail.

Load-bearing premise

The multimodal biomarker features drawn from MRI, CSF, PET, and demographics are assumed to be consistently complete and high-quality across the TADPOLE samples without substantial missing values or preprocessing complications.

What would settle it

TabPFN would lose its claimed advantage if an independent test set showed its AUC falling below LightGBM's when both are trained on only 50 samples from a new cohort with different missing-data patterns.

read the original abstract

Accurate prediction of conversion from Mild Cognitive Impairment (MCI) to Alzheimers Diseases (AD) is essential for early intervention, however, developing reliable conversion predictive models is difficult to develop due to limited longitudinal data availability We evaluate TabPFN (Tabular Pre-Trained Foundation Network) against traditional machine learning methods for predicting 3 year MCI to AD conversion using the TADPOLE dataset derived from ADNI. Using multimodal biomarker features extracted from demographics, APOE4, MRI volumes, CSF markers, and PET imaging, we conducted an experimental comparison across varying training set sizes (N=50 to 1000) and models including XGBoost, Random Forest, LightGBM, and Logistic Regression. TabPFN achieved one the highest performance (AUC=0.892), outperforming LightGBM (AUC=0.860) and demonstrating advantages in low data settings. At N=50 training samples, TabPFN maintained strong AUC while the traditional machine learning models struggles at small training samples. These findings demonstrate that foundation models are promising for disease prediction in data limited scenarios, such as Alzheimers diseases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TabPFN holds performance better than baselines at N=50 on the TADPOLE MCI-to-AD task, but the lack of detail on missing CSF/PET values and validation makes the low-sample advantage hard to trust.

read the letter

The main point is that TabPFN keeps a higher AUC than LightGBM and the other baselines when training data drops to 50 samples for 3-year MCI-to-AD conversion on TADPOLE. The paper reports 0.892 overall versus 0.860 for LightGBM, with the gap clearest at the smallest N. That is the practical observation worth passing along to anyone working on tabular models in medical data settings with limited labels. They run a direct comparison using multimodal features from demographics, APOE4, MRI volumes, CSF, and PET, testing across training sizes from 50 to 1000 against XGBoost, Random Forest, LightGBM, and logistic regression. The sample-size sweep is the part that actually adds something concrete. It shows the pre-trained model staying stable while the tree-based methods drop off, which matches the usual story about foundation models helping in low-data regimes. The work is an application of an existing model rather than a new method or derivation, so the novelty stays modest. The soft spot is the missing-data handling. ADNI-derived data has high rates of missing CSF and PET entries, yet nothing is said about imputation, complete-case analysis, or how missingness patterns were managed. At N=50 that choice can easily change which model ranks first, so the reported robustness could trace to an unstated preprocessing step instead of the model itself. The abstract also skips cross-validation details, statistical testing, and exact splits, which leaves the numbers difficult to reproduce from the text alone. This is the kind of paper that would interest people already following tabular foundation models or Alzheimer's prediction work. It supplies one more data point on low-sample behavior. A serious editor could send it for review once the methods section adds the missing preprocessing and validation steps, because the core comparison is direct and the question is practical even if the current write-up needs tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates TabPFN against XGBoost, Random Forest, LightGBM, and Logistic Regression for predicting 3-year MCI-to-AD conversion on the TADPOLE dataset derived from ADNI. Using multimodal features (demographics, APOE4, MRI volumes, CSF, PET), it reports performance across training sizes N=50 to 1000, with TabPFN reaching AUC 0.892 (vs. LightGBM 0.860) and retaining strong performance at N=50 where baselines degrade.

Significance. If the reported low-data advantage holds after methodological clarification, the work provides useful empirical evidence that tabular foundation models can be effective for Alzheimer's prediction tasks where sample sizes are small, a common constraint in longitudinal biomarker studies.

major comments (2)

[Dataset and Feature Extraction] The manuscript provides no description of missing-data handling for CSF and PET biomarkers. Given that ADNI/TADPOLE data typically exhibit high missingness rates in these modalities, the absence of details on imputation, complete-case selection, or feature-wise missingness patterns renders the N=50 performance claims vulnerable to preprocessing artifacts that may interact differently with TabPFN's prior than with tree-based baselines.
[Experimental Evaluation] The experimental section supplies no information on the cross-validation procedure, exact train-test split strategy, hyperparameter search, or statistical testing used to support the AUC comparisons. Without these, the headline result (TabPFN AUC 0.892 vs. LightGBM 0.860) cannot be independently verified or attributed unambiguously to model properties.

minor comments (2)

[Abstract] Abstract contains grammatical issues: 'one the highest' should read 'one of the highest'; 'struggles' should be 'struggle'; 'Alzheimers diseases' should be 'Alzheimer's disease'.
[Dataset and Feature Extraction] The paper would benefit from an explicit statement of the total number of subjects and the distribution of missing values per modality before any performance tables are presented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments identify important gaps in methodological transparency that we will address in the revision. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Dataset and Feature Extraction] The manuscript provides no description of missing-data handling for CSF and PET biomarkers. Given that ADNI/TADPOLE data typically exhibit high missingness rates in these modalities, the absence of details on imputation, complete-case selection, or feature-wise missingness patterns renders the N=50 performance claims vulnerable to preprocessing artifacts that may interact differently with TabPFN's prior than with tree-based baselines.

Authors: We agree that the current version of the manuscript omits explicit details on missing-data handling, which is a valid concern given the known missingness patterns in ADNI-derived datasets. We will add a dedicated preprocessing subsection in the Methods that reports feature-wise missingness rates, the imputation approach employed (consistent across all models), and whether complete-case analysis was used for any modality. This addition will allow readers to assess whether preprocessing choices could differentially affect TabPFN versus the baselines. revision: yes
Referee: [Experimental Evaluation] The experimental section supplies no information on the cross-validation procedure, exact train-test split strategy, hyperparameter search, or statistical testing used to support the AUC comparisons. Without these, the headline result (TabPFN AUC 0.892 vs. LightGBM 0.860) cannot be independently verified or attributed unambiguously to model properties.

Authors: We acknowledge that the experimental protocol is under-specified in the present manuscript, limiting independent verification. In the revised version we will expand the Experimental Evaluation section to describe the cross-validation scheme (stratified k-fold), the train-test partitioning procedure (including subject-level constraints to avoid leakage), the hyperparameter search strategy for each baseline, and the statistical tests or confidence-interval methods used for the reported AUC differences. These clarifications will strengthen the attribution of performance gains to model characteristics rather than experimental choices. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical model comparison on external dataset

full rationale

The manuscript reports standard supervised learning experiments that train models on subsets of the TADPOLE/ADNI cohort and evaluate AUC on held-out test subjects. All reported numbers (AUC=0.892 for TabPFN, AUC=0.860 for LightGBM, performance at N=50) are computed directly from the data splits and model outputs; none are obtained by fitting a parameter to the target metric and then relabeling it as a prediction. No equations, uniqueness theorems, or ansatzes are introduced, and the central claim rests on external baselines rather than self-citation chains. The evaluation is therefore self-contained against the public dataset and does not reduce to any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The comparison rests on the external TADPOLE/ADNI dataset and the pre-trained TabPFN weights; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Standard machine-learning assumptions of independent and identically distributed train-test splits and appropriate use of AUC as a performance metric hold for this medical prediction task.
Invoked implicitly by any comparative evaluation on tabular clinical data.

pith-pipeline@v0.9.0 · 5752 in / 1191 out tokens · 46997 ms · 2026-05-21T09:17:21.165416+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TabPFN achieved one the highest performance (AUC=0.892), outperforming LightGBM (AUC=0.860) ... At N=50 training samples, TabPFN maintained strong AUC
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Missing values were imputed using median imputation based on training set statistics

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

[1]

Machine learning models show promise for predicting disease progression, yet their development faces a fundamental challenge: limited high-quality train- ing data

INTRODUCTION The number of Americans living with Alzheimer’s Disease (AD) is projected to reach 13.8 million by 2060[1], under- scoring the urgent need for improved early detection and in- tervention strategies. Machine learning models show promise for predicting disease progression, yet their development faces a fundamental challenge: limited high-qualit...

work page 2060
[2]

MATERIALS AND EXPERIMENTS 2.1. Dataset and Preprocessing We utilized the TADPOLE dataset, derived from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), a comprehen- sive longitudinal study containing clinical, imaging, and biomarker data from 1,737 participants[4]. The dataset in- cludes multiple visit observations spanning up to 10 years, with mea...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

XGBoost achieved the highest AUC score of 0.901, followed closely by TabPFN at 0.892

RESULTS AND DISCUSSIONS Figure 1 presents the overall performance of all models on the holdout validation set. XGBoost achieved the highest AUC score of 0.901, followed closely by TabPFN at 0.892. Random Forest achieved 0.888, while LightGBM and Lo- gistic Regression performed comparably at 0.860 and 0.859 respectively. These results indicate that both tu...

work page
[4]

CONCLUSIONS This study provides a systematic evaluation of TabPFN, a foundation model for tabular data, for predicting MCI-to-AD conversion using biomarker features from the TADPOLE dataset. Our results demonstrate that foundation models of- fer meaningful advantages in data-limited clinical scenarios while also revealing important practical consideration...

work page
[5]

Metrics for multiclass classification: An overview,

“Metrics for multiclass classification: An overview,” 2020

work page 2020
[6]

Xgboost, a novel explainable ai technique, in the prediction of myocardial infarction: A uk biobank cohort study,

A. Moore and M. Bell, “Xgboost, a novel explainable ai technique, in the prediction of myocardial infarction: A uk biobank cohort study,”Clinical Medicine Insights: Cardiology, vol. 16, pp. 117954682211336, Jan 2022

work page 2022
[7]

Navigating data scarcity using foundation models: A benchmark of few- shot and zero-shot learning approaches in medical imag- ing,

S. Woerner and C. F. Baumgartner, “Navigating data scarcity using foundation models: A benchmark of few- shot and zero-shot learning approaches in medical imag- ing,”arXiv preprint arXiv:2408.08058, 2024

work page arXiv 2024
[8]

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

N. Hollmann, S. M ¨uller, K. Eggensperger, and F. Hut- ter, “Tabpfn: A transformer that solves small tabu- lar classification problems in a second,”arXiv preprint arXiv:2207.01848, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Research progress in predicting the conver- sion from mild cognitive impairment to alzheimer’s dis- ease via multimodal mri and artificial intelligence,

Min Ai, Yu Liu, Dan Liu, Chengxi Yan, Xia Wang, and Xun Chen, “Research progress in predicting the conver- sion from mild cognitive impairment to alzheimer’s dis- ease via multimodal mri and artificial intelligence,”Fron- tiers in Neurology, vol. 16, pp. 1596632, 2025

work page 2025
[10]

A systematic review of the barriers to the implementation of artificial intelligence in healthcare,

M. I. Ahmed, B. Spooner, J. Isherwood, M. A. Lane, E. Orrock, and A. Dennison, “A systematic review of the barriers to the implementation of artificial intelligence in healthcare,”Cureus, vol. 15, no. 10, 2023

work page 2023
[11]

Tadpole challenge: Accu- rate alzheimer’s disease prediction through crowdsourced forecasting of future data,

R. V . Marinescu et al., “Tadpole challenge: Accu- rate alzheimer’s disease prediction through crowdsourced forecasting of future data,” inLecture Notes in Computer Science. Springer, 2019, vol. 11843, pp. 1–10

work page 2019
[12]

2024 alzheimer’s disease facts and figures,

Alzheimer’s Association, “2024 alzheimer’s disease facts and figures,”Alzheimer’s & Dementia, vol. 20, no. 5, pp. 3708–3821, Apr 2024

work page 2024
[13]

Bridging the gap: Missing data imputation methods and their effect on dementia classification performance,

F. Aracri, M. G. Bianco, A. Quattrone, and A. Sarica, “Bridging the gap: Missing data imputation methods and their effect on dementia classification performance,” Brain Sciences, vol. 15, no. 6, pp. 639, Jun 2025

work page 2025

[1] [1]

Machine learning models show promise for predicting disease progression, yet their development faces a fundamental challenge: limited high-quality train- ing data

INTRODUCTION The number of Americans living with Alzheimer’s Disease (AD) is projected to reach 13.8 million by 2060[1], under- scoring the urgent need for improved early detection and in- tervention strategies. Machine learning models show promise for predicting disease progression, yet their development faces a fundamental challenge: limited high-qualit...

work page 2060

[2] [2]

MATERIALS AND EXPERIMENTS 2.1. Dataset and Preprocessing We utilized the TADPOLE dataset, derived from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), a comprehen- sive longitudinal study containing clinical, imaging, and biomarker data from 1,737 participants[4]. The dataset in- cludes multiple visit observations spanning up to 10 years, with mea...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

XGBoost achieved the highest AUC score of 0.901, followed closely by TabPFN at 0.892

RESULTS AND DISCUSSIONS Figure 1 presents the overall performance of all models on the holdout validation set. XGBoost achieved the highest AUC score of 0.901, followed closely by TabPFN at 0.892. Random Forest achieved 0.888, while LightGBM and Lo- gistic Regression performed comparably at 0.860 and 0.859 respectively. These results indicate that both tu...

work page

[4] [4]

CONCLUSIONS This study provides a systematic evaluation of TabPFN, a foundation model for tabular data, for predicting MCI-to-AD conversion using biomarker features from the TADPOLE dataset. Our results demonstrate that foundation models of- fer meaningful advantages in data-limited clinical scenarios while also revealing important practical consideration...

work page

[5] [5]

Metrics for multiclass classification: An overview,

“Metrics for multiclass classification: An overview,” 2020

work page 2020

[6] [6]

Xgboost, a novel explainable ai technique, in the prediction of myocardial infarction: A uk biobank cohort study,

A. Moore and M. Bell, “Xgboost, a novel explainable ai technique, in the prediction of myocardial infarction: A uk biobank cohort study,”Clinical Medicine Insights: Cardiology, vol. 16, pp. 117954682211336, Jan 2022

work page 2022

[7] [7]

Navigating data scarcity using foundation models: A benchmark of few- shot and zero-shot learning approaches in medical imag- ing,

S. Woerner and C. F. Baumgartner, “Navigating data scarcity using foundation models: A benchmark of few- shot and zero-shot learning approaches in medical imag- ing,”arXiv preprint arXiv:2408.08058, 2024

work page arXiv 2024

[8] [8]

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

N. Hollmann, S. M ¨uller, K. Eggensperger, and F. Hut- ter, “Tabpfn: A transformer that solves small tabu- lar classification problems in a second,”arXiv preprint arXiv:2207.01848, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Research progress in predicting the conver- sion from mild cognitive impairment to alzheimer’s dis- ease via multimodal mri and artificial intelligence,

Min Ai, Yu Liu, Dan Liu, Chengxi Yan, Xia Wang, and Xun Chen, “Research progress in predicting the conver- sion from mild cognitive impairment to alzheimer’s dis- ease via multimodal mri and artificial intelligence,”Fron- tiers in Neurology, vol. 16, pp. 1596632, 2025

work page 2025

[10] [10]

A systematic review of the barriers to the implementation of artificial intelligence in healthcare,

M. I. Ahmed, B. Spooner, J. Isherwood, M. A. Lane, E. Orrock, and A. Dennison, “A systematic review of the barriers to the implementation of artificial intelligence in healthcare,”Cureus, vol. 15, no. 10, 2023

work page 2023

[11] [11]

Tadpole challenge: Accu- rate alzheimer’s disease prediction through crowdsourced forecasting of future data,

R. V . Marinescu et al., “Tadpole challenge: Accu- rate alzheimer’s disease prediction through crowdsourced forecasting of future data,” inLecture Notes in Computer Science. Springer, 2019, vol. 11843, pp. 1–10

work page 2019

[12] [12]

2024 alzheimer’s disease facts and figures,

Alzheimer’s Association, “2024 alzheimer’s disease facts and figures,”Alzheimer’s & Dementia, vol. 20, no. 5, pp. 3708–3821, Apr 2024

work page 2024

[13] [13]

Bridging the gap: Missing data imputation methods and their effect on dementia classification performance,

F. Aracri, M. G. Bianco, A. Quattrone, and A. Sarica, “Bridging the gap: Missing data imputation methods and their effect on dementia classification performance,” Brain Sciences, vol. 15, no. 6, pp. 639, Jun 2025

work page 2025