arxiv: 2602.21876 · v2 · submitted 2026-02-25 · 📊 stat.AP

Recognition: 1 theorem link

· Lean Theorem

Comparative Evaluation of Machine Learning Models for Predicting Donor Kidney Discard

Peer Schliephacke , Hannah Schult , Leon Mizera , Judith W\"urfel , Gunter Grieser , Axel Rahmel , Carl-Ludwig Fischer-Fr\"ohlich , Antje Jahn-Eimermacher

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:39 UTC · model grok-4.3

classification 📊 stat.AP

keywords kidney discard predictionmachine learningorgan transplantationensemble modelsSHAP explainabilitydeceased donorspredictive modelingdata preprocessing

0 comments

The pith

Consistent data preprocessing matters more than the choice of machine learning algorithm when predicting which donor kidneys will be discarded.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains and compares five standard machine learning models plus an ensemble on records from 4080 German deceased donors to forecast kidney discard. It applies one shared pipeline of feature engineering, selection, Bayesian hyperparameter search, and evaluation metrics so that differences arise from the models themselves rather than from inconsistent data handling. The ensemble reaches the strongest discrimination scores, yet logistic regression, random forest, and deep learning perform nearly as well and clearly outperform decision trees. Calibration improves with Platt scaling on tree and neural models, and SHAP values across all models point to donor age and renal function markers as the dominant drivers. The authors conclude that standardized preprocessing and evaluation drive predictive success more than selecting any single algorithm.

Core claim

When five common machine learning models and an ensemble are trained under identical conditions of feature engineering, selection, and Bayesian optimization on 4080 German deceased-donor records, the ensemble attains the highest discrimination (MCC 0.76, AUC 0.87, F1 0.90) while logistic regression, random forest, and deep learning perform comparably and better than decision trees; consistent preprocessing and evaluation prove more decisive for success than the particular algorithm chosen.

What carries the argument

The unified benchmarking framework of standardized feature engineering, selection, and Bayesian hyperparameter optimization applied across all models.

If this is right

An ensemble model can achieve strong discrimination for kidney discard prediction when preprocessing is held constant.
Logistic regression, random forest, and deep learning reach similar performance levels under the same unified setup.
SHAP explanations consistently identify donor age and renal markers as leading predictors across models.
Platt scaling improves calibration for tree-based and neural-network models.
Attention to data preprocessing and feature selection can improve predictive reliability more than switching algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same unified framework could be tested on other solid organs or in non-European donor registries to check whether the preprocessing advantage persists.
If the models generalize, transplant centers might embed the predictions into rescue-allocation workflows to reduce discard rates in real time.
Future work could examine whether adding time-to-decision features or geographic variables further improves calibration without altering the preprocessing emphasis.

Load-bearing premise

The 4080 German donor records contain all relevant predictors without selection bias, missingness patterns, or leakage that would prevent generalization to other countries or populations.

What would settle it

A replication on an independent non-German donor dataset in which the performance ordering among logistic regression, random forest, deep learning, and decision trees reverses or the ensemble loses its lead.

read the original abstract

A kidney transplant can improve the life expectancy and quality of life of patients with end-stage renal failure. Even more patients could be helped with a transplant if the rate of kidneys that are discarded and not transplanted could be reduced. Machine learning (ML) can support decision-making in this context by early identification of donor organs at high risk of discard, for instance to enable timely interventions to improve organ utilization such as rescue allocation. Although various ML models have been applied, their results are difficult to compare due to heterogenous datasets and differences in feature engineering and evaluation strategies. This study aims to provide a systematic and reproducible comparison of ML models for donor kidney discard prediction. We trained five commonly used ML models: Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, and Deep Learning along with an ensemble model on data from 4,080 deceased donors (death determined by neurologic criteria) in Germany. A unified benchmarking framework was implemented, including standardized feature engineering and selection, and Bayesian hyperparameter optimization. Model performance was assessed for discrimination (MCC, AUC, F1), calibration (Brier score), and explainability (SHAP). The ensemble achieved the highest discrimination performance (MCC=0.76, AUC=0.87, F1=0.90), while individual models such as Logistic Regression, Random Forest, and Deep Learning performed comparably and better than Decision Trees. Platt scaling improved calibration for tree-and neural network-based models. SHAP consistently identified donor age and renal markers as dominant predictors across models, reflecting clinical plausibility. This study demonstrates that consistent data preprocessing, feature selection, and evaluation can be more decisive for predictive success than the choice of the ML algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A clean benchmark of standard ML models on new German kidney donor data, but the claim that preprocessing beats algorithm choice rests on inference rather than direct tests.

read the letter

The paper runs logistic regression, decision trees, random forest, gradient boosting, deep learning, and an ensemble on 4080 German deceased-donor records under one shared preprocessing and Bayesian tuning pipeline. The ensemble reaches MCC 0.76, AUC 0.87, and F1 0.90, with logistic regression, random forest, and the neural net close behind; decision trees trail. Platt scaling helps calibration on the tree and neural models, and SHAP points to donor age and renal markers as the main drivers, which lines up with clinical knowledge. The unified framework and full set of metrics (discrimination, calibration, explainability) make the comparison reproducible and give a usable baseline for this population. That is the solid part. The softer spot is the conclusion that consistent preprocessing, feature selection, and evaluation matter more than model choice. The results show similar performance across most models when the pipeline is fixed, but the study does not run controlled variations in preprocessing steps to measure how large those changes are compared with swapping algorithms. Without that, the relative-decisiveness claim is an inference rather than a measured result. The single-country cohort also leaves open how well the numbers travel. This is the kind of applied paper that transplant statisticians and organ-allocation groups would read for the dataset and the side-by-side numbers. It is worth sending to peer review because the methods are transparent and the data is new, even if the interpretation section could be tightened.

Referee Report

3 major / 2 minor

Summary. The manuscript reports a comparative evaluation of five ML models (Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, Deep Learning) plus an ensemble for predicting donor kidney discard, using data from 4,080 German deceased donors. It applies a unified pipeline with standardized feature engineering, Bayesian hyperparameter optimization, and evaluates discrimination (MCC, AUC, F1), calibration (Brier score with Platt scaling), and explainability (SHAP). The ensemble achieves the highest scores (MCC=0.76, AUC=0.87, F1=0.90), other models perform comparably except Decision Trees, SHAP identifies donor age and renal markers as key, and the conclusion states that consistent preprocessing is more decisive than algorithm choice.

Significance. If the central results hold, the work supplies a reproducible benchmark for kidney discard prediction that highlights the value of standardized pipelines, ensemble methods, and SHAP-based interpretability in a clinically actionable domain. The use of Bayesian optimization and multiple standard metrics (including calibration) is a strength. However, the single-country cohort and absence of ablation experiments limit the strength of the claim that preprocessing dominates algorithm choice; addressing this would increase impact on organ allocation practices.

major comments (3)

[Discussion] The claim in the concluding paragraph that 'consistent data preprocessing, feature selection, and evaluation can be more decisive for predictive success than the choice of the ML algorithm' is not supported by direct evidence. The study applies one fixed pipeline across models and observes comparable MCC/AUC/F1 for LR, RF, and DL, but provides no ablation that varies preprocessing steps (e.g., alternative imputation or feature selection) while fixing the model, or vice versa, to quantify relative performance deltas.
[Methods] Details on the train/test split (proportions, stratification, or cross-validation), exclusion criteria applied to the 4,080 records, missing-data handling, and checks for data leakage in the unified feature engineering pipeline are insufficient to assess reproducibility and to evaluate the risk of selection bias in the German donor population.
[Results] Table or figure reporting per-model Brier scores before and after Platt scaling, together with statistical tests or confidence intervals on MCC/AUC differences between models, is needed to substantiate statements of comparability and calibration improvement.

minor comments (2)

[Abstract] The abstract should specify the number of features retained after selection and the exact construction of the ensemble (e.g., voting or stacking).
Define all acronyms (MCC, AUC, F1, SHAP, Brier) at first use in the main text and ensure consistent notation for performance metrics across sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve reproducibility, support for claims, and clarity.

read point-by-point responses

Referee: [Discussion] The claim in the concluding paragraph that 'consistent data preprocessing, feature selection, and evaluation can be more decisive for predictive success than the choice of the ML algorithm' is not supported by direct evidence. The study applies one fixed pipeline across models and observes comparable MCC/AUC/F1 for LR, RF, and DL, but provides no ablation that varies preprocessing steps (e.g., alternative imputation or feature selection) while fixing the model, or vice versa, to quantify relative performance deltas.

Authors: We agree that the original phrasing overstated the inference from a single fixed pipeline. The comparable performance across LR, RF, and DL under unified preprocessing supports the importance of the pipeline but does not constitute a direct head-to-head ablation. In the revised manuscript we have softened the concluding statement to indicate that the results are consistent with preprocessing being decisive, while explicitly noting the absence of ablation experiments as a limitation and recommending such studies for future work. revision: partial
Referee: [Methods] Details on the train/test split (proportions, stratification, or cross-validation), exclusion criteria applied to the 4,080 records, missing-data handling, and checks for data leakage in the unified feature engineering pipeline are insufficient to assess reproducibility and to evaluate the risk of selection bias in the German donor population.

Authors: We have expanded the Methods section to specify a 70/30 stratified train/test split (stratified on donor age, region, and primary renal diagnosis), explicit exclusion criteria (donors with >30% missing core features or implausible values), missing-data handling via chained equations imputation performed only on the training fold, and pipeline ordering that applies feature selection and scaling inside cross-validation to prevent leakage. These additions directly address reproducibility and selection-bias concerns. revision: yes
Referee: [Results] Table or figure reporting per-model Brier scores before and after Platt scaling, together with statistical tests or confidence intervals on MCC/AUC differences between models, is needed to substantiate statements of comparability and calibration improvement.

Authors: We have added a new supplementary table that reports Brier scores for every model before and after Platt scaling, together with 95% bootstrap confidence intervals for MCC and AUC. Pairwise DeLong tests for AUC differences and McNemar tests for MCC are now included with p-values, confirming that differences among LR, RF, GB, and DL are not statistically significant while Decision Trees remain inferior. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper reports standard empirical ML benchmarking results on a fixed German donor cohort using held-out evaluation, Bayesian hyperparameter tuning, and conventional metrics (MCC, AUC, F1, Brier). The central claim that preprocessing is more decisive than algorithm choice is an interpretive summary of observed performance similarity across models under one unified pipeline; it does not reduce to any equation, fitted parameter, or self-citation that is defined in terms of the reported outcome. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the provided text. The derivation chain is self-contained against external benchmarks and receives a score of 0.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The study rests on standard supervised learning assumptions and the representativeness of the German donor registry; no new entities are postulated and the only free parameters are the usual model hyperparameters tuned on the data.

free parameters (1)

Model hyperparameters
Bayesian optimization was used to select hyperparameters for each of the five base models and the ensemble; these are fitted to the training data.

axioms (1)

domain assumption The 4080 donor records are independent and identically distributed samples from the target population.
Required for training, cross-validation, and generalization claims in any supervised ML setting.

pith-pipeline@v0.9.0 · 5645 in / 1297 out tokens · 24537 ms · 2026-05-15T19:39:06.362382+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The ensemble achieved the highest discrimination performance (MCC=0.76, AUC=0.87, F1=0.90)... consistent data preprocessing, feature selection, and evaluation can be more decisive for predictive success than the choice of the ML algorithm.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · 4 internal anchors

[1]

https://dso.de/SiteCollectionDocuments/ DSO-Jahresbericht%202023.pdf Accessed 2025-03-19

Organtransplantation, D.S.: Jahresbericht: Organspende und Transplan- tation in Deutschland (2023). https://dso.de/SiteCollectionDocuments/ DSO-Jahresbericht%202023.pdf Accessed 2025-03-19

work page 2023
[2]

Yoon, J., Alaa, A.M., Cadeiras, M., Schaar, M.v.d.: Personalized Donor- Recipient Matching for Organ Transplantation. arXiv. arXiv:1611.03934 [cs] (2016). https://doi.org/10.48550/arXiv.1611.03934 . http://arxiv.org/abs/1611. 03934 Accessed 2025-03-20 24

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1611.03934 2016
[3]

Biomedical Signal Processing and Control52, 456–462 (2019)

Shaikhina, T., Lowe, D., Daga, S., Briggs, D., Higgins, R., Khovanova, N.: Decision tree and random forest models for outcome prediction in antibody incom- patible kidney transplantation. Biomedical Signal Processing and Control52, 456–462 (2019)

work page 2019
[4]

Nature Communications15(1), 554 (2024) https://doi.org/10.1038/s41467-023-44595-z

Yoo, D., Divard, G., Raynaud, M., Cohen, A., Mone, T.D., Rosenthal, J.T., Bentall, A.J., Stegall, M.D., Naesens, M., Zhang, H., Wang, C., Gueguen, J., Kamar, N., Bouquegneau, A., Batal, I., Coley, S.M., Gill, J.S., Oppenheimer, F., De Sousa-Amorim, E., Kuypers, D.R.J., Durrbach, A., Seron, D., Rabant, M., Van Huyen, J.-P.D., Campbell, P., Shojai, S., Meng...

work page doi:10.1038/s41467-023-44595-z 2024
[5]

Trans- plantology5(2), 51–64 (2024) https://doi.org/10.3390/transplantology5020006

McKenney, C., Torabi, J., Todd, R., Akhtar, M.Z., Tedla, F.M., Shapiro, R., Florman, S.S., Holzner, M.L., van Leeuwen, L.L.: Wasted potential: Decoding the trifecta of donor kidney shortage, underutilization, and rising discard rates. Trans- plantology5(2), 51–64 (2024) https://doi.org/10.3390/transplantology5020006

work page doi:10.3390/transplantology5020006 2024
[6]

NPJ digital medicine5(1), 89 (2022)

Gotlieb, N., Azhie, A., Sharma, D., Spann, A., Suo, N.-J., Tran, J., Orchanian- Cheff, A., Wang, B., Goldenberg, A., Chass´ e, M.,et al.: The promise of machine learning applications in solid organ transplantation. NPJ digital medicine5(1), 89 (2022)

work page 2022
[7]

Clinical transplantation37(5), 14951 (2023) https://doi.org/10.1111/ctr.14951

Pettit, R.W., Marlatt, B.B., Miles, T.J., Uzgoren, S., Corr, S.J., Shetty, A., Havelka, J., Rana, A.: The utility of machine learning for predicting donor dis- card in abdominal transplantation. Clinical transplantation37(5), 14951 (2023) https://doi.org/10.1111/ctr.14951

work page doi:10.1111/ctr.14951 2023
[8]

Gut56(2), 253–258 (2007) https://doi.org/10.1136/gut.2005.084434

Cucchetti, A., Vivarelli, M., Heaton, N.D., Phillips, S., Piscaglia, F., Bolondi, L., La Barba, G., Foxton, M.R., Rela, M., O’Grady, J., Pinna, A.D.: Artificial neural network is superior to meld in predicting mortality of patients with end-stage liver disease. Gut56(2), 253–258 (2007) https://doi.org/10.1136/gut.2005.084434

work page doi:10.1136/gut.2005.084434 2007
[10]

American Journal of Transplantation10(7), 1613–1620 (2010) https://doi.org/10.1111/j.1600-6143.2010.03163.x

Massie, A.B., Desai, N.M., Montgomery, R.A., Singer, A.L., Segev, D.L.: Improv- ing Distribution Efficiency of Hard-to-Place Deceased Donor Kidneys: Predicting Probability of Discard or Delay. American Journal of Transplantation10(7), 1613–1620 (2010) https://doi.org/10.1111/j.1600-6143.2010.03163.x

work page doi:10.1111/j.1600-6143.2010.03163.x 2010
[12]

American Journal of Transplantation18(2), 391– 401 (2017) https://doi.org/10.1111/ajt.14449

Cohen, J.B., Shults, J., Goldberg, D.S., Abt, P.L., Sawinski, D.L., Reese, P.P.: Kidney allograft offers: Predictors of turndown and the impact of late organ acceptance on allograft survival. American Journal of Transplantation18(2), 391– 401 (2017) https://doi.org/10.1111/ajt.14449

work page doi:10.1111/ajt.14449 2017
[13]

American Journal of Transplantation18(11), 2708–2718 (2018) https://doi.org/10.1111/ajt.14712

Narvaez, J.R.F., Nie, J., Noyes, K., Leeman, M., Kayler, L.K.: Hard-to-place kidney offers: Donor- and system-level predictors of discard. American Journal of Transplantation18(11), 2708–2718 (2018) https://doi.org/10.1111/ajt.14712

work page doi:10.1111/ajt.14712 2018
[14]

Transplantation105(9), 2054–2071 (2021) https://doi.org/10.1097/tp

Barah, M., Mehrotra, S.: Predicting kidney discard using machine learn- ing. Transplantation105(9), 2054–2071 (2021) https://doi.org/10.1097/tp. 0000000000003620

work page doi:10.1097/tp 2054
[15]

JAMA Surgery159(1), 60 (2023) https://doi.org/10.1001/jamasurg.2023.4679

Sageshima, J., Than, P., Goussous, N., Mineyev, N., Perez, R.: Prediction of High-Risk Donors for Kidney Discard and Nonrecovery Using Structured Donor Characteristics and Unstructured Donor Narratives. JAMA Surgery159(1), 60 (2023) https://doi.org/10.1001/jamasurg.2023.4679

work page doi:10.1001/jamasurg.2023.4679 2023
[16]

Kidney Medicine, 101276 (2026)

Guan, G., Studnia, J., Neelam, S., Cheng, X.S., Melcher, M.L., Agarwal, N., Somaini, P., Ashlagi, I.: Machine learning predictions for assessing hard-to-place deceased donor kidneys. Kidney Medicine, 101276 (2026)

work page 2026
[17]

Gazi University Journal of Science 36(4), 1506–1520 (2023)

B¨ uy¨ ukke¸ ceci, M., Okur, M.C.: A comprehensive review of feature selection and feature selection stability in machine learning. Gazi University Journal of Science 36(4), 1506–1520 (2023)

work page 2023
[18]

bmj385(2024)

Collins, G.S., Moons, K.G., Dhiman, P., Riley, R.D., Beam, A.L., Van Calster, B., Ghassemi, M., Liu, X., Reitsma, J.B., Van Smeden, M., et al.: Tripod+ ai statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. bmj385(2024)

work page 2024
[19]

The Lancet Digital Health (2025)

Van Calster, B., Collins, G.S., Vickers, A.J., Wynants, L., Kerr, K.F., Barre˜ nada, L., Varoquaux, G., Singh, K., Moons, K.G., Hernandez-Boussard, T., et al.: Evaluation of performance measures in predictive artificial intelligence models to support medical decisions: overview and guidance. The Lancet Digital Health (2025)

work page 2025
[20]

1, 5th edn

WHO, W.H.O.: ICD-10: International Statistical Classification of Diseases and Related Health Problems 10th Revision vol. 1, 5th edn. World Health Organi- zation (WHO). https://doi.org/10.1038/s41598-024-66976-0 . https://apps.who. int/iris/bitstream/10665/246208/1/9789241549165-V1-eng.pdf

work page doi:10.1038/s41598-024-66976-0
[21]

Scientific Reports (2023) https://doi.org/10.1038/s41598-023-35270-w

Sauthier, N., Bouchakri, R., Carrier, F.M., Sauthier, M., Mullie, L.-A., Cardinal, 26 H., Fortin, M.-C., Lahrichi, N., Chass´ e, M.: Automated screening of potential organ donors using a temporal machine learning model. Scientific Reports (2023) https://doi.org/10.1038/s41598-023-35270-w

work page doi:10.1038/s41598-023-35270-w 2023
[22]

Nature Communications (2024) https://doi.org/10.1038/ s41467-023-44595-z

Mohan, S., Husain, S.A., Schold, J.D., Reese, P.P., Stewart, D., Kadatz, M., Chow, D.S., Khurana, K.K., Axelrod, D., Mulligan, D.C., Formica, R.N., Roberts, J.P., Segev, D.L., Locke, J.E., Rees, M.J., Matas, A., Stegall, M.L., Cooper, M., Stock, P.G., Ellis, M.J., Heeger, P.S., Cohen, D.J., Danovitch, G.M., Mont- gomery, R.A., Bromberg, J.S., Redfield, R....

work page 2024
[23]

Scikit-learn: IterativeImputer. (2024). Accessed: 2025-09-26. https://scikit-learn. org/stable/modules/generated/sklearn.impute.IterativeImputer.html

work page 2024
[24]

Scikit-learn: HistGradientBoostingClassifier. (2024). Accessed: 2025-03-

work page 2024
[25]

HistGradientBoostingClassifier.html

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble. HistGradientBoostingClassifier.html

work page
[26]

Chapman & Hall/CRC, Boca Raton and London and New York and Washington, D.C

Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Chapman & Hall/CRC, Boca Raton and London and New York and Washington, D.C. (1984). https://doi.org/10.1201/9781315139470 . https://www.taylorfrancis.com/books/9781351460491

work page doi:10.1201/9781315139470 1984
[27]

John Wiley & Sons

Hosmer Jr, D.W., Lemeshow, S., Sturdivant, R.X.: Applied Logistic Regression. John Wiley & Sons

work page
[28]

Machine Learning45(1), 5–32 (2001) https://doi

Breiman, L.: Random forests. Machine Learning45(1), 5–32 (2001) https://doi. org/10.1023/A:1010933404324

work page doi:10.1023/a:1010933404324 2001
[29]

In: Krishnapuram, B

Chen, T., Guestrin, C.: Xgboost. In: Krishnapuram, B. (ed.) Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Digital Library, pp. 785–794. ACM, New York, NY (2016). https: //doi.org/10.1145/2939672.2939785

work page doi:10.1145/2939672.2939785 2016
[30]

Journal of computer and system sciences55(1), 119–139 (1997)

Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences55(1), 119–139 (1997)

work page 1997
[31]

In: 2021 IEEE 3rd International Conference on Fron- tiers Technology of Information and Computer, pp

Wu, J., Li, Y., Ma, Y.: Comparison of xgboost and the neural network model on the class-balanced datasets. In: 2021 IEEE 3rd International Conference on Fron- tiers Technology of Information and Computer, pp. 457–461. IEEE, Piscataway, NJ (2021). https://doi.org/10.1109/ICFTIC54370.2021.9647373

work page doi:10.1109/icftic54370.2021.9647373 2021
[32]

Grinsztajn, L., Oyallon, E., Varoquaux, G.: Why do tree-based models still out- perform deep learning on typical tabular data? Advances in neural information 27 processing systems35, 507–520 (2022)

work page 2022
[33]

MIT Press, ??? (2016)

Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, ??? (2016). http://www.deeplearningbook.org

work page 2016
[34]

Scientific american267(1), 66–73 (1992)

Holland, J.H.: Genetic algorithms. Scientific american267(1), 66–73 (1992)

work page 1992
[35]

Knowledge and Information Systems, 1–40 (2025)

Taha, Z.Y., Abdullah, A.A., Rashid, T.A.: Optimizing feature selection with genetic algorithms: a review of methods and applications. Knowledge and Information Systems, 1–40 (2025)

work page 2025
[36]

In: Industrial and Engineering Applications or Artificial Intelligence and Expert Systems, pp

Liu, H., Setiono, R.: Feature selection and classification–a probabilistic wrapper approach. In: Industrial and Engineering Applications or Artificial Intelligence and Expert Systems, pp. 419–424. CRC Press

work page
[37]

Pattern Recognition40(5), 1474–1485 (2007)

Ji, S., Carin, L.: Cost-sensitive feature acquisition and classification. Pattern Recognition40(5), 1474–1485 (2007)

work page 2007
[38]

IEEE transactions on evolutionary computation6(2), 182–197 (2002)

Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE transactions on evolutionary computation6(2), 182–197 (2002)

work page 2002
[39]

In: Proceedings of the Genetic and Evolutionary Computation Conference, pp

Hamano, R., Saito, S., Nomura, M., Shirakawa, S.: Cma-es with margin: Lower- bounding marginal probability for mixed-integer black-box optimization. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 639–647 (2022)

work page 2022
[40]

In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp

Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: A next- generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631 (2019)

work page 2019
[41]

The journal of machine learning research13(1), 281–305 (2012)

Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. The journal of machine learning research13(1), 281–305 (2012)

work page 2012
[42]

Biochimica et Biophysica Acta (BBA) - Protein Structure405(2), 442–451 (1975) https://doi.org/10.1016/0005-2795(75)90109-9

Matthews, B.W.: Comparison of the predicted and observed secondary struc- ture of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure405(2), 442–451 (1975) https://doi.org/10.1016/0005-2795(75)90109-9 . Accessed 2025-02-16

work page doi:10.1016/0005-2795(75)90109-9 1975
[43]

BioData Mining16(1), 4 (2023) https://doi.org/10.1186/s13040-023-00322-4

Chicco, D., Jurman, G.: The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. BioData Mining16(1), 4 (2023) https://doi.org/10.1186/s13040-023-00322-4 . Accessed 2025-02-16

work page doi:10.1186/s13040-023-00322-4 2023
[44]

BMC 28 Genomics21(1), 6 (2020) https://doi.org/10.1186/s12864-019-6413-7

Chicco, D., Jurman, G.: The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC 28 Genomics21(1), 6 (2020) https://doi.org/10.1186/s12864-019-6413-7 . Accessed 2025-02-16

work page doi:10.1186/s12864-019-6413-7 2020
[45]

SN Applied Sciences 1(10), 1297 (2019)

Atallah, D.M., Badawy, M., El-Sayed, A.: Intelligent feature selection with modi- fied k-nearest neighbor for kidney transplantation prediction. SN Applied Sciences 1(10), 1297 (2019)

work page 2019
[46]

arXiv preprint arXiv:2405.19729 (2024)

Chen, Y., Gao, J., Wu, J.: Dynamic feature selection in medical predictive monitoring by reinforcement learning. arXiv preprint arXiv:2405.19729 (2024)

work page arXiv 2024
[47]

Advances in neural information processing systems24(2011)

Bergstra, J., Bardenet, R., Bengio, Y., K´ egl, B.: Algorithms for hyper-parameter optimization. Advances in neural information processing systems24(2011)

work page 2011
[48]

In: NeurIPS 2020 Competition and Demonstration Track, pp

Turner, R., Eriksson, D., McCourt, M., Kiili, J., Laaksonen, E., Xu, Z., Guyon, I.: Bayesian optimization is superior to random search for machine learning hyper- parameter tuning: Analysis of the black-box optimization challenge 2020. In: NeurIPS 2020 Competition and Demonstration Track, pp. 3–26 (2021). PMLR

work page 2020
[49]

Bioinformatics26(6), 822–830 (2010)

Hanczar, B., Hua, J., Sima, C., Weinstein, J., Bittner, M., Dougherty, E.R.: Small- sample precision of roc-related estimates. Bioinformatics26(6), 822–830 (2010)

work page 2010
[50]

Global ecology and Biogeography 17(2), 145–151 (2008)

Lobo, J.M., Jim´ enez-Valverde, A., Real, R.: Auc: a misleading measure of the performance of predictive distribution models. Global ecology and Biogeography 17(2), 145–151 (2008)

work page 2008
[51]

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

Hollmann, N., M¨ uller, S., Eggensperger, K., Hutter, F.: Tabpfn: A transformer that solves small tabular classification problems in a second. arXiv preprint arXiv:2207.01848 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[52]

VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

Bardes, A., Ponce, J., LeCun, Y.: Vicreg: Variance-invariance-covariance regular- ization for self-supervised learning. arXiv preprint arXiv:2105.04906 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[53]

ArXivabs/2106.11959(2021)

Gorishniy, Y.V., Rubachev, I., Khrulkov, V., Babenko, A.: Revisiting deep learning models for tabular data. ArXivabs/2106.11959(2021)

work page arXiv 2021
[54]

Advances in neural information processing systems27(2014)

Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information processing systems27(2014)

work page 2014
[55]

Machine learning24(2), 123–140 (1996)

Breiman, L.: Bagging predictors. Machine learning24(2), 123–140 (1996)

work page 1996
[56]

In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp

Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010). JMLR Workshop and Conference Proceedings

work page 2010
[57]

In: Proceedings of the IEEE 29 International Conference on Computer Vision, pp

He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE 29 International Conference on Computer Vision, pp. 1026–1034 (2015)

work page 2015
[58]

International Journal Of Engineering And Computer Science13(04) (2024)

Desai, C., Desai, C.: Impact of weight initialization techniques on neural network efficiency and performance: a case study with mnist dataset. International Journal Of Engineering And Computer Science13(04) (2024)

work page 2024
[59]

ACM Computing Surveys56(3), 1–24 (2023)

Christen, P., Hand, D.J., Kirielle, N.: A review of the f-measure: its history, properties, criticism, and alternatives. ACM Computing Surveys56(3), 1–24 (2023)

work page 2023
[60]

Radiology143(1), 29–36 (1982)

Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology143(1), 29–36 (1982)

work page 1982
[61]

Girden, E.R.: ANOVA: Repeated Measures vol. 84. sage

work page
[62]

International Journal of Statistics and Applied Mathematics6(1), 59–65 (2021)

Nanda, A., Mohapatra, B.B., Mahapatra, A.P.K., Mahapatra, A.P.K., Mahapa- tra, A.P.K.: Multiple comparison test by tukey’s honestly significant difference (hsd): Do the confident level control type i error. International Journal of Statistics and Applied Mathematics6(1), 59–65 (2021)

work page 2021
[63]

Journal of Clinical Epidemiology145, 126–135 (2022)

Haller, M.C., Aschauer, C., Wallisch, C., Leffondr´ e, K., Smeden, M., Oberbauer, R., Heinze, G.: Prediction models for living organ transplantation are poorly developed, reported, and validated: a systematic review. Journal of Clinical Epidemiology145, 126–135 (2022)

work page 2022
[64]

Renal Failure47(1), 2431147 (2025)

Ali, H., Shroff, A., F¨ ul¨ op, T., Molnar, M.Z., Sharif, A., Burke, B., Shroff, S., Briggs, D., Krishnan, N.: Artificial intelligence assisted risk prediction in organ transplantation: a uk live-donor kidney transplant outcome prediction tool. Renal Failure47(1), 2431147 (2025)

work page 2025
[65]

Scientific Reports14(1), 17356 (2024)

Sala¨ un, A., Knight, S., Wingfield, L., Zhu, T.: Predicting graft and patient outcomes following kidney transplantation using interpretable machine learning models. Scientific Reports14(1), 17356 (2024)

work page 2024
[66]

In: Compendium of Meteorology: Prepared Under the Direction of the Committee on the Com- pendium of Meteorology, pp

Brier, G.W., Allen, R.A.: Verification of weather forecasts. In: Compendium of Meteorology: Prepared Under the Direction of the Committee on the Com- pendium of Meteorology, pp. 841–848. Springer

work page
[67]

Journal of Applied Meteorology and Climatology12(4), 595–600 (1973)

Murphy, A.H.: A new vector partition of the probability score. Journal of Applied Meteorology and Climatology12(4), 595–600 (1973)

work page 1973
[68]

Statistics in medicine42(29), 5451–5478 (2023) https://doi.org/10.1002/sim.9921

Ojeda, F.M., Jansen, M.L., Thi´ ery, A., Blankenberg, S., Weimar, C., Schmid, M., Ziegler, A.: Calibrating machine learning approaches for probability estimation: A comprehensive comparison. Statistics in medicine42(29), 5451–5478 (2023) https://doi.org/10.1002/sim.9921

work page doi:10.1002/sim.9921 2023
[69]

Journal of clinical 30 epidemiology63(8), 938–939 (2010)

Rufibach, K.: Use of brier score to assess binary predictions. Journal of clinical 30 epidemiology63(8), 938–939 (2010)

work page 2010
[70]

Advances in large margin classifiers10(3), 61–74 (1999)

Platt, J.,et al.: Probabilistic outputs for support vector machines and compar- isons to regularized likelihood methods. Advances in large margin classifiers10(3), 61–74 (1999)

work page 1999
[71]

In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp

Zadrozny, B., Elkan, C.: Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 694–699 (2002)

work page 2002
[72]

Lundberg, S., Lee, S.-I.: A Unified Approach to Interpreting Model Predictions. arXiv. arXiv:1705.07874 [cs] (2017). https://doi.org/10.48550/arXiv.1705.07874 . http://arxiv.org/abs/1705.07874 Accessed 2025-02-20

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1705.07874 2017
[73]

Contribution to the Theory of Games 2(1953)

Shapley, L.S.: A value for n-person games. Contribution to the Theory of Games 2(1953)

work page 1953
[74]

In: European Conference on Computer Vision, pp

Zheng, Q., Wang, Z., Zhou, J., Lu, J.: Shap-cam: Visual explanations for con- volutional neural networks based on shapley value. In: European Conference on Computer Vision, pp. 459–474 (2022). Springer

work page 2022
[75]

In: Proceedings of the 29th International Conference on Computational Linguistics, pp

Mosca, E., Szigeti, F., Tragianni, S., Gallagher, D., Groh, G.: Shap-based expla- nation methods: a review for nlp interpretability. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 4593–4603 (2022)

work page 2022
[76]

Journal of computer-aided molecular design34(10), 1013–1026 (2020)

Rodr´ ıguez-P´ erez, R., Bajorath, J.: Interpretation of machine learning models using shapley values: application to compound potency and multi-target activ- ity predictions. Journal of computer-aided molecular design34(10), 1013–1026 (2020)

work page 2020
[77]

SN Applied Sciences3(2), 272 (2021)

Saarela, M., Jauhiainen, S.: Comparison of feature importance measures as explanations for classification models. SN Applied Sciences3(2), 272 (2021)

work page 2021
[78]

Knowledge-Based Systems263, 110273 (2023) https://doi.org/10.1016/j.knosys.2023.110273

Saeed, W., Omlin, C.: Explainable AI (XAI): A systematic meta-survey of cur- rent challenges and future opportunities. Knowledge-Based Systems263, 110273 (2023) https://doi.org/10.1016/j.knosys.2023.110273 . Accessed 2025-03-19

work page doi:10.1016/j.knosys.2023.110273 2023
[79]

In: Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, W¨ urzburg, Germany, September 16–20, 2019, Proceedings, Part I, pp

Scholbeck, C.A., Molnar, C., Heumann, C., Bischl, B., Casalicchio, G.: Sampling, intervention, prediction, aggregation: a generalized framework for model-agnostic interpretations. In: Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, W¨ urzburg, Germany, September 16–20, 2019, Proceedings, Part I, pp. 205–21...

work page 2019
[80]

BMC medical informatics and decision making15(1), 83 (2015) 31

Decruyenaere, A., Decruyenaere, P., Peeters, P., Vermassen, F., Dhaene, T., Couckuyt, I.: Prediction of delayed graft function after kidney transplantation: comparison between logistic regression and machine learning methods. BMC medical informatics and decision making15(1), 83 (2015) 31

work page 2015
[81]

Journal of research in health sciences18(2), 412 (2018)

Esmaily, H., Tayefi, M., Doosti, H., Ghayour-Mobarhan, M., Nezami, H., Amirabadizadeh, A.: A comparison between decision tree and random forest in determining the risk factors associated with type 2 diabetes. Journal of research in health sciences18(2), 412 (2018)

work page 2018
[82]

Nature genetics32(4), 502–508 (2002)

Slonim, D.K.: From patterns to pathways: gene expression data analysis comes of age. Nature genetics32(4), 502–508 (2002)

work page 2002

Showing first 80 references.