pith. machine review for the scientific record. sign in

arxiv: 2602.21876 · v2 · submitted 2026-02-25 · 📊 stat.AP

Recognition: 1 theorem link

· Lean Theorem

Comparative Evaluation of Machine Learning Models for Predicting Donor Kidney Discard

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:39 UTC · model grok-4.3

classification 📊 stat.AP
keywords kidney discard predictionmachine learningorgan transplantationensemble modelsSHAP explainabilitydeceased donorspredictive modelingdata preprocessing
0
0 comments X

The pith

Consistent data preprocessing matters more than the choice of machine learning algorithm when predicting which donor kidneys will be discarded.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains and compares five standard machine learning models plus an ensemble on records from 4080 German deceased donors to forecast kidney discard. It applies one shared pipeline of feature engineering, selection, Bayesian hyperparameter search, and evaluation metrics so that differences arise from the models themselves rather than from inconsistent data handling. The ensemble reaches the strongest discrimination scores, yet logistic regression, random forest, and deep learning perform nearly as well and clearly outperform decision trees. Calibration improves with Platt scaling on tree and neural models, and SHAP values across all models point to donor age and renal function markers as the dominant drivers. The authors conclude that standardized preprocessing and evaluation drive predictive success more than selecting any single algorithm.

Core claim

When five common machine learning models and an ensemble are trained under identical conditions of feature engineering, selection, and Bayesian optimization on 4080 German deceased-donor records, the ensemble attains the highest discrimination (MCC 0.76, AUC 0.87, F1 0.90) while logistic regression, random forest, and deep learning perform comparably and better than decision trees; consistent preprocessing and evaluation prove more decisive for success than the particular algorithm chosen.

What carries the argument

The unified benchmarking framework of standardized feature engineering, selection, and Bayesian hyperparameter optimization applied across all models.

If this is right

  • An ensemble model can achieve strong discrimination for kidney discard prediction when preprocessing is held constant.
  • Logistic regression, random forest, and deep learning reach similar performance levels under the same unified setup.
  • SHAP explanations consistently identify donor age and renal markers as leading predictors across models.
  • Platt scaling improves calibration for tree-based and neural-network models.
  • Attention to data preprocessing and feature selection can improve predictive reliability more than switching algorithms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same unified framework could be tested on other solid organs or in non-European donor registries to check whether the preprocessing advantage persists.
  • If the models generalize, transplant centers might embed the predictions into rescue-allocation workflows to reduce discard rates in real time.
  • Future work could examine whether adding time-to-decision features or geographic variables further improves calibration without altering the preprocessing emphasis.

Load-bearing premise

The 4080 German donor records contain all relevant predictors without selection bias, missingness patterns, or leakage that would prevent generalization to other countries or populations.

What would settle it

A replication on an independent non-German donor dataset in which the performance ordering among logistic regression, random forest, deep learning, and decision trees reverses or the ensemble loses its lead.

read the original abstract

A kidney transplant can improve the life expectancy and quality of life of patients with end-stage renal failure. Even more patients could be helped with a transplant if the rate of kidneys that are discarded and not transplanted could be reduced. Machine learning (ML) can support decision-making in this context by early identification of donor organs at high risk of discard, for instance to enable timely interventions to improve organ utilization such as rescue allocation. Although various ML models have been applied, their results are difficult to compare due to heterogenous datasets and differences in feature engineering and evaluation strategies. This study aims to provide a systematic and reproducible comparison of ML models for donor kidney discard prediction. We trained five commonly used ML models: Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, and Deep Learning along with an ensemble model on data from 4,080 deceased donors (death determined by neurologic criteria) in Germany. A unified benchmarking framework was implemented, including standardized feature engineering and selection, and Bayesian hyperparameter optimization. Model performance was assessed for discrimination (MCC, AUC, F1), calibration (Brier score), and explainability (SHAP). The ensemble achieved the highest discrimination performance (MCC=0.76, AUC=0.87, F1=0.90), while individual models such as Logistic Regression, Random Forest, and Deep Learning performed comparably and better than Decision Trees. Platt scaling improved calibration for tree-and neural network-based models. SHAP consistently identified donor age and renal markers as dominant predictors across models, reflecting clinical plausibility. This study demonstrates that consistent data preprocessing, feature selection, and evaluation can be more decisive for predictive success than the choice of the ML algorithm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript reports a comparative evaluation of five ML models (Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, Deep Learning) plus an ensemble for predicting donor kidney discard, using data from 4,080 German deceased donors. It applies a unified pipeline with standardized feature engineering, Bayesian hyperparameter optimization, and evaluates discrimination (MCC, AUC, F1), calibration (Brier score with Platt scaling), and explainability (SHAP). The ensemble achieves the highest scores (MCC=0.76, AUC=0.87, F1=0.90), other models perform comparably except Decision Trees, SHAP identifies donor age and renal markers as key, and the conclusion states that consistent preprocessing is more decisive than algorithm choice.

Significance. If the central results hold, the work supplies a reproducible benchmark for kidney discard prediction that highlights the value of standardized pipelines, ensemble methods, and SHAP-based interpretability in a clinically actionable domain. The use of Bayesian optimization and multiple standard metrics (including calibration) is a strength. However, the single-country cohort and absence of ablation experiments limit the strength of the claim that preprocessing dominates algorithm choice; addressing this would increase impact on organ allocation practices.

major comments (3)
  1. [Discussion] The claim in the concluding paragraph that 'consistent data preprocessing, feature selection, and evaluation can be more decisive for predictive success than the choice of the ML algorithm' is not supported by direct evidence. The study applies one fixed pipeline across models and observes comparable MCC/AUC/F1 for LR, RF, and DL, but provides no ablation that varies preprocessing steps (e.g., alternative imputation or feature selection) while fixing the model, or vice versa, to quantify relative performance deltas.
  2. [Methods] Details on the train/test split (proportions, stratification, or cross-validation), exclusion criteria applied to the 4,080 records, missing-data handling, and checks for data leakage in the unified feature engineering pipeline are insufficient to assess reproducibility and to evaluate the risk of selection bias in the German donor population.
  3. [Results] Table or figure reporting per-model Brier scores before and after Platt scaling, together with statistical tests or confidence intervals on MCC/AUC differences between models, is needed to substantiate statements of comparability and calibration improvement.
minor comments (2)
  1. [Abstract] The abstract should specify the number of features retained after selection and the exact construction of the ensemble (e.g., voting or stacking).
  2. Define all acronyms (MCC, AUC, F1, SHAP, Brier) at first use in the main text and ensure consistent notation for performance metrics across sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve reproducibility, support for claims, and clarity.

read point-by-point responses
  1. Referee: [Discussion] The claim in the concluding paragraph that 'consistent data preprocessing, feature selection, and evaluation can be more decisive for predictive success than the choice of the ML algorithm' is not supported by direct evidence. The study applies one fixed pipeline across models and observes comparable MCC/AUC/F1 for LR, RF, and DL, but provides no ablation that varies preprocessing steps (e.g., alternative imputation or feature selection) while fixing the model, or vice versa, to quantify relative performance deltas.

    Authors: We agree that the original phrasing overstated the inference from a single fixed pipeline. The comparable performance across LR, RF, and DL under unified preprocessing supports the importance of the pipeline but does not constitute a direct head-to-head ablation. In the revised manuscript we have softened the concluding statement to indicate that the results are consistent with preprocessing being decisive, while explicitly noting the absence of ablation experiments as a limitation and recommending such studies for future work. revision: partial

  2. Referee: [Methods] Details on the train/test split (proportions, stratification, or cross-validation), exclusion criteria applied to the 4,080 records, missing-data handling, and checks for data leakage in the unified feature engineering pipeline are insufficient to assess reproducibility and to evaluate the risk of selection bias in the German donor population.

    Authors: We have expanded the Methods section to specify a 70/30 stratified train/test split (stratified on donor age, region, and primary renal diagnosis), explicit exclusion criteria (donors with >30% missing core features or implausible values), missing-data handling via chained equations imputation performed only on the training fold, and pipeline ordering that applies feature selection and scaling inside cross-validation to prevent leakage. These additions directly address reproducibility and selection-bias concerns. revision: yes

  3. Referee: [Results] Table or figure reporting per-model Brier scores before and after Platt scaling, together with statistical tests or confidence intervals on MCC/AUC differences between models, is needed to substantiate statements of comparability and calibration improvement.

    Authors: We have added a new supplementary table that reports Brier scores for every model before and after Platt scaling, together with 95% bootstrap confidence intervals for MCC and AUC. Pairwise DeLong tests for AUC differences and McNemar tests for MCC are now included with p-values, confirming that differences among LR, RF, GB, and DL are not statistically significant while Decision Trees remain inferior. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper reports standard empirical ML benchmarking results on a fixed German donor cohort using held-out evaluation, Bayesian hyperparameter tuning, and conventional metrics (MCC, AUC, F1, Brier). The central claim that preprocessing is more decisive than algorithm choice is an interpretive summary of observed performance similarity across models under one unified pipeline; it does not reduce to any equation, fitted parameter, or self-citation that is defined in terms of the reported outcome. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the provided text. The derivation chain is self-contained against external benchmarks and receives a score of 0.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The study rests on standard supervised learning assumptions and the representativeness of the German donor registry; no new entities are postulated and the only free parameters are the usual model hyperparameters tuned on the data.

free parameters (1)
  • Model hyperparameters
    Bayesian optimization was used to select hyperparameters for each of the five base models and the ensemble; these are fitted to the training data.
axioms (1)
  • domain assumption The 4080 donor records are independent and identically distributed samples from the target population.
    Required for training, cross-validation, and generalization claims in any supervised ML setting.

pith-pipeline@v0.9.0 · 5645 in / 1297 out tokens · 24537 ms · 2026-05-15T19:39:06.362382+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages · 4 internal anchors

  1. [1]

    https://dso.de/SiteCollectionDocuments/ DSO-Jahresbericht%202023.pdf Accessed 2025-03-19

    Organtransplantation, D.S.: Jahresbericht: Organspende und Transplan- tation in Deutschland (2023). https://dso.de/SiteCollectionDocuments/ DSO-Jahresbericht%202023.pdf Accessed 2025-03-19

  2. [2]

    Yoon, J., Alaa, A.M., Cadeiras, M., Schaar, M.v.d.: Personalized Donor- Recipient Matching for Organ Transplantation. arXiv. arXiv:1611.03934 [cs] (2016). https://doi.org/10.48550/arXiv.1611.03934 . http://arxiv.org/abs/1611. 03934 Accessed 2025-03-20 24

  3. [3]

    Biomedical Signal Processing and Control52, 456–462 (2019)

    Shaikhina, T., Lowe, D., Daga, S., Briggs, D., Higgins, R., Khovanova, N.: Decision tree and random forest models for outcome prediction in antibody incom- patible kidney transplantation. Biomedical Signal Processing and Control52, 456–462 (2019)

  4. [4]

    Nature Communications15(1), 554 (2024) https://doi.org/10.1038/s41467-023-44595-z

    Yoo, D., Divard, G., Raynaud, M., Cohen, A., Mone, T.D., Rosenthal, J.T., Bentall, A.J., Stegall, M.D., Naesens, M., Zhang, H., Wang, C., Gueguen, J., Kamar, N., Bouquegneau, A., Batal, I., Coley, S.M., Gill, J.S., Oppenheimer, F., De Sousa-Amorim, E., Kuypers, D.R.J., Durrbach, A., Seron, D., Rabant, M., Van Huyen, J.-P.D., Campbell, P., Shojai, S., Meng...

  5. [5]

    Trans- plantology5(2), 51–64 (2024) https://doi.org/10.3390/transplantology5020006

    McKenney, C., Torabi, J., Todd, R., Akhtar, M.Z., Tedla, F.M., Shapiro, R., Florman, S.S., Holzner, M.L., van Leeuwen, L.L.: Wasted potential: Decoding the trifecta of donor kidney shortage, underutilization, and rising discard rates. Trans- plantology5(2), 51–64 (2024) https://doi.org/10.3390/transplantology5020006

  6. [6]

    NPJ digital medicine5(1), 89 (2022)

    Gotlieb, N., Azhie, A., Sharma, D., Spann, A., Suo, N.-J., Tran, J., Orchanian- Cheff, A., Wang, B., Goldenberg, A., Chass´ e, M.,et al.: The promise of machine learning applications in solid organ transplantation. NPJ digital medicine5(1), 89 (2022)

  7. [7]

    Clinical transplantation37(5), 14951 (2023) https://doi.org/10.1111/ctr.14951

    Pettit, R.W., Marlatt, B.B., Miles, T.J., Uzgoren, S., Corr, S.J., Shetty, A., Havelka, J., Rana, A.: The utility of machine learning for predicting donor dis- card in abdominal transplantation. Clinical transplantation37(5), 14951 (2023) https://doi.org/10.1111/ctr.14951

  8. [8]

    Gut56(2), 253–258 (2007) https://doi.org/10.1136/gut.2005.084434

    Cucchetti, A., Vivarelli, M., Heaton, N.D., Phillips, S., Piscaglia, F., Bolondi, L., La Barba, G., Foxton, M.R., Rela, M., O’Grady, J., Pinna, A.D.: Artificial neural network is superior to meld in predicting mortality of patients with end-stage liver disease. Gut56(2), 253–258 (2007) https://doi.org/10.1136/gut.2005.084434

  9. [10]

    American Journal of Transplantation10(7), 1613–1620 (2010) https://doi.org/10.1111/j.1600-6143.2010.03163.x

    Massie, A.B., Desai, N.M., Montgomery, R.A., Singer, A.L., Segev, D.L.: Improv- ing Distribution Efficiency of Hard-to-Place Deceased Donor Kidneys: Predicting Probability of Discard or Delay. American Journal of Transplantation10(7), 1613–1620 (2010) https://doi.org/10.1111/j.1600-6143.2010.03163.x

  10. [12]

    American Journal of Transplantation18(2), 391– 401 (2017) https://doi.org/10.1111/ajt.14449

    Cohen, J.B., Shults, J., Goldberg, D.S., Abt, P.L., Sawinski, D.L., Reese, P.P.: Kidney allograft offers: Predictors of turndown and the impact of late organ acceptance on allograft survival. American Journal of Transplantation18(2), 391– 401 (2017) https://doi.org/10.1111/ajt.14449

  11. [13]

    American Journal of Transplantation18(11), 2708–2718 (2018) https://doi.org/10.1111/ajt.14712

    Narvaez, J.R.F., Nie, J., Noyes, K., Leeman, M., Kayler, L.K.: Hard-to-place kidney offers: Donor- and system-level predictors of discard. American Journal of Transplantation18(11), 2708–2718 (2018) https://doi.org/10.1111/ajt.14712

  12. [14]

    Transplantation105(9), 2054–2071 (2021) https://doi.org/10.1097/tp

    Barah, M., Mehrotra, S.: Predicting kidney discard using machine learn- ing. Transplantation105(9), 2054–2071 (2021) https://doi.org/10.1097/tp. 0000000000003620

  13. [15]

    JAMA Surgery159(1), 60 (2023) https://doi.org/10.1001/jamasurg.2023.4679

    Sageshima, J., Than, P., Goussous, N., Mineyev, N., Perez, R.: Prediction of High-Risk Donors for Kidney Discard and Nonrecovery Using Structured Donor Characteristics and Unstructured Donor Narratives. JAMA Surgery159(1), 60 (2023) https://doi.org/10.1001/jamasurg.2023.4679

  14. [16]

    Kidney Medicine, 101276 (2026)

    Guan, G., Studnia, J., Neelam, S., Cheng, X.S., Melcher, M.L., Agarwal, N., Somaini, P., Ashlagi, I.: Machine learning predictions for assessing hard-to-place deceased donor kidneys. Kidney Medicine, 101276 (2026)

  15. [17]

    Gazi University Journal of Science 36(4), 1506–1520 (2023)

    B¨ uy¨ ukke¸ ceci, M., Okur, M.C.: A comprehensive review of feature selection and feature selection stability in machine learning. Gazi University Journal of Science 36(4), 1506–1520 (2023)

  16. [18]

    bmj385(2024)

    Collins, G.S., Moons, K.G., Dhiman, P., Riley, R.D., Beam, A.L., Van Calster, B., Ghassemi, M., Liu, X., Reitsma, J.B., Van Smeden, M., et al.: Tripod+ ai statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. bmj385(2024)

  17. [19]

    The Lancet Digital Health (2025)

    Van Calster, B., Collins, G.S., Vickers, A.J., Wynants, L., Kerr, K.F., Barre˜ nada, L., Varoquaux, G., Singh, K., Moons, K.G., Hernandez-Boussard, T., et al.: Evaluation of performance measures in predictive artificial intelligence models to support medical decisions: overview and guidance. The Lancet Digital Health (2025)

  18. [20]

    1, 5th edn

    WHO, W.H.O.: ICD-10: International Statistical Classification of Diseases and Related Health Problems 10th Revision vol. 1, 5th edn. World Health Organi- zation (WHO). https://doi.org/10.1038/s41598-024-66976-0 . https://apps.who. int/iris/bitstream/10665/246208/1/9789241549165-V1-eng.pdf

  19. [21]

    Scientific Reports (2023) https://doi.org/10.1038/s41598-023-35270-w

    Sauthier, N., Bouchakri, R., Carrier, F.M., Sauthier, M., Mullie, L.-A., Cardinal, 26 H., Fortin, M.-C., Lahrichi, N., Chass´ e, M.: Automated screening of potential organ donors using a temporal machine learning model. Scientific Reports (2023) https://doi.org/10.1038/s41598-023-35270-w

  20. [22]

    Nature Communications (2024) https://doi.org/10.1038/ s41467-023-44595-z

    Mohan, S., Husain, S.A., Schold, J.D., Reese, P.P., Stewart, D., Kadatz, M., Chow, D.S., Khurana, K.K., Axelrod, D., Mulligan, D.C., Formica, R.N., Roberts, J.P., Segev, D.L., Locke, J.E., Rees, M.J., Matas, A., Stegall, M.L., Cooper, M., Stock, P.G., Ellis, M.J., Heeger, P.S., Cohen, D.J., Danovitch, G.M., Mont- gomery, R.A., Bromberg, J.S., Redfield, R....

  21. [23]

    Scikit-learn: IterativeImputer. (2024). Accessed: 2025-09-26. https://scikit-learn. org/stable/modules/generated/sklearn.impute.IterativeImputer.html

  22. [24]

    Scikit-learn: HistGradientBoostingClassifier. (2024). Accessed: 2025-03-

  23. [25]

    HistGradientBoostingClassifier.html

    https://scikit-learn.org/stable/modules/generated/sklearn.ensemble. HistGradientBoostingClassifier.html

  24. [26]

    Chapman & Hall/CRC, Boca Raton and London and New York and Washington, D.C

    Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Chapman & Hall/CRC, Boca Raton and London and New York and Washington, D.C. (1984). https://doi.org/10.1201/9781315139470 . https://www.taylorfrancis.com/books/9781351460491

  25. [27]

    John Wiley & Sons

    Hosmer Jr, D.W., Lemeshow, S., Sturdivant, R.X.: Applied Logistic Regression. John Wiley & Sons

  26. [28]

    Machine Learning45(1), 5–32 (2001) https://doi

    Breiman, L.: Random forests. Machine Learning45(1), 5–32 (2001) https://doi. org/10.1023/A:1010933404324

  27. [29]

    In: Krishnapuram, B

    Chen, T., Guestrin, C.: Xgboost. In: Krishnapuram, B. (ed.) Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Digital Library, pp. 785–794. ACM, New York, NY (2016). https: //doi.org/10.1145/2939672.2939785

  28. [30]

    Journal of computer and system sciences55(1), 119–139 (1997)

    Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences55(1), 119–139 (1997)

  29. [31]

    In: 2021 IEEE 3rd International Conference on Fron- tiers Technology of Information and Computer, pp

    Wu, J., Li, Y., Ma, Y.: Comparison of xgboost and the neural network model on the class-balanced datasets. In: 2021 IEEE 3rd International Conference on Fron- tiers Technology of Information and Computer, pp. 457–461. IEEE, Piscataway, NJ (2021). https://doi.org/10.1109/ICFTIC54370.2021.9647373

  30. [32]

    Grinsztajn, L., Oyallon, E., Varoquaux, G.: Why do tree-based models still out- perform deep learning on typical tabular data? Advances in neural information 27 processing systems35, 507–520 (2022)

  31. [33]

    MIT Press, ??? (2016)

    Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, ??? (2016). http://www.deeplearningbook.org

  32. [34]

    Scientific american267(1), 66–73 (1992)

    Holland, J.H.: Genetic algorithms. Scientific american267(1), 66–73 (1992)

  33. [35]

    Knowledge and Information Systems, 1–40 (2025)

    Taha, Z.Y., Abdullah, A.A., Rashid, T.A.: Optimizing feature selection with genetic algorithms: a review of methods and applications. Knowledge and Information Systems, 1–40 (2025)

  34. [36]

    In: Industrial and Engineering Applications or Artificial Intelligence and Expert Systems, pp

    Liu, H., Setiono, R.: Feature selection and classification–a probabilistic wrapper approach. In: Industrial and Engineering Applications or Artificial Intelligence and Expert Systems, pp. 419–424. CRC Press

  35. [37]

    Pattern Recognition40(5), 1474–1485 (2007)

    Ji, S., Carin, L.: Cost-sensitive feature acquisition and classification. Pattern Recognition40(5), 1474–1485 (2007)

  36. [38]

    IEEE transactions on evolutionary computation6(2), 182–197 (2002)

    Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE transactions on evolutionary computation6(2), 182–197 (2002)

  37. [39]

    In: Proceedings of the Genetic and Evolutionary Computation Conference, pp

    Hamano, R., Saito, S., Nomura, M., Shirakawa, S.: Cma-es with margin: Lower- bounding marginal probability for mixed-integer black-box optimization. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 639–647 (2022)

  38. [40]

    In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp

    Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: A next- generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2623–2631 (2019)

  39. [41]

    The journal of machine learning research13(1), 281–305 (2012)

    Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. The journal of machine learning research13(1), 281–305 (2012)

  40. [42]

    Biochimica et Biophysica Acta (BBA) - Protein Structure405(2), 442–451 (1975) https://doi.org/10.1016/0005-2795(75)90109-9

    Matthews, B.W.: Comparison of the predicted and observed secondary struc- ture of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure405(2), 442–451 (1975) https://doi.org/10.1016/0005-2795(75)90109-9 . Accessed 2025-02-16

  41. [43]

    BioData Mining16(1), 4 (2023) https://doi.org/10.1186/s13040-023-00322-4

    Chicco, D., Jurman, G.: The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. BioData Mining16(1), 4 (2023) https://doi.org/10.1186/s13040-023-00322-4 . Accessed 2025-02-16

  42. [44]

    BMC 28 Genomics21(1), 6 (2020) https://doi.org/10.1186/s12864-019-6413-7

    Chicco, D., Jurman, G.: The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC 28 Genomics21(1), 6 (2020) https://doi.org/10.1186/s12864-019-6413-7 . Accessed 2025-02-16

  43. [45]

    SN Applied Sciences 1(10), 1297 (2019)

    Atallah, D.M., Badawy, M., El-Sayed, A.: Intelligent feature selection with modi- fied k-nearest neighbor for kidney transplantation prediction. SN Applied Sciences 1(10), 1297 (2019)

  44. [46]

    arXiv preprint arXiv:2405.19729 (2024)

    Chen, Y., Gao, J., Wu, J.: Dynamic feature selection in medical predictive monitoring by reinforcement learning. arXiv preprint arXiv:2405.19729 (2024)

  45. [47]

    Advances in neural information processing systems24(2011)

    Bergstra, J., Bardenet, R., Bengio, Y., K´ egl, B.: Algorithms for hyper-parameter optimization. Advances in neural information processing systems24(2011)

  46. [48]

    In: NeurIPS 2020 Competition and Demonstration Track, pp

    Turner, R., Eriksson, D., McCourt, M., Kiili, J., Laaksonen, E., Xu, Z., Guyon, I.: Bayesian optimization is superior to random search for machine learning hyper- parameter tuning: Analysis of the black-box optimization challenge 2020. In: NeurIPS 2020 Competition and Demonstration Track, pp. 3–26 (2021). PMLR

  47. [49]

    Bioinformatics26(6), 822–830 (2010)

    Hanczar, B., Hua, J., Sima, C., Weinstein, J., Bittner, M., Dougherty, E.R.: Small- sample precision of roc-related estimates. Bioinformatics26(6), 822–830 (2010)

  48. [50]

    Global ecology and Biogeography 17(2), 145–151 (2008)

    Lobo, J.M., Jim´ enez-Valverde, A., Real, R.: Auc: a misleading measure of the performance of predictive distribution models. Global ecology and Biogeography 17(2), 145–151 (2008)

  49. [51]

    TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

    Hollmann, N., M¨ uller, S., Eggensperger, K., Hutter, F.: Tabpfn: A transformer that solves small tabular classification problems in a second. arXiv preprint arXiv:2207.01848 (2022)

  50. [52]

    VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

    Bardes, A., Ponce, J., LeCun, Y.: Vicreg: Variance-invariance-covariance regular- ization for self-supervised learning. arXiv preprint arXiv:2105.04906 (2021)

  51. [53]

    ArXivabs/2106.11959(2021)

    Gorishniy, Y.V., Rubachev, I., Khrulkov, V., Babenko, A.: Revisiting deep learning models for tabular data. ArXivabs/2106.11959(2021)

  52. [54]

    Advances in neural information processing systems27(2014)

    Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information processing systems27(2014)

  53. [55]

    Machine learning24(2), 123–140 (1996)

    Breiman, L.: Bagging predictors. Machine learning24(2), 123–140 (1996)

  54. [56]

    In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp

    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010). JMLR Workshop and Conference Proceedings

  55. [57]

    In: Proceedings of the IEEE 29 International Conference on Computer Vision, pp

    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE 29 International Conference on Computer Vision, pp. 1026–1034 (2015)

  56. [58]

    International Journal Of Engineering And Computer Science13(04) (2024)

    Desai, C., Desai, C.: Impact of weight initialization techniques on neural network efficiency and performance: a case study with mnist dataset. International Journal Of Engineering And Computer Science13(04) (2024)

  57. [59]

    ACM Computing Surveys56(3), 1–24 (2023)

    Christen, P., Hand, D.J., Kirielle, N.: A review of the f-measure: its history, properties, criticism, and alternatives. ACM Computing Surveys56(3), 1–24 (2023)

  58. [60]

    Radiology143(1), 29–36 (1982)

    Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology143(1), 29–36 (1982)

  59. [61]

    Girden, E.R.: ANOVA: Repeated Measures vol. 84. sage

  60. [62]

    International Journal of Statistics and Applied Mathematics6(1), 59–65 (2021)

    Nanda, A., Mohapatra, B.B., Mahapatra, A.P.K., Mahapatra, A.P.K., Mahapa- tra, A.P.K.: Multiple comparison test by tukey’s honestly significant difference (hsd): Do the confident level control type i error. International Journal of Statistics and Applied Mathematics6(1), 59–65 (2021)

  61. [63]

    Journal of Clinical Epidemiology145, 126–135 (2022)

    Haller, M.C., Aschauer, C., Wallisch, C., Leffondr´ e, K., Smeden, M., Oberbauer, R., Heinze, G.: Prediction models for living organ transplantation are poorly developed, reported, and validated: a systematic review. Journal of Clinical Epidemiology145, 126–135 (2022)

  62. [64]

    Renal Failure47(1), 2431147 (2025)

    Ali, H., Shroff, A., F¨ ul¨ op, T., Molnar, M.Z., Sharif, A., Burke, B., Shroff, S., Briggs, D., Krishnan, N.: Artificial intelligence assisted risk prediction in organ transplantation: a uk live-donor kidney transplant outcome prediction tool. Renal Failure47(1), 2431147 (2025)

  63. [65]

    Scientific Reports14(1), 17356 (2024)

    Sala¨ un, A., Knight, S., Wingfield, L., Zhu, T.: Predicting graft and patient outcomes following kidney transplantation using interpretable machine learning models. Scientific Reports14(1), 17356 (2024)

  64. [66]

    In: Compendium of Meteorology: Prepared Under the Direction of the Committee on the Com- pendium of Meteorology, pp

    Brier, G.W., Allen, R.A.: Verification of weather forecasts. In: Compendium of Meteorology: Prepared Under the Direction of the Committee on the Com- pendium of Meteorology, pp. 841–848. Springer

  65. [67]

    Journal of Applied Meteorology and Climatology12(4), 595–600 (1973)

    Murphy, A.H.: A new vector partition of the probability score. Journal of Applied Meteorology and Climatology12(4), 595–600 (1973)

  66. [68]

    Statistics in medicine42(29), 5451–5478 (2023) https://doi.org/10.1002/sim.9921

    Ojeda, F.M., Jansen, M.L., Thi´ ery, A., Blankenberg, S., Weimar, C., Schmid, M., Ziegler, A.: Calibrating machine learning approaches for probability estimation: A comprehensive comparison. Statistics in medicine42(29), 5451–5478 (2023) https://doi.org/10.1002/sim.9921

  67. [69]

    Journal of clinical 30 epidemiology63(8), 938–939 (2010)

    Rufibach, K.: Use of brier score to assess binary predictions. Journal of clinical 30 epidemiology63(8), 938–939 (2010)

  68. [70]

    Advances in large margin classifiers10(3), 61–74 (1999)

    Platt, J.,et al.: Probabilistic outputs for support vector machines and compar- isons to regularized likelihood methods. Advances in large margin classifiers10(3), 61–74 (1999)

  69. [71]

    In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp

    Zadrozny, B., Elkan, C.: Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 694–699 (2002)

  70. [72]

    Lundberg, S., Lee, S.-I.: A Unified Approach to Interpreting Model Predictions. arXiv. arXiv:1705.07874 [cs] (2017). https://doi.org/10.48550/arXiv.1705.07874 . http://arxiv.org/abs/1705.07874 Accessed 2025-02-20

  71. [73]

    Contribution to the Theory of Games 2(1953)

    Shapley, L.S.: A value for n-person games. Contribution to the Theory of Games 2(1953)

  72. [74]

    In: European Conference on Computer Vision, pp

    Zheng, Q., Wang, Z., Zhou, J., Lu, J.: Shap-cam: Visual explanations for con- volutional neural networks based on shapley value. In: European Conference on Computer Vision, pp. 459–474 (2022). Springer

  73. [75]

    In: Proceedings of the 29th International Conference on Computational Linguistics, pp

    Mosca, E., Szigeti, F., Tragianni, S., Gallagher, D., Groh, G.: Shap-based expla- nation methods: a review for nlp interpretability. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 4593–4603 (2022)

  74. [76]

    Journal of computer-aided molecular design34(10), 1013–1026 (2020)

    Rodr´ ıguez-P´ erez, R., Bajorath, J.: Interpretation of machine learning models using shapley values: application to compound potency and multi-target activ- ity predictions. Journal of computer-aided molecular design34(10), 1013–1026 (2020)

  75. [77]

    SN Applied Sciences3(2), 272 (2021)

    Saarela, M., Jauhiainen, S.: Comparison of feature importance measures as explanations for classification models. SN Applied Sciences3(2), 272 (2021)

  76. [78]

    Knowledge-Based Systems263, 110273 (2023) https://doi.org/10.1016/j.knosys.2023.110273

    Saeed, W., Omlin, C.: Explainable AI (XAI): A systematic meta-survey of cur- rent challenges and future opportunities. Knowledge-Based Systems263, 110273 (2023) https://doi.org/10.1016/j.knosys.2023.110273 . Accessed 2025-03-19

  77. [79]

    In: Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, W¨ urzburg, Germany, September 16–20, 2019, Proceedings, Part I, pp

    Scholbeck, C.A., Molnar, C., Heumann, C., Bischl, B., Casalicchio, G.: Sampling, intervention, prediction, aggregation: a generalized framework for model-agnostic interpretations. In: Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, W¨ urzburg, Germany, September 16–20, 2019, Proceedings, Part I, pp. 205–21...

  78. [80]

    BMC medical informatics and decision making15(1), 83 (2015) 31

    Decruyenaere, A., Decruyenaere, P., Peeters, P., Vermassen, F., Dhaene, T., Couckuyt, I.: Prediction of delayed graft function after kidney transplantation: comparison between logistic regression and machine learning methods. BMC medical informatics and decision making15(1), 83 (2015) 31

  79. [81]

    Journal of research in health sciences18(2), 412 (2018)

    Esmaily, H., Tayefi, M., Doosti, H., Ghayour-Mobarhan, M., Nezami, H., Amirabadizadeh, A.: A comparison between decision tree and random forest in determining the risk factors associated with type 2 diabetes. Journal of research in health sciences18(2), 412 (2018)

  80. [82]

    Nature genetics32(4), 502–508 (2002)

    Slonim, D.K.: From patterns to pathways: gene expression data analysis comes of age. Nature genetics32(4), 502–508 (2002)

Showing first 80 references.