Customer Churn Prediction on Structured Data Using FT-Transformer and Stacking Ensembles

Joyjit Roy; Laxmi Shaw; Samaresh Kumar Singh

arxiv: 2606.07582 · v1 · pith:CME5DAIFnew · submitted 2026-05-26 · 💻 cs.LG · cs.AI· cs.ET

Customer Churn Prediction on Structured Data Using FT-Transformer and Stacking Ensembles

Joyjit Roy , Samaresh Kumar Singh , Laxmi Shaw This is my paper

Pith reviewed 2026-06-29 18:08 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ET

keywords customer churn predictionFT-TransformerXGBooststacking ensemblestabular dataclass imbalanceprobability calibrationstructured data

0 comments

The pith

A hybrid FT-Transformer and XGBoost stacking model raises F1 to 62.10 percent on bank churn data by combining self-attention interactions with tree decision boundaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a stacking ensemble of feature-tokenized transformers and gradient-boosted trees outperforms a multi-layer perceptron baseline on structured tabular churn prediction. It shows this through 5x5 cross-validation results that include confidence intervals and ablation checks confirming each component adds value. A reader would care because churn prediction directly affects retention costs in banking and similar sectors where acquiring new customers exceeds keeping existing ones. The work handles class imbalance via weighted losses rather than oversampling and uses out-of-fold stacking with a logistic meta-learner for calibration. Ablations and reported metrics on a public dataset supply the concrete evidence for the hybrid advantage.

Core claim

The paper claims that the FT-Transformer captures higher-order feature interactions through self-attention while XGBoost supplies complementary gradient-boosted boundaries; their out-of-fold stacking with a logistic regression meta-learner then recalibrates probabilities and learns optimal weights, yielding 62.10 percent F1, 0.861 AUC-ROC and 0.647 PR-AUC on the bank dataset and beating the MLP baseline by 3.37 F1 points and 0.027 AUC under repeated cross-validation.

What carries the argument

Calibration-aware out-of-fold stacking of FT-Transformer and XGBoost, where the transformer supplies attention-based interaction modeling and the trees supply decision-boundary modeling before the logistic meta-learner combines and recalibrates their outputs.

If this is right

Ablation results establish that both the transformer component and the stacking meta-learner contribute materially to the final scores.
Class-weighted loss functions address imbalance without generating synthetic minority samples, preserving the original distribution.
Out-of-fold stacking with logistic regression produces better-calibrated probabilities than either base model alone.
The full pipeline supplies a reproducible reference for churn tasks on heterogeneous tabular inputs containing numeric and categorical columns.
Performance holds under 95 percent confidence intervals derived from the repeated cross-validation folds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stacking pattern could be applied to other imbalanced tabular tasks such as fraud detection or credit default to test whether the interaction capture generalizes.
Replacing the logistic meta-learner with a small neural net might further improve calibration on datasets with stronger nonlinear dependencies.
Releasing the exact hyperparameter grids and seed values would allow independent verification of whether the gains survive different optimization budgets.
The emphasis on avoiding oversampling suggests the method may preserve minority-class structure better than SMOTE-style approaches on very small positive classes.

Load-bearing premise

The measured gains arise from the hybrid architecture and stacking rather than from hyperparameter search or from properties unique to the single public bank dataset.

What would settle it

Repeating the identical 5x5 cross-validation protocol on a second public tabular churn dataset and obtaining no statistically significant lift over the MLP baseline would falsify the claim of consistent hybrid superiority.

Figures

Figures reproduced from arXiv: 2606.07582 by Joyjit Roy, Laxmi Shaw, Samaresh Kumar Singh.

**Figure 2.** Figure 2: FIGURE 2: Confusion matrix for the stacked ensemble ag [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: FIGURE 3: FT-Transformer learning curves show conver [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: FIGURE 4: XGBoost SHAP feature importance ranked by [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: FIGURE 5: Feature-to-feature attention weights from the FT-Transformer’s final layer. Darker cells indicate stronger attention. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: FIGURE 6: Permutation importance test illustrating F1- [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 8.** Figure 8: FIGURE 8: Precision–recall curves comparing the stacked [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: FIGURE 9: Reliability diagram showing predicted versus [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

read the original abstract

Customer churn prediction is essential across data-driven industries such as insurance, digital banking, eCommerce, and subscription platforms, where retaining existing customers is typically more cost-effective than acquiring new ones. Predicting churn on structured datasets remains challenging due to class imbalance, nonlinear feature interactions, and heterogeneous feature types. Tree-based ensemble methods consistently demonstrate strong performance in these contexts, often outperforming conventional neural networks. This study introduces a validated hybrid architecture that integrates feature-tokenized transformers (FT-Transformer) with gradient-boosted trees through calibration-aware stacking. The proposed framework addresses persistent gaps in statistical validation, probability calibration, and reproducibility found in prior research. The FT-Transformer captures higher-order feature interactions using self-attention, while XGBoost captures gradient-boosted decision boundaries with complementary inductive biases. Class imbalance is handled using class-weighted loss functions, thereby avoiding synthetic oversampling and preserving minority-class distributions. The models are ensembled using out-of-fold (OOF) stacking with a logistic regression meta-learner, which recalibrates overconfident base model outputs and learns optimal combination weights. On a public bank churn dataset, the hybrid model achieves 62.10% F1, 0.861 AUC-ROC, and 0.647 PR-AUC, outperforming the Multi-Layer Perceptron (MLP) baseline by 3.37 F1 points and 0.027 AUC under 5x5 cross-validation with 95% confidence intervals reported. Ablation studies demonstrate that both the transformer component and stacking strategy contribute materially to performance. The proposed methodology offers a reproducible and extensible reference architecture for contemporary churn prediction on structured tabular data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies FT-Transformer plus XGBoost stacking to a bank churn dataset and reports gains over MLP, but those gains may trace to unequal tuning effort rather than the claimed components.

read the letter

The headline result is a hybrid model that hits 62.10 F1, 0.861 AUC-ROC, and 0.647 PR-AUC on the public bank churn set under 5x5 CV with confidence intervals, beating an MLP baseline by 3.37 F1 points and 0.027 AUC. They use class-weighted loss, FT-Transformer for higher-order interactions, XGBoost for tree boundaries, and logistic regression on out-of-fold predictions for calibration and weighting.

What the work actually does is put together two existing pieces—feature-tokenized transformers and gradient boosting—into a stacking pipeline and run the usual ablations plus proper cross-validation. The ablations indicate both the transformer and the meta-learner add something, and the focus on calibration and avoiding synthetic oversampling is sensible for this domain. That is the useful part: a concrete, reproducible reference for practitioners who already know the individual components.

The soft spot is exactly the one in the stress-test note. The abstract claims the improvements come from the architecture and stacking strategy, yet gives no description of the hyperparameter search budget or protocol applied to the MLP baseline. On a single public dataset, a 3-point F1 lift can easily appear when one model receives more tuning trials than the other. Without matched compute or search space details, the attribution to inductive-bias complementarity stays unproven.

This paper is for applied teams doing churn modeling who want a worked example with calibration and CV. It is not for readers looking for new theory or broad tabular insights. The empirical setup is careful enough on the surface to warrant referee time, even though the baseline comparison needs tightening.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces a hybrid architecture for customer churn prediction on structured tabular data that combines an FT-Transformer (for higher-order feature interactions via self-attention) with XGBoost (for gradient-boosted decision boundaries) through out-of-fold stacking with a logistic regression meta-learner for calibration. Class imbalance is addressed via class-weighted losses rather than oversampling. On a public bank churn dataset the hybrid reports 62.10% F1, 0.861 AUC-ROC and 0.647 PR-AUC under 5 imes5 cross-validation with 95% confidence intervals, outperforming an MLP baseline by 3.37 F1 points and 0.027 AUC; ablation studies are stated to show material contributions from both the transformer component and the stacking strategy.

Significance. If the performance deltas can be shown to arise from the claimed inductive-bias complementarity rather than unequal hyperparameter effort, the work would supply a reproducible reference architecture that explicitly addresses calibration, statistical validation and reproducibility gaps noted in prior churn-prediction literature. The use of a public dataset, nested cross-validation and reported confidence intervals is a positive step toward falsifiable empirical claims.

major comments (2)

[Abstract] Abstract (performance claims paragraph): the headline attribution of the 3.37 F1 / 0.027 AUC improvement to the FT-Transformer + XGBoost OOF stacking architecture requires that the MLP baseline received an equivalent hyperparameter optimization protocol (search space, trial count, compute budget). No such protocol is described, so the reported deltas could arise from optimization disparity on a standard public dataset rather than the claimed architecture.
[Ablation studies] Ablation studies paragraph: the claim that 'ablation studies demonstrate that both the transformer component and stacking strategy contribute materially' is load-bearing for the central contribution statement, yet the manuscript supplies no quantitative controls confirming that hyperparameter search effort was matched across the ablated variants and the MLP baseline.

minor comments (1)

[Abstract] The abstract would benefit from an explicit citation or link to the precise public bank churn dataset used (e.g., Kaggle identifier) to support reproducibility claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these precise comments on hyperparameter comparability. We agree that explicit documentation of optimization protocols is required to substantiate the performance deltas and ablation claims. The revised manuscript will incorporate the requested details without altering the core experimental design.

read point-by-point responses

Referee: [Abstract] Abstract (performance claims paragraph): the headline attribution of the 3.37 F1 / 0.027 AUC improvement to the FT-Transformer + XGBoost OOF stacking architecture requires that the MLP baseline received an equivalent hyperparameter optimization protocol (search space, trial count, compute budget). No such protocol is described, so the reported deltas could arise from optimization disparity on a standard public dataset rather than the claimed architecture.

Authors: We agree that the absence of an explicit hyperparameter protocol description for the MLP baseline weakens the attribution of gains. In revision we will add a new subsection (and corresponding appendix table) that specifies the identical Bayesian optimization procedure (search space, trial budget, early-stopping rule, and compute allocation) applied to the MLP, FT-Transformer, XGBoost, and all ensemble variants. This documentation will confirm that the reported 3.37 F1 / 0.027 AUC margins were obtained under matched effort. revision: yes
Referee: [Ablation studies] Ablation studies paragraph: the claim that 'ablation studies demonstrate that both the transformer component and stacking strategy contribute materially' is load-bearing for the central contribution statement, yet the manuscript supplies no quantitative controls confirming that hyperparameter search effort was matched across the ablated variants and the MLP baseline.

Authors: We concur that the ablation paragraph must be supported by evidence of matched hyperparameter effort. The revised manuscript will expand the ablation section with an explicit table listing, for each ablated configuration (FT-Transformer alone, XGBoost alone, no-stacking ensemble, MLP), the search space size, number of trials executed, and final validation scores obtained under the same optimization budget used for the full hybrid model. This will provide the quantitative controls needed to attribute performance differences to architectural components rather than tuning disparity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on public dataset are self-contained

full rationale

The paper reports standard machine-learning training and 5x5 cross-validation performance of FT-Transformer, XGBoost, MLP, and their OOF-stacked ensemble on a fixed public bank-churn dataset. No equations, uniqueness theorems, or self-citations are invoked to derive the reported metrics; the numbers are obtained by fitting models to held-out folds and computing F1/AUC/PR-AUC directly from those predictions. Ablation results compare variants on the same data splits without any parameter being fitted to a target quantity and then re-presented as a prediction. The central claim therefore rests on external data and conventional evaluation rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard machine learning practices and assumptions about the dataset and model combination, with many free parameters in the form of hyperparameters.

free parameters (2)

model hyperparameters
Numerous hyperparameters in FT-Transformer and XGBoost are tuned to achieve the reported performance.
class weights
Used to handle imbalance, likely tuned.

axioms (2)

domain assumption The chosen public bank churn dataset is representative of real-world churn scenarios
Evaluation relies on this dataset without discussion of its limitations in the abstract.
domain assumption Out-of-fold stacking with logistic regression meta-learner provides calibrated and optimal combination
Assumed to work as described.

pith-pipeline@v0.9.1-grok · 5837 in / 1296 out tokens · 54369 ms · 2026-06-29T18:08:08.518821+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Zero defections: Quality comes to services,

F. F. Reichheld and W. E. Sasser, “Zero defections: Quality comes to services,” Harvard Business Review, vol. 68, no. 5, pp. 105–111, 1990. [Online]. Available: https://hbr.org/1990/09/ zero-defections-quality-comes-to-services

1990
[2]

De- fection detection: Measuring and understanding the predictive accuracy of customer churn models,

S. A. Neslin, S. Gupta, W. A. Kamakura, J. Lu, and C. H. Mason, “De- fection detection: Measuring and understanding the predictive accuracy of customer churn models,” Journal of Marketing Research, vol. 43, no. 2, pp. 204–211, 2006

2006
[3]

Handling class imbalance in customer churn prediction,

J. Burez and D. Van den Poel, “Handling class imbalance in customer churn prediction,” Expert Systems with Applications, vol. 36, no. 3, pp. 4626–4636, 2009

2009
[4]

Xgboost: A scalable tree boosting system,

T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794

2016
[5]

Revisiting deep learning models for tabular data,

Y . Gorishniy, I. Rubachev, V . Khrulkov, and A. Babenko, “Revisiting deep learning models for tabular data,” in Advances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 18 932– 18 943

2021
[6]

Why do tree-based models still outperform deep learning on tabular data?

L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why do tree-based models still outperform deep learning on tabular data?” in Advances in Neural Information Processing Systems, vol. 35, 2022

2022
[7]

Tabtransformer: Tabular data modeling using contextual embeddings,

X. Huang, A. Khetan, M. Cvitkovic, and Z. Karnin, “Tabtransformer: Tabular data modeling using contextual embeddings,” in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 6543–6553

2020
[8]

Saint: Improved neural networks for tabular data via row attention and contrastive pre-training,

G. Somepalli, M. Goldblum, A. Schwarzschild, C. B. Bruss, and T. Gold- stein, “Saint: Improved neural networks for tabular data via row attention and contrastive pre-training,” in Advances in Neural Information Process- ing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 6111–6122

2021
[9]

Telecom churn prediction system based on ensemble learning using feature grouping,

T. Xu, Y . Ma, and K. Kim, “Telecom churn prediction system based on ensemble learning using feature grouping,” Applied Sciences, vol. 11, no. 11, p. 4742, 2021

2021
[10]

Customer personality analysis for churn prediction using hybrid ensemble models and class balancing techniques,

N. Ahmad, M. J. Awan, H. Nobanee, A. M. Zain, A. Naseem, and A. Mahmoud, “Customer personality analysis for churn prediction using hybrid ensemble models and class balancing techniques,” IEEE Access, vol. 12, pp. 1865–1879, 2024

2024
[11]

Sampling- based novel heterogeneous multi-layer stacking ensemble method for telecom customer churn prediction,

F. E. Usman-Hamza, A. O. Balogun, R. T. Amosa, L. F. Capretz, H. A. Mojeed, S. A. Salihu, A. G. Akintola, and M. A. Mabayoje, “Sampling- based novel heterogeneous multi-layer stacking ensemble method for telecom customer churn prediction,” Scientific African, vol. 24, p. e02223, 2024

2024
[12]

Building compre- hensible customer churn prediction models with advanced rule induction techniques,

W. Verbeke, D. Martens, C. Mues, and B. Baesens, “Building compre- hensible customer churn prediction models with advanced rule induction techniques,” Expert Systems with Applications, vol. 38, no. 3, pp. 2354– 2364, 2011

2011
[13]

New insights into churn prediction in the telecommunication sector: A profit driven data mining approach,

W. Verbeke, K. Dejaeger, D. Martens, J. Hur, and B. Baesens, “New insights into churn prediction in the telecommunication sector: A profit driven data mining approach,” European Journal of Operational Research, vol. 218, no. 1, pp. 211–229, 2012

2012
[14]

Baesens, Analytics in a Big Data World

B. Baesens, Analytics in a Big Data World. Hoboken, NJ: Wiley, 2014

2014
[15]

Predictive banking: A deep ensemble customer churn prediction model for enhanced customer retention,

C. Warnakulaarachchi and S. Kumarapathirage, “Predictive banking: A deep ensemble customer churn prediction model for enhanced customer retention,” in World Conference on Information Systems and Technolo- gies. Cham: Springer Nature Switzerland, 2025, pp. 469–483

2025
[16]

Ensemble methods in machine learning,

T. G. Dietterich, “Ensemble methods in machine learning,” in Multiple Classifier Systems, ser. Lecture Notes in Computer Science, vol. 1857. Berlin, Heidelberg: Springer, 2000, pp. 1–15

2000
[17]

Random forests,

L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001

2001
[18]

A decision-theoretic generalization of on- line learning and an application to boosting,

Y . Freund and R. E. Schapire, “A decision-theoretic generalization of on- line learning and an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1997

1997
[19]

Greedy function approximation: A gradient boosting machine,

J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, 2001

2001
[20]

Lightgbm: A highly efficient gradient boosting decision tree,

G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y . Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” in Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 3146–3154. [Online]. Available: https://proceedings.neurips.cc/paper/ 2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html

2017
[21]

Catboost: unbiased boosting with categorical features,

L. Prokhorenkova, G. Gusev, A. V orobev, A. V . Dorogush, and A. Gulin, “Catboost: unbiased boosting with categorical features,” Advances in Neural Information Processing Systems, vol. 31, 2018

2018
[22]

A comparative study of advanced transformer learning frameworks for water potability analysis using physicochemical parameters,

E. Algül, S. Oyucu, O. Polat, F. Harrou, and Y . Sun, “A comparative study of advanced transformer learning frameworks for water potability analysis using physicochemical parameters,” Applied Sciences, vol. 15, no. 13, p. 7262, 2025

2025
[23]

Flexible label- induced manifold broad learning system for multiclass recognition,

J. Jin, B. Geng, Y . Li, J. Liang, Y . Xiao, and C. L. P. Chen, “Flexible label- induced manifold broad learning system for multiclass recognition,” IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 11, pp. 16 076–16 090, 2024

2024
[24]

Regularized dis- criminative broad learning system for image classification,

J. Jin, Z. Qin, D. Yu, Y . Li, J. Liang, and C. L. P. Chen, “Regularized dis- criminative broad learning system for image classification,” Knowledge- Based Systems, vol. 251, p. 109306, 2022. VOLUME 4, 2016 21 Royet al.: Customer Churn Prediction Using FT -Transformer and Stacking Ensembles

2022
[25]

Tabnet: Attentive interpretable tabular learning,

S. O. Arik and T. Pfister, “Tabnet: Attentive interpretable tabular learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 8. AAAI Press, 2021, pp. 6679–6687

2021
[26]

Improving deep tabular learning,

S. Sarafian, “Improving deep tabular learning,” arXiv preprint, vol. arXiv:2509.16354, 2025

work page arXiv 2025
[27]

Stacked generalization,

D. H. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, no. 2, pp. 241–259, 1992

1992
[28]

Zhou, Ensemble Methods: Foundations and Algorithms

Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms. Boca Raton, FL: CRC Press, 2012

2012
[29]

Hastie, R

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York, NY: Springer, 2009. [Online]. Available: https://hastie.su.domains/ ElemStatLearn

2009
[30]

Ensemble-based classifiers,

L. Rokach, “Ensemble-based classifiers,” Artificial Intelligence Review, vol. 33, no. 1–2, pp. 1–39, 2010

2010
[31]

A novel transformer-based stacking ensemble method with multi-model integration for cancer classification,

X. Yang, Y . Zhao, and X. Chen, “A novel transformer-based stacking ensemble method with multi-model integration for cancer classification,” PeerJ Computer Science, vol. 11, p. e3314, 2025

2025
[32]

Tabpfn: A transformer that solves small tabular classification problems in a second,

N. Hollmann, S. Muller, K. Eggensperger, and F. Hutter, “Tabpfn: A transformer that solves small tabular classification problems in a second,” in Proceedings of the International Conference on Learning Representations. OpenReview.net, 2023. [Online]. Available: https: //openreview.net/forum?id=cp5PvcI6w8

2023
[33]

Accurate predictions on small data with a tabular foundation model,

N. Hollmann, S. Müller, L. Purucker, A. Krishnakumar, J. Grabocka, and F. Hutter, “Accurate predictions on small data with a tabular foundation model,” Nature, vol. 637, pp. 319–326, 2025

2025
[34]

Excelformer: A neural network surpassing gbdts on tabular data,

J. Chen, R. Ye, X. Zhu, and H. Chen, “Excelformer: A neural network surpassing gbdts on tabular data,” arXiv preprint, vol. arXiv:2301.02819, 2023

work page arXiv 2023
[35]

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, and A. Smola, “Autogluon-tabular: Robust and accurate automl for structured data,” arXiv preprint, vol. arXiv:2003.06505, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2003
[36]

H2o automl: Scalable automatic machine learning,

E. LeDell and S. Poirier, “H2o automl: Scalable automatic machine learning,” in Proceedings of the ICML Workshop on Automated Machine Learning, 2020. [Online]. Available: https://www.automl.org/wp-content/ uploads/2020/07/AutoML_2020_paper_61.pdf

2020
[37]

Layer Normalization

J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint, vol. arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[38]

Gaussian error linear units (GELUs),

D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” 2016

2016
[39]

Bank customer churn modeling dataset,

Kaggle Community, “Bank customer churn modeling dataset,” https://www.kaggle.com/datasets/barelydedicated/ bank-customer-churn-modeling, 2018, accessed: 2024-12-20

2018
[40]

Adam: A method for stochastic optimization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of the 3rd International Conference on Learning Representations, 2015. [Online]. Available: https://openreview.net/forum? id=8gmWwjFyLj

2015
[41]

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

N. Hollmann, S. Müller, K. Eggensperger, and F. Hutter, “Tabpfn: A transformer that solves small tabular classification problems in a second,” arXiv preprint arXiv:2207.01848, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[42]

A unified approach to interpreting model predictions,

S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017, pp. 4765–4774

2017
[43]

Predicting good probabilities with supervised learning,

A. Niculescu-Mizil and R. Caruana, “Predicting good probabilities with supervised learning,” in Proceedings of the 22nd International Conference on Machine Learning. New York, NY: ACM, 2005, pp. 625–632

2005
[44]

Tabular data: Deep learning is not all you need,

R. Shwartz-Ziv and A. Armon, “Tabular data: Deep learning is not all you need,” Information Fusion, vol. 81, pp. 84–90, 2022

2022
[45]

On calibration of modern neural networks,

C. Guo, G. Pleiss, Y . Sun, and K. Weinberger, “On calibration of modern neural networks,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 70. PMLR, 2017, pp. 1321–1330. [Online]. Available: https://proceedings.mlr.press/v70/guo17a.html

2017
[46]

Obtaining well calibrated probabilities using bayesian binning,

M. P. Naeini, G. F. Cooper, and M. Hauskrecht, “Obtaining well calibrated probabilities using bayesian binning,” in Proceedings of the 29th AAAI Conference on Artificial Intelligence. AAAI Press, 2015, pp. 2901–2907. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/ article/view/9602

2015
[47]

Customer base analysis: Partial defec- tion of behaviorally loyal clients in a non-contractual FMCG retail setting,

W. Buckinx and D. Van den Poel, “Customer base analysis: Partial defec- tion of behaviorally loyal clients in a non-contractual FMCG retail setting,” European Journal of Operational Research, vol. 164, no. 1, pp. 252–268, 2005. JOYJIT ROYis a senior technology and program management leader with over 21 years of expe- rience in enterprise digital transform...

2005

[1] [1]

Zero defections: Quality comes to services,

F. F. Reichheld and W. E. Sasser, “Zero defections: Quality comes to services,” Harvard Business Review, vol. 68, no. 5, pp. 105–111, 1990. [Online]. Available: https://hbr.org/1990/09/ zero-defections-quality-comes-to-services

1990

[2] [2]

De- fection detection: Measuring and understanding the predictive accuracy of customer churn models,

S. A. Neslin, S. Gupta, W. A. Kamakura, J. Lu, and C. H. Mason, “De- fection detection: Measuring and understanding the predictive accuracy of customer churn models,” Journal of Marketing Research, vol. 43, no. 2, pp. 204–211, 2006

2006

[3] [3]

Handling class imbalance in customer churn prediction,

J. Burez and D. Van den Poel, “Handling class imbalance in customer churn prediction,” Expert Systems with Applications, vol. 36, no. 3, pp. 4626–4636, 2009

2009

[4] [4]

Xgboost: A scalable tree boosting system,

T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794

2016

[5] [5]

Revisiting deep learning models for tabular data,

Y . Gorishniy, I. Rubachev, V . Khrulkov, and A. Babenko, “Revisiting deep learning models for tabular data,” in Advances in Neural Information Processing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 18 932– 18 943

2021

[6] [6]

Why do tree-based models still outperform deep learning on tabular data?

L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why do tree-based models still outperform deep learning on tabular data?” in Advances in Neural Information Processing Systems, vol. 35, 2022

2022

[7] [7]

Tabtransformer: Tabular data modeling using contextual embeddings,

X. Huang, A. Khetan, M. Cvitkovic, and Z. Karnin, “Tabtransformer: Tabular data modeling using contextual embeddings,” in Advances in Neural Information Processing Systems, vol. 33. Curran Associates, Inc., 2020, pp. 6543–6553

2020

[8] [8]

Saint: Improved neural networks for tabular data via row attention and contrastive pre-training,

G. Somepalli, M. Goldblum, A. Schwarzschild, C. B. Bruss, and T. Gold- stein, “Saint: Improved neural networks for tabular data via row attention and contrastive pre-training,” in Advances in Neural Information Process- ing Systems, vol. 34. Curran Associates, Inc., 2021, pp. 6111–6122

2021

[9] [9]

Telecom churn prediction system based on ensemble learning using feature grouping,

T. Xu, Y . Ma, and K. Kim, “Telecom churn prediction system based on ensemble learning using feature grouping,” Applied Sciences, vol. 11, no. 11, p. 4742, 2021

2021

[10] [10]

Customer personality analysis for churn prediction using hybrid ensemble models and class balancing techniques,

N. Ahmad, M. J. Awan, H. Nobanee, A. M. Zain, A. Naseem, and A. Mahmoud, “Customer personality analysis for churn prediction using hybrid ensemble models and class balancing techniques,” IEEE Access, vol. 12, pp. 1865–1879, 2024

2024

[11] [11]

Sampling- based novel heterogeneous multi-layer stacking ensemble method for telecom customer churn prediction,

F. E. Usman-Hamza, A. O. Balogun, R. T. Amosa, L. F. Capretz, H. A. Mojeed, S. A. Salihu, A. G. Akintola, and M. A. Mabayoje, “Sampling- based novel heterogeneous multi-layer stacking ensemble method for telecom customer churn prediction,” Scientific African, vol. 24, p. e02223, 2024

2024

[12] [12]

Building compre- hensible customer churn prediction models with advanced rule induction techniques,

W. Verbeke, D. Martens, C. Mues, and B. Baesens, “Building compre- hensible customer churn prediction models with advanced rule induction techniques,” Expert Systems with Applications, vol. 38, no. 3, pp. 2354– 2364, 2011

2011

[13] [13]

New insights into churn prediction in the telecommunication sector: A profit driven data mining approach,

W. Verbeke, K. Dejaeger, D. Martens, J. Hur, and B. Baesens, “New insights into churn prediction in the telecommunication sector: A profit driven data mining approach,” European Journal of Operational Research, vol. 218, no. 1, pp. 211–229, 2012

2012

[14] [14]

Baesens, Analytics in a Big Data World

B. Baesens, Analytics in a Big Data World. Hoboken, NJ: Wiley, 2014

2014

[15] [15]

Predictive banking: A deep ensemble customer churn prediction model for enhanced customer retention,

C. Warnakulaarachchi and S. Kumarapathirage, “Predictive banking: A deep ensemble customer churn prediction model for enhanced customer retention,” in World Conference on Information Systems and Technolo- gies. Cham: Springer Nature Switzerland, 2025, pp. 469–483

2025

[16] [16]

Ensemble methods in machine learning,

T. G. Dietterich, “Ensemble methods in machine learning,” in Multiple Classifier Systems, ser. Lecture Notes in Computer Science, vol. 1857. Berlin, Heidelberg: Springer, 2000, pp. 1–15

2000

[17] [17]

Random forests,

L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001

2001

[18] [18]

A decision-theoretic generalization of on- line learning and an application to boosting,

Y . Freund and R. E. Schapire, “A decision-theoretic generalization of on- line learning and an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1997

1997

[19] [19]

Greedy function approximation: A gradient boosting machine,

J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, 2001

2001

[20] [20]

Lightgbm: A highly efficient gradient boosting decision tree,

G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y . Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” in Advances in Neural Information Processing Systems, vol. 30, 2017, pp. 3146–3154. [Online]. Available: https://proceedings.neurips.cc/paper/ 2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html

2017

[21] [21]

Catboost: unbiased boosting with categorical features,

L. Prokhorenkova, G. Gusev, A. V orobev, A. V . Dorogush, and A. Gulin, “Catboost: unbiased boosting with categorical features,” Advances in Neural Information Processing Systems, vol. 31, 2018

2018

[22] [22]

A comparative study of advanced transformer learning frameworks for water potability analysis using physicochemical parameters,

E. Algül, S. Oyucu, O. Polat, F. Harrou, and Y . Sun, “A comparative study of advanced transformer learning frameworks for water potability analysis using physicochemical parameters,” Applied Sciences, vol. 15, no. 13, p. 7262, 2025

2025

[23] [23]

Flexible label- induced manifold broad learning system for multiclass recognition,

J. Jin, B. Geng, Y . Li, J. Liang, Y . Xiao, and C. L. P. Chen, “Flexible label- induced manifold broad learning system for multiclass recognition,” IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 11, pp. 16 076–16 090, 2024

2024

[24] [24]

Regularized dis- criminative broad learning system for image classification,

J. Jin, Z. Qin, D. Yu, Y . Li, J. Liang, and C. L. P. Chen, “Regularized dis- criminative broad learning system for image classification,” Knowledge- Based Systems, vol. 251, p. 109306, 2022. VOLUME 4, 2016 21 Royet al.: Customer Churn Prediction Using FT -Transformer and Stacking Ensembles

2022

[25] [25]

Tabnet: Attentive interpretable tabular learning,

S. O. Arik and T. Pfister, “Tabnet: Attentive interpretable tabular learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 8. AAAI Press, 2021, pp. 6679–6687

2021

[26] [26]

Improving deep tabular learning,

S. Sarafian, “Improving deep tabular learning,” arXiv preprint, vol. arXiv:2509.16354, 2025

work page arXiv 2025

[27] [27]

Stacked generalization,

D. H. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, no. 2, pp. 241–259, 1992

1992

[28] [28]

Zhou, Ensemble Methods: Foundations and Algorithms

Z.-H. Zhou, Ensemble Methods: Foundations and Algorithms. Boca Raton, FL: CRC Press, 2012

2012

[29] [29]

Hastie, R

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. New York, NY: Springer, 2009. [Online]. Available: https://hastie.su.domains/ ElemStatLearn

2009

[30] [30]

Ensemble-based classifiers,

L. Rokach, “Ensemble-based classifiers,” Artificial Intelligence Review, vol. 33, no. 1–2, pp. 1–39, 2010

2010

[31] [31]

A novel transformer-based stacking ensemble method with multi-model integration for cancer classification,

X. Yang, Y . Zhao, and X. Chen, “A novel transformer-based stacking ensemble method with multi-model integration for cancer classification,” PeerJ Computer Science, vol. 11, p. e3314, 2025

2025

[32] [32]

Tabpfn: A transformer that solves small tabular classification problems in a second,

N. Hollmann, S. Muller, K. Eggensperger, and F. Hutter, “Tabpfn: A transformer that solves small tabular classification problems in a second,” in Proceedings of the International Conference on Learning Representations. OpenReview.net, 2023. [Online]. Available: https: //openreview.net/forum?id=cp5PvcI6w8

2023

[33] [33]

Accurate predictions on small data with a tabular foundation model,

N. Hollmann, S. Müller, L. Purucker, A. Krishnakumar, J. Grabocka, and F. Hutter, “Accurate predictions on small data with a tabular foundation model,” Nature, vol. 637, pp. 319–326, 2025

2025

[34] [34]

Excelformer: A neural network surpassing gbdts on tabular data,

J. Chen, R. Ye, X. Zhu, and H. Chen, “Excelformer: A neural network surpassing gbdts on tabular data,” arXiv preprint, vol. arXiv:2301.02819, 2023

work page arXiv 2023

[35] [35]

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, and A. Smola, “Autogluon-tabular: Robust and accurate automl for structured data,” arXiv preprint, vol. arXiv:2003.06505, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2003

[36] [36]

H2o automl: Scalable automatic machine learning,

E. LeDell and S. Poirier, “H2o automl: Scalable automatic machine learning,” in Proceedings of the ICML Workshop on Automated Machine Learning, 2020. [Online]. Available: https://www.automl.org/wp-content/ uploads/2020/07/AutoML_2020_paper_61.pdf

2020

[37] [37]

Layer Normalization

J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint, vol. arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[38] [38]

Gaussian error linear units (GELUs),

D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” 2016

2016

[39] [39]

Bank customer churn modeling dataset,

Kaggle Community, “Bank customer churn modeling dataset,” https://www.kaggle.com/datasets/barelydedicated/ bank-customer-churn-modeling, 2018, accessed: 2024-12-20

2018

[40] [40]

Adam: A method for stochastic optimization,

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of the 3rd International Conference on Learning Representations, 2015. [Online]. Available: https://openreview.net/forum? id=8gmWwjFyLj

2015

[41] [41]

TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

N. Hollmann, S. Müller, K. Eggensperger, and F. Hutter, “Tabpfn: A transformer that solves small tabular classification problems in a second,” arXiv preprint arXiv:2207.01848, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[42] [42]

A unified approach to interpreting model predictions,

S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017, pp. 4765–4774

2017

[43] [43]

Predicting good probabilities with supervised learning,

A. Niculescu-Mizil and R. Caruana, “Predicting good probabilities with supervised learning,” in Proceedings of the 22nd International Conference on Machine Learning. New York, NY: ACM, 2005, pp. 625–632

2005

[44] [44]

Tabular data: Deep learning is not all you need,

R. Shwartz-Ziv and A. Armon, “Tabular data: Deep learning is not all you need,” Information Fusion, vol. 81, pp. 84–90, 2022

2022

[45] [45]

On calibration of modern neural networks,

C. Guo, G. Pleiss, Y . Sun, and K. Weinberger, “On calibration of modern neural networks,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 70. PMLR, 2017, pp. 1321–1330. [Online]. Available: https://proceedings.mlr.press/v70/guo17a.html

2017

[46] [46]

Obtaining well calibrated probabilities using bayesian binning,

M. P. Naeini, G. F. Cooper, and M. Hauskrecht, “Obtaining well calibrated probabilities using bayesian binning,” in Proceedings of the 29th AAAI Conference on Artificial Intelligence. AAAI Press, 2015, pp. 2901–2907. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/ article/view/9602

2015

[47] [47]

Customer base analysis: Partial defec- tion of behaviorally loyal clients in a non-contractual FMCG retail setting,

W. Buckinx and D. Van den Poel, “Customer base analysis: Partial defec- tion of behaviorally loyal clients in a non-contractual FMCG retail setting,” European Journal of Operational Research, vol. 164, no. 1, pp. 252–268, 2005. JOYJIT ROYis a senior technology and program management leader with over 21 years of expe- rience in enterprise digital transform...

2005