Interpretable vs Learned Encoders for High-Cardinality Fraud Detection
Pith reviewed 2026-07-02 16:21 UTC · model grok-4.3
The pith
Entity embeddings match CatBoost on fraud AUC-ROC while beating tier group encoding on high-cardinality data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Entity embeddings achieve an AUC-ROC of 0.9612 on the IEEE-CIS fraud dataset, statistically tying CatBoost at 0.9602 and exceeding tier group encoding at 0.9548. The advantage arises from joint representation across multiple high-cardinality columns. Target encoding falls 0.0023 behind tier group encoding while retaining clear tier boundaries. CatBoost leads on AUC-PR at 0.822 versus 0.793 for embeddings, and TabNet collapses relative to tree-based methods when data is scarce.
What carries the argument
Entity embeddings as a learned encoder for high-cardinality categorical columns, enabling joint multi-column representation inside a fixed downstream LightGBM learner and compared directly to tier group encoding and target encoding.
If this is right
- Entity embeddings can serve as a competitive alternative to CatBoost for maximizing AUC-ROC on high-cardinality fraud features.
- Tier group encoding delivers near-comparable AUC-ROC while preserving auditor-friendly boundaries that target encoding also nearly matches.
- The choice of encoder should depend on the primary metric, since CatBoost leads on AUC-PR where embeddings do not.
- Joint multi-column representation explains the performance edge of embeddings over single-column methods like target encoding.
- Tree-based pipelines with these encoders outperform off-the-shelf TabNet under the observed data scarcity and imbalance.
Where Pith is reading between the lines
- The same encoder ranking may appear in other tabular tasks that combine high-cardinality categoricals with strong class imbalance.
- Allowing per-encoder hyperparameter search could shift the observed performance gaps if certain encodings interact favorably with specific learner settings.
- The narrow gap between target encoding and tier group encoding suggests room for simple hybrid methods that trade minimal accuracy for added interpretability.
- Further tests on datasets with varying positive rates could clarify when neural approaches like TabNet become viable relative to the tree pipelines.
Load-bearing premise
That freezing the downstream LightGBM learner across encoders isolates their individual performance without missing useful hyperparameter interactions that could favor one encoder over another.
What would settle it
Re-running the stratified 5-fold cross-validation after allowing separate hyperparameter tuning for each encoder and checking whether the AUC-ROC ranking among entity embeddings, CatBoost, and tier group encoding remains unchanged.
Figures
read the original abstract
A total of seven categorical encoding methods were tested on the IEEE-CIS fraud benchmark dataset (590,540 records, 3.5% positives, 8 high-cardinality columns). The encoders were evaluated using a stratified 5-fold cross-validation (CV) with three repetitions. Five of the encoders had identical frozen LightGBM learners in the downstream phase, allowing for controlled comparisons of their performance to each other. CatBoost and TabNet were included as comparisons across paradigms using different learners. The entity embeddings produced the highest AUC-ROC (0.9612), with a statistically significant tie with that of CatBoost (0.9602) and statistically superior to tier group encoding (0.9548), whereas target encoding was only 0.0023 worse than tier group encoding and the auditor-friendly tier boundaries were maintained. Off-the-shelf TabNet did not outperform tree-based pipelines and collapsed under data scarcity. On AUC-PR, CatBoost leads (0.822 vs. 0.793); no encoder dominated both metrics. Per-column analysis confirmed the embedding advantage arises from joint multi-column representation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates seven categorical encoding methods (including entity embeddings, target encoding, and tier group encoding) plus CatBoost and TabNet on the IEEE-CIS fraud dataset (590k records, 3.5% positives, 8 high-cardinality columns). Using stratified 5-fold CV with three repetitions, it reports that entity embeddings achieve the highest AUC-ROC (0.9612) under frozen LightGBM, statistically tying CatBoost (0.9602) and outperforming tier group encoding (0.9548); target encoding is only 0.0023 worse than tier group. CatBoost leads on AUC-PR (0.822 vs. 0.793); TabNet underperforms. Per-column analysis attributes the embedding advantage to joint multi-column representation.
Significance. If the encoder ranking is robust, the work provides a controlled empirical benchmark showing learned embeddings can improve performance on high-cardinality features in imbalanced fraud detection while preserving auditor-friendly tier boundaries in alternatives. The stratified repeated CV with statistical significance testing and cross-paradigm comparisons (CatBoost, TabNet) are strengths; the finding that no method dominates both AUC-ROC and AUC-PR is a useful practical observation.
major comments (1)
- [downstream phase] The central claim that entity embeddings are superior (0.9612 AUC-ROC) rests on five encoders sharing an identical frozen LightGBM configuration in the downstream phase. Because embeddings produce dense continuous vectors while tier-group and target encodings produce different scales and sparsity, optimal LightGBM hyperparameters (learning rate, max_depth, regularization) are unlikely to be invariant; the 0.0064 gap over tier-group encoding could therefore be an artifact of the fixed learner rather than an intrinsic encoder property. A sensitivity analysis or per-encoder tuning is needed to isolate the encoder effect.
minor comments (2)
- Abstract reports point estimates without error bars or standard deviations from the three CV repetitions, making it impossible to assess the practical significance of the 0.0023 difference between target and tier-group encoding.
- No details are supplied on embedding implementation (dimensionality, training procedure) or explicit checks that target or embedding leakage was prevented during encoding.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on the downstream phase of our experiments. We address the point below and outline planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [downstream phase] The central claim that entity embeddings are superior (0.9612 AUC-ROC) rests on five encoders sharing an identical frozen LightGBM configuration in the downstream phase. Because embeddings produce dense continuous vectors while tier-group and target encodings produce different scales and sparsity, optimal LightGBM hyperparameters (learning rate, max_depth, regularization) are unlikely to be invariant; the 0.0064 gap over tier-group encoding could therefore be an artifact of the fixed learner rather than an intrinsic encoder property. A sensitivity analysis or per-encoder tuning is needed to isolate the encoder effect.
Authors: We agree that the fixed LightGBM configuration is a deliberate design choice that limits claims of intrinsic encoder superiority independent of the downstream learner. The frozen setup was selected specifically to isolate encoding effects under identical conditions, which is a standard approach for controlled benchmarking. Nevertheless, the referee correctly notes that different input representations (dense vs. sparse) could interact with hyperparameters. In the revision we will add a sensitivity analysis that varies learning rate and max_depth for the leading encoders (entity embeddings and tier-group) while holding other settings fixed; this will quantify robustness of the observed 0.0064 AUC-ROC gap. Full per-encoder hyperparameter optimization for all seven methods is computationally prohibitive within the current experimental budget, but the added sensitivity results will clarify the extent to which the ranking depends on the chosen configuration. revision: partial
Circularity Check
No circularity in empirical benchmark study
full rationale
The paper reports direct empirical measurements of encoder performance via stratified 5-fold CV on the IEEE-CIS fraud dataset, with AUC-ROC and AUC-PR computed on held-out folds under fixed or paradigm-specific learners. No derivations, first-principles predictions, fitted parameters renamed as outputs, or self-citation chains are present. All claims reduce to observable results on external data rather than to the paper's own inputs by construction. This is a standard self-contained benchmark evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Stratified k-fold CV produces unbiased estimates of generalization performance on imbalanced data
Reference graph
Works this paper leans on
-
[1]
Regularized target en- coding outperforms traditional methods in supervised machine learning with high cardinality features,
F. Pargent, F. Pfisterer, J. Thomas, and B. Bischl, “Regularized target en- coding outperforms traditional methods in supervised machine learning with high cardinality features,”Computational Statistics, vol. 37, no. 5, pp. 2671–2692, 2022
2022
-
[2]
Encoding high-cardinality string categori- cal variables,
P. Cerda and G. Varoquaux, “Encoding high-cardinality string categori- cal variables,”IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 3, pp. 1164–1176, 2022
2022
-
[3]
Why do tree-based models still outperform deep learning on typical tabular data?
L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why do tree-based models still outperform deep learning on typical tabular data?” in Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, vol. 35, 2022, pp. 507–520
2022
-
[4]
Tabular data: Deep learning is not all you need,
R. Shwartz-Ziv and A. Armon, “Tabular data: Deep learning is not all you need,”Information Fusion, vol. 81, pp. 84–90, 2022
2022
-
[5]
Deep neural networks and tabular data: A survey,
V . Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, and G. Kasneci, “Deep neural networks and tabular data: A survey,”IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 10, pp. 7499–7519, 2022
2022
-
[6]
Entity Embeddings of Categorical Variables
C. Guo and F. Berkhahn, “Entity embeddings of categorical variables,” arXiv preprint arXiv:1604.06737, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[7]
A preprocessing scheme for high-cardinality categor- ical attributes in classification and prediction problems,
D. Micci-Barreca, “A preprocessing scheme for high-cardinality categor- ical attributes in classification and prediction problems,”ACM SIGKDD Explorations Newsletter, vol. 3, no. 1, pp. 27–32, 2001
2001
-
[8]
CatBoost: Unbiased boosting with categorical features,
L. Prokhorenkova, G. Gusev, A. V orobev, A. V . Dorogush, and A. Gulin, “CatBoost: Unbiased boosting with categorical features,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 31, 2018, pp. 6639–6649
2018
-
[9]
Encoding Categorical Variables with Conjugate Bayesian Models for WeWork Lead Scoring Engine
A. Slakey, D. Salas, and Y . Schamroth, “Encoding categorical variables with conjugate Bayesian models for WeWork lead scoring engine,”arXiv preprint arXiv:1904.13001, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[10]
Feature hashing for large scale multitask learning,
K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, “Feature hashing for large scale multitask learning,” inProceedings of the 26th International Conference on Machine Learning (ICML), 2009, pp. 1113–1120
2009
-
[11]
Similarity encoding for learning with dirty categorical variables,
P. Cerda, G. Varoquaux, and B. Kégl, “Similarity encoding for learning with dirty categorical variables,”Machine Learning, vol. 107, no. 8–10, pp. 1477–1494, 2018
2018
-
[12]
TabNet: Attentive interpretable tabular learn- ing,
S. Ö. Arık and T. Pfister, “TabNet: Attentive interpretable tabular learn- ing,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 8, 2021, pp. 6679–6687
2021
-
[13]
Neural oblivious decision en- sembles for deep learning on tabular data,
S. Popov, S. Morozov, and A. Babenko, “Neural oblivious decision en- sembles for deep learning on tabular data,” inInternational Conference on Learning Representations (ICLR), 2020
2020
-
[14]
Revisiting deep learning models for tabular data,
Y . Gorishniy, I. Rubachev, V . Khrulkov, and A. Babenko, “Revisiting deep learning models for tabular data,” inAdvances in Neural Informa- tion Processing Systems (NeurIPS), vol. 34, 2021, pp. 18 932–18 943
2021
-
[15]
XGBoost: A scalable tree boosting system,
T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” inProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794
2016
-
[16]
LightGBM: A highly efficient gradient boosting decision tree,
G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y . Liu, “LightGBM: A highly efficient gradient boosting decision tree,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017, pp. 3149–3157
2017
-
[17]
IEEE- CIS fraud detection dataset,
IEEE Computational Intelligence Society and Vesta Corporation, “IEEE- CIS fraud detection dataset,” Kaggle Competition, 2019, https://kaggle. com/competitions/ieee-fraud-detection
2019
-
[18]
Deep learning methods for credit card fraud detection,
T. T. Nguyen, H. Tahir, M. Abdelrazek, and A. Babar, “Deep learning methods for credit card fraud detection,”arXiv preprint arXiv:2012.03754, 2020
-
[19]
Credit card fraud detection using AdaBoost and majority voting,
K. Randhawa, C. K. Loo, M. Seera, C. P. Lim, and A. K. Nandi, “Credit card fraud detection using AdaBoost and majority voting,”IEEE Access, vol. 6, pp. 14 277–14 284, 2018
2018
-
[20]
Using generative adversarial networks for improving classification effective- ness in credit card fraud detection,
U. Fiore, A. D. Santis, F. Perla, P. Zanetti, and F. Palmieri, “Using generative adversarial networks for improving classification effective- ness in credit card fraud detection,”Information Sciences, vol. 479, pp. 448–455, 2019
2019
-
[21]
High-recall deep learning: A gated recurrent unit approach to bank account fraud detection on imbalanced data,
W. Sun, Z. Qi, and Q. Shen, “High-recall deep learning: A gated recurrent unit approach to bank account fraud detection on imbalanced data,” in2025 5th International Conference on Digital Society and Intelligent Systems (DSInS), 2025, pp. 207–212
2025
-
[22]
An empirical comparison of supervised learning algorithms,
R. Caruana and A. Niculescu-Mizil, “An empirical comparison of supervised learning algorithms,” inProceedings of the 23rd International Conference on Machine Learning (ICML), 2006, pp. 161–168
2006
-
[23]
Inference for the generalization error,
C. Nadeau and Y . Bengio, “Inference for the generalization error,” Machine Learning, vol. 52, no. 3, pp. 239–281, 2003
2003
-
[24]
Statistical comparisons of classifiers over multiple data sets,
J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006
2006
-
[25]
Beyond Agent Architecture: Execution Assumptions and Reproducibility in LLM-Based Trading Systems
J. Yao and Z. Zheng, “Beyond agent architecture: Execution assumptions and reproducibility in LLM-based trading systems,” 2026. [Online]. Available: https://arxiv.org/abs/2606.08285
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[26]
Task- specific efficiency analysis: When small language mod- els outperform large language models,
J. Cao, Y . Ma, X. Li, Q. Ren, and X. Chen, “Task-specific efficiency analysis: When small language models outperform large language models,” 2026. [Online]. Available: https://arxiv.org/abs/2603.21389
-
[27]
Do transformers always win? an empirical study of semantic embeddings for short-text e-commerce reviews,
L. Lai, Z. Cheng, K. Cheng, and X. Qi, “Do transformers always win? an empirical study of semantic embeddings for short-text e-commerce reviews,” in2026 9th International Symposium on Big Data and Applied Statistics (ISBDAS), 2026, pp. 525–529
2026
-
[28]
Objective over architecture: Fraud detection under extreme imbalance in bank account opening,
W. Sun, Q. Shen, Y . Gao, Q. Mao, T. Qi, and S. Xu, “Objective over architecture: Fraud detection under extreme imbalance in bank account opening,”Computation, vol. 13, no. 12, p. 290, 2025
2025
-
[29]
V olatility persistence and model choice in cross-market volatility forecasting,
K. Cheng, X. Qi, Z. Cheng, and L. Lai, “V olatility persistence and model choice in cross-market volatility forecasting,”Available at SSRN 6610278, 2026
2026
-
[30]
Z. Cheng, L. Lai, and Y . Liu, “Sustainable hybrid document- routed retrieval for financial RAG: Resolving the robustness-precision trade-off,” 2026. [Online]. Available: https://arxiv.org/abs/2603.26815
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.