pith. sign in

arxiv: 2607.00477 · v1 · pith:WELP5X2Gnew · submitted 2026-07-01 · 💻 cs.LG · cs.CE

Interpretable vs Learned Encoders for High-Cardinality Fraud Detection

Pith reviewed 2026-07-02 16:21 UTC · model grok-4.3

classification 💻 cs.LG cs.CE
keywords high-cardinality encodingfraud detectionentity embeddingstier group encodingtarget encodingLightGBMAUC-ROCCatBoost
0
0 comments X

The pith

Entity embeddings match CatBoost on fraud AUC-ROC while beating tier group encoding on high-cardinality data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates seven categorical encoding methods on the IEEE-CIS fraud dataset of 590,540 records. Five encoders share an identical frozen LightGBM downstream model to isolate their effects, while CatBoost and TabNet serve as cross-paradigm baselines. Entity embeddings reach the highest AUC-ROC of 0.9612, statistically tying CatBoost at 0.9602 and exceeding tier group encoding at 0.9548. Target encoding trails tier group encoding by only 0.0023 yet preserves auditor-friendly boundaries. Per-column analysis attributes the embedding gain to joint multi-column representation, and CatBoost leads on AUC-PR while TabNet underperforms tree pipelines under data scarcity.

Core claim

Entity embeddings achieve an AUC-ROC of 0.9612 on the IEEE-CIS fraud dataset, statistically tying CatBoost at 0.9602 and exceeding tier group encoding at 0.9548. The advantage arises from joint representation across multiple high-cardinality columns. Target encoding falls 0.0023 behind tier group encoding while retaining clear tier boundaries. CatBoost leads on AUC-PR at 0.822 versus 0.793 for embeddings, and TabNet collapses relative to tree-based methods when data is scarce.

What carries the argument

Entity embeddings as a learned encoder for high-cardinality categorical columns, enabling joint multi-column representation inside a fixed downstream LightGBM learner and compared directly to tier group encoding and target encoding.

If this is right

  • Entity embeddings can serve as a competitive alternative to CatBoost for maximizing AUC-ROC on high-cardinality fraud features.
  • Tier group encoding delivers near-comparable AUC-ROC while preserving auditor-friendly boundaries that target encoding also nearly matches.
  • The choice of encoder should depend on the primary metric, since CatBoost leads on AUC-PR where embeddings do not.
  • Joint multi-column representation explains the performance edge of embeddings over single-column methods like target encoding.
  • Tree-based pipelines with these encoders outperform off-the-shelf TabNet under the observed data scarcity and imbalance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same encoder ranking may appear in other tabular tasks that combine high-cardinality categoricals with strong class imbalance.
  • Allowing per-encoder hyperparameter search could shift the observed performance gaps if certain encodings interact favorably with specific learner settings.
  • The narrow gap between target encoding and tier group encoding suggests room for simple hybrid methods that trade minimal accuracy for added interpretability.
  • Further tests on datasets with varying positive rates could clarify when neural approaches like TabNet become viable relative to the tree pipelines.

Load-bearing premise

That freezing the downstream LightGBM learner across encoders isolates their individual performance without missing useful hyperparameter interactions that could favor one encoder over another.

What would settle it

Re-running the stratified 5-fold cross-validation after allowing separate hyperparameter tuning for each encoder and checking whether the AUC-ROC ranking among entity embeddings, CatBoost, and tier group encoding remains unchanged.

Figures

Figures reproduced from arXiv: 2607.00477 by Chenyu Wu, Jingjing Liu, Moxuan Zheng, Xiao Han, Zhen Zhang.

Figure 1
Figure 1. Figure 1: Per-run (N=15) Nemenyi critical-difference diagram (illustrative; CV folds are correlated). Encoders not joined by a bold bar differ at α=0.05 (CD = 2.325). Primary inference is the NB-corrected test (Table II). −10 −8 −6 −4 −2 0 2 4 t-SNE dimension 1 −10 −8 −6 −4 −2 0 2 4 6 t-SNE dimension 2 gmail.com yahoo.com missing hotmail.com anonymous.com aol.com comcast.net icloud.com E4 tier Tier 1 (rate 0.012, 37… view at source ↗
Figure 2
Figure 2. Figure 2: t-SNE of E6’s P_emaildomain embeddings, colored by E4 tier. The weak spatial alignment (ARI = 0.051) shows embeddings capture structure beyond single-column fraud-rate ordering. below E6. E7’s fit time varies (1263–2255 s) due to the sample-size sensitivity of the sparse-attention architecture; our comparisons use default published parameters with no architecture or hyperparameter search. Given the total A… view at source ↗
read the original abstract

A total of seven categorical encoding methods were tested on the IEEE-CIS fraud benchmark dataset (590,540 records, 3.5% positives, 8 high-cardinality columns). The encoders were evaluated using a stratified 5-fold cross-validation (CV) with three repetitions. Five of the encoders had identical frozen LightGBM learners in the downstream phase, allowing for controlled comparisons of their performance to each other. CatBoost and TabNet were included as comparisons across paradigms using different learners. The entity embeddings produced the highest AUC-ROC (0.9612), with a statistically significant tie with that of CatBoost (0.9602) and statistically superior to tier group encoding (0.9548), whereas target encoding was only 0.0023 worse than tier group encoding and the auditor-friendly tier boundaries were maintained. Off-the-shelf TabNet did not outperform tree-based pipelines and collapsed under data scarcity. On AUC-PR, CatBoost leads (0.822 vs. 0.793); no encoder dominated both metrics. Per-column analysis confirmed the embedding advantage arises from joint multi-column representation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper evaluates seven categorical encoding methods (including entity embeddings, target encoding, and tier group encoding) plus CatBoost and TabNet on the IEEE-CIS fraud dataset (590k records, 3.5% positives, 8 high-cardinality columns). Using stratified 5-fold CV with three repetitions, it reports that entity embeddings achieve the highest AUC-ROC (0.9612) under frozen LightGBM, statistically tying CatBoost (0.9602) and outperforming tier group encoding (0.9548); target encoding is only 0.0023 worse than tier group. CatBoost leads on AUC-PR (0.822 vs. 0.793); TabNet underperforms. Per-column analysis attributes the embedding advantage to joint multi-column representation.

Significance. If the encoder ranking is robust, the work provides a controlled empirical benchmark showing learned embeddings can improve performance on high-cardinality features in imbalanced fraud detection while preserving auditor-friendly tier boundaries in alternatives. The stratified repeated CV with statistical significance testing and cross-paradigm comparisons (CatBoost, TabNet) are strengths; the finding that no method dominates both AUC-ROC and AUC-PR is a useful practical observation.

major comments (1)
  1. [downstream phase] The central claim that entity embeddings are superior (0.9612 AUC-ROC) rests on five encoders sharing an identical frozen LightGBM configuration in the downstream phase. Because embeddings produce dense continuous vectors while tier-group and target encodings produce different scales and sparsity, optimal LightGBM hyperparameters (learning rate, max_depth, regularization) are unlikely to be invariant; the 0.0064 gap over tier-group encoding could therefore be an artifact of the fixed learner rather than an intrinsic encoder property. A sensitivity analysis or per-encoder tuning is needed to isolate the encoder effect.
minor comments (2)
  1. Abstract reports point estimates without error bars or standard deviations from the three CV repetitions, making it impossible to assess the practical significance of the 0.0023 difference between target and tier-group encoding.
  2. No details are supplied on embedding implementation (dimensionality, training procedure) or explicit checks that target or embedding leakage was prevented during encoding.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the downstream phase of our experiments. We address the point below and outline planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [downstream phase] The central claim that entity embeddings are superior (0.9612 AUC-ROC) rests on five encoders sharing an identical frozen LightGBM configuration in the downstream phase. Because embeddings produce dense continuous vectors while tier-group and target encodings produce different scales and sparsity, optimal LightGBM hyperparameters (learning rate, max_depth, regularization) are unlikely to be invariant; the 0.0064 gap over tier-group encoding could therefore be an artifact of the fixed learner rather than an intrinsic encoder property. A sensitivity analysis or per-encoder tuning is needed to isolate the encoder effect.

    Authors: We agree that the fixed LightGBM configuration is a deliberate design choice that limits claims of intrinsic encoder superiority independent of the downstream learner. The frozen setup was selected specifically to isolate encoding effects under identical conditions, which is a standard approach for controlled benchmarking. Nevertheless, the referee correctly notes that different input representations (dense vs. sparse) could interact with hyperparameters. In the revision we will add a sensitivity analysis that varies learning rate and max_depth for the leading encoders (entity embeddings and tier-group) while holding other settings fixed; this will quantify robustness of the observed 0.0064 AUC-ROC gap. Full per-encoder hyperparameter optimization for all seven methods is computationally prohibitive within the current experimental budget, but the added sensitivity results will clarify the extent to which the ranking depends on the chosen configuration. revision: partial

Circularity Check

0 steps flagged

No circularity in empirical benchmark study

full rationale

The paper reports direct empirical measurements of encoder performance via stratified 5-fold CV on the IEEE-CIS fraud dataset, with AUC-ROC and AUC-PR computed on held-out folds under fixed or paradigm-specific learners. No derivations, first-principles predictions, fitted parameters renamed as outputs, or self-citation chains are present. All claims reduce to observable results on external data rather than to the paper's own inputs by construction. This is a standard self-contained benchmark evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests entirely on the reported AUC numbers from the benchmark; no free parameters, new axioms, or invented entities are introduced beyond standard supervised learning assumptions.

axioms (1)
  • domain assumption Stratified k-fold CV produces unbiased estimates of generalization performance on imbalanced data
    Invoked by the choice of stratified 5-fold CV with repetitions

pith-pipeline@v0.9.1-grok · 5731 in / 1272 out tokens · 27229 ms · 2026-07-02T16:21:32.038228+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    Regularized target en- coding outperforms traditional methods in supervised machine learning with high cardinality features,

    F. Pargent, F. Pfisterer, J. Thomas, and B. Bischl, “Regularized target en- coding outperforms traditional methods in supervised machine learning with high cardinality features,”Computational Statistics, vol. 37, no. 5, pp. 2671–2692, 2022

  2. [2]

    Encoding high-cardinality string categori- cal variables,

    P. Cerda and G. Varoquaux, “Encoding high-cardinality string categori- cal variables,”IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 3, pp. 1164–1176, 2022

  3. [3]

    Why do tree-based models still outperform deep learning on typical tabular data?

    L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why do tree-based models still outperform deep learning on typical tabular data?” in Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, vol. 35, 2022, pp. 507–520

  4. [4]

    Tabular data: Deep learning is not all you need,

    R. Shwartz-Ziv and A. Armon, “Tabular data: Deep learning is not all you need,”Information Fusion, vol. 81, pp. 84–90, 2022

  5. [5]

    Deep neural networks and tabular data: A survey,

    V . Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, and G. Kasneci, “Deep neural networks and tabular data: A survey,”IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 10, pp. 7499–7519, 2022

  6. [6]

    Entity Embeddings of Categorical Variables

    C. Guo and F. Berkhahn, “Entity embeddings of categorical variables,” arXiv preprint arXiv:1604.06737, 2016

  7. [7]

    A preprocessing scheme for high-cardinality categor- ical attributes in classification and prediction problems,

    D. Micci-Barreca, “A preprocessing scheme for high-cardinality categor- ical attributes in classification and prediction problems,”ACM SIGKDD Explorations Newsletter, vol. 3, no. 1, pp. 27–32, 2001

  8. [8]

    CatBoost: Unbiased boosting with categorical features,

    L. Prokhorenkova, G. Gusev, A. V orobev, A. V . Dorogush, and A. Gulin, “CatBoost: Unbiased boosting with categorical features,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 31, 2018, pp. 6639–6649

  9. [9]

    Encoding Categorical Variables with Conjugate Bayesian Models for WeWork Lead Scoring Engine

    A. Slakey, D. Salas, and Y . Schamroth, “Encoding categorical variables with conjugate Bayesian models for WeWork lead scoring engine,”arXiv preprint arXiv:1904.13001, 2019

  10. [10]

    Feature hashing for large scale multitask learning,

    K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg, “Feature hashing for large scale multitask learning,” inProceedings of the 26th International Conference on Machine Learning (ICML), 2009, pp. 1113–1120

  11. [11]

    Similarity encoding for learning with dirty categorical variables,

    P. Cerda, G. Varoquaux, and B. Kégl, “Similarity encoding for learning with dirty categorical variables,”Machine Learning, vol. 107, no. 8–10, pp. 1477–1494, 2018

  12. [12]

    TabNet: Attentive interpretable tabular learn- ing,

    S. Ö. Arık and T. Pfister, “TabNet: Attentive interpretable tabular learn- ing,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 8, 2021, pp. 6679–6687

  13. [13]

    Neural oblivious decision en- sembles for deep learning on tabular data,

    S. Popov, S. Morozov, and A. Babenko, “Neural oblivious decision en- sembles for deep learning on tabular data,” inInternational Conference on Learning Representations (ICLR), 2020

  14. [14]

    Revisiting deep learning models for tabular data,

    Y . Gorishniy, I. Rubachev, V . Khrulkov, and A. Babenko, “Revisiting deep learning models for tabular data,” inAdvances in Neural Informa- tion Processing Systems (NeurIPS), vol. 34, 2021, pp. 18 932–18 943

  15. [15]

    XGBoost: A scalable tree boosting system,

    T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” inProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794

  16. [16]

    LightGBM: A highly efficient gradient boosting decision tree,

    G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y . Liu, “LightGBM: A highly efficient gradient boosting decision tree,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 30, 2017, pp. 3149–3157

  17. [17]

    IEEE- CIS fraud detection dataset,

    IEEE Computational Intelligence Society and Vesta Corporation, “IEEE- CIS fraud detection dataset,” Kaggle Competition, 2019, https://kaggle. com/competitions/ieee-fraud-detection

  18. [18]

    Deep learning methods for credit card fraud detection,

    T. T. Nguyen, H. Tahir, M. Abdelrazek, and A. Babar, “Deep learning methods for credit card fraud detection,”arXiv preprint arXiv:2012.03754, 2020

  19. [19]

    Credit card fraud detection using AdaBoost and majority voting,

    K. Randhawa, C. K. Loo, M. Seera, C. P. Lim, and A. K. Nandi, “Credit card fraud detection using AdaBoost and majority voting,”IEEE Access, vol. 6, pp. 14 277–14 284, 2018

  20. [20]

    Using generative adversarial networks for improving classification effective- ness in credit card fraud detection,

    U. Fiore, A. D. Santis, F. Perla, P. Zanetti, and F. Palmieri, “Using generative adversarial networks for improving classification effective- ness in credit card fraud detection,”Information Sciences, vol. 479, pp. 448–455, 2019

  21. [21]

    High-recall deep learning: A gated recurrent unit approach to bank account fraud detection on imbalanced data,

    W. Sun, Z. Qi, and Q. Shen, “High-recall deep learning: A gated recurrent unit approach to bank account fraud detection on imbalanced data,” in2025 5th International Conference on Digital Society and Intelligent Systems (DSInS), 2025, pp. 207–212

  22. [22]

    An empirical comparison of supervised learning algorithms,

    R. Caruana and A. Niculescu-Mizil, “An empirical comparison of supervised learning algorithms,” inProceedings of the 23rd International Conference on Machine Learning (ICML), 2006, pp. 161–168

  23. [23]

    Inference for the generalization error,

    C. Nadeau and Y . Bengio, “Inference for the generalization error,” Machine Learning, vol. 52, no. 3, pp. 239–281, 2003

  24. [24]

    Statistical comparisons of classifiers over multiple data sets,

    J. Demšar, “Statistical comparisons of classifiers over multiple data sets,” Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006

  25. [25]

    Beyond Agent Architecture: Execution Assumptions and Reproducibility in LLM-Based Trading Systems

    J. Yao and Z. Zheng, “Beyond agent architecture: Execution assumptions and reproducibility in LLM-based trading systems,” 2026. [Online]. Available: https://arxiv.org/abs/2606.08285

  26. [26]

    Task- specific efficiency analysis: When small language mod- els outperform large language models,

    J. Cao, Y . Ma, X. Li, Q. Ren, and X. Chen, “Task-specific efficiency analysis: When small language models outperform large language models,” 2026. [Online]. Available: https://arxiv.org/abs/2603.21389

  27. [27]

    Do transformers always win? an empirical study of semantic embeddings for short-text e-commerce reviews,

    L. Lai, Z. Cheng, K. Cheng, and X. Qi, “Do transformers always win? an empirical study of semantic embeddings for short-text e-commerce reviews,” in2026 9th International Symposium on Big Data and Applied Statistics (ISBDAS), 2026, pp. 525–529

  28. [28]

    Objective over architecture: Fraud detection under extreme imbalance in bank account opening,

    W. Sun, Q. Shen, Y . Gao, Q. Mao, T. Qi, and S. Xu, “Objective over architecture: Fraud detection under extreme imbalance in bank account opening,”Computation, vol. 13, no. 12, p. 290, 2025

  29. [29]

    V olatility persistence and model choice in cross-market volatility forecasting,

    K. Cheng, X. Qi, Z. Cheng, and L. Lai, “V olatility persistence and model choice in cross-market volatility forecasting,”Available at SSRN 6610278, 2026

  30. [30]

    Sustainable Hybrid Document-Routed Retrieval for Financial RAG: Resolving the Robustness-Precision Trade-off

    Z. Cheng, L. Lai, and Y . Liu, “Sustainable hybrid document- routed retrieval for financial RAG: Resolving the robustness-precision trade-off,” 2026. [Online]. Available: https://arxiv.org/abs/2603.26815