TILBench: A Systematic Benchmark for Tabular Imbalanced Learning Across Data Regimes
Pith reviewed 2026-06-30 21:10 UTC · model grok-4.3
The pith
No single imbalanced learning method dominates all tabular settings; performance depends on dataset characteristics and computational constraints.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TILBench evaluates more than 40 representative algorithms across 57 diverse tabular datasets, resulting in over 200000 controlled experiments across a wide range of data characteristics. Our findings show that no single method consistently dominates across all settings; instead, the effectiveness of imbalanced learning methods depends strongly on dataset characteristics and computational constraints. Based on these findings, we provide practical recommendations for selecting appropriate methods in real-world applications.
What carries the argument
TILBench, the benchmark that runs controlled comparisons of algorithm families under varied tabular data regimes and resource limits.
If this is right
- Practitioners must examine dataset traits such as imbalance ratio and dimensionality before picking a method instead of defaulting to one option.
- Compute budgets should be treated as a first-class input when choosing between oversampling, undersampling, cost-sensitive, or ensemble approaches.
- Algorithm comparisons that ignore data characteristics or runtime will produce misleading rankings for deployment.
- The benchmark supplies a starting map for matching common method families to typical data regimes encountered in applications.
Where Pith is reading between the lines
- An automated selector that inspects a few dataset statistics could route new problems to the empirically strongest method family for those traits.
- The observed variability suggests value in hybrid algorithms that switch internal strategies according to detected data properties.
- Repeating the benchmark on streaming tabular data or with concept drift would test whether the same dependence on characteristics persists.
- Method developers could prioritize variants that remain effective under tight compute limits, since the results flag scalability as a frequent bottleneck.
Load-bearing premise
The 57 chosen datasets and more than 40 algorithms sufficiently cover the space of real-world tabular imbalanced learning problems so that the observed performance patterns generalize beyond the benchmark.
What would settle it
A follow-up study that identifies one algorithm or family achieving top results on the majority of the same 57 datasets across multiple imbalance ratios, sizes, and compute budgets would undermine the claim that no method dominates.
Figures
read the original abstract
Imbalanced learning remains a fundamental challenge in tabular data applications. Despite decades of research and numerous proposed algorithms, a systematic empirical understanding of how different imbalanced learning methods behave across diverse data characteristics is still lacking. In particular, it remains unclear how different method families compare in predictive performance, robustness under varying data characteristics, and computational scalability. In this work, we present Tabular Imbalanced Learning Benchmark (TILBench), a large-scale empirical benchmark for tabular imbalanced learning. TILBench evaluates more than 40 representative algorithms across 57 diverse tabular datasets, resulting in over 200000 controlled experiments across a wide range of data characteristics. Our findings show that no single method consistently dominates across all settings; instead, the effectiveness of imbalanced learning methods depends strongly on dataset characteristics and computational constraints. Based on these findings, we provide practical recommendations for selecting appropriate methods in real-world applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TILBench, a large-scale empirical benchmark that evaluates more than 40 imbalanced learning algorithms across 57 tabular datasets in over 200,000 controlled experiments. It reports that no single method consistently dominates across settings and that method effectiveness depends strongly on dataset characteristics and computational constraints, from which it derives practical recommendations for method selection.
Significance. If the experimental controls and coverage hold, the work supplies a useful empirical map of method behavior across data regimes in tabular imbalanced learning, a domain where practitioners often lack systematic guidance. The scale of the study and the explicit inclusion of computational scaling measurements are strengths that could inform both algorithm choice and future benchmark design.
major comments (2)
- [§4, §5] §4 (Experimental Protocol) and §5 (Results): the central claim that 'no single method consistently dominates' requires a precise definition of dominance (e.g., win-rate thresholds, handling of statistical ties, and the exact multiple-comparison correction). Without these details it is unclear whether the reported pattern is robust to reasonable variations in aggregation.
- [§3.2] §3.2 (Dataset Selection) and meta-feature analysis: while selection criteria are stated, the paper should quantify how well the 57 datasets span the space of real-world imbalance ratios, feature types, and class-overlap regimes; a sensitivity check removing the most frequent meta-feature clusters would strengthen the generalization claim.
minor comments (2)
- [Table 2, Figure 3] Table 2 and Figure 3: axis labels and legend entries should explicitly state the performance metric (e.g., AUROC vs. F1) and whether results are averaged over the 5 seeds or report median.
- [§5.3] §5.3 (Computational Analysis): the reported wall-clock times should include the hyperparameter-search budget so readers can distinguish training cost from tuning cost.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation of minor revision. The comments help clarify the robustness of our central claims and the generalizability of the benchmark. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§4, §5] §4 (Experimental Protocol) and §5 (Results): the central claim that 'no single method consistently dominates' requires a precise definition of dominance (e.g., win-rate thresholds, handling of statistical ties, and the exact multiple-comparison correction). Without these details it is unclear whether the reported pattern is robust to reasonable variations in aggregation.
Authors: We agree that an explicit operational definition strengthens the claim. In the revision we will add to §4 a precise definition: a method is considered to 'dominate' if it obtains the highest mean rank (or win rate > 0.5) across datasets within a regime; ties are resolved by Wilcoxon signed-rank tests (α = 0.05) and multiple comparisons are corrected via the Holm-Bonferroni procedure. We will also report sensitivity of the 'no single method dominates' conclusion to reasonable variations in these thresholds and corrections in §5. revision: yes
-
Referee: [§3.2] §3.2 (Dataset Selection) and meta-feature analysis: while selection criteria are stated, the paper should quantify how well the 57 datasets span the space of real-world imbalance ratios, feature types, and class-overlap regimes; a sensitivity check removing the most frequent meta-feature clusters would strengthen the generalization claim.
Authors: We will expand §3.2 with quantitative coverage statistics: distributions and summary metrics for imbalance ratios, proportion of categorical vs. numerical features, and class-overlap measures (e.g., F1 overlap and nearest-neighbor overlap). We will also add a sensitivity analysis that clusters datasets by meta-features, removes the largest cluster, and re-evaluates the main findings to verify that the dependence on data characteristics remains consistent. revision: yes
Circularity Check
No significant circularity
full rationale
This is a purely empirical benchmark paper that evaluates >40 algorithms on 57 external public datasets via >200k controlled experiments. The central claim (no method dominates; performance depends on data characteristics) is an observed pattern from those runs, not a derivation or fitted quantity. No equations, self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided text or abstract. The work is self-contained against external benchmarks and meets the criteria for a non-circular empirical study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 57 tabular datasets and 40+ algorithms are representative of real-world imbalanced learning scenarios across data regimes
Reference graph
Works this paper leans on
-
[1]
H. He, E. A. Garcia, Learning from imbalanced data, IEEE Transactions on knowledge and data engineering 21 (9) (2009) 1263–1284. 1https://imbalanced-learn.org/stable/index.html 2https://smote-variants.readthedocs.io/en/latest/index.html 3http://scikit-learn.org/stable/ 4https://github.com/Luojiaqimath/ClassbalancedLoss4GBDT 5https://xgboost.readthedocs.io...
2009
-
[2]
Haixiang, L
G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, G. Bing, Learning from class-imbalanced data: Review of methods and applica- tions, Expert Systems with Applications 73 (2017) 220–239
2017
-
[3]
31 Table A.13: Method abbreviations used in figures
A.Fernández, S.García, M.Galar, R.C.Prati, B.Krawczyk, F.Herrera, Learning from imbalanced data sets, Springer (2018). 31 Table A.13: Method abbreviations used in figures. FamilyAbbreviationMethod FamilyAbbreviationMethod Baseline XGB XGBoost Alg.-level FL XGBoostFL Data-level TL TomekLinks WCE XGBoostWCEENN EditedNearestNeighbours CBE XGBoostCBENCR Neigh...
2018
-
[4]
Krawczyk, Learning from imbalanced data: open challenges and fu- ture directions, Progress in Artificial Intelligence 5 (4) (2016) 221–232
B. Krawczyk, Learning from imbalanced data: open challenges and fu- ture directions, Progress in Artificial Intelligence 5 (4) (2016) 221–232
2016
-
[5]
H. Zhu, G. Liu, M. Zhou, Y. Xie, A. Abusorrah, Q. Kang, Optimizing weighted extreme learning machines for imbalanced classification and 32 Table A.14: The hyperparameters involved in training are given. Methods in the data- level family, as well as those involving XGBoost in the algorithm-level family, share the same four hyperparameters as the base model...
2020
-
[6]
L. I. Santos, M. O. Camargos, M. F. S. V. D’Angelo, J. B. Mendes, 33 Table A.14: Hyperparameters for different methods(continued) Category Algorithm Hyperparameter Type Range/V alues Ensemble-based Methods SelfPacedEnsemble n_estimators int[20,200] k_bins int[2,10] BalanceCascadeEnsemble n_estimators int[20,200] BalancedRandomForest n_estimators int[20,20...
2022
-
[7]
Zhang, X
Y. Zhang, X. Li, L. Gao, L. Wang, L. Wen, Imbalanced data fault 35 Table B.16: Complete performance results for multi-class classification tasks. Rank Method F1-score Method G-mean score Multi-class 1 SMOTE 75.84±1.45 XGBoostCost 83.91±1.16 2 XGBoostCost 75.83±1.57 SMOTE 83.63±1.00 3 SMOTETomek 75.49±1.45 SMOTETomek 83.44±0.94 4 BorderlineSMOTE75.49±1.71 ...
2018
-
[8]
M.R.Smith, T.Martinez, C.Giraud-Carrier, Instancehardness: Amea- sure of difficulty for an instance based on classification error, Machine 37 Table B.19: Top five methods in each imbalance severity regime ranked by G-mean score for binary and multi-class tasks. Imbalance RatioRank Method G-mean scoreMethod G-mean score Binary Multi-class <10 1 UnderBaggin...
2014
-
[9]
N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority over-sampling technique, Journal of artificial intelli- gence research 16 (2002) 321–357. 38 Figure B.11: Family-level G-mean scores across imbalance severity regimes. Each box shows the distribution of method performance within a family for each imbalance group. Results a...
2002
-
[10]
Han, W.-Y
H. Han, W.-Y. Wang, B.-H. Mao, Borderline-smote: a new over- sampling method in imbalanced data sets learning, in: International conference on intelligent computing, Springer, 2005, pp. 878–887
2005
-
[11]
J. Luo, Y. Yuan, S. Xu, Improving gbdt performance on imbalanced datasets: An empirical study of class-balanced loss functions, Neuro- computing 634 (2025) 129896
2025
-
[12]
Q. Xu, S. Lu, W. Jia, C. Jiang, Imbalanced fault diagnosis of rotating machinery via multi-domain feature extraction and cost-sensitive learn- ing, Journal of Intelligent Manufacturing 31 (6) (2020) 1467–1481
2020
-
[13]
W. Liu, H. Fan, M. Xia, M. Xia, A focal-aware cost-sensitive boosted tree for imbalanced credit scoring, Expert Systems with Applications 208 (2022) 118158
2022
-
[14]
Z. Liu, W. Cao, Z. Gao, J. Bian, H. Chen, Y. Chang, T.-Y. Liu, Self- paced ensemble for highly imbalanced massive data classification, in: 2020 IEEE 36th international conference on data engineering (ICDE), IEEE, 2020, pp. 841–852
2020
-
[15]
Karakoulas, J
G. Karakoulas, J. Shawe-Taylor, Optimizing classifers for imbalanced training sets, Advances in neural information processing systems 11 (1998). 39
1998
-
[16]
Viola, M
P. Viola, M. Jones, Fast and robust classification using asymmetric ad- aboost and a detector cascade, Advances in neural information process- ing systems 14 (2001)
2001
-
[17]
A. A. Khan, O. Chaudhari, R. Chandra, A review of ensemble learning anddataaugmentationmodelsforclassimbalancedproblems: Combina- tion, implementation and evaluation, Expert Systems with Applications 244 (2024) 122778
2024
-
[18]
Kovács, An empirical comparison and evaluation of minority over- sampling techniques on a large number of imbalanced datasets, Applied Soft Computing 83 (2019) 105662
G. Kovács, An empirical comparison and evaluation of minority over- sampling techniques on a large number of imbalanced datasets, Applied Soft Computing 83 (2019) 105662
2019
- [19]
-
[20]
Tomek, Two modifications of cnn
I. Tomek, Two modifications of cnn. (1976)
1976
-
[21]
D. L. Wilson, Asymptotic properties of nearest neighbor rules using edited data, IEEE Transactions on Systems, Man, and Cybernetics (3) (2007) 408–421
2007
-
[22]
J. Laurikkala, Improving identification of difficult small classes by bal- ancing class distribution, in: Conference on artificial intelligence in medicine in Europe, Springer, 2001, pp. 63–66
2001
-
[23]
Gazzah, N
S. Gazzah, N. E. B. Amara, New oversampling approaches based on polynomial fitting for imbalanced data sets, in: 2008 the eighth iapr international workshop on document analysis systems, IEEE, 2008, pp. 677–684
2008
-
[24]
G. E. Batista, R. C. Prati, M. C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD explorations newsletter 6 (1) (2004) 20–29
2004
-
[25]
G. E. Batista, A. L. Bazzan, M. C. Monard, et al., Balancing training data for automated annotation of keywords: a case study., Wob 3 (2003) 10–18. 40
2003
-
[26]
W. Liu, H. Fan, M. Xia, C. Pang, Predicting and interpreting financial distress using a weighted boosted tree-based tree, Engineering Applica- tions of Artificial Intelligence 116 (2022) 105466
2022
-
[27]
J. Luo, Y. Quan, S. Xu, Robust-gbdt: leveraging robust loss for noisy and imbalanced classification with gbdt, Knowledge and Information Systems 67 (12) (2025) 12361–12381
2025
-
[28]
C. Wang, C. Deng, S. Wang, Imbalance-xgboost: leveraging weighted and focal losses for binary label-imbalanced classification with xgboost, Pattern recognition letters 136 (2020) 190–197
2020
-
[29]
L. M. Manevitz, M. Yousef, One-class svms for document classification, Journal of machine Learning research 2 (Dec) (2001) 139–154
2001
-
[30]
X.-Y. Liu, J. Wu, Z.-H. Zhou, Exploratory undersampling for class- imbalance learning, IEEE Transactions on Systems, Man, and Cyber- netics, Part B (Cybernetics) 39 (2) (2008) 539–550
2008
-
[31]
C. Chen, A. Liaw, L. Breiman, et al., Using random forest to learn imbalanced data, University of California, Berkeley 110 (1-12) (2004) 24
2004
-
[32]
N. V. Chawla, A. Lazarevic, L. O. Hall, K. W. Bowyer, Smoteboost: Improving prediction of the minority class in boosting, in: Knowledge Discovery in Databases: PKDD 2003: 7th European Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat- Dubrovnik, Croatia, September 22-26, 2003. Proceedings 7, Springer, 2003, pp. 107–119
2003
-
[33]
W. Fan, S. J. Stolfo, J. Zhang, P. K. Chan, Adacost: misclassification cost-sensitive boosting, in: Icml, Vol. 99, 1999, pp. 97–105
1999
-
[34]
Nikpour, F
B. Nikpour, F. Rahmati, B. Mirzaei, H. Nezamabadi-pour, A compre- hensive review on data-level methods for imbalanced data classification, Expert Systems with Applications 295 (2026) 128920
2026
-
[35]
S. H. Khan, M. Hayat, M. Bennamoun, F. A. Sohel, R. Togneri, Cost- sensitive learning of deep feature representations from imbalanced data, IEEE Transactions on Neural Networks and Learning Systems 29 (8) (2018) 3573–3587. 41
2018
-
[36]
I. Araf, A. Idri, I. Chairi, Cost-sensitive learning for imbalanced medical data: a review., Artificial Intelligence Review 57 (4) (2024)
2024
-
[37]
Rezvani, X
S. Rezvani, X. Wang, A broad review on class imbalance learning tech- niques, Applied Soft Computing 143 (2023) 110415
2023
-
[38]
G.Aguiar, B.Krawczyk, A.Cano, Asurveyonlearningfromimbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework, Machine learning 113 (7) (2024) 4165–4243
2024
-
[39]
B. Zhu, B. Baesens, A. Backiel, S. K. Vanden Broucke, Benchmarking sampling techniques for imbalance learning in churn prediction, Journal of the Operational Research Society 69 (1) (2018) 49–65
2018
-
[40]
J.Xiao, Y.Wang, J.Chen, L.Xie, J.Huang, Impactofresamplingmeth- ods and classification models on the imbalanced credit scoring problems, Information Sciences 569 (2021) 508–526
2021
-
[41]
Wongvorachan, S
T. Wongvorachan, S. He, O. Bulut, A comparison of undersampling, oversampling, and smote methods for dealing with imbalanced classifi- cation in educational data mining, Information 14 (1) (2023) 54
2023
-
[42]
Vanschoren, J
J. Vanschoren, J. N. Van Rijn, B. Bischl, L. Torgo, Openml: networked science in machine learning, ACM SIGKDD Explorations Newsletter 15 (2) (2014) 49–60
2014
-
[43]
T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, in: Proceedings of the 22nd acm sigkdd international conference on knowl- edge discovery and data mining, 2016, pp. 785–794
2016
-
[44]
Borisov, T
V. Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, G. Kasneci, Deep neural networks and tabular data: A survey, IEEE transactions on neural networks and learning systems (2022)
2022
-
[45]
Gorishniy, I
Y. Gorishniy, I. Rubachev, V. Khrulkov, A. Babenko, Revisiting deep learning models for tabular data, Advances in Neural Information Pro- cessing Systems 34 (2021) 18932–18943
2021
-
[46]
Grinsztajn, E
L. Grinsztajn, E. Oyallon, G. Varoquaux, Why do tree-based models still outperform deep learning on typical tabular data?, in: Thirty- sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. 42
2022
-
[47]
Lin, C.-F
W.-C. Lin, C.-F. Tsai, Y.-H. Hu, J.-S. Jhang, Clustering-based under- sampling in class-imbalanced data, Information Sciences 409 (2017) 17– 26
2017
-
[48]
Hart, The condensed nearest neighbor rule (corresp.), IEEE transac- tions on information theory 14 (3) (1968) 515–516
P. Hart, The condensed nearest neighbor rule (corresp.), IEEE transac- tions on information theory 14 (3) (1968) 515–516
1968
-
[49]
Tomek, An experiment with the edited nearest-nieghbor rule
I. Tomek, An experiment with the edited nearest-nieghbor rule. (1976)
1976
-
[50]
I. Mani, I. Zhang, knn approach to unbalanced data distributions: a case study involving information extraction, in: Proceedings of workshop on learning from imbalanced datasets, Vol. 126, ICML, 2003, pp. 1–7
2003
-
[51]
Kubat, Addressing the curse of imbalanced training sets: one-sided selection, in: Proceedings of the 14th international conference on ma- chine learning, Morgan Kaufmann, 1997, pp
M. Kubat, Addressing the curse of imbalanced training sets: one-sided selection, in: Proceedings of the 14th international conference on ma- chine learning, Morgan Kaufmann, 1997, pp. 179–186
1997
-
[52]
J. A. Sáez, J. Luengo, J. Stefanowski, F. Herrera, Smote–ipf: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering, Information Sciences 291 (2015) 184–203
2015
-
[53]
Lee, N.-r
J. Lee, N.-r. Kim, J.-H. Lee, An over-sampling technique with rejection for imbalanced class learning, in: Proceedings of the 9th international conference on ubiquitous information management and communication, 2015, pp. 1–6
2015
-
[54]
Q. Cao, S. Wang, Applying over-sampling technique based on data den- sity and cost-sensitive svm to imbalanced learning, in: 2011 Interna- tional conference on information management, innovation management and industrial engineering, Vol. 2, IEEE, 2011, pp. 543–548
2011
-
[55]
Ridnik, E
T. Ridnik, E. Ben-Baruch, N. Zamir, A. Noy, I. Friedman, M. Protter, L. Zelnik-Manor, Asymmetric loss for multi-label classification, in: Pro- ceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 82–91
2021
-
[56]
T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988. 43
2017
-
[57]
Y. Sun, A. K. Wong, M. S. Kamel, Classification of imbalanced data: A review, International journal of pattern recognition and artificial intel- ligence 23 (04) (2009) 687–719
2009
-
[58]
Y. Cui, M. Jia, T.-Y. Lin, Y. Song, S. Belongie, Class-balanced loss based on effective number of samples, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9268– 9277
2019
-
[59]
Seiffert, T
C. Seiffert, T. M. Khoshgoftaar, J. Van Hulse, A. Napolitano, Rusboost: A hybrid approach to alleviating class imbalance, IEEE transactions on systems, man, and cybernetics-part A: systems and humans 40 (1) (2009) 185–197
2009
-
[60]
Maclin, D
R. Maclin, D. Opitz, An empirical evaluation of bagging and boosting, AAAI/IAAI 1997 (1997) 546–551
1997
-
[61]
S. Wang, X. Yao, Diversity analysis on imbalanced data sets by using en- semble models, in: 2009 IEEE symposium on computational intelligence and data mining, IEEE, 2009, pp. 324–331
2009
-
[62]
Akiba, S
T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next- generation hyperparameter optimization framework, in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discov- ery & data mining, 2019, pp. 2623–2631
2019
-
[63]
Pedregosa, G
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al., Scikit-learn: Machine learning in python, the Journal of machine Learn- ing research 12 (2011) 2825–2830. 44
2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.