An Experimental Study on the Rashomon Effect of Balancing Methods in Imbalanced Classification
Pith reviewed 2026-05-24 03:02 UTC · model grok-4.3
The pith
Balancing methods increase predictive multiplicity among models with similar accuracy on imbalanced data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Balancing methods inflate predictive multiplicity, measured by higher values of ambiguity, discrepancy, and obscurity, among candidate models that retain comparable predictive performance on imbalanced classification tasks.
What carries the argument
Predictive multiplicity quantified by ambiguity, discrepancy, and obscurity metrics, applied to models trained after balancing versus on the original imbalanced data.
If this is right
- Blind selection from a set of equally accurate models becomes more risky after balancing is applied.
- Validation and explanation steps must account for the larger set of conflicting predictions.
- An extended performance-gain plot can be used to monitor the trade-off between accuracy improvement and increased multiplicity.
- Different balancing methods produce different degrees of multiplicity, so method choice affects downstream stability.
Where Pith is reading between the lines
- High-stakes applications may need to prefer balancing techniques that keep multiplicity low rather than those that maximize minority-class recall alone.
- The same multiplicity analysis could be applied to other preprocessing steps such as feature scaling or missing-value imputation.
- In production, multiplicity-aware selection or ensemble methods could reduce the practical cost of the observed inflation.
Load-bearing premise
The selected real datasets and model families are representative of typical imbalanced classification problems, and the three metrics together capture the forms of multiplicity that matter for downstream decisions.
What would settle it
Repeating the experiments on a fresh collection of imbalanced datasets and model families yields no systematic rise, or even a drop, in the three multiplicity metrics after balancing.
Figures
read the original abstract
Predictive models may generate biased predictions when classifying imbalanced datasets. This happens when the model favors the majority class, leading to low performance in accurately predicting the minority class. To address this issue, balancing or resampling methods are critical data-centric AI approaches in the modeling process to improve prediction performance. However, there have been debates and questions about the functionality of these methods in recent years. In particular, many candidate models may exhibit very similar predictive performance, called the Rashomon effect, in model selection, and they may even produce different predictions for the same observations. Selecting one of these models without considering the predictive multiplicity -- which is the case of yielding conflicting models' predictions for any sample -- can result in blind selection. In this paper, the impact of balancing methods on predictive multiplicity is examined using the Rashomon effect. It is crucial because the blind model selection in data-centric AI is risky from a set of approximately equally accurate models. This may lead to severe problems in model selection, validation, and explanation. To tackle this matter, we conducted real dataset experiments to observe the impact of balancing methods on predictive multiplicity through the Rashomon effect by using a newly proposed metric obscurity in addition to the existing ones: ambiguity and discrepancy. Our findings showed that balancing methods inflate the predictive multiplicity and yield varying results. To monitor the trade-off between the prediction performance and predictive multiplicity for conducting the modeling process responsibly, we proposed using the extended version of the performance-gain plot when balancing the training data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts an experimental study on real-world imbalanced classification datasets to assess how balancing/resampling methods affect predictive multiplicity under the Rashomon effect. It measures this via existing metrics (ambiguity, discrepancy) plus a newly proposed obscurity metric, reports that balancing inflates multiplicity and produces varying results across methods, and recommends an extended performance-gain plot to monitor the performance-multiplicity trade-off during model selection.
Significance. If the attribution of multiplicity inflation to balancing holds after proper controls, the work would usefully caution practitioners in data-centric AI against blind application of balancing without multiplicity checks, potentially improving responsible model selection and validation. The multi-metric approach and real-data focus are practical strengths, though the absence of controlled ablations limits immediate generalizability.
major comments (1)
- [§4 and §5] §4 (Experimental Setup) and §5 (Results): the central claim that balancing methods inflate predictive multiplicity requires isolating the effect of resampling from dataset properties. The study uses a fixed collection of real datasets without systematic variation of imbalance ratio, controlled synthetic data with varying class overlap/separability, or ablation on these factors; any observed rise in ambiguity, discrepancy, or obscurity could therefore be an artifact of the chosen data distributions rather than a general consequence of balancing.
minor comments (2)
- [§3] The definition and motivation for the new obscurity metric (relative to ambiguity and discrepancy) would benefit from a dedicated subsection with a formal equation and a small illustrative example on a toy dataset.
- [§5] Figure captions and axis labels in the performance-gain plots should explicitly state the balancing methods and base learners used in each panel to improve readability.
Simulated Author's Rebuttal
We thank the referee for the detailed review and constructive feedback on isolating the effect of balancing methods. We address the major comment below and propose targeted revisions to clarify the scope of our claims.
read point-by-point responses
-
Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): the central claim that balancing methods inflate predictive multiplicity requires isolating the effect of resampling from dataset properties. The study uses a fixed collection of real datasets without systematic variation of imbalance ratio, controlled synthetic data with varying class overlap/separability, or ablation on these factors; any observed rise in ambiguity, discrepancy, or obscurity could therefore be an artifact of the chosen data distributions rather than a general consequence of balancing.
Authors: We agree that controlled synthetic experiments or systematic ablations would provide stronger causal isolation between balancing and multiplicity. Our study deliberately focuses on real-world imbalanced datasets to reflect practical data-centric AI scenarios, where datasets exhibit natural variation in imbalance ratios, overlap, and separability. The observed inflation in ambiguity, discrepancy, and obscurity is reported as an empirical finding across these datasets rather than a universal causal claim. To address the concern, we will revise §4 and §5 to explicitly state that results are observational on the chosen real datasets, add a limitations paragraph discussing potential confounding by data properties, and include a recommendation for future controlled studies with synthetic data varying imbalance and overlap. This constitutes a partial revision focused on scope clarification and transparency rather than new experiments. revision: partial
Circularity Check
No circularity: purely observational experimental study with no derivations or self-referential reductions
full rationale
The paper conducts real-dataset experiments to measure effects of balancing methods on predictive multiplicity (via ambiguity, discrepancy, and a newly proposed obscurity metric). No load-bearing steps involve derivations, first-principles predictions, fitted parameters renamed as predictions, or self-citation chains that justify the central claims. All results are empirical observations; the proposed performance-gain plot extension is a visualization tool, not a definitional reduction. This matches the default expectation for non-circular experimental work.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Explainable bank failure prediction models: Counterfactual explanations to reduce the failure risk
Compares counterfactual generation methods with balancing strategies on bank failure data, finding NICF with cost-sensitive learning produces the highest quality explanations on validity, proximity, and sparsity.
Reference graph
Works this paper leans on
-
[1]
Khoshgoftaar, T. M., Seiffert, C., Van Hulse, J., Napolitano, A., Folleco, A.: Learn- ing with limited minority class data. In: 6th Int. Conf. on Machine Learning and Applications, pp. 348–353 (2007)
work page 2007
-
[2]
Zha, D., Bhat, Z. P., Lai, K. H., Yang, F., Jiang, Z., Zhong, S., Hu, X.: Data-centric AI: A survey. arXiv preprint arXiv:2303.10158 (2023)
-
[3]
Wang, A. X., Chukova, S. S., Nguyen, B. P.: Data-Centric AI to Improve Churn Prediction with Synthetic Data. In: 3rd Int. Conf. on Computer, Control, and Robotics, pp. 409–413 (2023)
work page 2023
- [4]
-
[5]
Vargas, W., Aranda, J. A. S., dos Santos Costa, R., da Silva Pereira, P. R., Vict´ oria Barbosa, J. L.: Imbalanced data pre-processing techniques for ML: A systematic mapping study. Knowl. Inf. Syst., 65(1), pp. 31–57 (2023). The Rashomon Effect of Balancing Methods 15
work page 2023
-
[6]
Moniz, N., Monteiro, H.: No free lunch in imbalanced learning. Knowl.-Based Syst. 227, 107222 (2021)
work page 2021
-
[7]
Stando, A., Cavus, M., Biecek, P.: The effect of balancing methods on model be- havior in imbalanced classification. In: Int. Workshop on Learning with Imbalanced Domains, pp. 16–30. PMLR (2024)
work page 2024
-
[8]
Patil, A., Framewala, A., Kazi, F.: Explainability of SMOTE-based oversampling for imbalanced datasets. In: 3rd Int. Conf. on Information and Computer Tech- nologies, pp. 41–45 (2020)
work page 2020
- [9]
-
[10]
Goorbergh, R., Smeden, M., Timmerman, D., Calster, B.: Harm of class imbalance corrections for risk prediction models. J. Am. Med. Inform. Assoc., 29(9), pp. 1525–1534 (2022)
work page 2022
-
[11]
Carriero, A., Luijken, K., Hond, A., Moons, K. G., Calster, B., van Smeden, M.: Harms of class imbalance corrections for ML prediction models: A simulation study. arXiv preprint arXiv:2404.19494 (2024)
-
[12]
arXiv preprint arXiv:2308.16681 (2024)
Simson, J., Pfisterer, F., Kern, C.: One Model Many Scores: Using Multiverse Analysis to Prevent Fairness Hacking and Evaluate Model Design Decisions. arXiv preprint arXiv:2308.16681 (2024)
-
[13]
M., Chouldechova, A.: Multi-target multiplicity: Flexibility and fairness in target specification
Watson-Daniels, J., Barocas, S., Hofman, J. M., Chouldechova, A.: Multi-target multiplicity: Flexibility and fairness in target specification. In: Proc. of the 2023 ACM Conf. on Fairness, Accountability, and Transparency, pp. 297–311 (2023)
work page 2023
-
[14]
F., Elreedy, D.: Partial resampling of imbalanced data
Kamalov, F., Atiya, A. F., Elreedy, D.: Partial resampling of imbalanced data. arXiv preprint arXiv:2207.04631 (2022)
-
[15]
C., Ustun, B.: Predictive multiplicity in probabilis- tic classification
Watson-Daniels, J., Parkes, D. C., Ustun, B.: Predictive multiplicity in probabilis- tic classification. In Proc. AAAI Conf. Artif. Intell. 37(9), pp. 10306–10314 (2023)
work page 2023
-
[16]
Breiman, L.: Statistical modeling: The two cultures (with comments and a rejoin- der). Stat. Sci. 16(3), pp. 199–231 (2001)
work page 2001
-
[17]
Marx, C., Calmon, F., Ustun, B.: Predictive multiplicity in classification. In: Int. Conf. on Machine Learning, pp. 6765–6774. PMLR (2020)
work page 2020
-
[18]
Biecek, P., Baniecki, H., Krzyznski, M., Cook, D.: Performance is not enough: The story told by a Rashomon Quartet. J. Comput. Graph. Stat., pp. 1–4 (2024)
work page 2024
-
[19]
Rudin, C., Chen, C., Chen, Z., Huang, H., Semenova, L., Zhong, C.: Interpretable ML: Fundamental principles and 10 grand challenges. Stat. Surveys, 16, pp. 1–85 (2022)
work page 2022
-
[20]
arXiv preprint arXiv:2402.00728 (2024)
Hsu, H., Li, G., Hu, S.: Dropout-Based Rashomon Set Exploration for Efficient Predictive Multiplicity Estimation. arXiv preprint arXiv:2402.00728 (2024)
-
[21]
Donnelly, J., Katta, S., Rudin, C., Browne, E.: The Rashomon Importance Distri- bution: Getting RID of Unstable, Single Model-based Variable Importance. Adv. Neural Inf. Process. Syst., 36 (2024)
work page 2024
-
[22]
Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P.: SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res., 16, pp. 321–357 (2002)
work page 2002
-
[23]
In: Workshop on Learning from Imbalanced Datasets, pp
Mani, I., Zhang, I.: kNN approach to unbalanced data distributions: A case study. In: Workshop on Learning from Imbalanced Datasets, pp. 1–7 (2003)
work page 2003
-
[24]
Kozak, A., Ruczy´ nski, H.: Forester: A Novel Approach to Accessible and Inter- pretable AutoML for Tree-Based Modeling. AutoML Conf. (2023)
work page 2023
-
[25]
Fisher, A., Rudin, C., Dominici, F.: All models are wrong, but many are useful: Learning variable importance by studying a class of prediction models. J. Mach. Learn. Res. 20(177), 1–81 (2019)
work page 2019
-
[26]
Chapman and Hall/CRC, New York (2021)
Biecek, P., Burzykowski, T.: Explanatory Model Analysis. Chapman and Hall/CRC, New York (2021). 16 Cavus and Biecek
work page 2021
-
[27]
Greenwell, B. M., Boehmke, B. C., Gray, B.: Variable importance plots: An intro- duction to the VIP package. R J., 21(1), pp. 343–366 (2020)
work page 2020
-
[28]
R package ver- sion 0.2.1, https://CRAN.R-project.org/package=vivo (2020)
Kozak, A., Biecek, P.: Vivo: Variable Importance via Oscillations. R package ver- sion 0.2.1, https://CRAN.R-project.org/package=vivo (2020)
work page 2020
-
[29]
Zhang, Y., Xu, F., Zou, J., Petrosian, O. L., Krinkin, K. V.: XAI Evaluation: Evaluating Black-Box Model Explanations. In: 2nd Int. Conf. on Neural Networks and Neurotechnologies, pp. 13–16 (2021)
work page 2021
-
[30]
arXiv preprint arXiv:2308.11446 (2023)
Kobyli´ nska, K., Krzyzi´ nski, M., Machowicz, R., Adamek, M., Biecek, P.: Ex- ploration of Rashomon set assists explanations for medical data. arXiv preprint arXiv:2308.11446 (2023)
-
[31]
G.: A new measure of rank correlation
Kendall, M. G.: A new measure of rank correlation. Biometrics, 30, pp. 81–93 (1938)
work page 1938
-
[32]
Patil, I.: Visualizations with statistical details: The ggstatsplot approach. J. Open Source Softw., 6(61), pp. 3167 (2021)
work page 2021
-
[33]
Kruskal, W. H., Wallis, W. A.: Use of ranks in one-criterion variance analysis. J. Am. Stat. Assoc., 47, pp. 583–621 (1952)
work page 1952
-
[34]
Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc., 32, pp. 675–701 (1937)
work page 1937
-
[35]
W.: A multiple comparison procedure for comparing several treat- ments with a control
Dunnett, C. W.: A multiple comparison procedure for comparing several treat- ments with a control. J. Am. Stat. Assoc., 50, pp. 1096–1121 (1955)
work page 1955
-
[36]
Hsu, H., Calmon, F.: Rashomon capacity: A metric for predictive multiplicity in classification. Adv. Neural Inf. Process. Syst., 35, pp. 28988–29000 (2022)
work page 2022
-
[37]
arXiv preprint arXiv:2308.07247 (2023)
Poiret, C., Grigis, A., Thomas, J., Noulhiane, M.: Can we Agree? On the Rashomon Effect and the Reliability of Post-Hoc Explainable AI. arXiv preprint arXiv:2308.07247 (2023)
-
[38]
Oh, S., Ustun, B., McAuley, J., Kumar, S.: Rank list sensitivity of recommender systems to interaction perturbations. In: 31st ACM Int. Conf. on Information & Knowledge Management, pp. 1584–1594 (2022)
work page 2022
- [39]
-
[40]
Paes, L. M., Cruz, R., Calmon, F. P., Diaz, M.: On the inevitability of the Rashomon effect. In: IEEE Int. Symp. on Information Theory, pp. 549–554 (2023)
work page 2023
-
[41]
Meyer, A. P., Albarghouthi, A., D’Antoni, L.: The dataset multiplicity problem: How unreliable data impacts predictions. In: Proc. of the 2023 ACM Conf. on Fairness, Accountability, and Transparency, pp. 193–204 (2023)
work page 2023
-
[42]
Komorniczak, J., Ksieniewicz, P., Wo´ zniak, M.: Data complexity and classification accuracy correlation in oversampling algorithms. In: 4th Int. Workshop on Learning with Imbalanced Domains, pp. 175–186. PMLR (2022)
work page 2022
-
[43]
Junior, J. D. S. F., Pisani, P. H.: Performance and model complexity on imbalanced datasets using resampling and cost-sensitive algorithms. In: 4th Int. Workshop on Learning with Imbalanced Domains, pp. 83–97. PMLR (2022)
work page 2022
-
[44]
Garcia, V., S´ anchez, J. S., Mollineda, R. A.: On the effectiveness of pre-processing methods for class imbalance. Knowl.-Based Syst., 25(1), pp. 13–21 (2012)
work page 2012
-
[45]
Prati, R. C., Batista, G. E., Silva, D. F.: Class imbalance revisited: A new experi- mental setup. Knowl. Inf. Syst., 45, pp. 247–270 (2015)
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.