pith. sign in

arxiv: 2405.01557 · v4 · pith:447ZPBX2new · submitted 2024-03-22 · 💻 cs.LG

An Experimental Study on the Rashomon Effect of Balancing Methods in Imbalanced Classification

Pith reviewed 2026-05-24 03:02 UTC · model grok-4.3

classification 💻 cs.LG
keywords Rashomon effectpredictive multiplicityimbalanced classificationbalancing methodsresamplingambiguitydiscrepancyobscurity metric
0
0 comments X

The pith

Balancing methods increase predictive multiplicity among models with similar accuracy on imbalanced data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether common resampling techniques used to fix class imbalance also raise the Rashomon effect, in which many models achieve nearly identical performance yet disagree on individual predictions. Experiments on real datasets compare models trained on balanced versus original data, tracking three multiplicity measures: ambiguity, discrepancy, and a new obscurity metric. Results indicate that balancing consistently produces more conflicting predictions across candidate models. The authors therefore recommend tracking the performance-multiplicity trade-off with an extended performance-gain plot before choosing a final model.

Core claim

Balancing methods inflate predictive multiplicity, measured by higher values of ambiguity, discrepancy, and obscurity, among candidate models that retain comparable predictive performance on imbalanced classification tasks.

What carries the argument

Predictive multiplicity quantified by ambiguity, discrepancy, and obscurity metrics, applied to models trained after balancing versus on the original imbalanced data.

If this is right

  • Blind selection from a set of equally accurate models becomes more risky after balancing is applied.
  • Validation and explanation steps must account for the larger set of conflicting predictions.
  • An extended performance-gain plot can be used to monitor the trade-off between accuracy improvement and increased multiplicity.
  • Different balancing methods produce different degrees of multiplicity, so method choice affects downstream stability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • High-stakes applications may need to prefer balancing techniques that keep multiplicity low rather than those that maximize minority-class recall alone.
  • The same multiplicity analysis could be applied to other preprocessing steps such as feature scaling or missing-value imputation.
  • In production, multiplicity-aware selection or ensemble methods could reduce the practical cost of the observed inflation.

Load-bearing premise

The selected real datasets and model families are representative of typical imbalanced classification problems, and the three metrics together capture the forms of multiplicity that matter for downstream decisions.

What would settle it

Repeating the experiments on a fresh collection of imbalanced datasets and model families yields no systematic rise, or even a drop, in the three multiplicity metrics after balancing.

Figures

Figures reproduced from arXiv: 2405.01557 by Mustafa Cavus, Przemys{\l}aw Biecek.

Figure 1
Figure 1. Figure 1: illustrates the computation of ambiguity, discrepancy, and obscurity. Assume that there are five models in the Rashomon set. To simplify the illustra￾tion, we analyzed only five samples; however, it is important to note that these two metrics were calculated on all samples. First column represents the reference model predictions ˆyi = fR(Xi) for observations i = 1, 2, 3, 4, 5. The following columns show th… view at source ↗
Figure 2
Figure 2. Figure 2: The 2d density plot of the Rashomon metrics [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The distribution plots of the Rashomon metrics [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The distribution plots of the Rashomon metric [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The distribution plots of the Rashomon metric [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The performance gain plots of obscurity, discrepancy, variable importance order discrepancy for different balancing methods and varying partial resampling ratios. Moving the zones towards the positive way on the horizontal axis indicates an increase in performance gain, and moving towards the negative way on the vertical axis indicates a decrease in the multiplicity. The oversampling-based resampling metho… view at source ↗
read the original abstract

Predictive models may generate biased predictions when classifying imbalanced datasets. This happens when the model favors the majority class, leading to low performance in accurately predicting the minority class. To address this issue, balancing or resampling methods are critical data-centric AI approaches in the modeling process to improve prediction performance. However, there have been debates and questions about the functionality of these methods in recent years. In particular, many candidate models may exhibit very similar predictive performance, called the Rashomon effect, in model selection, and they may even produce different predictions for the same observations. Selecting one of these models without considering the predictive multiplicity -- which is the case of yielding conflicting models' predictions for any sample -- can result in blind selection. In this paper, the impact of balancing methods on predictive multiplicity is examined using the Rashomon effect. It is crucial because the blind model selection in data-centric AI is risky from a set of approximately equally accurate models. This may lead to severe problems in model selection, validation, and explanation. To tackle this matter, we conducted real dataset experiments to observe the impact of balancing methods on predictive multiplicity through the Rashomon effect by using a newly proposed metric obscurity in addition to the existing ones: ambiguity and discrepancy. Our findings showed that balancing methods inflate the predictive multiplicity and yield varying results. To monitor the trade-off between the prediction performance and predictive multiplicity for conducting the modeling process responsibly, we proposed using the extended version of the performance-gain plot when balancing the training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper conducts an experimental study on real-world imbalanced classification datasets to assess how balancing/resampling methods affect predictive multiplicity under the Rashomon effect. It measures this via existing metrics (ambiguity, discrepancy) plus a newly proposed obscurity metric, reports that balancing inflates multiplicity and produces varying results across methods, and recommends an extended performance-gain plot to monitor the performance-multiplicity trade-off during model selection.

Significance. If the attribution of multiplicity inflation to balancing holds after proper controls, the work would usefully caution practitioners in data-centric AI against blind application of balancing without multiplicity checks, potentially improving responsible model selection and validation. The multi-metric approach and real-data focus are practical strengths, though the absence of controlled ablations limits immediate generalizability.

major comments (1)
  1. [§4 and §5] §4 (Experimental Setup) and §5 (Results): the central claim that balancing methods inflate predictive multiplicity requires isolating the effect of resampling from dataset properties. The study uses a fixed collection of real datasets without systematic variation of imbalance ratio, controlled synthetic data with varying class overlap/separability, or ablation on these factors; any observed rise in ambiguity, discrepancy, or obscurity could therefore be an artifact of the chosen data distributions rather than a general consequence of balancing.
minor comments (2)
  1. [§3] The definition and motivation for the new obscurity metric (relative to ambiguity and discrepancy) would benefit from a dedicated subsection with a formal equation and a small illustrative example on a toy dataset.
  2. [§5] Figure captions and axis labels in the performance-gain plots should explicitly state the balancing methods and base learners used in each panel to improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback on isolating the effect of balancing methods. We address the major comment below and propose targeted revisions to clarify the scope of our claims.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Experimental Setup) and §5 (Results): the central claim that balancing methods inflate predictive multiplicity requires isolating the effect of resampling from dataset properties. The study uses a fixed collection of real datasets without systematic variation of imbalance ratio, controlled synthetic data with varying class overlap/separability, or ablation on these factors; any observed rise in ambiguity, discrepancy, or obscurity could therefore be an artifact of the chosen data distributions rather than a general consequence of balancing.

    Authors: We agree that controlled synthetic experiments or systematic ablations would provide stronger causal isolation between balancing and multiplicity. Our study deliberately focuses on real-world imbalanced datasets to reflect practical data-centric AI scenarios, where datasets exhibit natural variation in imbalance ratios, overlap, and separability. The observed inflation in ambiguity, discrepancy, and obscurity is reported as an empirical finding across these datasets rather than a universal causal claim. To address the concern, we will revise §4 and §5 to explicitly state that results are observational on the chosen real datasets, add a limitations paragraph discussing potential confounding by data properties, and include a recommendation for future controlled studies with synthetic data varying imbalance and overlap. This constitutes a partial revision focused on scope clarification and transparency rather than new experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: purely observational experimental study with no derivations or self-referential reductions

full rationale

The paper conducts real-dataset experiments to measure effects of balancing methods on predictive multiplicity (via ambiguity, discrepancy, and a newly proposed obscurity metric). No load-bearing steps involve derivations, first-principles predictions, fitted parameters renamed as predictions, or self-citation chains that justify the central claims. All results are empirical observations; the proposed performance-gain plot extension is a visualization tool, not a definitional reduction. This matches the default expectation for non-circular experimental work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no explicit free parameters, axioms, or invented entities beyond the standard experimental assumption that the chosen metrics and datasets are appropriate; the new obscurity metric definition is not provided so any implicit parameters remain unknown.

pith-pipeline@v0.9.0 · 5803 in / 1036 out tokens · 30798 ms · 2026-05-24T03:02:50.329928+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Explainable bank failure prediction models: Counterfactual explanations to reduce the failure risk

    cs.LG 2024-07 unverdicted novelty 4.0

    Compares counterfactual generation methods with balancing strategies on bank failure data, finding NICF with cost-sensitive learning produces the highest quality explanations on validity, proximity, and sparsity.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper

  1. [1]

    M., Seiffert, C., Van Hulse, J., Napolitano, A., Folleco, A.: Learn- ing with limited minority class data

    Khoshgoftaar, T. M., Seiffert, C., Van Hulse, J., Napolitano, A., Folleco, A.: Learn- ing with limited minority class data. In: 6th Int. Conf. on Machine Learning and Applications, pp. 348–353 (2007)

  2. [2]

    P., Lai, K

    Zha, D., Bhat, Z. P., Lai, K. H., Yang, F., Jiang, Z., Zhong, S., Hu, X.: Data-centric AI: A survey. arXiv preprint arXiv:2303.10158 (2023)

  3. [3]

    X., Chukova, S

    Wang, A. X., Chukova, S. S., Nguyen, B. P.: Data-Centric AI to Improve Churn Prediction with Synthetic Data. In: 3rd Int. Conf. on Computer, Control, and Robotics, pp. 409–413 (2023)

  4. [4]

    Data Sci

    Singh, P.: Systematic review of data-centric approaches in AI and ML. Data Sci. Manag., 6(3), pp. 144–157 (2023)

  5. [5]

    Vargas, W., Aranda, J. A. S., dos Santos Costa, R., da Silva Pereira, P. R., Vict´ oria Barbosa, J. L.: Imbalanced data pre-processing techniques for ML: A systematic mapping study. Knowl. Inf. Syst., 65(1), pp. 31–57 (2023). The Rashomon Effect of Balancing Methods 15

  6. [6]

    Knowl.-Based Syst

    Moniz, N., Monteiro, H.: No free lunch in imbalanced learning. Knowl.-Based Syst. 227, 107222 (2021)

  7. [7]

    Stando, A., Cavus, M., Biecek, P.: The effect of balancing methods on model be- havior in imbalanced classification. In: Int. Workshop on Learning with Imbalanced Domains, pp. 16–30. PMLR (2024)

  8. [8]

    In: 3rd Int

    Patil, A., Framewala, A., Kazi, F.: Explainability of SMOTE-based oversampling for imbalanced datasets. In: 3rd Int. Conf. on Information and Computer Tech- nologies, pp. 41–45 (2020)

  9. [9]

    Data Sci

    Alarab, I., Prakoonwit, S.: Effect of data resampling on feature importance in imbalanced blockchain data. Data Sci. Manag., 5(2), pp. 66–76 (2022)

  10. [10]

    Goorbergh, R., Smeden, M., Timmerman, D., Calster, B.: Harm of class imbalance corrections for risk prediction models. J. Am. Med. Inform. Assoc., 29(9), pp. 1525–1534 (2022)

  11. [11]

    G., Calster, B., van Smeden, M.: Harms of class imbalance corrections for ML prediction models: A simulation study

    Carriero, A., Luijken, K., Hond, A., Moons, K. G., Calster, B., van Smeden, M.: Harms of class imbalance corrections for ML prediction models: A simulation study. arXiv preprint arXiv:2404.19494 (2024)

  12. [12]

    arXiv preprint arXiv:2308.16681 (2024)

    Simson, J., Pfisterer, F., Kern, C.: One Model Many Scores: Using Multiverse Analysis to Prevent Fairness Hacking and Evaluate Model Design Decisions. arXiv preprint arXiv:2308.16681 (2024)

  13. [13]

    M., Chouldechova, A.: Multi-target multiplicity: Flexibility and fairness in target specification

    Watson-Daniels, J., Barocas, S., Hofman, J. M., Chouldechova, A.: Multi-target multiplicity: Flexibility and fairness in target specification. In: Proc. of the 2023 ACM Conf. on Fairness, Accountability, and Transparency, pp. 297–311 (2023)

  14. [14]

    F., Elreedy, D.: Partial resampling of imbalanced data

    Kamalov, F., Atiya, A. F., Elreedy, D.: Partial resampling of imbalanced data. arXiv preprint arXiv:2207.04631 (2022)

  15. [15]

    C., Ustun, B.: Predictive multiplicity in probabilis- tic classification

    Watson-Daniels, J., Parkes, D. C., Ustun, B.: Predictive multiplicity in probabilis- tic classification. In Proc. AAAI Conf. Artif. Intell. 37(9), pp. 10306–10314 (2023)

  16. [16]

    Breiman, L.: Statistical modeling: The two cultures (with comments and a rejoin- der). Stat. Sci. 16(3), pp. 199–231 (2001)

  17. [17]

    Marx, C., Calmon, F., Ustun, B.: Predictive multiplicity in classification. In: Int. Conf. on Machine Learning, pp. 6765–6774. PMLR (2020)

  18. [18]

    Biecek, P., Baniecki, H., Krzyznski, M., Cook, D.: Performance is not enough: The story told by a Rashomon Quartet. J. Comput. Graph. Stat., pp. 1–4 (2024)

  19. [19]

    Rudin, C., Chen, C., Chen, Z., Huang, H., Semenova, L., Zhong, C.: Interpretable ML: Fundamental principles and 10 grand challenges. Stat. Surveys, 16, pp. 1–85 (2022)

  20. [20]

    arXiv preprint arXiv:2402.00728 (2024)

    Hsu, H., Li, G., Hu, S.: Dropout-Based Rashomon Set Exploration for Efficient Predictive Multiplicity Estimation. arXiv preprint arXiv:2402.00728 (2024)

  21. [21]

    Donnelly, J., Katta, S., Rudin, C., Browne, E.: The Rashomon Importance Distri- bution: Getting RID of Unstable, Single Model-based Variable Importance. Adv. Neural Inf. Process. Syst., 36 (2024)

  22. [22]

    V., Bowyer, K

    Chawla, N. V., Bowyer, K. W., Hall, L. O., Kegelmeyer, W. P.: SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res., 16, pp. 321–357 (2002)

  23. [23]

    In: Workshop on Learning from Imbalanced Datasets, pp

    Mani, I., Zhang, I.: kNN approach to unbalanced data distributions: A case study. In: Workshop on Learning from Imbalanced Datasets, pp. 1–7 (2003)

  24. [24]

    AutoML Conf

    Kozak, A., Ruczy´ nski, H.: Forester: A Novel Approach to Accessible and Inter- pretable AutoML for Tree-Based Modeling. AutoML Conf. (2023)

  25. [25]

    Fisher, A., Rudin, C., Dominici, F.: All models are wrong, but many are useful: Learning variable importance by studying a class of prediction models. J. Mach. Learn. Res. 20(177), 1–81 (2019)

  26. [26]

    Chapman and Hall/CRC, New York (2021)

    Biecek, P., Burzykowski, T.: Explanatory Model Analysis. Chapman and Hall/CRC, New York (2021). 16 Cavus and Biecek

  27. [27]

    M., Boehmke, B

    Greenwell, B. M., Boehmke, B. C., Gray, B.: Variable importance plots: An intro- duction to the VIP package. R J., 21(1), pp. 343–366 (2020)

  28. [28]

    R package ver- sion 0.2.1, https://CRAN.R-project.org/package=vivo (2020)

    Kozak, A., Biecek, P.: Vivo: Variable Importance via Oscillations. R package ver- sion 0.2.1, https://CRAN.R-project.org/package=vivo (2020)

  29. [29]

    L., Krinkin, K

    Zhang, Y., Xu, F., Zou, J., Petrosian, O. L., Krinkin, K. V.: XAI Evaluation: Evaluating Black-Box Model Explanations. In: 2nd Int. Conf. on Neural Networks and Neurotechnologies, pp. 13–16 (2021)

  30. [30]

    arXiv preprint arXiv:2308.11446 (2023)

    Kobyli´ nska, K., Krzyzi´ nski, M., Machowicz, R., Adamek, M., Biecek, P.: Ex- ploration of Rashomon set assists explanations for medical data. arXiv preprint arXiv:2308.11446 (2023)

  31. [31]

    G.: A new measure of rank correlation

    Kendall, M. G.: A new measure of rank correlation. Biometrics, 30, pp. 81–93 (1938)

  32. [32]

    Patil, I.: Visualizations with statistical details: The ggstatsplot approach. J. Open Source Softw., 6(61), pp. 3167 (2021)

  33. [33]

    H., Wallis, W

    Kruskal, W. H., Wallis, W. A.: Use of ranks in one-criterion variance analysis. J. Am. Stat. Assoc., 47, pp. 583–621 (1952)

  34. [34]

    Friedman, M.: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc., 32, pp. 675–701 (1937)

  35. [35]

    W.: A multiple comparison procedure for comparing several treat- ments with a control

    Dunnett, C. W.: A multiple comparison procedure for comparing several treat- ments with a control. J. Am. Stat. Assoc., 50, pp. 1096–1121 (1955)

  36. [36]

    Hsu, H., Calmon, F.: Rashomon capacity: A metric for predictive multiplicity in classification. Adv. Neural Inf. Process. Syst., 35, pp. 28988–29000 (2022)

  37. [37]

    arXiv preprint arXiv:2308.07247 (2023)

    Poiret, C., Grigis, A., Thomas, J., Noulhiane, M.: Can we Agree? On the Rashomon Effect and the Reliability of Post-Hoc Explainable AI. arXiv preprint arXiv:2308.07247 (2023)

  38. [38]

    In: 31st ACM Int

    Oh, S., Ustun, B., McAuley, J., Kumar, S.: Rank list sensitivity of recommender systems to interaction perturbations. In: 31st ACM Int. Conf. on Information & Knowledge Management, pp. 1584–1594 (2022)

  39. [39]

    Elor, Y., Averbuch-Elor, H.: To SMOTE, or not to SMOTE? arXiv preprint arXiv:2201.08528 (2022)

  40. [40]

    M., Cruz, R., Calmon, F

    Paes, L. M., Cruz, R., Calmon, F. P., Diaz, M.: On the inevitability of the Rashomon effect. In: IEEE Int. Symp. on Information Theory, pp. 549–554 (2023)

  41. [41]

    P., Albarghouthi, A., D’Antoni, L.: The dataset multiplicity problem: How unreliable data impacts predictions

    Meyer, A. P., Albarghouthi, A., D’Antoni, L.: The dataset multiplicity problem: How unreliable data impacts predictions. In: Proc. of the 2023 ACM Conf. on Fairness, Accountability, and Transparency, pp. 193–204 (2023)

  42. [42]

    In: 4th Int

    Komorniczak, J., Ksieniewicz, P., Wo´ zniak, M.: Data complexity and classification accuracy correlation in oversampling algorithms. In: 4th Int. Workshop on Learning with Imbalanced Domains, pp. 175–186. PMLR (2022)

  43. [43]

    Junior, J. D. S. F., Pisani, P. H.: Performance and model complexity on imbalanced datasets using resampling and cost-sensitive algorithms. In: 4th Int. Workshop on Learning with Imbalanced Domains, pp. 83–97. PMLR (2022)

  44. [44]

    S., Mollineda, R

    Garcia, V., S´ anchez, J. S., Mollineda, R. A.: On the effectiveness of pre-processing methods for class imbalance. Knowl.-Based Syst., 25(1), pp. 13–21 (2012)

  45. [45]

    C., Batista, G

    Prati, R. C., Batista, G. E., Silva, D. F.: Class imbalance revisited: A new experi- mental setup. Knowl. Inf. Syst., 45, pp. 247–270 (2015)