pith. sign in

arxiv: 2509.22267 · v4 · pith:7YGU22KUnew · submitted 2025-09-26 · 💻 cs.LG · eess.SP

Towards a more realistic evaluation of machine learning models for bearing fault diagnosis

Pith reviewed 2026-05-21 21:16 UTC · model grok-4.3

classification 💻 cs.LG eess.SP
keywords bearing fault diagnosisdata leakagemachine learning evaluationvibration signalsdataset partitioningmulti-label classificationgeneralization
0
0 comments X

The pith

Common ways of splitting bearing vibration data let models train and test on the same physical bearings, creating leakage that inflates accuracy scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that segment-wise and condition-wise splits of bearing datasets allow models to exploit correlations from identical physical components appearing in both training and test data. This leakage produces performance numbers that do not reflect how the models would behave when deployed on new bearings. The proposed fix is a bearing-wise split that assigns entire bearings to either training or testing only, paired with a multi-label formulation of the fault task and ROC-based metrics that ignore fault prevalence. Experiments across four public datasets further indicate that the sheer number of distinct training bearings is the dominant factor controlling generalization. The work supplies concrete partitioning rules and validation practices intended to produce more trustworthy assessments of diagnostic models.

Core claim

The authors argue that leakage from non-bearing-wise partitions is the primary cause of overstated results in existing machine learning studies of bearing faults. They show that enforcing a partition by physical bearing, reformulating the problem as multi-label classification, and adopting prevalence-independent ROC metrics produces substantially lower yet more realistic performance figures, with the count of unique training bearings emerging as the decisive variable for generalization.

What carries the argument

Bearing-wise data partitioning that places all measurements from any single physical bearing exclusively into the training set or the test set.

If this is right

  • Reported accuracies for bearing fault models will drop once leakage is removed, but the remaining performance will be a better predictor of behavior on unseen equipment.
  • Collecting data from a larger number of distinct bearings becomes the most effective way to improve generalization rather than refining model architecture alone.
  • Multi-label classification allows simultaneous detection of co-occurring fault types that single-label setups miss.
  • ROC-based metrics give a clearer comparison across datasets that differ in fault prevalence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same leakage patterns probably appear in other sensor-based diagnosis tasks that reuse measurements from the same physical units.
  • Public benchmark datasets may need to be expanded with many more independent bearings before they can support claims of robust industrial performance.
  • Engineers may have to gather site-specific bearing data rather than rely solely on existing public collections for reliable deployment.

Load-bearing premise

The bearings within each dataset are sufficiently independent that removing any one of them from the training pool still leaves a representative range of fault behaviors for real-world use.

What would settle it

If models retrained under a strict bearing-wise split on the same four datasets recover the high accuracies previously reported in the literature, the claim that leakage was the main driver of inflated performance would be refuted.

Figures

Figures reproduced from arXiv: 2509.22267 by Danilo Silva, Jo\~ao Paulo Vieira, Rodrigo Kobashikawa Rosa, Victor Afonso Bauler.

Figure 1
Figure 1. Figure 1: Comparison of Decision Tree (DT) and Logistic Regression (LR) accuracy across varying numbers of training bearings, evaluated under two conditions: a leakage-free test (Valid) set and a test set with data leakage (Leakage). Hendriks et al. [5] proposed a bearing-wise splitting strategy on the CWRU dataset. Their methodology closely resembles ours, with one critical difference: healthy signals were included… view at source ↗
Figure 2
Figure 2. Figure 2: Exemplary bearing-level data partitioning for the generic dataset. The training set (green) and test set (blue) are disjoint at the bearing level, with a 3:2 allocation of bearings per health state. Under our multi-label framework, these states are represented by binary vectors, where a healthy bearing is encoded as [0,0], an inner race fault as [1,0], and an outer race fault as [0,1]. The partitioning of … view at source ↗
Figure 3
Figure 3. Figure 3: Specification of the generic bearing fault dataset, comprising 15 unique bearings, two fault modes (inner, outer), and two distinct acquisition configurations per bearing. datasets rarely mirror real-world conditions, where healthy states dominate until a fault develops. Secondly, the metric’s formulation assumes mutually exclusive classes, which inherently prevents the diagnosis of co-occurring faults. Cr… view at source ↗
Figure 4
Figure 4. Figure 4: Schematic of the Double Cross-Validation (CVM-CV) protocol applied to the UORED-VAFCLS dataset. A distinct set of 5 bearing-level splits is used for hyperparameter tuning, while a separate set of 100 splits is used for final performance evaluation. 4.2.2. Paderborn University (PU) Dataset The Paderborn University (PU) bearing dataset [16] represents a complex and widely-used benchmark, distin￾guished by it… view at source ↗
Figure 5
Figure 5. Figure 5: Schematic of the Double Cross-Validation (CVM-CV) protocol applied to the PU dataset. 4.2.3. CWRU bearing fault dataset The Case Western Reserve University (CWRU) bearing fault dataset [19] is a collection of experiments that involved a single pair of healthy bearings and several artificially created faulty bearings. The faults were created through electro-discharge machining, introducing point faults with… view at source ↗
Figure 6
Figure 6. Figure 6: Bearing configurations used in the CWRU dataset. Each cell represents a specific acquisition setup containing two bearings—one on the drive-end and one on the fan-end. Fault types are denoted as follows: I for inner race fault, O for outer race fault, B for ball fault, and H for healthy. However, this solution is not entirely foolproof, as the dataset contains a single healthy bearing configuration, which … view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of the 3-fold split in the CWRU dataset following our proposed hyperparameter optimization methodology. Each cell contains two bearings: one on the drive-end and one on the fan-end. I I H I H O H O O H B H B H B D F D F D F I I H I O H O O H B B B D F D F D F 7 14 21 Fault size Accelerometers I H I I O O H O B B B H D F D F D F H I H I I H O O H O B H B H B D F D F D F Train with healthy signa… view at source ↗
Figure 8
Figure 8. Figure 8: Example of the two healthy-bearing split scenarios in our proposed methodology. was adopted for model selection using the CWRU dataset. To perform evaluation, we created 50 additional 2:1 splits following the same principles illustrated in [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Impact of train-test split ratio on model performance on the UORED-VAFCLS dataset. The plot shows the mean Macro AUROC (and standard deviation as error bars) calculated over 100 evaluation splits for four distinct bearing-level train-to-test ratios [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Impact of train-test split ratio on model performance on the PU dataset using envelope spectrum as the input representation [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Impact of train-test split ratio on model performance on the CWRU dataset using Random Forest with hand￾crafted features. Although previous studies have investigated data leakage in bearing diagnosis, to the best of our knowledge, no prior work has done so solely by altering the test set. In our proposed experiments, we address this by keeping the training set—and thus the model—fixed, while modifying onl… view at source ↗
Figure 12
Figure 12. Figure 12: CWRU leakage experiment groups. The results are presented in [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
read the original abstract

Reliable detection of bearing faults is essential for maintaining the safety and operational efficiency of rotating machinery. While recent advances in machine learning (ML), particularly deep learning, have shown strong performance in controlled settings, many studies fail to generalize to real-world applications due to methodological flaws, most notably data leakage. This paper investigates the issue of data leakage in vibration-based bearing fault diagnosis and its impact on model evaluation. We demonstrate that common dataset partitioning strategies, such as segment-wise and condition-wise splits, introduce spurious correlations that inflate performance metrics. To address this, we propose a rigorous, leakage-free evaluation methodology centered on bearing-wise data partitioning, ensuring no overlap between the physical components used for training and testing. Additionally, we reformulate the classification task as a multi-label problem, enabling the detection of co-occurring fault types and the use of prevalence-independent metrics based on the ROC curve. Beyond preventing leakage, we also examine the effect of dataset diversity on generalization, showing that the number of unique training bearings is a decisive factor for achieving robust performance. We evaluate our methodology on four widely adopted datasets: CWRU, Paderborn University (PU), University of Ottawa (UORED-VAFCLS) and HUST bearing. This study highlights the importance of leakage-aware evaluation protocols and provides practical guidelines for dataset partitioning, model selection, and validation, fostering the development of more trustworthy ML systems for industrial fault diagnosis applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript claims that common dataset partitioning strategies such as segment-wise and condition-wise splits in vibration-based bearing fault diagnosis introduce spurious correlations and data leakage, inflating ML model performance metrics. It proposes a bearing-wise partitioning scheme that ensures no physical bearing overlap between training and test sets, reformulates the task as multi-label classification to handle co-occurring faults with prevalence-independent ROC metrics, and reports that performance drops under this split while improving as the number of unique training bearings increases. The approach is evaluated on the CWRU, Paderborn University (PU), UORED-VAFCLS, and HUST datasets.

Significance. If the empirical findings hold, the work is significant for promoting more trustworthy evaluation protocols in industrial ML applications. It directly targets a known source of over-optimism in the bearing fault diagnosis literature by demonstrating concrete performance gaps on four public datasets and by linking generalization to training-set diversity. The multi-label reformulation and emphasis on leakage-free splits provide actionable guidelines that could improve reproducibility and real-world applicability.

major comments (1)
  1. The central claim that bearing-wise splits remove leakage rests on the assumption that physical bearings constitute the dominant source of independent variation; the manuscript should explicitly test or discuss whether shared manufacturing batches or sensor mounting effects across bearings could still induce correlations even after a bearing-wise split (see the weakest-assumption note in the stress test).
minor comments (3)
  1. The abstract and experimental sections would benefit from reporting the exact train/test bearing counts and split ratios used for each of the four datasets to allow direct replication.
  2. Clarify how the multi-label formulation handles cases where multiple fault types co-occur on the same bearing; an explicit label-encoding example or reference to the ROC implementation would improve clarity.
  3. Figure captions and table headers should explicitly state whether the reported metrics are macro-averaged or micro-averaged ROC-AUC to avoid ambiguity.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their constructive feedback, which has helped us strengthen the discussion of assumptions underlying our proposed evaluation protocol. We address the major comment below and have made corresponding revisions to the manuscript.

read point-by-point responses
  1. Referee: The central claim that bearing-wise splits remove leakage rests on the assumption that physical bearings constitute the dominant source of independent variation; the manuscript should explicitly test or discuss whether shared manufacturing batches or sensor mounting effects across bearings could still induce correlations even after a bearing-wise split (see the weakest-assumption note in the stress test).

    Authors: We agree that physical bearings are not necessarily the only possible source of correlation and that factors such as shared manufacturing batches or sensor mounting could in principle induce residual dependencies. Our bearing-wise partitioning is designed to eliminate the most direct and commonly overlooked form of leakage—reusing multiple segments or operating conditions from the identical physical bearing—which the literature has shown to produce unrealistically high performance. In the revised manuscript we have added an explicit paragraph in the Discussion section acknowledging this limitation of the assumption, noting that the public datasets (CWRU, PU, UORED-VAFCLS, HUST) lack batch-level or mounting metadata that would allow an empirical stress test of these secondary effects. We therefore treat the bearing-wise split as a necessary but not always sufficient condition for leakage-free evaluation and recommend that future dataset releases include such metadata. This addition does not change our empirical results but clarifies the scope of the claims. revision: partial

standing simulated objections not resolved
  • Explicit empirical testing of manufacturing-batch or sensor-mounting correlations is not possible with the current public datasets because they do not release the required metadata.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is an empirical comparison of dataset partitioning schemes on four public bearing-fault datasets. Its central claims rest on measured performance differences between segment-wise, condition-wise, and bearing-wise splits, together with a multi-label ROC reformulation; none of these quantities are defined in terms of parameters fitted from the same evaluation data, nor do they reduce to self-citations or imported uniqueness theorems. The argument is therefore self-contained against external benchmarks and contains no load-bearing step that collapses to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that bearings constitute independent units for leakage purposes and on standard supervised-learning assumptions about i.i.d. data after proper splitting.

axioms (1)
  • domain assumption Vibration signals from distinct physical bearings are independent enough that placing all data from one bearing entirely in train or test removes spurious correlations.
    This premise underpins the recommendation to use bearing-wise partitioning instead of segment-wise or condition-wise splits.

pith-pipeline@v0.9.0 · 5794 in / 1264 out tokens · 57147 ms · 2026-05-21T21:16:07.156534+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    Y. Lei, B. Yang, X. Jiang, F. Jia, N. Li, A. K. Nandi, Applications of machine learning to machine fault diagnosis: A review and roadmap, Mechanical Systems and Signal Processing 138 (2020) 106587

  2. [2]

    Kapoor, E

    S. Kapoor, E. M. Cantrell, K. Peng, T. H. Pham, C. A. Bail, O. E. Gundersen, J. M. Hofman, J. Hullman, M. A. Lones, M. M. Malik, P. Nanayakkara, R. A. Poldrack, I. D. Raji, M. Roberts, M. J. Salganik, M. Serra-Garcia, B. M. Stewart, G. Vandewiele, A. Narayanan, Reforms: Consensus-based recommendations for machine-learning-based science, Science Advances 1...

  3. [3]

    Kapoor, A

    S. Kapoor, A. Narayanan, Leakage and the reproducibility crisis in machine-learning-based science, Patterns 4 (9) (2023) 100804.doi: https://doi.org/10.1016/j.patter.2023.100804. URL https://www.sciencedirect.com/science/article/pii/S2666389923001599

  4. [4]

    doi:10.1016/j.eswa.2020

    T.W.Rauber,A.L.daSilvaLoca,F.d.A.Boldt,A.L.Rodrigues,F.M.Varejão,Anexperimentalmethodologytoevaluatemachinelearning methodsforfaultdiagnosisbasedonvibrationsignals,ExpertSystemswithApplications167(2021)114022. doi:10.1016/j.eswa.2020. 114022

  5. [5]

    Hendriks, P

    J. Hendriks, P. Dumond, D. Knox, Towards better benchmarking using the CWRU bearing fault dataset, Mechanical Systems and Signal Processing 169 (2022) 108732

  6. [6]

    Abburi, T

    H. Abburi, T. Chaudhary, S. H. W. Ilyas, L. Manne, D. Mittal, D. Williams, D. Snaidauf, E. Bowen, B. Veeramani, A closer look at bearing fault classification approaches, in: Annual Conference of the PHM Society, Vol. 15, 2023

  7. [7]

    D. R. Roberts, V. Bahn, S. Ciuti, M. S. Boyce, J. Elith, G. Guillera-Arroita, S. Hauenstein, J. J. Lahoz-Monfort, B. Schröder, W. Thuiller, D. I. Warton, B. A. Wintle, F. Hartig, C. F. Dormann, Cross-validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure, Ecography 40 (8) (2017) 913–929.doi:10.1111/ecog.02881

  8. [8]

    Passos, D

    F.Pedregosa,G.Varoquaux,A.Gramfort,V.Michel,B.Thirion,O.Grisel,M.Blondel,P.Prettenhofer,R.Weiss,V.Dubourg,J.Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830

  9. [9]

    doi: 10.1145/2020408.2020496

    S. Kaufman, S. Rosset, C. Perlich, Leakage in data mining: Formulation, detection, and avoidance, Vol. 6, 2011, pp. 556–563.doi: 10.1145/2020408.2020496

  10. [10]

    arXiv:2108.02497

    M.A.Lones,Howtoavoidmachinelearningpitfalls:aguideforacademicresearchers,CoRRabs/2108.02497(2021). arXiv:2108.02497. URL https://arxiv.org/abs/2108.02497

  11. [11]

    M. A. Lones, Avoiding common machine learning pitfalls, Patterns (2024) 101046doi:10.1016/j.patter.2024.101046

  12. [12]

    I. M. D. S. Varejão, L. G. D. O. Costa, L. H. P. D. Silva, A. Rodrigues, M. P. Ribeiro, F. M. Varejão, T. Oliveira-Santos, The similarity bias problem: What it is and how it impacts vibration based intelligent fault diagnosis, Mechanical Systems and Signal Processing 235 (2025) 112822, publisher: Elsevier BV.doi:10.1016/j.ymssp.2025.112822. URL https://li...

  13. [13]

    Wheat, M

    L. Wheat, M. V. Mohrenschildt, S. Habibi, D. Al-Ani, Impact of Data Leakage in Vibration Signals Used for Bearing Fault Diagnosis, IEEEAccess12(2024)169879–169895,publisher:InstituteofElectricalandElectronicsEngineers(IEEE). doi:10.1109/access.2024. 3497716. URL https://ieeexplore.ieee.org/document/10752530/

  14. [14]

    D. Wang, Y. Li, L. Jia, Y. Song, Y. Liu, Novel three-stage feature fusion method of multimodal data for bearing fault diagnosis, IEEE Transactions on Instrumentation and Measurement 70 (2021) 1–10.doi:10.1109/TIM.2021.3071232

  15. [15]

    D.Wang,Y.Li,L.Jia,Y.Song,T.Wen,Attention-basedbilinearfeaturefusionmethodforbearingfaultdiagnosis,IEEE/ASMETransactions on Mechatronics 28 (3) (2023) 1695–1705.doi:10.1109/TMECH.2022.3223358

  16. [16]

    Lessmeier, J

    C. Lessmeier, J. K. Kimotho, D. Zimmer, W. Sextro, Condition monitoring of bearing damage in electromechanical drive systems by using motor current signals of electric motors: A benchmark data set for data-driven classification, in: PHM society European conference, Vol. 3, 2016

  17. [17]

    Tsamardinos, A

    I. Tsamardinos, A. Rakhshani, V. Lagani, Performance-estimation properties of cross-validation-based protocols with simultaneous hyper- parameter optimization, International Journal on Artificial Intelligence Tools 24 (05) (2015) 1540023

  18. [18]

    URL https://linkinghub.elsevier.com/retrieve/pii/S2352340923004456

    M.Sehri,P.Dumond,M.Bouchard,UniversityofOttawaconstantloadandspeedrolling-elementbearingvibrationandacousticfaultsignature datasets, Data in Brief 49 (2023) 109327, publisher: Elsevier BV.doi:10.1016/j.dib.2023.109327. URL https://linkinghub.elsevier.com/retrieve/pii/S2352340923004456

  19. [19]

    URL https://www.sciencedirect.com/science/article/pii/S0888327015002034

    W.A.Smith,R.B.Randall,Rollingelementbearingdiagnosticsusingthecasewesternreserveuniversitydata:Abenchmarkstudy,Mechanical Systems and Signal Processing 64-65 (2015) 100–131.doi:https://doi.org/10.1016/j.ymssp.2015.04.021. URL https://www.sciencedirect.com/science/article/pii/S0888327015002034

  20. [20]

    González, V

    M. González, V. G. Díaz, B. L. Pérez, B. C. P. G-Bustelo, J. P. Anzola, Bearing fault diagnosis with envelope analysis and machine learning approaches using cwru dataset, IEEE Access 11 (2023) 57796–57805.doi:10.1109/ACCESS.2023.3283466

  21. [21]

    URL https://www.mdpi.com/1424-8220/17/2/425

    W.Zhang,G.Peng,C.Li,Y.Chen,Z.Zhang,ANewDeepLearningModelforFaultDiagnosiswithGoodAnti-NoiseandDomainAdaptation Ability on Raw Vibration Signals, Sensors 17 (2) (2017) 425, publisher: MDPI AG.doi:10.3390/s17020425. URL https://www.mdpi.com/1424-8220/17/2/425

  22. [22]

    J. Jiao, M. Zhao, J. Lin, C. Ding, Deep Coupled Dense Convolutional Network With Complementary Data for Intelligent Fault Diagnosis, IEEE Transactions on Industrial Electronics 66 (12) (2019) 9858–9867, publisher: Institute of Electrical and Electronics Engineers (IEEE). J. P. Vieira et al.:Preprint submitted to Elsevier Page 24 of 25 doi:10.1109/tie.2019...

  23. [23]

    275–283.doi:10.1109/ icdmw51313.2020.00046

    J.VanDenHoogen,S.Bloemheuvel,M.Atzmueller,AnImprovedWide-KernelCNNforClassifyingMultivariateSignalsinFaultDiagnosis, in: 2020 International Conference on Data Mining Workshops (ICDMW), IEEE, Sorrento, Italy, 2020, pp. 275–283.doi:10.1109/ icdmw51313.2020.00046. URL https://ieeexplore.ieee.org/document/9346555/

  24. [24]

    Q. Wei, Y. Liu, X. Ruan, A report on audio tagging with deeper cnn, 1d-convnet and 2d-convnet, DCASE, 2018. URL https://dcase.community/documents/challenge2018/technical_reports/DCASE2018_WEI_53.pdf

  25. [25]

    Tchatchoua, G

    P. Tchatchoua, G. Graton, M. Ouladsine, J.-F. Christaud, Application of 1D ResNet for Multivariate Fault Detection on Semiconductor Manufacturing Equipment, Sensors 23 (22) (2023) 9099, publisher: MDPI AG.doi:10.3390/s23229099. URL https://www.mdpi.com/1424-8220/23/22/9099

  26. [26]

    Z. Yan, H. Liu, SMoCo: A Powerful and Efficient Method Based on Self-Supervised Learning for Fault Diagnosis of Aero-Engine Bearing under Limited Data, Mathematics 10 (15) (2022) 2796, publisher: MDPI AG.doi:10.3390/math10152796. URL https://www.mdpi.com/2227-7390/10/15/2796 J. P. Vieira et al.:Preprint submitted to Elsevier Page 25 of 25