Measuring the Sensitivity of Classification Models with the Error Sensitivity Profile
Pith reviewed 2026-05-07 16:28 UTC · model grok-4.3
The pith
The Error Sensitivity Profile quantifies how errors in specific features degrade classification model performance to guide targeted data cleaning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Error Sensitivity Profile (ESP) is proposed as a metric that quantifies the sensitivity of model performance to errors in a single feature or in multiple features. An integrated suite of tools called dirty is created to support its computation. Extensive experiments on two widely used datasets using 14 classification models show that performance degradation is not always predictable from simple correlations with the target variable.
What carries the argument
The Error Sensitivity Profile (ESP), a metric that measures the change in model performance when controlled errors are introduced into one or more features.
Load-bearing premise
That the sensitivity rankings produced by ESP on the two tested datasets will generalize to other datasets and real-world data-cleaning decisions.
What would settle it
Running the same error-injection experiments on a third dataset and finding that cleaning low-ESP features improves accuracy more than cleaning high-ESP features would falsify the claim that ESP reliably identifies the most damaging errors.
Figures
read the original abstract
The quality of training data is critical to the performance of machine learning models. In this paper, the Error Sensitivity Profile (ESP) is proposed. It quantifies the sensitivity of model performance to errors in a single feature or in multiple features. By leveraging ESP, data-cleaning efforts can be prioritized based on error types and features most likely to affect model performance. To support the computation of this metric, an integrated suite of tools, called \dirty, is created. We conduct an extensive experimental study on two widely used datasets using 14 classification models, revealing that performance degradation is not always predictable from simple correlations with the target variable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Error Sensitivity Profile (ESP) metric to quantify how classification model performance degrades in response to errors in one or more input features. It introduces an accompanying software suite called 'dirty' to compute the metric and reports an experimental study on two datasets using 14 classification models. The central empirical finding is that performance degradation is not always predictable from simple feature-target correlations, implying that ESP can be used to prioritize data-cleaning efforts toward the most impactful error types and features.
Significance. If the ESP metric and its associated findings hold under broader validation, the work could meaningfully improve data-quality workflows in machine learning by providing a systematic way to rank error sources by their effect on downstream performance. The release of the 'dirty' tool is a concrete strength that supports reproducibility and immediate practical use. The observation that simple correlations fail to predict degradation challenges a common heuristic in data preprocessing and could shift how practitioners allocate cleaning resources.
major comments (2)
- [Experimental study] Experimental study (two datasets, 14 models): the claim that ESP enables general prioritization of data-cleaning efforts rests on the observation that degradation is not predictable from target correlations. With validation confined to only two datasets, this non-predictability may be an artifact of the specific data distributions or model families tested; the manuscript should either expand the experimental suite or explicitly bound the scope of the prioritization guidance.
- [ESP definition] ESP definition and error-injection procedure: the utility of ESP for real-world cleaning prioritization requires that the synthetic error model used to construct the profile corresponds to plausible data-quality issues. The manuscript should supply a clearer justification or sensitivity analysis showing that the injected error distributions align with observed real-world error patterns; otherwise the leap from controlled injections to actionable cleaning priorities remains under-supported.
minor comments (2)
- [Abstract] The abstract states the proposal and key finding but omits any quantitative detail (e.g., number of error rates tested, magnitude of observed effects, or statistical significance), reducing its value as a standalone summary.
- [Metric definition] Notation for the multi-feature ESP extension should be introduced with an explicit equation or pseudocode to avoid ambiguity when readers compare single-feature versus joint-feature sensitivity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the scope and applicability of our work. We address each major point below and indicate the planned revisions.
read point-by-point responses
-
Referee: [Experimental study] Experimental study (two datasets, 14 models): the claim that ESP enables general prioritization of data-cleaning efforts rests on the observation that degradation is not predictable from target correlations. With validation confined to only two datasets, this non-predictability may be an artifact of the specific data distributions or model families tested; the manuscript should either expand the experimental suite or explicitly bound the scope of the prioritization guidance.
Authors: We agree that the experimental scope is limited to two datasets and that this constrains the generality of prioritization guidance. Rather than expanding the suite at this stage, we will explicitly bound our claims in the revised manuscript by adding a dedicated limitations paragraph stating that the observed non-predictability from target correlations holds for the examined datasets and model families, and that broader validation across additional domains would be required before treating ESP as a universal prioritization tool. This directly follows the referee's suggested alternative. revision: yes
-
Referee: [ESP definition] ESP definition and error-injection procedure: the utility of ESP for real-world cleaning prioritization requires that the synthetic error model used to construct the profile corresponds to plausible data-quality issues. The manuscript should supply a clearer justification or sensitivity analysis showing that the injected error distributions align with observed real-world error patterns; otherwise the leap from controlled injections to actionable cleaning priorities remains under-supported.
Authors: We acknowledge that the manuscript currently describes the error-injection procedure without sufficient linkage to real-world patterns. In revision we will add a new subsection under the ESP definition that (i) cites established data-quality literature on common error types (e.g., attribute noise, missingness, and label errors) and (ii) reports a sensitivity analysis in which we vary error rates and distributions while recomputing ESP profiles. The results of this analysis will be summarized to show that the relative ordering of feature sensitivities remains stable under moderate perturbations, thereby strengthening the bridge to practical cleaning priorities. revision: yes
Circularity Check
No circularity in ESP definition or experimental claims
full rationale
The paper introduces the Error Sensitivity Profile (ESP) as a new metric for quantifying model sensitivity to feature errors and supports its utility via controlled experiments on two datasets with 14 classification models. No equations, derivations, or self-citations appear in the provided text that would reduce any claimed result to its own inputs by construction. The central contribution is an empirical observation about non-predictability from target correlations, which rests on the experimental setup rather than any fitted parameter or self-referential definition. This is a standard non-circular empirical study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Neural Comput
Adnan, F.A., et al.: A review of the current publication trends on missing data im- putation over three decades: direction and future research. Neural Comput. Appl. 34(21), 1832518340 (Nov 2022)
2022
-
[2]
PuckTrick: A Library for Making Synthetic Data More Realistic
Agostini, A., Sphaiu, B., Maurino, A.: Pucktrick: A library for making synthetic data more realistic. In: SEBD (2025), https://arxiv.org/abs/2506.18499
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
In: DeSE
Ansari, S., et al.: Impact of outliers on regression and classification models: An empirical analysis. In: DeSE. pp. 211–218 (2024)
2024
-
[4]
Arocena, P.C., et al: Messing up with bart: error generation for evaluating data- cleaning algorithms. Proc. VLDB Endow. 9(2), 3647 (Oct 2015)
2015
-
[5]
The Annals of Statistics 29(4), 1165–1188 (2001)
Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics 29(4), 1165–1188 (2001)
2001
-
[6]
In: INDIN
Dix, M., et al.: Measuring the robustness of ML models against data quality issues in industrial time series data. In: INDIN. pp. 1–8. IEEE (2023)
2023
-
[7]
IEEE Trans
Frenay, B., Verleysen, M.: Classification in the presence of label noise: A survey. IEEE Trans. on Neural Networks and Learning Systems 25(5), 845–869 (2014)
2014
-
[8]
arXiv preprint arXiv:1904.09483 75 (2019)
Li, P., et al: Cleanml: A benchmark for joint data cleaning and machine learning [experiments and analysis]. arXiv preprint arXiv:1904.09483 75 (2019)
-
[9]
Information Systems 132, 102549 (2025)
Mohammed, S., at al: The effects of data quality on machine learning performance on tabular data. Information Systems 132, 102549 (2025)
2025
-
[10]
Qi, Z., et al.: Impacts of Dirty Data on Classification and Clustering Models, pp. 7–37. Springer Nature Singapore (2024)
2024
-
[11]
UCI Machine Learning Repository (2018), licensed under CC BY 4.0
Sakar, C., Kastro, Y.: Online shoppers purchasing intention dataset. UCI Machine Learning Repository (2018), licensed under CC BY 4.0
2018
-
[12]
Schelter, S., Rukat, T., Biessmann, F.: Jenga: a framework to study the impact of data errors on the predictions of machine learning models (2021)
2021
-
[13]
Shah, V., et al.: How do categorical duplicates affect ml? a new benchmark and empirical analyses. Proc. VLDB Endow. 17(6), 13911404 (2024)
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.