Measuring the Sensitivity of Classification Models with the Error Sensitivity Profile

Andrea Maurino

arxiv: 2604.25765 · v1 · submitted 2026-04-28 · 💻 cs.LG · cs.AI

Measuring the Sensitivity of Classification Models with the Error Sensitivity Profile

Andrea Maurino This is my paper

Pith reviewed 2026-05-07 16:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Error Sensitivity Profiledata cleaningclassification modelsfeature errorsmodel performancedata qualityerror sensitivity

0 comments

The pith

The Error Sensitivity Profile quantifies how errors in specific features degrade classification model performance to guide targeted data cleaning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Error Sensitivity Profile (ESP) as a metric that measures how sensitive a classification model's accuracy is to errors appearing in one feature or across several features at once. This profile lets practitioners rank features and error types by their actual impact on performance instead of treating all data issues the same. Experiments on two standard datasets across fourteen different classification models demonstrate that the resulting performance drops frequently diverge from what simple correlation with the target variable would suggest. The work also supplies a software suite to compute the profile in practice.

Core claim

The Error Sensitivity Profile (ESP) is proposed as a metric that quantifies the sensitivity of model performance to errors in a single feature or in multiple features. An integrated suite of tools called dirty is created to support its computation. Extensive experiments on two widely used datasets using 14 classification models show that performance degradation is not always predictable from simple correlations with the target variable.

What carries the argument

The Error Sensitivity Profile (ESP), a metric that measures the change in model performance when controlled errors are introduced into one or more features.

Load-bearing premise

That the sensitivity rankings produced by ESP on the two tested datasets will generalize to other datasets and real-world data-cleaning decisions.

What would settle it

Running the same error-injection experiments on a third dataset and finding that cleaning low-ESP features improves accuracy more than cleaning high-ESP features would falsify the claim that ESP reliably identifies the most damaging errors.

Figures

Figures reproduced from arXiv: 2604.25765 by Andrea Maurino.

**Figure 1.** Figure 1: Canonical representation of ESP It should be noted that, given the five corruption levels used in this study, slope estimates in short monotonic regions may be computed on as few as two observations, yielding exact fits with no residual degrees of freedom; such estimates should therefore be interpreted as directional indicators rather than statistically reliable regression coefficients, and increasing the… view at source ↗

**Figure 2.** Figure 2: Relevant scenario where AEPC is positive view at source ↗

**Figure 3.** Figure 3: Relevant scenario where AEPC is negative moderately lower prevalence of the effect, though a formal verification is left to future work. 6 Conclusion It is widely recognized that the quality of training data is crucial to the success of machine learning models. In this paper, we introduce the Error Sensitivity Profile and present Dirtify, an all-encompassing tool suite for systematically evaluating the imp… view at source ↗

read the original abstract

The quality of training data is critical to the performance of machine learning models. In this paper, the Error Sensitivity Profile (ESP) is proposed. It quantifies the sensitivity of model performance to errors in a single feature or in multiple features. By leveraging ESP, data-cleaning efforts can be prioritized based on error types and features most likely to affect model performance. To support the computation of this metric, an integrated suite of tools, called \dirty, is created. We conduct an extensive experimental study on two widely used datasets using 14 classification models, revealing that performance degradation is not always predictable from simple correlations with the target variable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ESP gives a concrete way to rank feature-error pairs by their impact on classifier performance, but the supporting experiments stay narrow.

read the letter

The paper's main contribution is the Error Sensitivity Profile, a metric that scores how much a model's accuracy or other performance measures degrade when errors hit one feature or a set of features. It comes with the dirty tool to compute the profile and runs the idea on two standard datasets across 14 classifiers. The key observation is that the features whose errors hurt performance most are not always the ones with the strongest correlation to the target, which is a practical point for anyone deciding what data to clean first.

Referee Report

2 major / 2 minor

Summary. The paper proposes the Error Sensitivity Profile (ESP) metric to quantify how classification model performance degrades in response to errors in one or more input features. It introduces an accompanying software suite called 'dirty' to compute the metric and reports an experimental study on two datasets using 14 classification models. The central empirical finding is that performance degradation is not always predictable from simple feature-target correlations, implying that ESP can be used to prioritize data-cleaning efforts toward the most impactful error types and features.

Significance. If the ESP metric and its associated findings hold under broader validation, the work could meaningfully improve data-quality workflows in machine learning by providing a systematic way to rank error sources by their effect on downstream performance. The release of the 'dirty' tool is a concrete strength that supports reproducibility and immediate practical use. The observation that simple correlations fail to predict degradation challenges a common heuristic in data preprocessing and could shift how practitioners allocate cleaning resources.

major comments (2)

[Experimental study] Experimental study (two datasets, 14 models): the claim that ESP enables general prioritization of data-cleaning efforts rests on the observation that degradation is not predictable from target correlations. With validation confined to only two datasets, this non-predictability may be an artifact of the specific data distributions or model families tested; the manuscript should either expand the experimental suite or explicitly bound the scope of the prioritization guidance.
[ESP definition] ESP definition and error-injection procedure: the utility of ESP for real-world cleaning prioritization requires that the synthetic error model used to construct the profile corresponds to plausible data-quality issues. The manuscript should supply a clearer justification or sensitivity analysis showing that the injected error distributions align with observed real-world error patterns; otherwise the leap from controlled injections to actionable cleaning priorities remains under-supported.

minor comments (2)

[Abstract] The abstract states the proposal and key finding but omits any quantitative detail (e.g., number of error rates tested, magnitude of observed effects, or statistical significance), reducing its value as a standalone summary.
[Metric definition] Notation for the multi-feature ESP extension should be introduced with an explicit equation or pseudocode to avoid ambiguity when readers compare single-feature versus joint-feature sensitivity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and applicability of our work. We address each major point below and indicate the planned revisions.

read point-by-point responses

Referee: [Experimental study] Experimental study (two datasets, 14 models): the claim that ESP enables general prioritization of data-cleaning efforts rests on the observation that degradation is not predictable from target correlations. With validation confined to only two datasets, this non-predictability may be an artifact of the specific data distributions or model families tested; the manuscript should either expand the experimental suite or explicitly bound the scope of the prioritization guidance.

Authors: We agree that the experimental scope is limited to two datasets and that this constrains the generality of prioritization guidance. Rather than expanding the suite at this stage, we will explicitly bound our claims in the revised manuscript by adding a dedicated limitations paragraph stating that the observed non-predictability from target correlations holds for the examined datasets and model families, and that broader validation across additional domains would be required before treating ESP as a universal prioritization tool. This directly follows the referee's suggested alternative. revision: yes
Referee: [ESP definition] ESP definition and error-injection procedure: the utility of ESP for real-world cleaning prioritization requires that the synthetic error model used to construct the profile corresponds to plausible data-quality issues. The manuscript should supply a clearer justification or sensitivity analysis showing that the injected error distributions align with observed real-world error patterns; otherwise the leap from controlled injections to actionable cleaning priorities remains under-supported.

Authors: We acknowledge that the manuscript currently describes the error-injection procedure without sufficient linkage to real-world patterns. In revision we will add a new subsection under the ESP definition that (i) cites established data-quality literature on common error types (e.g., attribute noise, missingness, and label errors) and (ii) reports a sensitivity analysis in which we vary error rates and distributions while recomputing ESP profiles. The results of this analysis will be summarized to show that the relative ordering of feature sensitivities remains stable under moderate perturbations, thereby strengthening the bridge to practical cleaning priorities. revision: yes

Circularity Check

0 steps flagged

No circularity in ESP definition or experimental claims

full rationale

The paper introduces the Error Sensitivity Profile (ESP) as a new metric for quantifying model sensitivity to feature errors and supports its utility via controlled experiments on two datasets with 14 classification models. No equations, derivations, or self-citations appear in the provided text that would reduce any claimed result to its own inputs by construction. The central contribution is an empirical observation about non-predictability from target correlations, which rests on the experimental setup rather than any fitted parameter or self-referential definition. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are identifiable.

pith-pipeline@v0.9.0 · 5389 in / 924 out tokens · 42226 ms · 2026-05-07T16:28:27.112503+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Neural Comput

Adnan, F.A., et al.: A review of the current publication trends on missing data im- putation over three decades: direction and future research. Neural Comput. Appl. 34(21), 1832518340 (Nov 2022)

2022
[2]

PuckTrick: A Library for Making Synthetic Data More Realistic

Agostini, A., Sphaiu, B., Maurino, A.: Pucktrick: A library for making synthetic data more realistic. In: SEBD (2025), https://arxiv.org/abs/2506.18499

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

In: DeSE

Ansari, S., et al.: Impact of outliers on regression and classiﬁcation models: An empirical analysis. In: DeSE. pp. 211–218 (2024)

2024
[4]

Arocena, P.C., et al: Messing up with bart: error generation for evaluating data- cleaning algorithms. Proc. VLDB Endow. 9(2), 3647 (Oct 2015)

2015
[5]

The Annals of Statistics 29(4), 1165–1188 (2001)

Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics 29(4), 1165–1188 (2001)

2001
[6]

In: INDIN

Dix, M., et al.: Measuring the robustness of ML models against data quality issues in industrial time series data. In: INDIN. pp. 1–8. IEEE (2023)

2023
[7]

IEEE Trans

Frenay, B., Verleysen, M.: Classiﬁcation in the presence of label noise: A survey. IEEE Trans. on Neural Networks and Learning Systems 25(5), 845–869 (2014)

2014
[8]

arXiv preprint arXiv:1904.09483 75 (2019)

Li, P., et al: Cleanml: A benchmark for joint data cleaning and machine learning [experiments and analysis]. arXiv preprint arXiv:1904.09483 75 (2019)

work page arXiv 1904
[9]

Information Systems 132, 102549 (2025)

Mohammed, S., at al: The eﬀects of data quality on machine learning performance on tabular data. Information Systems 132, 102549 (2025)

2025
[10]

Qi, Z., et al.: Impacts of Dirty Data on Classiﬁcation and Clustering Models, pp. 7–37. Springer Nature Singapore (2024)

2024
[11]

UCI Machine Learning Repository (2018), licensed under CC BY 4.0

Sakar, C., Kastro, Y.: Online shoppers purchasing intention dataset. UCI Machine Learning Repository (2018), licensed under CC BY 4.0

2018
[12]

Schelter, S., Rukat, T., Biessmann, F.: Jenga: a framework to study the impact of data errors on the predictions of machine learning models (2021)

2021
[13]

Shah, V., et al.: How do categorical duplicates aﬀect ml? a new benchmark and empirical analyses. Proc. VLDB Endow. 17(6), 13911404 (2024)

2024

[1] [1]

Neural Comput

Adnan, F.A., et al.: A review of the current publication trends on missing data im- putation over three decades: direction and future research. Neural Comput. Appl. 34(21), 1832518340 (Nov 2022)

2022

[2] [2]

PuckTrick: A Library for Making Synthetic Data More Realistic

Agostini, A., Sphaiu, B., Maurino, A.: Pucktrick: A library for making synthetic data more realistic. In: SEBD (2025), https://arxiv.org/abs/2506.18499

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

In: DeSE

Ansari, S., et al.: Impact of outliers on regression and classiﬁcation models: An empirical analysis. In: DeSE. pp. 211–218 (2024)

2024

[4] [4]

Arocena, P.C., et al: Messing up with bart: error generation for evaluating data- cleaning algorithms. Proc. VLDB Endow. 9(2), 3647 (Oct 2015)

2015

[5] [5]

The Annals of Statistics 29(4), 1165–1188 (2001)

Benjamini, Y., Yekutieli, D.: The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics 29(4), 1165–1188 (2001)

2001

[6] [6]

In: INDIN

Dix, M., et al.: Measuring the robustness of ML models against data quality issues in industrial time series data. In: INDIN. pp. 1–8. IEEE (2023)

2023

[7] [7]

IEEE Trans

Frenay, B., Verleysen, M.: Classiﬁcation in the presence of label noise: A survey. IEEE Trans. on Neural Networks and Learning Systems 25(5), 845–869 (2014)

2014

[8] [8]

arXiv preprint arXiv:1904.09483 75 (2019)

Li, P., et al: Cleanml: A benchmark for joint data cleaning and machine learning [experiments and analysis]. arXiv preprint arXiv:1904.09483 75 (2019)

work page arXiv 1904

[9] [9]

Information Systems 132, 102549 (2025)

Mohammed, S., at al: The eﬀects of data quality on machine learning performance on tabular data. Information Systems 132, 102549 (2025)

2025

[10] [10]

Qi, Z., et al.: Impacts of Dirty Data on Classiﬁcation and Clustering Models, pp. 7–37. Springer Nature Singapore (2024)

2024

[11] [11]

UCI Machine Learning Repository (2018), licensed under CC BY 4.0

Sakar, C., Kastro, Y.: Online shoppers purchasing intention dataset. UCI Machine Learning Repository (2018), licensed under CC BY 4.0

2018

[12] [12]

Schelter, S., Rukat, T., Biessmann, F.: Jenga: a framework to study the impact of data errors on the predictions of machine learning models (2021)

2021

[13] [13]

Shah, V., et al.: How do categorical duplicates aﬀect ml? a new benchmark and empirical analyses. Proc. VLDB Endow. 17(6), 13911404 (2024)

2024