We Need to Rethink Benchmarking in Anomaly Detection

Daniel Schl\"or; Emmanuel M\"uller; Franz Rothlauf; Kevin Kammler; Philipp R\"ochner; Simon Kl\"uttermann

arxiv: 2507.15584 · v2 · pith:DAZZRDN5new · submitted 2025-07-21 · 💻 cs.LG

We Need to Rethink Benchmarking in Anomaly Detection

Philipp R\"ochner , Simon Kl\"uttermann , Kevin Kammler , Franz Rothlauf , Emmanuel M\"uller , Daniel Schl\"or This is my paper

classification 💻 cs.LG

keywords anomalydetectionalgorithmsapplicationsbenchmarkingbenchmarksdespiteneed

0 comments

read the original abstract

Despite the continuous proposal of new anomaly detection algorithms and extensive benchmarking efforts, progress seems to stagnate, with only minor performance differences between established baselines and new algorithms. In this position paper, we argue that this stagnation is due to limitations in how we evaluate anomaly detection algorithms. In current benchmarks, a trivial algorithm that only checks for extreme values in individual features performs competitively with state-of-the-art deep learning methods, despite failing on simple cases such as anomalies within an annulus of normal points. Moreover, existing benchmarks do not adequately reflect the diversity of anomaly detection applications, making it difficult for practitioners to reliably select algorithms for their applications. Consequently, we need to rethink benchmarking in anomaly detection. In our opinion, anomaly detection should be studied using scenarios that group applications sharing relevant characteristics, defined through a common taxonomy. Benchmarking within scenarios enables scenario-specific choices for preprocessing, metrics, and model selection, clarifying which advances transfer across similar applications and providing practitioners with reliable guidance for their specific contexts.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MacrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier Detection
cs.LG 2026-02 accept novelty 8.0

MacrOData supplies three large, curated benchmark suites totaling 2,446 datasets for tabular outlier detection, complete with standardized splits, metadata, and a public leaderboard.
Evaluating Tabular Representation Learning for Network Intrusion Detection
cs.LG 2026-05 unverdicted novelty 5.0

Tabular representation learning for network intrusion detection exhibits strong dataset-model dependency, with supervised methods outperforming unsupervised anomaly detection and limited but possible cross-dataset gen...