pith. machine review for the scientific record. sign in

arxiv: 2602.09329 · v2 · submitted 2026-02-10 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

MacrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:06 UTC · model grok-4.3

classification 💻 cs.LG
keywords tabular outlier detectionbenchmark suiteanomaly detectionreal-world datasetssynthetic datasetsmachine learning evaluationstatistical robustness
0
0 comments X

The pith

MacrOData supplies 2,446 tabular datasets for statistically robust outlier detection tests

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces MacrOData, a benchmark suite that expands tabular outlier detection evaluation far beyond the 57-dataset AdBench standard. It consists of OddBench with 790 real-world datasets containing semantic anomalies, OvrBench with 856 real-world datasets containing statistical outliers, and SynBench with 800 synthetically generated datasets that cover diverse data distributions and outlier types. The scale and curation enable experiments that compare classical, deep, and foundation-model methods across many hyperparameter settings while providing standardized train/test splits and semantic metadata. These resources support reproducible comparisons and the release of detailed performance tables plus practical guidelines for method selection.

Core claim

The authors establish that a benchmark of 2,446 datasets, split into OddBench for semantic anomalies, OvrBench for statistical outliers, and SynBench for synthetic cases, supplies the diversity and volume needed to conduct statistically robust evaluations of outlier detection methods on tabular data, complete with train/test splits, metadata annotations, and a held-out leaderboard partition.

What carries the argument

MacrOData benchmark suite, which curates and partitions thousands of tabular datasets into standardized splits with semantic annotations to support comprehensive method comparison

If this is right

  • Evaluations gain statistical power from the large number of datasets spanning varied data characteristics
  • Experiments reveal performance differences among classical, deep, and foundation-model methods across hyperparameter ranges
  • Semantic metadata allows analysis of which methods perform best on specific anomaly types
  • Standardized splits and a public leaderboard with held-out labels enable ongoing, reproducible progress tracking
  • Practical guidelines for method choice emerge from the aggregated results across all three benchmark components

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The increased scale may push research toward methods that maintain performance across many data priors rather than overfitting to small benchmark sets
  • Future curation efforts could test whether adding time-series or graph-structured tabular data follows similar scaling benefits
  • Practitioners could use the released performance tables to select methods for specific application domains such as fraud or sensor monitoring
  • The availability of public and private partitions may reduce overfitting to leaderboard scores in future method papers

Load-bearing premise

The real-world datasets selected for OddBench and OvrBench contain cleanly identifiable outliers free of systematic selection bias or label noise that would favor some detection methods over others.

What would settle it

Re-auditing labels on a random sample of datasets from OddBench and OvrBench and finding substantial mislabeling rates or selection biases that alter relative method rankings would show the benchmark does not support fair comparisons.

Figures

Figures reproduced from arXiv: 2602.09329 by Haomin Wen, Leman Akoglu, Simon Kl\"uttermann, Xueying Ding, Yilong Chen.

Figure 1
Figure 1. Figure 1: t-SNE embedding of datasets (based on dataset [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Summary statistics of macrOData vs. ADBench. 3 Proposed macrOData Benchmarks Our macrOData consists of three separate benchmarks: (1) Odd￾Bench with 790 real world datasets associated with various ab￾normalities, thus containing semantic anomalies, carefully curated from Tablib [17]; (2) OvRBench with 856 real world datasets cu￾rated from Tablib [17] and classification libraries (TabZilla [36], TabRepo [46… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of curation steps used to build [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Word clouds of metadata keywords across Odd￾Bench datasets showing (a) the semantic anomalies and (b) a broad spectrum of domains covered. Full version is shown in (a), while overly frequent keywords including “anomal”, “sta￾tus”, “outlier”, “classification”, “data”, “performance”, “fail￾ure”, “quality”, “location”, and “error” are dropped in (b). dataset undergoes manual verification to confirm semantic r… view at source ↗
Figure 6
Figure 6. Figure 6: Word clouds of metadata keywords across OvR￾Bench datasets showing (a) it repurposes classification for OD and (b) a broad spectrum of domains covered. Full version is in (a), while overly frequent keywords “anomal”, “classifi￾cation”, “data”, “label”, “feature”, etc. are dropped in (b). Metadata enrichment, Public/Private partitions and anonymization, Representative subset selection). 3.2.2 Target Selecti… view at source ↗
Figure 5
Figure 5. Figure 5: Composition of OvRBench, which exhibits signifi￾cant data diversity by integrating a wide array of sources. Similar to OddBench, we apply five steps of filtering as shown in [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 8
Figure 8. Figure 8: (left) Running time comparison of OD methods [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance variation across HP configurations (as [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 7
Figure 7. Figure 7: Paired permutation tests on AUPRC across all [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: Summary statistics of ADBench datasets. A.2 Demystfying ADBench: Details A.2.1 Delving into DTE (Livernoche et al. [34]). Background: Denoising Diffusion Probabilistic Models: A diffusion process describes a stochastic system in which the proba￾bility distribution evolves over time following the diffusion equa￾tion. Denoising diffusion probabilistic models [28, 53] consider the states corresponding to tim… view at source ↗
Figure 13
Figure 13. Figure 13: Random Forest (supervised classification) perfor [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 12
Figure 12. Figure 12: Dataset origin (yellow) and situational tags (green) [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: (Top-Left): We construct an SCM and select a sub [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Paired permutation test results w.r.t. AUPRC across 690 public [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Paired permutation test results w.r.t. AUPRC across 756 public [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Paired permutation test results w.r.t. AUPRC across 800 public [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Paired permutation test results w.r.t. AUROC across 690 public [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Paired permutation test results w.r.t. AUROC across 756 public [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Paired permutation test results w.r.t. AUROC across 800 public [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Paired permutation test results w.r.t. AUROC across overall 2,446 datasets. Foundation models (last three rows) [PITH_FULL_IMAGE:figures/full_fig_p026_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Hyperparameter sensitivity as measured by inter-quartile 25%-75% performance; the higher, the more HP-sensitive. [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: (left) Running time comparison of OD methods across all datasets. Feed-forward-only foundation models (FMs) [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗
read the original abstract

Quality benchmarks are essential for fairly and accurately tracking scientific progress and enabling practitioners to make informed methodological choices. Outlier detection (OD) on tabular data underpins numerous real-world applications, yet existing OD benchmarks remain limited. The prominent OD benchmark AdBench is the de facto standard in the literature, yet comprises only 57 datasets. In addition to other shortcomings discussed in this work, its small scale severely restricts diversity and statistical power. We introduce MacrOData, a large-scale benchmark suite for tabular OD comprising three carefully curated components: OddBench, with 790 datasets containing real-world semantic anomalies; OvrBench, with 856 datasets featuring real-world statistical outliers; and SynBench, with 800 synthetically generated datasets spanning diverse data priors and outlier archetypes. Owing to its scale and diversity, MacrOData enables comprehensive and statistically robust evaluation of tabular OD methods. Our benchmarks further satisfy several key desiderata: We provide standardized train/test splits for all datasets, public/private benchmark partitions with held-out test labels for the latter reserved toward an online leaderboard, and annotate our datasets with semantic metadata. We conduct extensive experiments across all benchmarks, evaluating a broad range of OD methods comprising classical, deep, and foundation models, over diverse hyperparameter configurations. We report detailed empirical findings, practical guidelines, as well as individual performances as references for future research. All benchmarks containing 2,446 datasets combined are open-sourced, along with a publicly accessible leaderboard hosted at https://huggingface.co/MacrOData-CMU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MacrOData, a large-scale benchmark suite for tabular outlier detection comprising OddBench (790 real-world semantic anomaly datasets), OvrBench (856 real-world statistical outlier datasets), and SynBench (800 synthetically generated datasets). It addresses limitations of prior benchmarks such as AdBench by providing standardized train/test splits, public/private partitions with held-out labels, semantic metadata annotations, and extensive experiments evaluating classical, deep, and foundation-model OD methods across diverse hyperparameter settings, along with empirical findings and practical guidelines. All 2,446 datasets are open-sourced with a public leaderboard.

Significance. If the curation and labeling of the real-world components prove reliable, MacrOData would represent a substantial advance by scaling tabular OD evaluation far beyond the 57 datasets in AdBench, enabling statistically robust comparisons, diversity across data priors, and reproducible research via the open release and leaderboard. The provision of standardized splits and metadata is a concrete strength that supports extensibility.

major comments (2)
  1. [§3] §3 (dataset curation): The description of OddBench and OvrBench as 'carefully curated' real-world semantic anomalies and statistical outliers provides no quantitative validation of label quality, such as inter-annotator agreement, source-label consistency checks, or audits for selection bias and label noise. This is load-bearing because all downstream empirical comparisons, performance rankings, and practical guidelines rest on the assumption that the ground-truth outliers are clean and representative.
  2. [§4] §4 (experimental protocol): The paper reports results over 'diverse hyperparameter configurations' for a broad range of OD methods, but does not specify the exact search spaces, number of trials per method, or controls for computational budget; without these details it is unclear whether the reported rankings are robust to hyperparameter choice or could shift under different tuning regimes.
minor comments (2)
  1. [Abstract] The abstract states 2,446 datasets in total; confirm whether this figure accounts for any overlaps or deduplication steps between OddBench, OvrBench, and SynBench.
  2. Figure captions and table headers should explicitly state the number of datasets and methods included in each panel to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and for recognizing the potential of MacrOData to advance tabular outlier detection research. We address each major comment below and outline the corresponding revisions.

read point-by-point responses
  1. Referee: [§3] §3 (dataset curation): The description of OddBench and OvrBench as 'carefully curated' real-world semantic anomalies and statistical outliers provides no quantitative validation of label quality, such as inter-annotator agreement, source-label consistency checks, or audits for selection bias and label noise. This is load-bearing because all downstream empirical comparisons, performance rankings, and practical guidelines rest on the assumption that the ground-truth outliers are clean and representative.

    Authors: We agree that quantitative validation of label quality is essential. The curation process for OddBench and OvrBench selected datasets exclusively from public repositories where outlier labels were supplied by the original authors, followed by manual verification against dataset documentation to confirm consistency. Inter-annotator agreement metrics are not applicable because we did not introduce new labels. In the revised manuscript we will expand §3 with an explicit description of the source-label consistency checks performed, any detectable label noise rates, and a dedicated discussion of selection biases together with the mitigation steps taken during dataset inclusion. These additions will be supported by summary statistics on label provenance where available. revision: partial

  2. Referee: [§4] §4 (experimental protocol): The paper reports results over 'diverse hyperparameter configurations' for a broad range of OD methods, but does not specify the exact search spaces, number of trials per method, or controls for computational budget; without these details it is unclear whether the reported rankings are robust to hyperparameter choice or could shift under different tuning regimes.

    Authors: We agree that precise specification of the hyperparameter protocol is required for assessing result robustness. The experiments employed random search over method-specific ranges with a fixed trial budget per method to maintain computational tractability across 2,446 datasets. In the revised manuscript we will add a dedicated subsection (and supplementary tables) that enumerate the exact search spaces for every method, the number of trials executed (50 per method), and the computational budget controls applied (maximum wall-clock time and memory limits per run). This information will allow readers to evaluate sensitivity to alternative tuning regimes. revision: yes

Circularity Check

0 steps flagged

No circularity in benchmark dataset release

full rationale

The paper introduces MacrOData as a collection of 2446 curated datasets (OddBench, OvrBench, SynBench) plus evaluation protocols and a leaderboard. No derivation chain, equations, or predictions exist that could reduce to inputs by construction. Curation relies on external data sources and standard splits rather than fitted parameters or self-referential definitions. Self-citations, if present, are not load-bearing for any core claim. The contribution is empirical resource release and comparative evaluation, which is self-contained against external benchmarks and contains no self-definitional, fitted-input, or uniqueness-imported steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central contribution rests on curation decisions rather than mathematical axioms; the main free parameters are the inclusion thresholds and labeling rules used to assemble the real-world collections.

free parameters (1)
  • Dataset inclusion and labeling criteria
    Choices of which tables qualify as containing semantic anomalies or statistical outliers and how ground-truth labels are assigned.
axioms (1)
  • domain assumption Real tabular datasets contain identifiable outliers that can be labeled consistently
    The benchmark construction assumes that semantic and statistical outliers can be reliably identified and separated in the collected data.

pith-pipeline@v0.9.0 · 5594 in / 1319 out tokens · 47332 ms · 2026-05-16T06:06:18.970447+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We introduce MacrOData, a large-scale benchmark suite for tabular OD comprising three carefully curated components: OddBench, with 790 datasets containing real-world semantic anomalies; OvRBench, with 856 datasets featuring real-world statistical outliers; and SynBench, with 800 synthetically generated datasets spanning diverse data priors and outlier archetypes.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We conduct extensive experiments across all benchmarks, evaluating a broad range of OD methods comprising classical, deep, and foundation models, over diverse hyperparameter configurations.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 4 internal anchors

  1. [1]

    Aggarwal

    Charu C. Aggarwal. 2013.Outlier Analysis. Springer

  2. [2]

    Aggarwal and Saket Sathe

    Charu C. Aggarwal and Saket Sathe. 2015. Theoretical foundations and algorithms for outlier ensembles.Acm sigkdd explorations newsletter17, 1 (2015), 24–47

  3. [3]

    Leman Akoglu. 2021. Anomaly mining: Past, present and future. InProceedings of the 30th ACM International Conference on Information & Knowledge Management. 1–2

  4. [4]

    Lion Bergman and Yedid Hoshen. 2020. Classification-Based Anomaly Detection for General Data. InInternational Conference on Learning Representations

  5. [5]

    Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael G Mantovani, Jan N van Rijn, and Joaquin Vanschoren. 2017. Openml benchmarking suites.arXiv preprint arXiv:1708.03731(2017)

  6. [6]

    Stephan Bongers, Patrick Forré, Jonas Peters, and Joris M Mooij. 2021. Founda- tions of structural causal models with cycles and latent variables.The Annals of Statistics49, 5 (2021), 2885–2915

  7. [7]

    Leo Breiman. 2001. Random Forests.Mach. Learn.45, 1 (Oct. 2001), 5–32. doi:10. 1023/A:1010933404324

  8. [8]

    Breunig, Hans-Peter Kriegel, Raymond T

    Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: identifying density-based local outliers. InProceedings of the 2000 ACM SIGMOD International Conference on Management of Data(Dallas, Texas, USA) (SIGMOD ’00). Association for Computing Machinery, New York, NY, USA, 93–104. doi:10.1145/342009.335388

  9. [9]

    Guilherme O Campos, Arthur Zimek, Jörg Sander, Ricardo JGB Campello, Barbora Micenková, Erich Schubert, Ira Assent, and Michael E Houle. 2016. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study.Data mining and knowledge discovery30 (2016), 891–927

  10. [10]

    Carreira-Perpinan

    Miguel A. Carreira-Perpinan. 2002. Mode-finding for mixtures of Gaussian distributions.IEEE Transactions on Pattern Analysis and Machine Intelligence22, 11 (2002), 1318–1323

  11. [11]

    Raghavendra Chalapathy and Sanjay Chawla. 2019. Deep learning for anomaly detection: A survey.arXiv preprint arXiv:1901.03407(2019)

  12. [12]

    Aaron Clauset, Cosma Rohilla Shalizi, and Mark EJ Newman. 2009. Power-law distributions in empirical data.SIAM review51, 4 (2009), 661–703

  13. [13]

    Xueying Ding, Haomin Wen, Simon Klütterman, and Leman Akoglu. 2026. From Zero to Hero: Advancing Zero-Shot Foundation Models for Tabular Outlier Detection.arXiv preprint arXiv:2602.03018(2026)

  14. [14]

    Xueying Ding, Lingxiao Zhao, and Leman Akoglu. 2022. Hyperparameter sensi- tivity in deep outlier detection: Analysis and a scalable hyper-ensemble solution. Advances in Neural Information Processing Systems35 (2022), 9603–9616

  15. [15]

    Xueying Ding, Yue Zhao, and Leman Akoglu. 2024. Fast Unsupervised Deep Outlier Model Selection with Hypernetworks.ACM SIGKDD(2024)

  16. [16]

    Rémi Domingues, Maurizio Filippone, Pietro Michiardi, and Jihane Zouaoui

  17. [17]

    A comparative evaluation of outlier detection algorithms: Experiments and analyses.Pattern recognition74 (2018), 406–421

  18. [18]

    Gus Eggert, Kevin Huo, Mike Biven, and Justin Waugh. 2023. Tablib: A dataset of 627m tables with context.arXiv preprint arXiv:2310.07875(2023)

  19. [19]

    Arpad E. Elo. 1967. The Proposed USCF Rating System, Its Development, Theory, and Applications.Chess Life22 (1967), 242–247

  20. [20]

    Andrew Emmott, Shubhomoy Das, Thomas Dietterich, Alan Fern, and Weng- Keen Wong. 2015. A meta-analysis of the anomaly detection problem.arXiv preprint arXiv:1503.01158(2015)

  21. [21]

    Andrew F Emmott, Shubhomoy Das, Thomas Dietterich, Alan Fern, and Weng- Keen Wong. 2013. Systematic construction of anomaly detection benchmarks from real data. InProceedings of the ACM SIGKDD workshop on outlier detection and description. 16–21

  22. [22]

    Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Pra- teek Mutalik Desai, David Salinas, and Frank Hutter. 2025. Tabarena: A living benchmark for machine learning on tabular data.arXiv preprint arXiv:2506.16791 (2025)

  23. [23]

    Anurag Garg, Muhammad Ali, Noah Hollmann, Lennart Purucker, Samuel Müller, and Frank Hutter. 2025. Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data. arXiv:2507.03971 [cs.LG] https://arxiv.org/abs/2507.03971

  24. [24]

    Pieter Gijsbers, Erin LeDell, Janek Thomas, Sébastien Poirier, Bernd Bischl, and Joaquin Vanschoren. 2019. An open source AutoML benchmark.arXiv preprint arXiv:1907.00909(2019)

  25. [25]

    Michael Glodek, Martin Schels, and Friedhelm Schwenker. 2013. Ensemble Gaussian mixture models for probability density estimation.Comput. Stat.28, 1 (Feb. 2013), 127–138. doi:10.1007/s00180-012-0374-5

  26. [26]

    Markus Goldstein and Seiichi Uchida. 2016. A comparative evaluation of un- supervised anomaly detection algorithms for multivariate data.PloS one11, 4 (2016)

  27. [27]

    Songqiao Han, Xiyang Hu, Hailiang Huang, Minqi Jiang, and Yue Zhao. 2022. Ad- bench: Anomaly detection benchmark.Advances in Neural Information Processing Systems35 (2022)

  28. [28]

    Zengyou He, Xiaofei Xu, and Shengchun Deng. 2003. Discovering Cluster-Based Local Outliers.Pattern Recogn. Lett.24, 9–10 (jun 2003), 1641–1650

  29. [29]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems33 (2020), 6840–6851

  30. [30]

    Victoria Hodge and Jim Austin. 2004. A survey of outlier detection methodologies. Artificial intelligence review22, 2 (2004), 85–126

  31. [31]

    Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. 2023. TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. InThe Eleventh International Conference on Learning Representations

  32. [32]

    Regis Houssou, Mihai-Cezar Augustin, Efstratios Rappos, Vivien Bonvin, and Stephan Robert-Nicoud. 2022. Generation and simulation of synthetic datasets with copulas.arXiv preprint arXiv:2203.17250(2022)

  33. [33]

    Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation Forest. In2008 Eighth IEEE International Conference on Data Mining. 413–422

  34. [34]

    Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, and Han-Jia Ye. 2024. TALENT: A tabular analytics and learning toolbox.arXiv preprint arXiv:2407.04057(2024)

  35. [35]

    Victor Livernoche, Vineet Jain, Yashar Hezaveh, and Siamak Ravanbakhsh. 2024. On Diffusion Modeling for Anomaly Detection.. InICLR

  36. [36]

    Martin Q Ma, Yue Zhao, Xiaorong Zhang, and Leman Akoglu. 2023. The need for unsupervised outlier model selection: A review and evaluation of internal evaluation strategies.ACM SIGKDD Explorations Newsletter25, 1 (2023), 19–35

  37. [37]

    Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Ganesh Ramakrishnan, Micah Goldblum, and Colin White. 2023. When do neural nets outperform boosted trees on tabular data?Advances in Neural Information Processing Systems36 (2023), 76336–76369

  38. [38]

    2006.An introduction to copulas

    Roger B Nelsen. 2006.An introduction to copulas. Springer

  39. [39]

    Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel

  40. [40]

    https://github.com/mala-lab/ADBenchmarks-anomaly- detection-datasets

    Deep learning for anomaly detection: A review.ACM computing surveys (CSUR)54, 2 (2021), 1–38. https://github.com/mala-lab/ADBenchmarks-anomaly- detection-datasets

  41. [41]

    Guansong Pang, Chunhua Shen, and Anton van den Hengel. 2019. Deep Anomaly Detection with Deviation Networks. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery, New York, NY, USA, 353–362. doi:10.1145/3292500. 3330871

  42. [42]

    2017.Elements of causal inference: foundations and learning algorithms

    Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2017.Elements of causal inference: foundations and learning algorithms. The MIT press

  43. [43]

    Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim. 2000. Efficient algo- rithms for mining outliers from large data sets. InProceedings of the 2000 ACM SIGMOD International Conference on Management of Data. Association for Com- puting Machinery, New York, NY, USA, 427–438. doi:10.1145/342009.335437

  44. [44]

    Shebuti Rayana and Leman Akoglu. 2016. ODDS Library: Outlier Detection DataSets. https://shebuti.com/outlier-detection-datasets-odds/

  45. [45]

    Shebuti Rayana, Wen Zhong, and Leman Akoglu. 2016. Sequential ensemble learning for outlier detection: A bias-variance perspective. In2016 IEEE 16th international conference on data mining (ICDM). IEEE, 1167–1172

  46. [46]

    Philipp Röchner, Simon Klüttermann, Franz Rothlauf, and Daniel Schlör. 2025. We Need to Rethink Benchmarking in Anomaly Detection. arXiv:2507.15584 [cs.LG] https://arxiv.org/abs/2507.15584

  47. [47]

    Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. 2018. Deep One- Class Classification. InInternational Conference on Machine Learning. 4393–4402

  48. [48]

    David Salinas and Nick Erickson. 2024. TabRepo: A Large Scale Repository of Tabular Model Evaluations and its AutoML Applications. InInternational Conference on Automated Machine Learning. PMLR, 19–1

  49. [49]

    1992.Model Assisted Survey Sampling

    Carl-Erik Särndal, Bengt Swensson, and Jan Wretman. 1992.Model Assisted Survey Sampling. Springer-Verlag. doi:10.1007/978-1-4612-4378-6

  50. [50]

    Bernhard Schölkopf. 2022. Causality for machine learning. InProbabilistic and causal inference: The works of Judea Pearl. 765–804

  51. [51]

    Bernhard Schölkopf, Robert C Williamson, Alex Smola, John Shawe-Taylor, and John Platt. 1999. Support Vector Method for Novelty Detection. InAdvances in Neural Information Processing Systems, S. Solla, T. Leen, and K. Müller (Eds.), Vol. 12. MIT Press. https://proceedings.neurips.cc/paper_files/paper/1999/file/ 8725fb777f25776ffa9076e44fcfd776-Paper.pdf

  52. [52]

    Yuchen Shen, Haomin Wen, and Leman Akoglu. 2025. FoMo-0D: A Foundation Model for Zero-shot Tabular Outlier Detection.Transactions on Machine Learning Research(2025). https://openreview.net/forum?id=XCQzwpR9jE

  53. [53]

    Tom Shenkar and Lior Wolf. 2022. Anomaly Detection for Tabular Data with Inter- nal Contrastive Learning. InInternational Conference on Learning Representations. https://openreview.net/forum?id=_hszZbt46bT

  54. [54]

    M Sklar. 1959. Fonctions de répartition à n dimensions et leurs marges. InAnnales de l’ISUP, Vol. 8. 229–231

  55. [55]

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli

  56. [56]

    In International conference on machine learning

    Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning. pmlr, 2256–2265

  57. [57]

    Georg Steinbuss and Klemens Böhm. 2021. Benchmarking unsupervised outlier detection with realistic synthetic data.ACM Transactions on Knowledge Discovery from Data (TKDD)15, 4 (2021), 1–20. Under Review 2026, April 2026, Xueying Ding, Simon Klüttermann, Haomin Wen, Yilong Chen, and Leman Akoglu

  58. [58]

    Hugo Thimonier, Fabrice Popineau, Arpad Rimmel, and Bich-Liên Doan. 2024. Beyond Individual Input for Deep Anomaly Detection on Tabular Data. InPro- ceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235), Ruslan Salakhutdinov, Zico Kolter, Kather- ine Heller, Adrian Weller, Nuria Oliver, Jona...

  59. [59]

    Kai Ming Ting, Sunil Aryal, and Takashi Washio. 2018. Which Outlier Detector Should I use?. In2018 IEEE International Conference on Data Mining (ICDM). 8–8. doi:10.1109/ICDM.2018.00015

  60. [60]

    Joaquin Vanschoren. 2018. Meta-Learning: A Survey. InAutomated Machine Learning. https://api.semanticscholar.org/CorpusID:52938664

  61. [61]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  62. [62]

    Jingkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu. 2024. Generalized Out- of-Distribution Detection: A Survey.International Journal of Computer Vision 132, 12 (01 Dec 2024), 5635–5662. https://doi.org/10.1007/s11263-024-02117-4

  63. [63]

    Jaemin Yoo, Tiancheng Zhao, and Leman Akoglu. 2023. Data Augmentation is a Hyperparameter: Cherry-picked Self-Supervision for Unsupervised Anomaly Detection is Creating the Illusion of Success.Trans. Mach. Learn. Res.2023 (2023)

  64. [64]

    Maddix, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W

    Xiyuan Zhang, Danielle C. Maddix, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W. Mahoney, Cuixiong Hu, Huzefa Rangwala, George Karypis, and Bernie Wang

  65. [65]

    InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

    Mitra: Mixed Synthetic Priors for Enhancing Tabular Foundation Models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  66. [66]

    Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Ji- ayun Wu, Lang Mo, Li Mao, Mingchao Hao, et al . 2025. Limix: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505(2025)

  67. [67]

    Yue Zhao, Ryan Rossi, and Leman Akoglu. 2021. Automatic unsupervised outlier model selection.Advances in Neural Information Processing Systems34 (2021), 4489–4502

  68. [68]

    fraud",

    Yue Zhao, Sean Zhang, and Leman Akoglu. 2022. Toward unsupervised outlier model selection. In2022 IEEE International Conference on Data Mining (ICDM). IEEE, 773–782. macrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier Detection Under Review 2026, April 2026, Appendix A ADBench Analyses A.1 Summary Statistics Table 3: ADBench datasets su...