Collective Outlier Detection and Enumeration with Conformalized Closed Testing
Pith reviewed 2026-05-24 07:33 UTC · model grok-4.3
The pith
A distribution-free method detects and counts collective outliers by automatically selecting the best classifier and test for the data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that conformalized closed testing, paired with a data-driven rule for selecting the classifier and two-sample test, yields a flexible procedure that detects the presence of outliers and enumerates them with valid error control without assuming a specific data distribution.
What carries the argument
The automatic selection rule inside a conformalized closed testing procedure that produces valid p-values for outlier counts.
If this is right
- The method produces an estimate of the total number of outliers rather than only a detection decision.
- Error control holds after the classifier and test are chosen from the data.
- It applies to settings with sparse or weak outlier signals, as shown on the LHCO particle collision data.
- It integrates conformal p-values with classical multiple-testing ideas for enumeration.
Where Pith is reading between the lines
- The same selection mechanism might be reused for other conformal tasks that require choosing among several tests or models.
- The approach could be tested on streaming or sequential data to see whether the selection rule remains reliable over time.
- Connections to adaptive nonparametric testing might allow extensions that further improve power while keeping validity.
Load-bearing premise
The observations must satisfy the exchangeability conditions that make conformal p-values valid, and the automatic selection step must not introduce bias that invalidates the closed testing.
What would settle it
Run the procedure on simulated data with a known true number of outliers and check whether the reported intervals for the outlier count achieve the claimed coverage; systematic under- or over-coverage would falsify the validity claim.
Figures
read the original abstract
This paper develops a flexible distribution-free method for collective outlier detection and enumeration, designed for situations in which the presence of outliers can be detected powerfully even though their precise identification may be challenging due to the sparsity, weakness, or elusiveness of their signals. This method builds upon recent developments in conformal inference and integrates classical ideas from other areas, including multiple testing, locally most powerful and adaptive rank tests, and non-parametric large-sample asymptotics. The key innovation lies in developing a principled and effective approach for automatically choosing the most appropriate machine learning classifier and two-sample testing procedure for a given data set. The performance of our method is investigated through extensive empirical demonstrations, including an analysis of the LHCO high-energy particle collision data set.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a distribution-free procedure for collective outlier detection and enumeration that combines conformal p-values with closed testing. It introduces a data-driven rule for automatically selecting both an ML classifier and a two-sample test statistic for a given dataset, and reports empirical performance on synthetic examples and the LHCO particle-collision data.
Significance. If the validity of the automatic selection step can be established, the method would supply a practical, adaptive framework that inherits error control from conformal inference and closed testing while avoiding manual tuning of the classifier-test pair. The empirical demonstrations on high-energy physics data illustrate potential utility in sparse-signal settings.
major comments (2)
- [Section describing the automatic selection procedure (and any accompanying validity theorem)] The central validity claim rests on conformal exchangeability and closed-testing error control, yet the automatic, data-dependent selection of classifier and test is itself a function of the observed sample. The manuscript must supply an explicit argument (or additional conformal layer) showing that this selection preserves marginal validity of the resulting p-values under the global null; without it, the FDR or FWER guarantees for enumeration are not guaranteed to hold.
- [Theoretical results section] The abstract states that the method is 'distribution-free,' but the performance of the selected classifier-test pair is evaluated only empirically. A formal result establishing that the overall procedure controls the target error rate uniformly over all distributions (or at least under the stated exchangeability conditions) is needed to support the distribution-free claim.
minor comments (2)
- [Notation and algorithm sections] Notation for the conformal p-values and the closed-testing combination rule should be introduced with explicit definitions before their use in the main algorithm.
- [Empirical evaluation] The empirical section would benefit from reporting the frequency with which each classifier-test pair is selected across replications, to illustrate stability of the automatic rule.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address the two major points below and will revise the manuscript accordingly to strengthen the theoretical justification for the automatic selection procedure and the distribution-free guarantees.
read point-by-point responses
-
Referee: [Section describing the automatic selection procedure (and any accompanying validity theorem)] The central validity claim rests on conformal exchangeability and closed-testing error control, yet the automatic, data-dependent selection of classifier and test is itself a function of the observed sample. The manuscript must supply an explicit argument (or additional conformal layer) showing that this selection preserves marginal validity of the resulting p-values under the global null; without it, the FDR or FWER guarantees for enumeration are not guaranteed to hold.
Authors: We agree that an explicit argument is required. The selection rule is a measurable function of the calibration sample only and, under the global null and exchangeability, does not depend on the test points in a way that breaks the conformal exchangeability. In the revision we will add a short lemma establishing that the selected p-values remain marginally valid (uniform on [0,1] under the null) because the selection is performed symmetrically on the calibration set before any test-point information is used. If the referee prefers, we can also wrap the selection inside an outer conformal layer at the cost of a modest power reduction. revision: yes
-
Referee: [Theoretical results section] The abstract states that the method is 'distribution-free,' but the performance of the selected classifier-test pair is evaluated only empirically. A formal result establishing that the overall procedure controls the target error rate uniformly over all distributions (or at least under the stated exchangeability conditions) is needed to support the distribution-free claim.
Authors: The distribution-free claim is intended to mean that the error control holds for any distribution satisfying the exchangeability assumption used by the underlying conformal p-values and closed testing procedure; it does not claim uniform power or optimality. We will add a theorem (or corollary) in the theoretical results section that states the overall procedure controls the target error rate (FWER or FDR) under exchangeability, with the automatic selection treated as a fixed but data-dependent choice whose validity follows from the lemma above. The empirical results will be presented as illustrations of power rather than as the sole support for validity. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's central method extends established conformal inference and closed testing procedures to collective outlier detection, with the automatic selection of classifier and test presented as an empirical innovation rather than a fitted quantity renamed as a prediction. No equations or steps in the abstract or description reduce the performance claims or validity guarantees to self-definitional inputs, fitted parameters, or load-bearing self-citations by construction. The approach remains self-contained against external benchmarks from prior conformal and multiple-testing literature.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Conformal prediction yields valid finite-sample p-values or scores under exchangeability of the data points.
- standard math Closed testing procedures control the family-wise error rate when testing multiple hypotheses simultaneously.
Reference graph
Works this paper leans on
-
[1]
Ahmed, M. and A. N. Mahmood (2014). Network traffic analysis based on collective anomaly detection. In 9th IEEE Conference on Industrial Electronics and Applications , pp.\ 1141--1146. IEEE
work page 2014
-
[2]
Andreella, A., J. Hemerik, L. Finos, W. Weeda, and J. Goeman (2023). Permutation-based true discovery proportions for functional magnetic resonance imaging cluster analysis. Statistics in Medicine\/ 42\/ (14), 2311--2340
work page 2023
-
[3]
Barber, R. F., E. Cand \`e s, A. Ramdas, and R. J. Tibshirani (2021). Predictive inference with the jackknife+. Ann. Stat.\/ 49\/ (1), 486--507
work page 2021
-
[4]
Barber, R. F., E. J. Cand \`e s, A. Ramdas, and R. J. Tibshirani (2023). Conformal prediction beyond exchangeability. Ann. Stat.\/ 51\/ (2), 816--845
work page 2023
-
[5]
Bates, S., E. Cand \`e s, L. Lei, Y. Romano, and M. Sesia (2023). Testing for outliers with conformal p-values. Ann. Stat.\/ 51\/ (1), 149--178
work page 2023
-
[6]
Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B\/ 57\/ (1), 289--300
work page 1995
-
[7]
Benjamini, Y. and D. Yekutieli (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Stat.\/ 29\/ (4), 1165--1188
work page 2001
-
[8]
Birnbaum, A. (1954). Combining independent tests of significance. J. Am. Stat. Assoc.\/ 49\/ (267), 559--574
work page 1954
-
[9]
Blain, A., B. Thirion, and P. Neuvial (2022). Notip: Non-parametric true discovery proportion control for brain imaging. NeuroImage\/ 260 , 119492
work page 2022
-
[10]
Blanchard, G., P. Neuvial, and E. Roquain (2020). Post hoc confidence bounds on false positives using reference families. Ann. Stat.\/ 48\/ (3), 1281--1303
work page 2020
-
[11]
Bogomolov, M. (2023). Testing partial conjunction hypotheses under dependency, with applications to meta-analysis. Electron. J. Stat.\/ 17\/ (1), 102--155
work page 2023
-
[12]
Buckle, N., H. Kraft, Charles, and C. van Eeden (1969). An approximation to the wilcoxon-mann-whitney distribution. J. Am. Stat. Assoc.\/ 64\/ (326), 225--251
work page 1969
-
[13]
Cai, T. T. and W. Sun (2017). Large-scale global and simultaneous inference: Estimation and testing in very high dimensions. Annu. Rev. Econ.\/ 9 , 411--439
work page 2017
-
[14]
Chen, Y., P. Liu, K. S. Tan, and R. Wang (2023). Trade-off between validity and efficiency of merging p-values under arbitrary dependence. Stat. Sin.\/ 33\/ (2), 851--872
work page 2023
-
[15]
Choi, W. and I. Kim (2023). Averaging p-values under exchangeability. Statist. Probab. Lett.\/ 194 , 109748
work page 2023
-
[16]
Cox, D. R. and D. V. Hinkley (1979). Theoretical statistics . CRC Press
work page 1979
-
[17]
Amsterdam Library of Object Images (ALOI) Data Set
Dataset. Amsterdam Library of Object Images (ALOI) Data Set . https://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/literature/ALOI. Not normalized, without duplicates. Accessed: January, 2021
work page 2021
-
[18]
Dataset. Covertype Data Set . http://odds.cs.stonybrook.edu/forestcovercovertype-dataset. Accessed: January, 2021
work page 2021
-
[19]
Credit Card Fraud Detection Data Set
Dataset. Credit Card Fraud Detection Data Set . https://www.kaggle.com/mlg-ulb/creditcardfraud. Accessed: January, 2021
work page 2021
-
[20]
Dataset. Mammography Data Set . http://odds.cs.stonybrook.edu/mammography-dataset/. Accessed: January, 2021
work page 2021
-
[21]
Pen-Based Recognition of Handwritten Digits Data Set
Dataset. Pen-Based Recognition of Handwritten Digits Data Set . http://odds.cs.stonybrook.edu/pendigits-dataset. Accessed: January, 2021
work page 2021
-
[22]
Dataset. Statlog (Shuttle) Data Set . http://odds.cs.stonybrook.edu/shuttle-dataset. Accessed: January, 2021
work page 2021
-
[23]
Dobriban, E. (2020). Fast closed testing for exchangeable local tests. Biometrika\/ 107\/ (3), 761--768
work page 2020
-
[24]
Donoho, D. and J. Jin (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Stat.\/ 32\/ (3), 962--994
work page 2004
-
[25]
Donoho, D. and J. Jin (2015). Higher criticism for large-scale inference, especially for rare and weak effects. Statist. Sci.\/ 30\/ (1), 1--25
work page 2015
-
[26]
Dunnett, C. W. (1955). A multiple comparison procedure for comparing several treatments with a control. J. Am. Stat. Assoc.\/ 50\/ (272), 1096--1121
work page 1955
-
[27]
Ebrahimpoor, M., P. Spitali, K. Hettne, R. Tsonaka, and J. Goeman (2020). Simultaneous enrichment analysis of all possible gene-sets: unifying self-contained and competitive methods. Brief. Bioinform.\/ 21\/ (4), 1302--1312
work page 2020
-
[28]
Edgington, E. S. (1972). An additive method for combining probability values from independent experiments. J. Clin. Psychol.\/ 80\/ (2), 351--363
work page 1972
-
[29]
Feroze, A., A. Daud, T. Amjad, and M. K. Hayat (2021). Group anomaly detection: past notions, present insights, and future prospects. SN Computer Science\/ 2 , 1--27
work page 2021
-
[30]
Fisher, R. A. (1925). Statistical methods for research workers. In Breakthroughs in statistics: Methodology and distribution , pp.\ 66--70. Springer
work page 1925
-
[31]
Fix, E. and L. J. Hodges, Joseph (1955). Significance probabilities of the wilcoxon test. Ann. Math. Stat.\/ 26\/ (2), 301--312
work page 1955
-
[32]
Genovese, C. R. and L. Wasserman (2006). Exceedance control of the false discovery proportion. J. Am. Stat. Assoc.\/ 101\/ (476), 1408--1417
work page 2006
-
[33]
Goeman, J. J., P. G \'o recki, R. Monajemi, X. Chen, T. E. Nichols, and W. Weeda (2023). Cluster extent inference revisited: quantification and localisation of brain activity. J. R. Stat. Soc. B\/ 85\/ (4), 1128--1153
work page 2023
-
[34]
Goeman, J. J., J. Hemerik, and A. Solari (2021). Only closed testing procedures are admissible for controlling false discovery proportions. Ann. Stat.\/ 49\/ (2), 1218--1238
work page 2021
-
[35]
Goeman, J. J., R. J. Meijer, T. J. Krebs, and A. Solari (2019). Simultaneous control of all false discovery proportions in large-scale multiple hypothesis testing. Biometrika\/ 106\/ (4), 841--856
work page 2019
-
[36]
Goeman, J. J. and A. Solari (2011). Multiple testing for exploratory research. Stat. Sci.\/ 26\/ (4), 584–597
work page 2011
-
[37]
Guan, L. and R. Tibshirani (2022). Prediction and outlier detection in classification problems. J. R. Stat. Soc. B\/ 84\/ (2), 524--546
work page 2022
-
[38]
Heard, N. A. and P. Rubin-Delanchy (2018). Choosing between methods of combining-values. Biometrika\/ 105\/ (1), 239--246
work page 2018
-
[39]
Heller, R. and A. Solari (2023). Simultaneous directional inference. J. R. Stat. Soc. B\/ , qkad137
work page 2023
-
[40]
Hemerik, J. and J. Goeman (2018). Exact testing with random permutations. Test\/ 27\/ (4), 811--825
work page 2018
-
[41]
Hemerik, J. and J. J. Goeman (2021). Another look at the lady tasting tea and differences between permutation tests and randomisation tests. Int. Stat. Rev.\/ 89\/ (2), 367--381
work page 2021
-
[42]
Hemerik, J., A. Solari, and J. J. Goeman (2019). Permutation-based simultaneous confidence bounds for the false discovery proportion. Biometrika\/ 106\/ (3), 635--649
work page 2019
-
[43]
Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Ann. Math. Stat.\/ 19\/ (3), 293--325
work page 1948
-
[44]
Hommel, G. (1988). A stagewise rejective multiple test procedure based on a modified bonferroni test. Biometrika\/ 75\/ (2), 383--386
work page 1988
-
[45]
Hu, X. and J. Lei (2023). A two-sample conditional distribution test using conformal prediction and weighted rank sum. J. Am. Stat. Assoc.\/ (just-accepted), 1--43
work page 2023
-
[46]
Kasieczka, G., B. Nachman, D. Shih, O. Amram, A. Andreassen, K. Benkendorfer, B. Bortolato, G. Brooijmans, F. Canelli, J. H. Collins, et al. (2021). The LHC Olympics 2020 a community challenge for anomaly detection in high energy physics. Reports on progress in physics\/ 84\/ (12), 124201
work page 2021
-
[47]
Katsevich, E. and A. Ramdas (2020). Simultaneous high-probability bounds on the false discovery proportion in structured, regression and online settings. Ann. Stat.\/ 48\/ (6), 3465--3487
work page 2020
- [48]
-
[49]
Laxhammar, R. and G. Falkman (2015). Inductive conformal anomaly detection for sequential detection of anomalous sub-trajectories. Ann. Math. Artif. Intell.\/ 74 , 67--94
work page 2015
-
[50]
Lehmann, E. L. (1953). The power of rank tests. Ann. Math. Stat.\/ 24 , 23--42
work page 1953
-
[51]
Lehmann, E. L. and J. P. Romano (2005). Testing Statistical Hypotheses\/ (3 ed.). Springer Texts in Statistics. Springer New York, NY
work page 2005
-
[52]
Li, J., M. H. Maathuis, and J. J. Goeman (2024). Simultaneous false discovery proportion bounds via knockoffs and closed testing. J. R. Stat. Soc. B\/ , qkae012
work page 2024
-
[53]
Liang, Z., M. Sesia, and W. Sun (2024). Integrative conformal p-values for out-of-distribution testing with labelled outliers. J. R. Stat. Soc. B\/ , qkad138
work page 2024
-
[54]
Mann, H. B. and D. R. Whitney (1947). On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat.\/ 18\/ (1), 50 -- 60
work page 1947
-
[55]
Marandon, A., L. Lei, D. Mary, and E. Roquain (2024). Adaptive novelty detection with false discovery rate guarantee. Ann. Stat.\/ 52\/ (1), 157--183
work page 2024
-
[56]
Marcus, R., E. Peritz, and K. R. Gabriel (1976). Closed testing procedures with special reference to ordered analysis of variance. Biometrika\/ 1\/ (63), 655–660
work page 1976
-
[57]
Mary, D. and E. Roquain (2022). Semi-supervised multiple testing. Electron. J. Stat.\/ 16\/ (2), 4926--4981
work page 2022
-
[58]
Meinshausen, N. and J. Rice (2006). Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. Ann. Stat.\/ 34\/ (1), 373--393
work page 2006
-
[59]
Meng, X.-L. (1994). Posterior predictive p -values. Ann. Stat.\/ 22\/ (3), 1142--1160
work page 1994
-
[60]
Owen, A. B. (2009). Karl Pearson’s meta-analysis revisited . Ann. Stat.\/ 37\/ (6B), 3867 -- 3892
work page 2009
-
[61]
Patra, R. K. and B. Sen (2016). Estimation of a two-component mixture model with applications to multiple testing. J. R. Statist. Soc. B\/ 78\/ (4), 869--–893
work page 2016
-
[62]
Pesarin, F. and L. Salmaso (2010). Permutation tests for complex data: theory, applications and software . John Wiley & Sons
work page 2010
-
[63]
Rosenblatt, J. D., L. Finos, W. D. Weeda, A. Solari, and J. J. Goeman (2018). All-resolutions inference for brain imaging. Neuroimage\/ 181 , 786--796
work page 2018
-
[64]
R \"u schendorf, L. (1982). Random variables with maximum sums. Adv. in Appl. Probab.\/ 14\/ (3), 623--632
work page 1982
-
[65]
Sarkar, S. K. (2008). On the Simes inequality and its generalization. In Beyond parametrics in interdisciplinary research: Festschrift in honor of Professor Pranab K. Sen , Volume 1, pp.\ 231--243. Institute of Mathematical Statistics
work page 2008
-
[66]
Schweder, T. and E. Spj tvoll (1982). Plots of p-values to evaluate many tests simultaneously. Biometrika\/ 69\/ (3), 493--502
work page 1982
-
[67]
Shiraishi, T. (1985). Local powers of two-sample and multi-sample rank tests for lehmann's contaminated alternative. Ann. Inst. Stat. Math.\/ 37 , 519--527
work page 1985
-
[68]
Simes, R. J. (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika\/ 73\/ (3), 751--754
work page 1986
-
[69]
Stoepker, I. V., R. M. Castro, E. Arias-Castro, and E. van den Heuvel (2024). Anomaly detection for a large number of streams: A permutation-based higher criticism approach. J. Am. Stat. Assoc.\/ 119\/ (545), 461--474
work page 2024
-
[70]
Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. B\/ 64\/ (3), 479–498
work page 2002
-
[71]
Storey, J. D., J. E. Taylor, and D. Siegmund (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: A unified approach. J. R. Stat. Soc. B\/ 66\/ (1), 187–205
work page 2004
-
[72]
Tian, J., X. Chen, E. Katsevich, J. J. Goeman, and A. Ramdas (2023). Large-scale simultaneous inference under dependence. Scand. J. Stat.\/ 50\/ (2), 750–796
work page 2023
-
[73]
Tibshirani, R. J., R. Foygel Barber, E. Cand \`e s, and A. Ramdas (2019). Conformal prediction under covariate shift. In Adv. Neural Inf. Process. Syst. , Volume 32
work page 2019
-
[74]
Vatanen, T., M. Kuusela, E. Malmi, T. Raiko, T. Aaltonen, and Y. Nagai (2012). Semi-supervised detection of collective anomalies with an application in high energy particle physics. In International Joint Conference on Neural Networks , pp.\ 1--8. IEEE
work page 2012
-
[75]
Vesely, A., L. Finos, and J. J. Goeman (2023). Permutation-based true discovery guarantee by sum tests. J. R. Stat. Soc. B\/ 85\/ (3), 664--683
work page 2023
-
[76]
Vovk, V., A. Gammerman, and G. Shafer (2005). Algorithmic learning in a random world , Volume 29. Springer
work page 2005
-
[77]
Vovk, V., B. Wang, and R. Wang (2022). Admissible ways of merging p-values under arbitrary dependence. Ann. Stat.\/ 50\/ (1), 351--375
work page 2022
-
[78]
Vovk, V. and R. Wang (2020). Combining p-values via averaging. Biometrika\/ 107\/ (4), 791--808
work page 2020
-
[79]
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin\/ 1\/ (6), 80–83
work page 1945
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.