Collective Outlier Detection and Enumeration with Conformalized Closed Testing

Aldo Solari; Chiara G. Magnani; Matteo Sesia

arxiv: 2308.05534 · v4 · pith:S3IUU5JGnew · submitted 2023-08-10 · 📊 stat.ME

Collective Outlier Detection and Enumeration with Conformalized Closed Testing

Chiara G. Magnani , Matteo Sesia , Aldo Solari This is my paper

Pith reviewed 2026-05-24 07:33 UTC · model grok-4.3

classification 📊 stat.ME

keywords collective outlier detectionconformal inferenceclosed testingmultiple testingdistribution-free methodsclassifier selectiontwo-sample tests

0 comments

The pith

A distribution-free method detects and counts collective outliers by automatically selecting the best classifier and test for the data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method for detecting groups of outliers even when individual identification is difficult due to weak or sparse signals. It combines conformal inference with closed testing to deliver valid p-values for the number of outliers. The central advance is an automatic rule that chooses the most suitable machine learning classifier and two-sample test for each dataset while preserving distribution-free guarantees. Readers would care because many practical settings, such as particle collision data, involve elusive outliers where fixed procedures lose power or validity.

Core claim

The authors claim that conformalized closed testing, paired with a data-driven rule for selecting the classifier and two-sample test, yields a flexible procedure that detects the presence of outliers and enumerates them with valid error control without assuming a specific data distribution.

What carries the argument

The automatic selection rule inside a conformalized closed testing procedure that produces valid p-values for outlier counts.

If this is right

The method produces an estimate of the total number of outliers rather than only a detection decision.
Error control holds after the classifier and test are chosen from the data.
It applies to settings with sparse or weak outlier signals, as shown on the LHCO particle collision data.
It integrates conformal p-values with classical multiple-testing ideas for enumeration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selection mechanism might be reused for other conformal tasks that require choosing among several tests or models.
The approach could be tested on streaming or sequential data to see whether the selection rule remains reliable over time.
Connections to adaptive nonparametric testing might allow extensions that further improve power while keeping validity.

Load-bearing premise

The observations must satisfy the exchangeability conditions that make conformal p-values valid, and the automatic selection step must not introduce bias that invalidates the closed testing.

What would settle it

Run the procedure on simulated data with a known true number of outliers and check whether the reported intervals for the outlier count achieve the claimed coverage; systematic under- or over-coverage would falsify the validity claim.

Figures

Figures reproduced from arXiv: 2308.05534 by Aldo Solari, Chiara G. Magnani, Matteo Sesia.

**Figure 2.** Figure 2: Median values for a 90% lower confidence bound on the number of outliers in a test set, computed by [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: Median values for a 90% lower confidence bound on the number of outliers within an adaptively [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Performance of ACODE on synthetic data with adverserially hidden outliers exhibiting underdispersed [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Median values for a 90% lower confidence bound on the number of outliers within an adaptively [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

This paper develops a flexible distribution-free method for collective outlier detection and enumeration, designed for situations in which the presence of outliers can be detected powerfully even though their precise identification may be challenging due to the sparsity, weakness, or elusiveness of their signals. This method builds upon recent developments in conformal inference and integrates classical ideas from other areas, including multiple testing, locally most powerful and adaptive rank tests, and non-parametric large-sample asymptotics. The key innovation lies in developing a principled and effective approach for automatically choosing the most appropriate machine learning classifier and two-sample testing procedure for a given data set. The performance of our method is investigated through extensive empirical demonstrations, including an analysis of the LHCO high-energy particle collision data set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's real contribution is a conformal-plus-closed-testing procedure for enumerating collective outliers together with a claimed-principled rule for picking the classifier and test from the data; the open question is whether that rule keeps the distribution-free guarantees intact.

read the letter

The paper combines conformal p-values with closed testing to detect and count groups of outliers when individual signals are weak or sparse. The automatic, data-driven choice of both the ML classifier and the two-sample test is presented as the main new piece, and the authors apply it to the LHCO particle-collision data. That combination is not a routine extension of the cited conformal and multiple-testing literature, and the real-data example is a concrete plus for anyone who works with scientific measurements that have this structure. If the selection rule is shown to preserve marginal validity, the method gives practitioners a way to avoid hand-tuning the detector without losing error control. The stress-test concern about selection bias is worth checking directly in the proofs. The abstract calls the selection “principled,” but the validity argument for the chosen classifier and test must be explicit; otherwise the FDR or FWER control claimed for enumeration does not automatically follow. No other load-bearing gaps are visible from the description. The work is aimed at statisticians and physicists who need distribution-free tools for collective outlier problems. It is coherent on its own terms and engages the relevant literature, so it deserves a serious referee who can verify the selection step and the empirical controls. I would send it out for review rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper develops a distribution-free procedure for collective outlier detection and enumeration that combines conformal p-values with closed testing. It introduces a data-driven rule for automatically selecting both an ML classifier and a two-sample test statistic for a given dataset, and reports empirical performance on synthetic examples and the LHCO particle-collision data.

Significance. If the validity of the automatic selection step can be established, the method would supply a practical, adaptive framework that inherits error control from conformal inference and closed testing while avoiding manual tuning of the classifier-test pair. The empirical demonstrations on high-energy physics data illustrate potential utility in sparse-signal settings.

major comments (2)

[Section describing the automatic selection procedure (and any accompanying validity theorem)] The central validity claim rests on conformal exchangeability and closed-testing error control, yet the automatic, data-dependent selection of classifier and test is itself a function of the observed sample. The manuscript must supply an explicit argument (or additional conformal layer) showing that this selection preserves marginal validity of the resulting p-values under the global null; without it, the FDR or FWER guarantees for enumeration are not guaranteed to hold.
[Theoretical results section] The abstract states that the method is 'distribution-free,' but the performance of the selected classifier-test pair is evaluated only empirically. A formal result establishing that the overall procedure controls the target error rate uniformly over all distributions (or at least under the stated exchangeability conditions) is needed to support the distribution-free claim.

minor comments (2)

[Notation and algorithm sections] Notation for the conformal p-values and the closed-testing combination rule should be introduced with explicit definitions before their use in the main algorithm.
[Empirical evaluation] The empirical section would benefit from reporting the frequency with which each classifier-test pair is selected across replications, to illustrate stability of the automatic rule.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address the two major points below and will revise the manuscript accordingly to strengthen the theoretical justification for the automatic selection procedure and the distribution-free guarantees.

read point-by-point responses

Referee: [Section describing the automatic selection procedure (and any accompanying validity theorem)] The central validity claim rests on conformal exchangeability and closed-testing error control, yet the automatic, data-dependent selection of classifier and test is itself a function of the observed sample. The manuscript must supply an explicit argument (or additional conformal layer) showing that this selection preserves marginal validity of the resulting p-values under the global null; without it, the FDR or FWER guarantees for enumeration are not guaranteed to hold.

Authors: We agree that an explicit argument is required. The selection rule is a measurable function of the calibration sample only and, under the global null and exchangeability, does not depend on the test points in a way that breaks the conformal exchangeability. In the revision we will add a short lemma establishing that the selected p-values remain marginally valid (uniform on [0,1] under the null) because the selection is performed symmetrically on the calibration set before any test-point information is used. If the referee prefers, we can also wrap the selection inside an outer conformal layer at the cost of a modest power reduction. revision: yes
Referee: [Theoretical results section] The abstract states that the method is 'distribution-free,' but the performance of the selected classifier-test pair is evaluated only empirically. A formal result establishing that the overall procedure controls the target error rate uniformly over all distributions (or at least under the stated exchangeability conditions) is needed to support the distribution-free claim.

Authors: The distribution-free claim is intended to mean that the error control holds for any distribution satisfying the exchangeability assumption used by the underlying conformal p-values and closed testing procedure; it does not claim uniform power or optimality. We will add a theorem (or corollary) in the theoretical results section that states the overall procedure controls the target error rate (FWER or FDR) under exchangeability, with the automatic selection treated as a fixed but data-dependent choice whose validity follows from the lemma above. The empirical results will be presented as illustrations of power rather than as the sole support for validity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central method extends established conformal inference and closed testing procedures to collective outlier detection, with the automatic selection of classifier and test presented as an empirical innovation rather than a fitted quantity renamed as a prediction. No equations or steps in the abstract or description reduce the performance claims or validity guarantees to self-definitional inputs, fitted parameters, or load-bearing self-citations by construction. The approach remains self-contained against external benchmarks from prior conformal and multiple-testing literature.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on standard assumptions from conformal prediction and multiple testing; no new free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption Conformal prediction yields valid finite-sample p-values or scores under exchangeability of the data points.
Core requirement invoked for the distribution-free property.
standard math Closed testing procedures control the family-wise error rate when testing multiple hypotheses simultaneously.
Classical result used to support enumeration without excessive false positives.

pith-pipeline@v0.9.0 · 5650 in / 1269 out tokens · 45166 ms · 2026-05-24T07:33:50.234009+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages

[1]

Ahmed, M. and A. N. Mahmood (2014). Network traffic analysis based on collective anomaly detection. In 9th IEEE Conference on Industrial Electronics and Applications , pp.\ 1141--1146. IEEE

work page 2014
[2]

Hemerik, L

Andreella, A., J. Hemerik, L. Finos, W. Weeda, and J. Goeman (2023). Permutation-based true discovery proportions for functional magnetic resonance imaging cluster analysis. Statistics in Medicine\/ 42\/ (14), 2311--2340

work page 2023
[3]

Barber, R. F., E. Cand \`e s, A. Ramdas, and R. J. Tibshirani (2021). Predictive inference with the jackknife+. Ann. Stat.\/ 49\/ (1), 486--507

work page 2021
[4]

Barber, R. F., E. J. Cand \`e s, A. Ramdas, and R. J. Tibshirani (2023). Conformal prediction beyond exchangeability. Ann. Stat.\/ 51\/ (2), 816--845

work page 2023
[5]

Cand \`e s, L

Bates, S., E. Cand \`e s, L. Lei, Y. Romano, and M. Sesia (2023). Testing for outliers with conformal p-values. Ann. Stat.\/ 51\/ (1), 149--178

work page 2023
[6]

Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B\/ 57\/ (1), 289--300

work page 1995
[7]

Benjamini, Y. and D. Yekutieli (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Stat.\/ 29\/ (4), 1165--1188

work page 2001
[8]

Birnbaum, A. (1954). Combining independent tests of significance. J. Am. Stat. Assoc.\/ 49\/ (267), 559--574

work page 1954
[9]

Thirion, and P

Blain, A., B. Thirion, and P. Neuvial (2022). Notip: Non-parametric true discovery proportion control for brain imaging. NeuroImage\/ 260 , 119492

work page 2022
[10]

Neuvial, and E

Blanchard, G., P. Neuvial, and E. Roquain (2020). Post hoc confidence bounds on false positives using reference families. Ann. Stat.\/ 48\/ (3), 1281--1303

work page 2020
[11]

Bogomolov, M. (2023). Testing partial conjunction hypotheses under dependency, with applications to meta-analysis. Electron. J. Stat.\/ 17\/ (1), 102--155

work page 2023
[12]

Kraft, Charles, and C

Buckle, N., H. Kraft, Charles, and C. van Eeden (1969). An approximation to the wilcoxon-mann-whitney distribution. J. Am. Stat. Assoc.\/ 64\/ (326), 225--251

work page 1969
[13]

Cai, T. T. and W. Sun (2017). Large-scale global and simultaneous inference: Estimation and testing in very high dimensions. Annu. Rev. Econ.\/ 9 , 411--439

work page 2017
[14]

Chen, Y., P. Liu, K. S. Tan, and R. Wang (2023). Trade-off between validity and efficiency of merging p-values under arbitrary dependence. Stat. Sin.\/ 33\/ (2), 851--872

work page 2023
[15]

Choi, W. and I. Kim (2023). Averaging p-values under exchangeability. Statist. Probab. Lett.\/ 194 , 109748

work page 2023
[16]

Cox, D. R. and D. V. Hinkley (1979). Theoretical statistics . CRC Press

work page 1979
[17]

Amsterdam Library of Object Images (ALOI) Data Set

Dataset. Amsterdam Library of Object Images (ALOI) Data Set . https://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/literature/ALOI. Not normalized, without duplicates. Accessed: January, 2021

work page 2021
[18]

Covertype Data Set

Dataset. Covertype Data Set . http://odds.cs.stonybrook.edu/forestcovercovertype-dataset. Accessed: January, 2021

work page 2021
[19]

Credit Card Fraud Detection Data Set

Dataset. Credit Card Fraud Detection Data Set . https://www.kaggle.com/mlg-ulb/creditcardfraud. Accessed: January, 2021

work page 2021
[20]

Mammography Data Set

Dataset. Mammography Data Set . http://odds.cs.stonybrook.edu/mammography-dataset/. Accessed: January, 2021

work page 2021
[21]

Pen-Based Recognition of Handwritten Digits Data Set

Dataset. Pen-Based Recognition of Handwritten Digits Data Set . http://odds.cs.stonybrook.edu/pendigits-dataset. Accessed: January, 2021

work page 2021
[22]

Statlog (Shuttle) Data Set

Dataset. Statlog (Shuttle) Data Set . http://odds.cs.stonybrook.edu/shuttle-dataset. Accessed: January, 2021

work page 2021
[23]

Dobriban, E. (2020). Fast closed testing for exchangeable local tests. Biometrika\/ 107\/ (3), 761--768

work page 2020
[24]

Donoho, D. and J. Jin (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Stat.\/ 32\/ (3), 962--994

work page 2004
[25]

Donoho, D. and J. Jin (2015). Higher criticism for large-scale inference, especially for rare and weak effects. Statist. Sci.\/ 30\/ (1), 1--25

work page 2015
[26]

Dunnett, C. W. (1955). A multiple comparison procedure for comparing several treatments with a control. J. Am. Stat. Assoc.\/ 50\/ (272), 1096--1121

work page 1955
[27]

Spitali, K

Ebrahimpoor, M., P. Spitali, K. Hettne, R. Tsonaka, and J. Goeman (2020). Simultaneous enrichment analysis of all possible gene-sets: unifying self-contained and competitive methods. Brief. Bioinform.\/ 21\/ (4), 1302--1312

work page 2020
[28]

Edgington, E. S. (1972). An additive method for combining probability values from independent experiments. J. Clin. Psychol.\/ 80\/ (2), 351--363

work page 1972
[29]

Feroze, A., A. Daud, T. Amjad, and M. K. Hayat (2021). Group anomaly detection: past notions, present insights, and future prospects. SN Computer Science\/ 2 , 1--27

work page 2021
[30]

Fisher, R. A. (1925). Statistical methods for research workers. In Breakthroughs in statistics: Methodology and distribution , pp.\ 66--70. Springer

work page 1925
[31]

Fix, E. and L. J. Hodges, Joseph (1955). Significance probabilities of the wilcoxon test. Ann. Math. Stat.\/ 26\/ (2), 301--312

work page 1955
[32]

Genovese, C. R. and L. Wasserman (2006). Exceedance control of the false discovery proportion. J. Am. Stat. Assoc.\/ 101\/ (476), 1408--1417

work page 2006
[33]

Goeman, J. J., P. G \'o recki, R. Monajemi, X. Chen, T. E. Nichols, and W. Weeda (2023). Cluster extent inference revisited: quantification and localisation of brain activity. J. R. Stat. Soc. B\/ 85\/ (4), 1128--1153

work page 2023
[34]

Goeman, J. J., J. Hemerik, and A. Solari (2021). Only closed testing procedures are admissible for controlling false discovery proportions. Ann. Stat.\/ 49\/ (2), 1218--1238

work page 2021
[35]

Goeman, J. J., R. J. Meijer, T. J. Krebs, and A. Solari (2019). Simultaneous control of all false discovery proportions in large-scale multiple hypothesis testing. Biometrika\/ 106\/ (4), 841--856

work page 2019
[36]

Goeman, J. J. and A. Solari (2011). Multiple testing for exploratory research. Stat. Sci.\/ 26\/ (4), 584–597

work page 2011
[37]

Guan, L. and R. Tibshirani (2022). Prediction and outlier detection in classification problems. J. R. Stat. Soc. B\/ 84\/ (2), 524--546

work page 2022
[38]

Heard, N. A. and P. Rubin-Delanchy (2018). Choosing between methods of combining-values. Biometrika\/ 105\/ (1), 239--246

work page 2018
[39]

Heller, R. and A. Solari (2023). Simultaneous directional inference. J. R. Stat. Soc. B\/ , qkad137

work page 2023
[40]

Hemerik, J. and J. Goeman (2018). Exact testing with random permutations. Test\/ 27\/ (4), 811--825

work page 2018
[41]

Hemerik, J. and J. J. Goeman (2021). Another look at the lady tasting tea and differences between permutation tests and randomisation tests. Int. Stat. Rev.\/ 89\/ (2), 367--381

work page 2021
[42]

Solari, and J

Hemerik, J., A. Solari, and J. J. Goeman (2019). Permutation-based simultaneous confidence bounds for the false discovery proportion. Biometrika\/ 106\/ (3), 635--649

work page 2019
[43]

Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Ann. Math. Stat.\/ 19\/ (3), 293--325

work page 1948
[44]

Hommel, G. (1988). A stagewise rejective multiple test procedure based on a modified bonferroni test. Biometrika\/ 75\/ (2), 383--386

work page 1988
[45]

Hu, X. and J. Lei (2023). A two-sample conditional distribution test using conformal prediction and weighted rank sum. J. Am. Stat. Assoc.\/ (just-accepted), 1--43

work page 2023
[46]

Nachman, D

Kasieczka, G., B. Nachman, D. Shih, O. Amram, A. Andreassen, K. Benkendorfer, B. Bortolato, G. Brooijmans, F. Canelli, J. H. Collins, et al. (2021). The LHC Olympics 2020 a community challenge for anomaly detection in high energy physics. Reports on progress in physics\/ 84\/ (12), 124201

work page 2021
[47]

Katsevich, E. and A. Ramdas (2020). Simultaneous high-probability bounds on the false discovery proportion in structured, regression and online settings. Ann. Stat.\/ 48\/ (6), 3465--3487

work page 2020
[48]

Kuchibhotla, A. K. (2020). Exchangeability, conformal prediction, and rank tests. arXiv preprint arXiv:2005.06095\/

work page arXiv 2020
[49]

Laxhammar, R. and G. Falkman (2015). Inductive conformal anomaly detection for sequential detection of anomalous sub-trajectories. Ann. Math. Artif. Intell.\/ 74 , 67--94

work page 2015
[50]

Lehmann, E. L. (1953). The power of rank tests. Ann. Math. Stat.\/ 24 , 23--42

work page 1953
[51]

Lehmann, E. L. and J. P. Romano (2005). Testing Statistical Hypotheses\/ (3 ed.). Springer Texts in Statistics. Springer New York, NY

work page 2005
[52]

Li, J., M. H. Maathuis, and J. J. Goeman (2024). Simultaneous false discovery proportion bounds via knockoffs and closed testing. J. R. Stat. Soc. B\/ , qkae012

work page 2024
[53]

Sesia, and W

Liang, Z., M. Sesia, and W. Sun (2024). Integrative conformal p-values for out-of-distribution testing with labelled outliers. J. R. Stat. Soc. B\/ , qkad138

work page 2024
[54]

Mann, H. B. and D. R. Whitney (1947). On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat.\/ 18\/ (1), 50 -- 60

work page 1947
[55]

Marandon, A., L. Lei, D. Mary, and E. Roquain (2024). Adaptive novelty detection with false discovery rate guarantee. Ann. Stat.\/ 52\/ (1), 157--183

work page 2024
[56]

Peritz, and K

Marcus, R., E. Peritz, and K. R. Gabriel (1976). Closed testing procedures with special reference to ordered analysis of variance. Biometrika\/ 1\/ (63), 655–660

work page 1976
[57]

Mary, D. and E. Roquain (2022). Semi-supervised multiple testing. Electron. J. Stat.\/ 16\/ (2), 4926--4981

work page 2022
[58]

Meinshausen, N. and J. Rice (2006). Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. Ann. Stat.\/ 34\/ (1), 373--393

work page 2006
[59]

Meng, X.-L. (1994). Posterior predictive p -values. Ann. Stat.\/ 22\/ (3), 1142--1160

work page 1994
[60]

Owen, A. B. (2009). Karl Pearson’s meta-analysis revisited . Ann. Stat.\/ 37\/ (6B), 3867 -- 3892

work page 2009
[61]

Patra, R. K. and B. Sen (2016). Estimation of a two-component mixture model with applications to multiple testing. J. R. Statist. Soc. B\/ 78\/ (4), 869--–893

work page 2016
[62]

Pesarin, F. and L. Salmaso (2010). Permutation tests for complex data: theory, applications and software . John Wiley & Sons

work page 2010
[63]

Rosenblatt, J. D., L. Finos, W. D. Weeda, A. Solari, and J. J. Goeman (2018). All-resolutions inference for brain imaging. Neuroimage\/ 181 , 786--796

work page 2018
[64]

R \"u schendorf, L. (1982). Random variables with maximum sums. Adv. in Appl. Probab.\/ 14\/ (3), 623--632

work page 1982
[65]

Sarkar, S. K. (2008). On the Simes inequality and its generalization. In Beyond parametrics in interdisciplinary research: Festschrift in honor of Professor Pranab K. Sen , Volume 1, pp.\ 231--243. Institute of Mathematical Statistics

work page 2008
[66]

Schweder, T. and E. Spj tvoll (1982). Plots of p-values to evaluate many tests simultaneously. Biometrika\/ 69\/ (3), 493--502

work page 1982
[67]

Shiraishi, T. (1985). Local powers of two-sample and multi-sample rank tests for lehmann's contaminated alternative. Ann. Inst. Stat. Math.\/ 37 , 519--527

work page 1985
[68]

Simes, R. J. (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika\/ 73\/ (3), 751--754

work page 1986
[69]

Stoepker, I. V., R. M. Castro, E. Arias-Castro, and E. van den Heuvel (2024). Anomaly detection for a large number of streams: A permutation-based higher criticism approach. J. Am. Stat. Assoc.\/ 119\/ (545), 461--474

work page 2024
[70]

Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. B\/ 64\/ (3), 479–498

work page 2002
[71]

Storey, J. D., J. E. Taylor, and D. Siegmund (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: A unified approach. J. R. Stat. Soc. B\/ 66\/ (1), 187–205

work page 2004
[72]

Tian, J., X. Chen, E. Katsevich, J. J. Goeman, and A. Ramdas (2023). Large-scale simultaneous inference under dependence. Scand. J. Stat.\/ 50\/ (2), 750–796

work page 2023
[73]

Tibshirani, R. J., R. Foygel Barber, E. Cand \`e s, and A. Ramdas (2019). Conformal prediction under covariate shift. In Adv. Neural Inf. Process. Syst. , Volume 32

work page 2019
[74]

Kuusela, E

Vatanen, T., M. Kuusela, E. Malmi, T. Raiko, T. Aaltonen, and Y. Nagai (2012). Semi-supervised detection of collective anomalies with an application in high energy particle physics. In International Joint Conference on Neural Networks , pp.\ 1--8. IEEE

work page 2012
[75]

Finos, and J

Vesely, A., L. Finos, and J. J. Goeman (2023). Permutation-based true discovery guarantee by sum tests. J. R. Stat. Soc. B\/ 85\/ (3), 664--683

work page 2023
[76]

Gammerman, and G

Vovk, V., A. Gammerman, and G. Shafer (2005). Algorithmic learning in a random world , Volume 29. Springer

work page 2005
[77]

Wang, and R

Vovk, V., B. Wang, and R. Wang (2022). Admissible ways of merging p-values under arbitrary dependence. Ann. Stat.\/ 50\/ (1), 351--375

work page 2022
[78]

Vovk, V. and R. Wang (2020). Combining p-values via averaging. Biometrika\/ 107\/ (4), 791--808

work page 2020
[79]

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin\/ 1\/ (6), 80–83

work page 1945

[1] [1]

Ahmed, M. and A. N. Mahmood (2014). Network traffic analysis based on collective anomaly detection. In 9th IEEE Conference on Industrial Electronics and Applications , pp.\ 1141--1146. IEEE

work page 2014

[2] [2]

Hemerik, L

Andreella, A., J. Hemerik, L. Finos, W. Weeda, and J. Goeman (2023). Permutation-based true discovery proportions for functional magnetic resonance imaging cluster analysis. Statistics in Medicine\/ 42\/ (14), 2311--2340

work page 2023

[3] [3]

Barber, R. F., E. Cand \`e s, A. Ramdas, and R. J. Tibshirani (2021). Predictive inference with the jackknife+. Ann. Stat.\/ 49\/ (1), 486--507

work page 2021

[4] [4]

Barber, R. F., E. J. Cand \`e s, A. Ramdas, and R. J. Tibshirani (2023). Conformal prediction beyond exchangeability. Ann. Stat.\/ 51\/ (2), 816--845

work page 2023

[5] [5]

Cand \`e s, L

Bates, S., E. Cand \`e s, L. Lei, Y. Romano, and M. Sesia (2023). Testing for outliers with conformal p-values. Ann. Stat.\/ 51\/ (1), 149--178

work page 2023

[6] [6]

Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B\/ 57\/ (1), 289--300

work page 1995

[7] [7]

Benjamini, Y. and D. Yekutieli (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Stat.\/ 29\/ (4), 1165--1188

work page 2001

[8] [8]

Birnbaum, A. (1954). Combining independent tests of significance. J. Am. Stat. Assoc.\/ 49\/ (267), 559--574

work page 1954

[9] [9]

Thirion, and P

Blain, A., B. Thirion, and P. Neuvial (2022). Notip: Non-parametric true discovery proportion control for brain imaging. NeuroImage\/ 260 , 119492

work page 2022

[10] [10]

Neuvial, and E

Blanchard, G., P. Neuvial, and E. Roquain (2020). Post hoc confidence bounds on false positives using reference families. Ann. Stat.\/ 48\/ (3), 1281--1303

work page 2020

[11] [11]

Bogomolov, M. (2023). Testing partial conjunction hypotheses under dependency, with applications to meta-analysis. Electron. J. Stat.\/ 17\/ (1), 102--155

work page 2023

[12] [12]

Kraft, Charles, and C

Buckle, N., H. Kraft, Charles, and C. van Eeden (1969). An approximation to the wilcoxon-mann-whitney distribution. J. Am. Stat. Assoc.\/ 64\/ (326), 225--251

work page 1969

[13] [13]

Cai, T. T. and W. Sun (2017). Large-scale global and simultaneous inference: Estimation and testing in very high dimensions. Annu. Rev. Econ.\/ 9 , 411--439

work page 2017

[14] [14]

Chen, Y., P. Liu, K. S. Tan, and R. Wang (2023). Trade-off between validity and efficiency of merging p-values under arbitrary dependence. Stat. Sin.\/ 33\/ (2), 851--872

work page 2023

[15] [15]

Choi, W. and I. Kim (2023). Averaging p-values under exchangeability. Statist. Probab. Lett.\/ 194 , 109748

work page 2023

[16] [16]

Cox, D. R. and D. V. Hinkley (1979). Theoretical statistics . CRC Press

work page 1979

[17] [17]

Amsterdam Library of Object Images (ALOI) Data Set

Dataset. Amsterdam Library of Object Images (ALOI) Data Set . https://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/literature/ALOI. Not normalized, without duplicates. Accessed: January, 2021

work page 2021

[18] [18]

Covertype Data Set

Dataset. Covertype Data Set . http://odds.cs.stonybrook.edu/forestcovercovertype-dataset. Accessed: January, 2021

work page 2021

[19] [19]

Credit Card Fraud Detection Data Set

Dataset. Credit Card Fraud Detection Data Set . https://www.kaggle.com/mlg-ulb/creditcardfraud. Accessed: January, 2021

work page 2021

[20] [20]

Mammography Data Set

Dataset. Mammography Data Set . http://odds.cs.stonybrook.edu/mammography-dataset/. Accessed: January, 2021

work page 2021

[21] [21]

Pen-Based Recognition of Handwritten Digits Data Set

Dataset. Pen-Based Recognition of Handwritten Digits Data Set . http://odds.cs.stonybrook.edu/pendigits-dataset. Accessed: January, 2021

work page 2021

[22] [22]

Statlog (Shuttle) Data Set

Dataset. Statlog (Shuttle) Data Set . http://odds.cs.stonybrook.edu/shuttle-dataset. Accessed: January, 2021

work page 2021

[23] [23]

Dobriban, E. (2020). Fast closed testing for exchangeable local tests. Biometrika\/ 107\/ (3), 761--768

work page 2020

[24] [24]

Donoho, D. and J. Jin (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Stat.\/ 32\/ (3), 962--994

work page 2004

[25] [25]

Donoho, D. and J. Jin (2015). Higher criticism for large-scale inference, especially for rare and weak effects. Statist. Sci.\/ 30\/ (1), 1--25

work page 2015

[26] [26]

Dunnett, C. W. (1955). A multiple comparison procedure for comparing several treatments with a control. J. Am. Stat. Assoc.\/ 50\/ (272), 1096--1121

work page 1955

[27] [27]

Spitali, K

Ebrahimpoor, M., P. Spitali, K. Hettne, R. Tsonaka, and J. Goeman (2020). Simultaneous enrichment analysis of all possible gene-sets: unifying self-contained and competitive methods. Brief. Bioinform.\/ 21\/ (4), 1302--1312

work page 2020

[28] [28]

Edgington, E. S. (1972). An additive method for combining probability values from independent experiments. J. Clin. Psychol.\/ 80\/ (2), 351--363

work page 1972

[29] [29]

Feroze, A., A. Daud, T. Amjad, and M. K. Hayat (2021). Group anomaly detection: past notions, present insights, and future prospects. SN Computer Science\/ 2 , 1--27

work page 2021

[30] [30]

Fisher, R. A. (1925). Statistical methods for research workers. In Breakthroughs in statistics: Methodology and distribution , pp.\ 66--70. Springer

work page 1925

[31] [31]

Fix, E. and L. J. Hodges, Joseph (1955). Significance probabilities of the wilcoxon test. Ann. Math. Stat.\/ 26\/ (2), 301--312

work page 1955

[32] [32]

Genovese, C. R. and L. Wasserman (2006). Exceedance control of the false discovery proportion. J. Am. Stat. Assoc.\/ 101\/ (476), 1408--1417

work page 2006

[33] [33]

Goeman, J. J., P. G \'o recki, R. Monajemi, X. Chen, T. E. Nichols, and W. Weeda (2023). Cluster extent inference revisited: quantification and localisation of brain activity. J. R. Stat. Soc. B\/ 85\/ (4), 1128--1153

work page 2023

[34] [34]

Goeman, J. J., J. Hemerik, and A. Solari (2021). Only closed testing procedures are admissible for controlling false discovery proportions. Ann. Stat.\/ 49\/ (2), 1218--1238

work page 2021

[35] [35]

Goeman, J. J., R. J. Meijer, T. J. Krebs, and A. Solari (2019). Simultaneous control of all false discovery proportions in large-scale multiple hypothesis testing. Biometrika\/ 106\/ (4), 841--856

work page 2019

[36] [36]

Goeman, J. J. and A. Solari (2011). Multiple testing for exploratory research. Stat. Sci.\/ 26\/ (4), 584–597

work page 2011

[37] [37]

Guan, L. and R. Tibshirani (2022). Prediction and outlier detection in classification problems. J. R. Stat. Soc. B\/ 84\/ (2), 524--546

work page 2022

[38] [38]

Heard, N. A. and P. Rubin-Delanchy (2018). Choosing between methods of combining-values. Biometrika\/ 105\/ (1), 239--246

work page 2018

[39] [39]

Heller, R. and A. Solari (2023). Simultaneous directional inference. J. R. Stat. Soc. B\/ , qkad137

work page 2023

[40] [40]

Hemerik, J. and J. Goeman (2018). Exact testing with random permutations. Test\/ 27\/ (4), 811--825

work page 2018

[41] [41]

Hemerik, J. and J. J. Goeman (2021). Another look at the lady tasting tea and differences between permutation tests and randomisation tests. Int. Stat. Rev.\/ 89\/ (2), 367--381

work page 2021

[42] [42]

Solari, and J

Hemerik, J., A. Solari, and J. J. Goeman (2019). Permutation-based simultaneous confidence bounds for the false discovery proportion. Biometrika\/ 106\/ (3), 635--649

work page 2019

[43] [43]

Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Ann. Math. Stat.\/ 19\/ (3), 293--325

work page 1948

[44] [44]

Hommel, G. (1988). A stagewise rejective multiple test procedure based on a modified bonferroni test. Biometrika\/ 75\/ (2), 383--386

work page 1988

[45] [45]

Hu, X. and J. Lei (2023). A two-sample conditional distribution test using conformal prediction and weighted rank sum. J. Am. Stat. Assoc.\/ (just-accepted), 1--43

work page 2023

[46] [46]

Nachman, D

Kasieczka, G., B. Nachman, D. Shih, O. Amram, A. Andreassen, K. Benkendorfer, B. Bortolato, G. Brooijmans, F. Canelli, J. H. Collins, et al. (2021). The LHC Olympics 2020 a community challenge for anomaly detection in high energy physics. Reports on progress in physics\/ 84\/ (12), 124201

work page 2021

[47] [47]

Katsevich, E. and A. Ramdas (2020). Simultaneous high-probability bounds on the false discovery proportion in structured, regression and online settings. Ann. Stat.\/ 48\/ (6), 3465--3487

work page 2020

[48] [48]

Kuchibhotla, A. K. (2020). Exchangeability, conformal prediction, and rank tests. arXiv preprint arXiv:2005.06095\/

work page arXiv 2020

[49] [49]

Laxhammar, R. and G. Falkman (2015). Inductive conformal anomaly detection for sequential detection of anomalous sub-trajectories. Ann. Math. Artif. Intell.\/ 74 , 67--94

work page 2015

[50] [50]

Lehmann, E. L. (1953). The power of rank tests. Ann. Math. Stat.\/ 24 , 23--42

work page 1953

[51] [51]

Lehmann, E. L. and J. P. Romano (2005). Testing Statistical Hypotheses\/ (3 ed.). Springer Texts in Statistics. Springer New York, NY

work page 2005

[52] [52]

Li, J., M. H. Maathuis, and J. J. Goeman (2024). Simultaneous false discovery proportion bounds via knockoffs and closed testing. J. R. Stat. Soc. B\/ , qkae012

work page 2024

[53] [53]

Sesia, and W

Liang, Z., M. Sesia, and W. Sun (2024). Integrative conformal p-values for out-of-distribution testing with labelled outliers. J. R. Stat. Soc. B\/ , qkad138

work page 2024

[54] [54]

Mann, H. B. and D. R. Whitney (1947). On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat.\/ 18\/ (1), 50 -- 60

work page 1947

[55] [55]

Marandon, A., L. Lei, D. Mary, and E. Roquain (2024). Adaptive novelty detection with false discovery rate guarantee. Ann. Stat.\/ 52\/ (1), 157--183

work page 2024

[56] [56]

Peritz, and K

Marcus, R., E. Peritz, and K. R. Gabriel (1976). Closed testing procedures with special reference to ordered analysis of variance. Biometrika\/ 1\/ (63), 655–660

work page 1976

[57] [57]

Mary, D. and E. Roquain (2022). Semi-supervised multiple testing. Electron. J. Stat.\/ 16\/ (2), 4926--4981

work page 2022

[58] [58]

Meinshausen, N. and J. Rice (2006). Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses. Ann. Stat.\/ 34\/ (1), 373--393

work page 2006

[59] [59]

Meng, X.-L. (1994). Posterior predictive p -values. Ann. Stat.\/ 22\/ (3), 1142--1160

work page 1994

[60] [60]

Owen, A. B. (2009). Karl Pearson’s meta-analysis revisited . Ann. Stat.\/ 37\/ (6B), 3867 -- 3892

work page 2009

[61] [61]

Patra, R. K. and B. Sen (2016). Estimation of a two-component mixture model with applications to multiple testing. J. R. Statist. Soc. B\/ 78\/ (4), 869--–893

work page 2016

[62] [62]

Pesarin, F. and L. Salmaso (2010). Permutation tests for complex data: theory, applications and software . John Wiley & Sons

work page 2010

[63] [63]

Rosenblatt, J. D., L. Finos, W. D. Weeda, A. Solari, and J. J. Goeman (2018). All-resolutions inference for brain imaging. Neuroimage\/ 181 , 786--796

work page 2018

[64] [64]

R \"u schendorf, L. (1982). Random variables with maximum sums. Adv. in Appl. Probab.\/ 14\/ (3), 623--632

work page 1982

[65] [65]

Sarkar, S. K. (2008). On the Simes inequality and its generalization. In Beyond parametrics in interdisciplinary research: Festschrift in honor of Professor Pranab K. Sen , Volume 1, pp.\ 231--243. Institute of Mathematical Statistics

work page 2008

[66] [66]

Schweder, T. and E. Spj tvoll (1982). Plots of p-values to evaluate many tests simultaneously. Biometrika\/ 69\/ (3), 493--502

work page 1982

[67] [67]

Shiraishi, T. (1985). Local powers of two-sample and multi-sample rank tests for lehmann's contaminated alternative. Ann. Inst. Stat. Math.\/ 37 , 519--527

work page 1985

[68] [68]

Simes, R. J. (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika\/ 73\/ (3), 751--754

work page 1986

[69] [69]

Stoepker, I. V., R. M. Castro, E. Arias-Castro, and E. van den Heuvel (2024). Anomaly detection for a large number of streams: A permutation-based higher criticism approach. J. Am. Stat. Assoc.\/ 119\/ (545), 461--474

work page 2024

[70] [70]

Storey, J. D. (2002). A direct approach to false discovery rates. J. R. Stat. Soc. B\/ 64\/ (3), 479–498

work page 2002

[71] [71]

Storey, J. D., J. E. Taylor, and D. Siegmund (2004). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: A unified approach. J. R. Stat. Soc. B\/ 66\/ (1), 187–205

work page 2004

[72] [72]

Tian, J., X. Chen, E. Katsevich, J. J. Goeman, and A. Ramdas (2023). Large-scale simultaneous inference under dependence. Scand. J. Stat.\/ 50\/ (2), 750–796

work page 2023

[73] [73]

Tibshirani, R. J., R. Foygel Barber, E. Cand \`e s, and A. Ramdas (2019). Conformal prediction under covariate shift. In Adv. Neural Inf. Process. Syst. , Volume 32

work page 2019

[74] [74]

Kuusela, E

Vatanen, T., M. Kuusela, E. Malmi, T. Raiko, T. Aaltonen, and Y. Nagai (2012). Semi-supervised detection of collective anomalies with an application in high energy particle physics. In International Joint Conference on Neural Networks , pp.\ 1--8. IEEE

work page 2012

[75] [75]

Finos, and J

Vesely, A., L. Finos, and J. J. Goeman (2023). Permutation-based true discovery guarantee by sum tests. J. R. Stat. Soc. B\/ 85\/ (3), 664--683

work page 2023

[76] [76]

Gammerman, and G

Vovk, V., A. Gammerman, and G. Shafer (2005). Algorithmic learning in a random world , Volume 29. Springer

work page 2005

[77] [77]

Wang, and R

Vovk, V., B. Wang, and R. Wang (2022). Admissible ways of merging p-values under arbitrary dependence. Ann. Stat.\/ 50\/ (1), 351--375

work page 2022

[78] [78]

Vovk, V. and R. Wang (2020). Combining p-values via averaging. Biometrika\/ 107\/ (4), 791--808

work page 2020

[79] [79]

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin\/ 1\/ (6), 80–83

work page 1945