Extreme bandits

Alexandra Carpentier; Michal Valko

arxiv: 2604.24545 · v1 · submitted 2026-04-27 · 📊 stat.ML · cs.LG

Extreme bandits

Alexandra Carpentier , Michal Valko This is my paper

Pith reviewed 2026-05-08 01:14 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords extreme regretheaviest taillimited feedbacksequential allocationExtremeHunter algorithmbandit algorithmsoutlier detectionheavy-tailed distributions

0 comments

The pith

The ExtremeHunter algorithm minimizes extreme regret to detect the source with the heaviest tail under limited feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies sequential allocation of limited resources across sources to identify extreme outputs, as needed in medicine, security, and related fields. Standard bandit methods optimize regret against the best average reward, but here the target is extreme regret measured against an oracle that always picks the source with the heaviest tail. The work introduces the ExtremeHunter algorithm, gives its theoretical analysis, and shows empirical results on both synthetic and real-world data.

Core claim

The paper establishes that extreme regret, defined as the performance gap to the oracle policy that selects the source with the heaviest tail, can be minimized in a limited-feedback sequential setting through the proposed ExtremeHunter algorithm and its accompanying analysis.

What carries the argument

The ExtremeHunter algorithm, an adaptive selection rule that sequentially queries sources to approximate the heaviest-tail oracle and thereby reduce extreme regret.

If this is right

Extreme regret serves as a well-defined performance criterion for extreme-value detection tasks.
The algorithm yields theoretical bounds on extreme regret under the limited-feedback model.
Empirical tests confirm practical gains on synthetic distributions and real intrusion-detection data.
The approach applies directly to resource allocation in medicine, security, and life sciences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same limited-feedback setup could support other non-mean objectives such as tail-risk minimization.
Integration with classical extreme-value theory tools might improve the oracle approximation step.
The method suggests that bandit algorithms can be redesigned around distribution shape rather than location parameters.

Load-bearing premise

The underlying distributions admit a heaviest tail that can be identified and exploited through sequential limited observations so that extreme regret becomes a minimizable quantity.

What would settle it

If ExtremeHunter fails to produce lower extreme regret than uniform random sampling on a dataset containing one clearly heaviest-tailed source, the central claim would not hold.

Figures

Figures reproduced from arXiv: 2604.24545 by Alexandra Carpentier, Michal Valko.

**Figure 1.** Figure 1: Extreme regret as a function of time for the exact Pareto distributions (left), approximate view at source ↗

read the original abstract

In many areas of medicine, security, and life sciences, we want to allocate limited resources to different sources in order to detect extreme values. In this paper, we study an efficient way to allocate these resources sequentially under limited feedback. While sequential design of experiments is well studied in bandit theory, the most commonly optimized property is the regret with respect to the maximum mean reward. However, in other problems such as network intrusion detection, we are interested in detecting the most extreme value output by the sources. Therefore, in our work we study extreme regret which measures the efficiency of an algorithm compared to the oracle policy selecting the source with the heaviest tail. We propose the ExtremeHunter algorithm, provide its analysis, and evaluate it empirically on synthetic and real-world experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines extreme regret to target heaviest tails instead of means and gives ExtremeHunter, but the analysis may not hold without tail assumptions that limited feedback cannot guarantee.

read the letter

The main thing to know is that this paper replaces standard mean regret with extreme regret, which measures how well you find the arm with the heaviest tail rather than the highest average reward, and proposes the ExtremeHunter algorithm to do it under bandit feedback. The motivation from security and life-science settings where extremes matter is clear and practical. They also run synthetic and real-world experiments, which gives some evidence the approach works in those cases. That is the actual new piece: a tail-focused objective and a policy aimed at it, instead of another mean-regret variant. The analysis is claimed to exist, but the abstract gives no bounds or proof steps, so the strength depends on what is shown in the full text. The stress-test concern is worth checking: if the distributions are only in the domain of attraction of an extreme-value law with no known index or regular variation, consistent identification of the heaviest tail from single samples per round could require more observations than the horizon allows. If the paper supplies only parametric or oracle-based bounds, the nonparametric claim would not hold up. This is for readers who already work on non-standard bandit objectives or tail risks in applied domains. A serious referee should see it because the objective is well-motivated and the empirical side is present, even if the theory needs explicit handling of the estimation difficulty under limited feedback.

Referee Report

2 major / 1 minor

Summary. The paper studies a variant of the multi-armed bandit problem focused on extreme values rather than means. It defines 'extreme regret' as the performance gap relative to an oracle policy that always selects the arm with the heaviest tail. The authors introduce the ExtremeHunter algorithm, claim to provide its theoretical analysis (including regret bounds), and report empirical results on both synthetic and real-world data.

Significance. If the analysis establishes sublinear extreme regret under limited feedback without strong parametric assumptions on the tails, the work would meaningfully extend bandit theory to tail-focused objectives relevant to intrusion detection, medicine, and security applications. The inclusion of real-world experiments is a positive step toward practicality. However, the significance is currently limited by the need to verify that the guarantees hold for general distributions in the domain of attraction of extreme-value laws.

major comments (2)

[Theoretical analysis section] The analysis of ExtremeHunter (presumably in the theoretical section following the algorithm definition) claims sublinear extreme regret by sequentially approximating the heaviest-tail oracle. This appears to require consistent identification of the heaviest tail from single draws per round. For distributions belonging only to the domain of attraction without known tail index or second-order regular variation, the sample complexity for reliable distinction may exceed the horizon, rendering the bound non-sublinear. Please state the precise regret bound, all assumptions (including any tail regularity conditions), and whether the result is fully nonparametric.
[Problem formulation / extreme regret definition] The definition of extreme regret and the oracle policy (likely in the problem formulation section) assumes the heaviest tail is uniquely identifiable. When multiple arms have comparable tails, the oracle may not be unique, which could make the regret measure ill-defined or the comparison to the oracle ambiguous. Clarify how the regret is computed in such cases and whether the analysis accounts for it.

minor comments (1)

[Experiments] The experimental section would benefit from explicit details on how synthetic data are generated (e.g., specific tail indices or distributions used) and the exact real-world datasets, including preprocessing steps for measuring extremes, to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. We address each major comment in turn below and will prepare a revised manuscript that incorporates the requested clarifications.

read point-by-point responses

Referee: [Theoretical analysis section] The analysis of ExtremeHunter (presumably in the theoretical section following the algorithm definition) claims sublinear extreme regret by sequentially approximating the heaviest-tail oracle. This appears to require consistent identification of the heaviest tail from single draws per round. For distributions belonging only to the domain of attraction without known tail index or second-order regular variation, the sample complexity for reliable distinction may exceed the horizon, rendering the bound non-sublinear. Please state the precise regret bound, all assumptions (including any tail regularity conditions), and whether the result is fully nonparametric.

Authors: We agree that the current presentation would benefit from greater explicitness. The analysis of ExtremeHunter (Section 3) establishes a sublinear extreme-regret bound under the assumption that each arm distribution lies in the domain of attraction of an extreme-value law and satisfies a second-order regular-variation condition with a positive tail index. These conditions ensure that the sequential samples collected by the algorithm suffice for consistent ordering of the tails within the horizon. The result is therefore not fully nonparametric. In the revision we will (i) restate the exact regret bound from the main theorem, (ii) list every assumption in a dedicated paragraph immediately preceding the theorem, and (iii) add an explicit remark that the guarantees do not extend to distributions lacking the second-order regular-variation condition. revision: yes
Referee: [Problem formulation / extreme regret definition] The definition of extreme regret and the oracle policy (likely in the problem formulation section) assumes the heaviest tail is uniquely identifiable. When multiple arms have comparable tails, the oracle may not be unique, which could make the regret measure ill-defined or the comparison to the oracle ambiguous. Clarify how the regret is computed in such cases and whether the analysis accounts for it.

Authors: We acknowledge the potential ambiguity. The current definition takes the oracle to be any arm whose tail index equals the maximal index; when several arms share this maximal index the extreme regret is measured against the best attainable extreme value among those tied arms. The analysis in Section 3 already covers this case by using the minimal possible extreme value over the set of maximal-tail arms. Nevertheless, the manuscript does not spell out the tie-breaking rule explicitly. We will add a short paragraph in the problem-formulation section that defines the oracle set and states that regret is computed with respect to the best member of that set. revision: yes

Circularity Check

0 steps flagged

No circularity: extreme regret and ExtremeHunter derived independently from oracle definition

full rationale

The paper defines extreme regret directly as the performance gap to an oracle that always selects the heaviest-tailed source. It then introduces ExtremeHunter as a new algorithm for sequentially approximating this oracle under bandit feedback and analyzes its regret. No equation or step reduces the claimed sublinear bound or algorithm to a fitted parameter, self-citation, or renamed input by construction. The derivation chain is self-contained against the new regret definition and does not rely on prior fitted quantities from the same authors.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review prevents exhaustive extraction; the central claim rests on the existence of a heaviest-tail oracle and on the ability to bound extreme regret under limited feedback, both treated as domain assumptions rather than derived.

axioms (2)

domain assumption There exists an oracle policy that always selects the source with the heaviest tail.
Invoked when defining extreme regret relative to the oracle.
domain assumption Extreme regret is a meaningful and analyzable performance measure under sequential limited feedback.
Required for the algorithm analysis claimed in the abstract.

pith-pipeline@v0.9.0 · 5410 in / 1237 out tokens · 48225 ms · 2026-05-08T01:14:30.998728+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Outlier Detection by Active Learning

Naoki Abe, Bianca Zadrozny, and John Langford. Outlier Detection by Active Learning. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 504–509, 2006

work page 2006
[2]

Mix- ture models of endhost network traffic

John Mark Agosta, Jaideep Chandrashekar, Mark Crovella, Nina Taft, and Daniel Ting. Mix- ture models of endhost network traffic. InIEEE International Conference on Computer Com- munications (INFOCOM),, pages 225–229

work page
[3]

Finite-time Analysis of the Multiarmed Bandit Problem.Machine Learning, 47(2-3):235–256, 2002

Peter Auer, Nicol `o Cesa-Bianchi, and Paul Fischer. Finite-time Analysis of the Multiarmed Bandit Problem.Machine Learning, 47(2-3):235–256, 2002

work page 2002
[4]

Bandits With Heavy Tail.IEEE Transactions on Information Theory, 59(11):7711–7717, 2013

S ´ebastien Bubeck, Nicol `o Cesa-Bianchi, and G ´abor Lugosi. Bandits With Heavy Tail.IEEE Transactions on Information Theory, 59(11):7711–7717, 2013

work page 2013
[5]

Pure Exploration in Multi-armed Bandits Problems.Algorithmic Learning Theory, pages 23–37, 2009

S ´ebastien Bubeck, R ´emi Munos, and Gilles Stoltz. Pure Exploration in Multi-armed Bandits Problems.Algorithmic Learning Theory, pages 23–37, 2009

work page 2009
[6]

Alexandra Carpentier and Arlene K. H. Kim. Adaptive and minimax optimal estimation of the tail coefficient.Statistica Sinica, 2014

work page 2014
[7]

Alexandra Carpentier and Arlene K. H. Kim. Honest and adaptive confidence interval for the tail coefficient in the Pareto model.Electronic Journal of Statistics, 2014

work page 2014
[8]

Anomaly detection: A survey.ACM Computing Surveys, 41(3):15:1–15:58, July 2009

Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey.ACM Computing Surveys, 41(3):15:1–15:58, July 2009

work page 2009
[9]

Cicirello and Stephen F

Vincent A. Cicirello and Stephen F. Smith. The max k-armed bandit: A new model of explo- ration applied to search heuristic selection.AAAI Conference on Artificial Intelligence (AAAI), 2005

work page 2005
[10]

Springer Series in Operations Research and Financial Engineering

Laurens de Haan and Ana Ferreira.Extreme Value Theory: An Introduction. Springer Series in Operations Research and Financial Engineering. Springer, 2006

work page 2006
[11]

Limiting forms of the frequency distribution of the largest or smallest member of a sample.Mathematical Proceedings of the Cambridge Philosophical Society, 24:180, 1928

Ronald Aylmer Fisher and Leonard Henry Caleb Tippett. Limiting forms of the frequency distribution of the largest or smallest member of a sample.Mathematical Proceedings of the Cambridge Philosophical Society, 24:180, 1928

work page 1928
[12]

Sur la distribution limite du terme maximum d’une s ´erie al ´eatoire.The Annals of Mathematics, 44(3):423–453, 1943

Boris Gnedenko. Sur la distribution limite du terme maximum d’une s ´erie al ´eatoire.The Annals of Mathematics, 44(3):423–453, 1943

work page 1943
[13]

Peter Hall and Alan H. Welsh. Best Attainable Rates of Convergence for Estimates of Param- eters of Regular Variation.The Annals of Statistics, 12(3):1079–1084, 1984

work page 1984
[14]

Lai and Herbert Robbins

Tze L. Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules.Advances in Applied Mathematics, 6(1):4–22, 1985

work page 1985
[15]

Dynamic Intrusion Detection in Resource-Constrained Cyber Net- works

Keqin Liu and Qing Zhao. Dynamic Intrusion Detection in Resource-Constrained Cyber Net- works. InIEEE International Symposium on Information Theory (ISIT), 2012

work page 2012
[16]

Novelty detection: a review, part 1: statistical approaches

Markos Markou and Sameer Singh. Novelty detection: a review, part 1: statistical approaches. Signal Processing, 83(12):2481–2497, 2003

work page 2003
[17]

Neill and Gregory F

Daniel B. Neill and Gregory F. Cooper. A multivariate Bayesian scan statistic for early event detection and characterization.Machine Learning, 79:261–282, 2010

work page 2010
[18]

Priebe, John M

Carey E. Priebe, John M. Conroy, David J. Marchette, and Youngser Park. Scan Statistics on Enron Graphs. InComputational and Mathematical Organization Theory, volume 11, pages 229–247, 2005

work page 2005
[19]

A Classification Framework for Anomaly Detec- tion.Journal of Machine Learning Research, 6:211–232, 2005

Ingo Steinwart, Don Hush, and Clint Scovel. A Classification Framework for Anomaly Detec- tion.Journal of Machine Learning Research, 6:211–232, 2005

work page 2005
[20]

Streeter and Stephen F

Matthew J. Streeter and Stephen F. Smith. A Simple Distribution-Free Approach to the Max k- Armed Bandit Problem. InInternational Conference on Principles and Practice of Constraint Programming (CP), volume 4204, pages 560–574, 2006

work page 2006
[21]

Streeter and Stephen F

Matthew J. Streeter and Stephen F. Smith. An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem. InAAAI Conference on Artificial Intelligence (AAAI), pages 135–142, 2006

work page 2006
[22]

Fast online anomaly detection using scan statistics.IEEE Workshop on Machine Learning for Signal Processing (MLSP), 2010

Ryan Turner, Zoubin Ghahramani, and Steven Bottone. Fast online anomaly detection using scan statistics.IEEE Workshop on Machine Learning for Signal Processing (MLSP), 2010. 9 A Proof of Lemma 3 Lemma 3.Assume thatX 1, . . . , XT areTi.i.d. samples drawn according to(α, β, C, C ′)-second order Pareto distribution, then for anyx≥B: P max i Xi ≤x −exp −T Cx...

work page 2010

[1] [1]

Outlier Detection by Active Learning

Naoki Abe, Bianca Zadrozny, and John Langford. Outlier Detection by Active Learning. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 504–509, 2006

work page 2006

[2] [2]

Mix- ture models of endhost network traffic

John Mark Agosta, Jaideep Chandrashekar, Mark Crovella, Nina Taft, and Daniel Ting. Mix- ture models of endhost network traffic. InIEEE International Conference on Computer Com- munications (INFOCOM),, pages 225–229

work page

[3] [3]

Finite-time Analysis of the Multiarmed Bandit Problem.Machine Learning, 47(2-3):235–256, 2002

Peter Auer, Nicol `o Cesa-Bianchi, and Paul Fischer. Finite-time Analysis of the Multiarmed Bandit Problem.Machine Learning, 47(2-3):235–256, 2002

work page 2002

[4] [4]

Bandits With Heavy Tail.IEEE Transactions on Information Theory, 59(11):7711–7717, 2013

S ´ebastien Bubeck, Nicol `o Cesa-Bianchi, and G ´abor Lugosi. Bandits With Heavy Tail.IEEE Transactions on Information Theory, 59(11):7711–7717, 2013

work page 2013

[5] [5]

Pure Exploration in Multi-armed Bandits Problems.Algorithmic Learning Theory, pages 23–37, 2009

S ´ebastien Bubeck, R ´emi Munos, and Gilles Stoltz. Pure Exploration in Multi-armed Bandits Problems.Algorithmic Learning Theory, pages 23–37, 2009

work page 2009

[6] [6]

Alexandra Carpentier and Arlene K. H. Kim. Adaptive and minimax optimal estimation of the tail coefficient.Statistica Sinica, 2014

work page 2014

[7] [7]

Alexandra Carpentier and Arlene K. H. Kim. Honest and adaptive confidence interval for the tail coefficient in the Pareto model.Electronic Journal of Statistics, 2014

work page 2014

[8] [8]

Anomaly detection: A survey.ACM Computing Surveys, 41(3):15:1–15:58, July 2009

Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey.ACM Computing Surveys, 41(3):15:1–15:58, July 2009

work page 2009

[9] [9]

Cicirello and Stephen F

Vincent A. Cicirello and Stephen F. Smith. The max k-armed bandit: A new model of explo- ration applied to search heuristic selection.AAAI Conference on Artificial Intelligence (AAAI), 2005

work page 2005

[10] [10]

Springer Series in Operations Research and Financial Engineering

Laurens de Haan and Ana Ferreira.Extreme Value Theory: An Introduction. Springer Series in Operations Research and Financial Engineering. Springer, 2006

work page 2006

[11] [11]

Limiting forms of the frequency distribution of the largest or smallest member of a sample.Mathematical Proceedings of the Cambridge Philosophical Society, 24:180, 1928

Ronald Aylmer Fisher and Leonard Henry Caleb Tippett. Limiting forms of the frequency distribution of the largest or smallest member of a sample.Mathematical Proceedings of the Cambridge Philosophical Society, 24:180, 1928

work page 1928

[12] [12]

Sur la distribution limite du terme maximum d’une s ´erie al ´eatoire.The Annals of Mathematics, 44(3):423–453, 1943

Boris Gnedenko. Sur la distribution limite du terme maximum d’une s ´erie al ´eatoire.The Annals of Mathematics, 44(3):423–453, 1943

work page 1943

[13] [13]

Peter Hall and Alan H. Welsh. Best Attainable Rates of Convergence for Estimates of Param- eters of Regular Variation.The Annals of Statistics, 12(3):1079–1084, 1984

work page 1984

[14] [14]

Lai and Herbert Robbins

Tze L. Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules.Advances in Applied Mathematics, 6(1):4–22, 1985

work page 1985

[15] [15]

Dynamic Intrusion Detection in Resource-Constrained Cyber Net- works

Keqin Liu and Qing Zhao. Dynamic Intrusion Detection in Resource-Constrained Cyber Net- works. InIEEE International Symposium on Information Theory (ISIT), 2012

work page 2012

[16] [16]

Novelty detection: a review, part 1: statistical approaches

Markos Markou and Sameer Singh. Novelty detection: a review, part 1: statistical approaches. Signal Processing, 83(12):2481–2497, 2003

work page 2003

[17] [17]

Neill and Gregory F

Daniel B. Neill and Gregory F. Cooper. A multivariate Bayesian scan statistic for early event detection and characterization.Machine Learning, 79:261–282, 2010

work page 2010

[18] [18]

Priebe, John M

Carey E. Priebe, John M. Conroy, David J. Marchette, and Youngser Park. Scan Statistics on Enron Graphs. InComputational and Mathematical Organization Theory, volume 11, pages 229–247, 2005

work page 2005

[19] [19]

A Classification Framework for Anomaly Detec- tion.Journal of Machine Learning Research, 6:211–232, 2005

Ingo Steinwart, Don Hush, and Clint Scovel. A Classification Framework for Anomaly Detec- tion.Journal of Machine Learning Research, 6:211–232, 2005

work page 2005

[20] [20]

Streeter and Stephen F

Matthew J. Streeter and Stephen F. Smith. A Simple Distribution-Free Approach to the Max k- Armed Bandit Problem. InInternational Conference on Principles and Practice of Constraint Programming (CP), volume 4204, pages 560–574, 2006

work page 2006

[21] [21]

Streeter and Stephen F

Matthew J. Streeter and Stephen F. Smith. An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem. InAAAI Conference on Artificial Intelligence (AAAI), pages 135–142, 2006

work page 2006

[22] [22]

Fast online anomaly detection using scan statistics.IEEE Workshop on Machine Learning for Signal Processing (MLSP), 2010

Ryan Turner, Zoubin Ghahramani, and Steven Bottone. Fast online anomaly detection using scan statistics.IEEE Workshop on Machine Learning for Signal Processing (MLSP), 2010. 9 A Proof of Lemma 3 Lemma 3.Assume thatX 1, . . . , XT areTi.i.d. samples drawn according to(α, β, C, C ′)-second order Pareto distribution, then for anyx≥B: P max i Xi ≤x −exp −T Cx...

work page 2010