pith. sign in

arxiv: 2605.07115 · v1 · submitted 2026-05-08 · 💻 cs.LG · stat.ML

Conformal-Style Quantile Analyses for Stochastic Bandits

Pith reviewed 2026-05-11 00:57 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords stochastic banditsconformal predictionupper-quantile regretadaptive conformal estimationUCB algorithmregret analysisprediction intervals
0
0 comments X

The pith

ACP-UCB1 achieves logarithmic regret for upper-quantile performance in stochastic bandits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most stochastic bandit algorithms minimize regret with respect to mean rewards, yet many applications instead favor arms with strong upper-tail behavior. The paper introduces ACP-UCB1, a policy that pairs an adaptive conformal estimate of the upper endpoint of a central prediction interval with a UCB-style optimism bonus. It proves that the policy attains logarithmic upper-quantile regret, with each arm contributing a term of order log n divided by the gap in the quantile metric. The analysis controls the adaptive conformity scores through reward-quantile concentration, a perturbation argument, and deterministic localization of the level. The work also supplies metric-specific regret decompositions that compare ACP-UCB1 directly to classical UCB1 and validates the approach with numerical experiments.

Core claim

We propose ACP-UCB1, a conformal-style policy that combines an adaptive conformal estimate of the upper endpoint F_j^{-1}(1-α/2) with a UCB-type optimism bonus. The technical challenge that conformity scores are recomputed from evolving empirical quantile estimates and evaluated at an adaptive level is met by reward-quantile concentration, a perturbation argument for the recomputed score quantiles, and deterministic localization of the adaptive level. ACP-UCB1 achieves logarithmic upper-quantile regret with per-arm contribution O(log n / Δ_j^ACP). We also provide metric-specific regret decompositions comparing ACP-UCB1 with UCB1.

What carries the argument

ACP-UCB1, the policy that uses an adaptive conformal estimate of the upper prediction-interval endpoint together with a UCB optimism bonus, with the recomputed scores controlled by concentration, perturbation, and localization arguments.

If this is right

  • Each suboptimal arm contributes at most O(log n / Δ_j^ACP) to the upper-quantile regret.
  • Metric-specific regret decompositions allow direct comparison of ACP-UCB1 performance against UCB1 under both mean and upper-quantile criteria.
  • When the upper-tail target induces an arm ranking different from the mean ranking, ACP-UCB1 selects and exploits a different optimal arm than mean-based policies.
  • Numerical experiments confirm performance gains precisely in the regime where tail and mean rankings diverge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conformal adaptation mechanism could be applied to lower-tail or other fixed-quantile targets by changing the miscoverage parameter α.
  • The perturbation argument for recomputed conformity scores may transfer to other online problems that repeatedly update quantile estimates inside a decision loop.
  • The approach opens a route to risk-sensitive bandits that optimize distributional properties beyond a single quantile level.

Load-bearing premise

Conformity scores recomputed from evolving empirical quantile estimates can be controlled via reward-quantile concentration, a perturbation argument, and deterministic localization of the adaptive level.

What would settle it

An instance in which upper-quantile regret grows faster than logarithmically, for example linearly in n, under reward distributions that violate the quantile concentration bounds used to control the adaptive scores.

Figures

Figures reproduced from arXiv: 2605.07115 by Chengyu Du, Mengfan Xu.

Figure 1
Figure 1. Figure 1: Metric comparison on a Gaussian objective-mismatch instance. [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Final regret as the nominal level α varies. 0 2000 4000 6000 8000 10000 Round 0 200 400 600 800 Cumulative ACP-Regret Gaussian ACP-UCB1 UCB1 0 2000 4000 6000 8000 10000 Round 0 500 1000 1500 Cumulative ACP-Regret Student-t 0 2000 4000 6000 8000 10000 Round 0 200 400 600 800 1000 1200 Cumulative ACP-Regret Skewed Student-t Distributional robustness: ACP-regret over time [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 3
Figure 3. Figure 3: Distributional robustness under cumulative ACP-regret. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Stochastic bandit algorithms are usually analyzed under a mean-reward criterion, yet many problems favor arms with strong upper-tail performance, which we study herein. For a fixed miscoverage level \(\alpha\), the natural upper-tail target of arm \(j\) is the upper endpoint \(F_j^{-1}(1-\alpha/2)\) of a central prediction interval. This target can rank arms differently from their means, creating a central mismatch with the classical bandit objective. To this end, we propose ACP-UCB1, a conformal-style policy that combines an adaptive conformal estimate of the upper endpoint with a UCB-type optimism bonus. The technical challenge is that the conformity scores used by ACP-UCB1 are recomputed from evolving empirical quantile estimates and evaluated at an adaptive level. We control this endpoint through reward-quantile concentration, a perturbation argument for recomputed score quantiles, and deterministic localization of the adaptive level. ACP-UCB1 achieves logarithmic upper-quantile regret with per-arm contribution \(O(\nicefrac{\log n}{\Delta_j^{\mathrm{ACP}}})\). We also provide metric-specific regret decompositions comparing ACP-UCB1 with UCB1 and use numerical experiments to validate performance and improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes ACP-UCB1, a stochastic bandit algorithm targeting upper-quantile performance rather than mean rewards. It employs an adaptive conformal prediction estimate of the upper endpoint F_j^{-1}(1-α/2) combined with a UCB-style optimism bonus on conformity scores that are recomputed each round from evolving empirical quantile estimates and evaluated at an adaptive miscoverage level. The analysis controls this endpoint via reward-quantile concentration, a perturbation argument on the recomputed score quantiles, and deterministic localization of the adaptive level. The central result is a logarithmic upper-quantile regret bound with per-arm contribution O(log n / Δ_j^ACP), accompanied by metric-specific regret decompositions versus UCB1 and numerical experiments.

Significance. If the control arguments hold, the work provides a principled extension of conformal methods to bandit settings with quantile objectives, which are relevant for risk-sensitive or tail-focused applications where mean-based ranking fails. The achievement of classical logarithmic regret rates under this metric, together with explicit decompositions comparing ACP-UCB1 to UCB1, offers a clean theoretical comparison. The adaptive-level mechanism is a distinctive technical feature.

major comments (2)
  1. [Abstract] Abstract (technical challenge paragraph): the perturbation argument for recomputed conformity-score quantiles must hold uniformly over all histories, including those with o(log t) pulls per arm. The manuscript outline does not specify the continuity modulus or Lipschitz constant used, nor does it verify that the modulus remains controlled before the estimator stabilizes; if it degrades, the accumulated error in the UCB bonus can exceed the claimed O(log n / Δ_j^ACP) per-arm term.
  2. [Proof of the main regret bound] Proof of the main regret bound (technical section following the abstract): the deterministic localization of the adaptive level is invoked only after estimator stabilization, but the early-round contribution to regret is not separately bounded. Without an explicit high-probability control on the perturbation error for t ≪ n when sample counts are still small, the overall logarithmic bound does not follow from the three listed ingredients.
minor comments (2)
  1. [Abstract] The notation Δ_j^ACP is introduced without an explicit definition in the abstract; a one-line reminder of its meaning (gap between the target quantile and the next-best arm) would improve readability.
  2. [Abstract] The abstract states that 'metric-specific regret decompositions' are provided, yet no indication is given of whether these are additive or multiplicative and how they differ from the standard UCB1 decomposition; a brief comparison table or equation reference would clarify the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important points about uniformity and early-round control that require explicit clarification in the manuscript. We address each major comment below and will make the corresponding revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract (technical challenge paragraph): the perturbation argument for recomputed conformity-score quantiles must hold uniformly over all histories, including those with o(log t) pulls per arm. The manuscript outline does not specify the continuity modulus or Lipschitz constant used, nor does it verify that the modulus remains controlled before the estimator stabilizes; if it degrades, the accumulated error in the UCB bonus can exceed the claimed O(log n / Δ_j^ACP) per-arm term.

    Authors: We agree that uniformity of the perturbation argument over all histories must be stated explicitly, including regimes with few pulls per arm. Under the bounded-reward assumption the quantile map is Lipschitz with constant 1/α (which we will record in the revision). We will add a dedicated lemma establishing a uniform high-probability bound on the recomputed-score perturbation that holds for every t ≥ 1 and every history; the bound is obtained by applying the same reward-quantile concentration inequality used later in the proof, together with the deterministic localization of the adaptive level that is active from round 1. The resulting error term is absorbed into the O(log n / Δ_j^ACP) per-arm contribution without inflation. revision: yes

  2. Referee: [Proof of the main regret bound] Proof of the main regret bound (technical section following the abstract): the deterministic localization of the adaptive level is invoked only after estimator stabilization, but the early-round contribution to regret is not separately bounded. Without an explicit high-probability control on the perturbation error for t ≪ n when sample counts are still small, the overall logarithmic bound does not follow from the three listed ingredients.

    Authors: The referee is correct that the early-round contribution must be controlled separately before invoking stabilization. In the revised proof we will insert an explicit burn-in analysis: for t ≤ T_0 (where T_0 is a polynomial in log n chosen so that each arm has been pulled Ω(log n) times with high probability), we bound the per-round perturbation error crudely by a constant (using only boundedness of rewards and the fact that the adaptive level is localized deterministically). The total contribution of this finite-length phase is O(log n) and is absorbed into the leading logarithmic term. After T_0 the three ingredients (quantile concentration, perturbation control, and deterministic localization) combine exactly as stated to yield the claimed per-arm bound. revision: yes

Circularity Check

0 steps flagged

No circularity: regret bound derived from external concentration inequalities and adaptive arguments

full rationale

The paper's central claim is that ACP-UCB1 achieves O(log n / Δ_j^ACP) upper-quantile regret by controlling recomputed conformity scores via reward-quantile concentration inequalities, a perturbation argument, and deterministic localization of the adaptive level. These are standard external tools (Hoeffding-type bounds, continuity moduli, and deterministic pinning) that do not reduce the target regret expression to a fitted parameter or self-citation by construction. No equations in the provided abstract or description equate the final bound to an input fit; the derivation chain remains independent of the result itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on standard concentration properties for stochastic rewards and the validity of adaptive conformal estimation under recomputed scores; no free parameters or invented entities are explicitly introduced.

axioms (2)
  • domain assumption Reward distributions admit suitable quantile concentration inequalities
    Invoked to control the upper-endpoint estimates and the adaptive conformal scores.
  • domain assumption Perturbation and deterministic localization arguments suffice for recomputed conformity scores
    Used to handle the technical challenge of evolving empirical quantiles and adaptive levels.

pith-pipeline@v0.9.0 · 5508 in / 1120 out tokens · 37843 ms · 2026-05-11T00:57:32.430183+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages

  1. [1]

    Angelopoulos and Stephen Bates

    Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. Foundations and Trends in Machine Learning, 16 0 (4): 0 494--591, 2023

  2. [2]

    Angelopoulos, Rina Foygel Barber, and Stephen Bates

    Anastasios N. Angelopoulos, Rina Foygel Barber, and Stephen Bates. Online conformal prediction with decaying step sizes. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

  3. [3]

    Exploration--exploitation tradeoff using variance estimates in multi-armed bandits

    Jean-Yves Audibert, R \'e mi Munos, and Csaba Szepesv \'a ri. Exploration--exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410 0 (19): 0 1876--1902, 2009

  4. [4]

    Finite-time analysis of the multiarmed bandit problem

    Peter Auer, Nicol \`o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47 0 (2--3): 0 235--256, 2002

  5. [5]

    Cand \`e s, Aaditya Ramdas, and Ryan J

    Rina Foygel Barber, Emmanuel J. Cand \`e s, Aaditya Ramdas, and Ryan J. Tibshirani. Conformal prediction beyond exchangeability. The Annals of Statistics, 51 0 (2): 0 816--845, 2023

  6. [6]

    Conformal bandits: Bringing statistical validity and reward efficiency to the small-gap regime, 2025

    Simone Cuonzo and Nina Deliu. Conformal bandits: Bringing statistical validity and reward efficiency to the small-gap regime, 2025. arXiv:2512.09850

  7. [7]

    Pure exploration for max-quantile bandits

    Yahel David and Nahum Shimkin. Pure exploration for max-quantile bandits. In Machine Learning and Knowledge Discovery in Databases, volume 9851 of Lecture Notes in Computer Science, pages 556--571. Springer, 2016

  8. [8]

    Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator

    Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, 27 0 (3): 0 642--669, 1956

  9. [9]

    Exploration vs exploitation vs safety: Risk-aware multi-armed bandits

    Nicolas Galichet, Mich \`e le Sebag, and Olivier Teytaud. Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. In Proceedings of the 5th Asian Conference on Machine Learning, volume 29 of Proceedings of Machine Learning Research, pages 245--260. PMLR, 2013

  10. [10]

    The KL-UCB algorithm for bounded stochastic bandits and beyond

    Aur \'e lien Garivier and Olivier Capp \'e . The KL-UCB algorithm for bounded stochastic bandits and beyond. Proceedings of the 24th Annual Conference on Learning Theory (COLT), 2011

  11. [11]

    Stochastic online conformal prediction with semi-bandit feedback

    Haosen Ge, Hamsa Bastani, and Osbert Bastani. Stochastic online conformal prediction with semi-bandit feedback. In Proceedings of the 42nd International Conference on Machine Learning, 2025. ICML 2025 poster

  12. [12]

    Cand \`e s

    Isaac Gibbs and Emmanuel J. Cand \`e s. Adaptive conformal inference under distribution shift. In Advances in Neural Information Processing Systems, volume 34, 2021

  13. [13]

    Probability inequalities for sums of bounded random variables

    Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58 0 (301): 0 13--30, 1963

  14. [14]

    Asymptotically efficient adaptive allocation rules

    Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6 0 (1): 0 4--22, 1985

  15. [15]

    Bandit Algorithms

    Tor Lattimore and Csaba Szepesv \'a ri. Bandit Algorithms. Cambridge University Press, 2020

  16. [16]

    Tibshirani, and Larry Wasserman

    Jing Lei, Max G'Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113 0 (523): 0 1094--1111, 2018

  17. [17]

    The tight constant in the D voretzky-- K iefer-- W olfowitz inequality

    Pascal Massart. The tight constant in the D voretzky-- K iefer-- W olfowitz inequality. The Annals of Probability, 18 0 (3): 0 1269--1283, 1990

  18. [18]

    Nikolakakis, Dionysios S

    Konstantinos E. Nikolakakis, Dionysios S. Kalogerias, Or Sheffet, and Anand D. Sarwate. Quantile multi-armed bandits: Optimal best-arm identification and a differentially private scheme. IEEE Journal on Selected Areas in Information Theory, 2 0 (2): 0 534--548, 2021. doi:10.1109/JSAIT.2021.3081525

  19. [19]

    Cand \`e s

    Yaniv Romano, Evan Patterson, and Emmanuel J. Cand \`e s. Conformalized quantile regression. In Advances in Neural Information Processing Systems, volume 32, 2019

  20. [20]

    Risk-aversion in multi-armed bandits

    Amir Sani, Alessandro Lazaric, and R \'e mi Munos. Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, volume 25, 2012

  21. [21]

    Qualitative multi-armed bandits: A quantile-based approach

    Balazs Szorenyi, R \'o bert Busa-Fekete, Paul Weng, and Eyke H \"u llermeier. Qualitative multi-armed bandits: A quantile-based approach. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1660--1668. PMLR, 2015

  22. [22]

    Thompson

    William R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25 0 (3--4): 0 285--294, 1933

  23. [23]

    Algorithmic Learning in a Random World

    Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005

  24. [24]

    Adaptive conformal predictions for time series

    Margaux Zaffran, Aymeric Dieuleveut, Olivier F \'e ron, Yannig Goude, and Julie Josse. Adaptive conformal predictions for time series. In Proceedings of the 39th International Conference on Machine Learning (ICML), 2022

  25. [25]

    2025 , eprint =

    Cuonzo, Simone and Deliu, Nina , title =. 2025 , eprint =

  26. [26]

    Finite-time Analysis of the Multiarmed Bandit Problem , journal =

    Auer, Peter and Cesa-Bianchi, Nicol. Finite-time Analysis of the Multiarmed Bandit Problem , journal =

  27. [27]

    Adaptive Conformal Inference under Distribution Shift , booktitle =

    Gibbs, Isaac and Cand. Adaptive Conformal Inference under Distribution Shift , booktitle =

  28. [28]

    Bandit Algorithms , publisher =

    Lattimore, Tor and Szepesv. Bandit Algorithms , publisher =

  29. [29]

    Advances in Applied Mathematics , volume =

    Lai, Tze Leung and Robbins, Herbert , title =. Advances in Applied Mathematics , volume =

  30. [30]

    The Annals of Probability , volume =

    Massart, Pascal , title =. The Annals of Probability , volume =

  31. [31]

    and Barber, Rina Foygel and Bates, Stephen , title =

    Angelopoulos, Anastasios N. and Barber, Rina Foygel and Bates, Stephen , title =. Proceedings of the 41st International Conference on Machine Learning (ICML) , year =

  32. [32]

    Garivier, Aur. The. Proceedings of the 24th Annual Conference on Learning Theory (COLT) , year =

  33. [33]

    Exploration--Exploitation Tradeoff Using Variance Estimates in Multi-armed Bandits , journal =

    Audibert, Jean-Yves and Munos, R. Exploration--Exploitation Tradeoff Using Variance Estimates in Multi-armed Bandits , journal =

  34. [34]

    , title =

    Thompson, William R. , title =. Biometrika , volume =

  35. [35]

    Vovk, Vladimir and Gammerman, Alex and Shafer, Glenn , title =

  36. [36]

    and Wasserman, Larry , title =

    Lei, Jing and G'Sell, Max and Rinaldo, Alessandro and Tibshirani, Ryan J. and Wasserman, Larry , title =. Journal of the American Statistical Association , volume =

  37. [37]

    and Bates, Stephen , title =

    Angelopoulos, Anastasios N. and Bates, Stephen , title =. Foundations and Trends in Machine Learning , volume =

  38. [38]

    Adaptive Conformal Predictions for Time Series , booktitle =

    Zaffran, Margaux and Dieuleveut, Aymeric and F. Adaptive Conformal Predictions for Time Series , booktitle =

  39. [39]

    Conformal Prediction Beyond Exchangeability , journal =

    Barber, Rina Foygel and Cand. Conformal Prediction Beyond Exchangeability , journal =

  40. [40]

    Bandits with Heavy Tail , journal =

    Bubeck, S. Bandits with Heavy Tail , journal =

  41. [41]

    Risk-Aversion in Multi-armed Bandits , booktitle =

    Sani, Amir and Lazaric, Alessandro and Munos, R. Risk-Aversion in Multi-armed Bandits , booktitle =

  42. [42]

    Qualitative Multi-Armed Bandits: A Quantile-Based Approach , booktitle =

    Szorenyi, Balazs and Busa-Fekete, R. Qualitative Multi-Armed Bandits: A Quantile-Based Approach , booktitle =

  43. [43]

    and Kalogerias, Dionysios S

    Nikolakakis, Konstantinos E. and Kalogerias, Dionysios S. and Sheffet, Or and Sarwate, Anand D. , title =. IEEE Journal on Selected Areas in Information Theory , volume =. 2021 , doi =

  44. [44]

    Conformalized Quantile Regression , booktitle =

    Romano, Yaniv and Patterson, Evan and Cand. Conformalized Quantile Regression , booktitle =

  45. [45]

    Proceedings of the 42nd International Conference on Machine Learning , year =

    Ge, Haosen and Bastani, Hamsa and Bastani, Osbert , title =. Proceedings of the 42nd International Conference on Machine Learning , year =

  46. [46]

    Journal of the American Statistical Association , volume=

    Probability inequalities for sums of bounded random variables , author=. Journal of the American Statistical Association , volume=

  47. [47]

    The Annals of Mathematical Statistics , volume=

    Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator , author=. The Annals of Mathematical Statistics , volume=

  48. [48]

    Machine Learning and Knowledge Discovery in Databases , series =

    David, Yahel and Shimkin, Nahum , title =. Machine Learning and Knowledge Discovery in Databases , series =

  49. [49]

    Exploration vs Exploitation vs Safety: Risk-Aware Multi-Armed Bandits , booktitle =

    Galichet, Nicolas and Sebag, Mich. Exploration vs Exploitation vs Safety: Risk-Aware Multi-Armed Bandits , booktitle =