Conformal-Style Quantile Analyses for Stochastic Bandits
Pith reviewed 2026-05-11 00:57 UTC · model grok-4.3
The pith
ACP-UCB1 achieves logarithmic regret for upper-quantile performance in stochastic bandits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose ACP-UCB1, a conformal-style policy that combines an adaptive conformal estimate of the upper endpoint F_j^{-1}(1-α/2) with a UCB-type optimism bonus. The technical challenge that conformity scores are recomputed from evolving empirical quantile estimates and evaluated at an adaptive level is met by reward-quantile concentration, a perturbation argument for the recomputed score quantiles, and deterministic localization of the adaptive level. ACP-UCB1 achieves logarithmic upper-quantile regret with per-arm contribution O(log n / Δ_j^ACP). We also provide metric-specific regret decompositions comparing ACP-UCB1 with UCB1.
What carries the argument
ACP-UCB1, the policy that uses an adaptive conformal estimate of the upper prediction-interval endpoint together with a UCB optimism bonus, with the recomputed scores controlled by concentration, perturbation, and localization arguments.
If this is right
- Each suboptimal arm contributes at most O(log n / Δ_j^ACP) to the upper-quantile regret.
- Metric-specific regret decompositions allow direct comparison of ACP-UCB1 performance against UCB1 under both mean and upper-quantile criteria.
- When the upper-tail target induces an arm ranking different from the mean ranking, ACP-UCB1 selects and exploits a different optimal arm than mean-based policies.
- Numerical experiments confirm performance gains precisely in the regime where tail and mean rankings diverge.
Where Pith is reading between the lines
- The same conformal adaptation mechanism could be applied to lower-tail or other fixed-quantile targets by changing the miscoverage parameter α.
- The perturbation argument for recomputed conformity scores may transfer to other online problems that repeatedly update quantile estimates inside a decision loop.
- The approach opens a route to risk-sensitive bandits that optimize distributional properties beyond a single quantile level.
Load-bearing premise
Conformity scores recomputed from evolving empirical quantile estimates can be controlled via reward-quantile concentration, a perturbation argument, and deterministic localization of the adaptive level.
What would settle it
An instance in which upper-quantile regret grows faster than logarithmically, for example linearly in n, under reward distributions that violate the quantile concentration bounds used to control the adaptive scores.
Figures
read the original abstract
Stochastic bandit algorithms are usually analyzed under a mean-reward criterion, yet many problems favor arms with strong upper-tail performance, which we study herein. For a fixed miscoverage level \(\alpha\), the natural upper-tail target of arm \(j\) is the upper endpoint \(F_j^{-1}(1-\alpha/2)\) of a central prediction interval. This target can rank arms differently from their means, creating a central mismatch with the classical bandit objective. To this end, we propose ACP-UCB1, a conformal-style policy that combines an adaptive conformal estimate of the upper endpoint with a UCB-type optimism bonus. The technical challenge is that the conformity scores used by ACP-UCB1 are recomputed from evolving empirical quantile estimates and evaluated at an adaptive level. We control this endpoint through reward-quantile concentration, a perturbation argument for recomputed score quantiles, and deterministic localization of the adaptive level. ACP-UCB1 achieves logarithmic upper-quantile regret with per-arm contribution \(O(\nicefrac{\log n}{\Delta_j^{\mathrm{ACP}}})\). We also provide metric-specific regret decompositions comparing ACP-UCB1 with UCB1 and use numerical experiments to validate performance and improvement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ACP-UCB1, a stochastic bandit algorithm targeting upper-quantile performance rather than mean rewards. It employs an adaptive conformal prediction estimate of the upper endpoint F_j^{-1}(1-α/2) combined with a UCB-style optimism bonus on conformity scores that are recomputed each round from evolving empirical quantile estimates and evaluated at an adaptive miscoverage level. The analysis controls this endpoint via reward-quantile concentration, a perturbation argument on the recomputed score quantiles, and deterministic localization of the adaptive level. The central result is a logarithmic upper-quantile regret bound with per-arm contribution O(log n / Δ_j^ACP), accompanied by metric-specific regret decompositions versus UCB1 and numerical experiments.
Significance. If the control arguments hold, the work provides a principled extension of conformal methods to bandit settings with quantile objectives, which are relevant for risk-sensitive or tail-focused applications where mean-based ranking fails. The achievement of classical logarithmic regret rates under this metric, together with explicit decompositions comparing ACP-UCB1 to UCB1, offers a clean theoretical comparison. The adaptive-level mechanism is a distinctive technical feature.
major comments (2)
- [Abstract] Abstract (technical challenge paragraph): the perturbation argument for recomputed conformity-score quantiles must hold uniformly over all histories, including those with o(log t) pulls per arm. The manuscript outline does not specify the continuity modulus or Lipschitz constant used, nor does it verify that the modulus remains controlled before the estimator stabilizes; if it degrades, the accumulated error in the UCB bonus can exceed the claimed O(log n / Δ_j^ACP) per-arm term.
- [Proof of the main regret bound] Proof of the main regret bound (technical section following the abstract): the deterministic localization of the adaptive level is invoked only after estimator stabilization, but the early-round contribution to regret is not separately bounded. Without an explicit high-probability control on the perturbation error for t ≪ n when sample counts are still small, the overall logarithmic bound does not follow from the three listed ingredients.
minor comments (2)
- [Abstract] The notation Δ_j^ACP is introduced without an explicit definition in the abstract; a one-line reminder of its meaning (gap between the target quantile and the next-best arm) would improve readability.
- [Abstract] The abstract states that 'metric-specific regret decompositions' are provided, yet no indication is given of whether these are additive or multiplicative and how they differ from the standard UCB1 decomposition; a brief comparison table or equation reference would clarify the contribution.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments highlight important points about uniformity and early-round control that require explicit clarification in the manuscript. We address each major comment below and will make the corresponding revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract (technical challenge paragraph): the perturbation argument for recomputed conformity-score quantiles must hold uniformly over all histories, including those with o(log t) pulls per arm. The manuscript outline does not specify the continuity modulus or Lipschitz constant used, nor does it verify that the modulus remains controlled before the estimator stabilizes; if it degrades, the accumulated error in the UCB bonus can exceed the claimed O(log n / Δ_j^ACP) per-arm term.
Authors: We agree that uniformity of the perturbation argument over all histories must be stated explicitly, including regimes with few pulls per arm. Under the bounded-reward assumption the quantile map is Lipschitz with constant 1/α (which we will record in the revision). We will add a dedicated lemma establishing a uniform high-probability bound on the recomputed-score perturbation that holds for every t ≥ 1 and every history; the bound is obtained by applying the same reward-quantile concentration inequality used later in the proof, together with the deterministic localization of the adaptive level that is active from round 1. The resulting error term is absorbed into the O(log n / Δ_j^ACP) per-arm contribution without inflation. revision: yes
-
Referee: [Proof of the main regret bound] Proof of the main regret bound (technical section following the abstract): the deterministic localization of the adaptive level is invoked only after estimator stabilization, but the early-round contribution to regret is not separately bounded. Without an explicit high-probability control on the perturbation error for t ≪ n when sample counts are still small, the overall logarithmic bound does not follow from the three listed ingredients.
Authors: The referee is correct that the early-round contribution must be controlled separately before invoking stabilization. In the revised proof we will insert an explicit burn-in analysis: for t ≤ T_0 (where T_0 is a polynomial in log n chosen so that each arm has been pulled Ω(log n) times with high probability), we bound the per-round perturbation error crudely by a constant (using only boundedness of rewards and the fact that the adaptive level is localized deterministically). The total contribution of this finite-length phase is O(log n) and is absorbed into the leading logarithmic term. After T_0 the three ingredients (quantile concentration, perturbation control, and deterministic localization) combine exactly as stated to yield the claimed per-arm bound. revision: yes
Circularity Check
No circularity: regret bound derived from external concentration inequalities and adaptive arguments
full rationale
The paper's central claim is that ACP-UCB1 achieves O(log n / Δ_j^ACP) upper-quantile regret by controlling recomputed conformity scores via reward-quantile concentration inequalities, a perturbation argument, and deterministic localization of the adaptive level. These are standard external tools (Hoeffding-type bounds, continuity moduli, and deterministic pinning) that do not reduce the target regret expression to a fitted parameter or self-citation by construction. No equations in the provided abstract or description equate the final bound to an input fit; the derivation chain remains independent of the result itself.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Reward distributions admit suitable quantile concentration inequalities
- domain assumption Perturbation and deterministic localization arguments suffice for recomputed conformity scores
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We control this endpoint through reward-quantile concentration, a perturbation argument for recomputed score quantiles, and deterministic localization of the adaptive level. ACP-UCB1 achieves logarithmic upper-quantile regret with per-arm contribution O(log n / Δ_j^ACP).
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Assumption 1... local density regularity around the relevant reward and score quantiles
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Angelopoulos and Stephen Bates
Anastasios N. Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. Foundations and Trends in Machine Learning, 16 0 (4): 0 494--591, 2023
work page 2023
-
[2]
Angelopoulos, Rina Foygel Barber, and Stephen Bates
Anastasios N. Angelopoulos, Rina Foygel Barber, and Stephen Bates. Online conformal prediction with decaying step sizes. In Proceedings of the 41st International Conference on Machine Learning (ICML), 2024
work page 2024
-
[3]
Exploration--exploitation tradeoff using variance estimates in multi-armed bandits
Jean-Yves Audibert, R \'e mi Munos, and Csaba Szepesv \'a ri. Exploration--exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410 0 (19): 0 1876--1902, 2009
work page 1902
-
[4]
Finite-time analysis of the multiarmed bandit problem
Peter Auer, Nicol \`o Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47 0 (2--3): 0 235--256, 2002
work page 2002
-
[5]
Cand \`e s, Aaditya Ramdas, and Ryan J
Rina Foygel Barber, Emmanuel J. Cand \`e s, Aaditya Ramdas, and Ryan J. Tibshirani. Conformal prediction beyond exchangeability. The Annals of Statistics, 51 0 (2): 0 816--845, 2023
work page 2023
-
[6]
Conformal bandits: Bringing statistical validity and reward efficiency to the small-gap regime, 2025
Simone Cuonzo and Nina Deliu. Conformal bandits: Bringing statistical validity and reward efficiency to the small-gap regime, 2025. arXiv:2512.09850
-
[7]
Pure exploration for max-quantile bandits
Yahel David and Nahum Shimkin. Pure exploration for max-quantile bandits. In Machine Learning and Knowledge Discovery in Databases, volume 9851 of Lecture Notes in Computer Science, pages 556--571. Springer, 2016
work page 2016
-
[8]
Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, 27 0 (3): 0 642--669, 1956
work page 1956
-
[9]
Exploration vs exploitation vs safety: Risk-aware multi-armed bandits
Nicolas Galichet, Mich \`e le Sebag, and Olivier Teytaud. Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. In Proceedings of the 5th Asian Conference on Machine Learning, volume 29 of Proceedings of Machine Learning Research, pages 245--260. PMLR, 2013
work page 2013
-
[10]
The KL-UCB algorithm for bounded stochastic bandits and beyond
Aur \'e lien Garivier and Olivier Capp \'e . The KL-UCB algorithm for bounded stochastic bandits and beyond. Proceedings of the 24th Annual Conference on Learning Theory (COLT), 2011
work page 2011
-
[11]
Stochastic online conformal prediction with semi-bandit feedback
Haosen Ge, Hamsa Bastani, and Osbert Bastani. Stochastic online conformal prediction with semi-bandit feedback. In Proceedings of the 42nd International Conference on Machine Learning, 2025. ICML 2025 poster
work page 2025
-
[12]
Isaac Gibbs and Emmanuel J. Cand \`e s. Adaptive conformal inference under distribution shift. In Advances in Neural Information Processing Systems, volume 34, 2021
work page 2021
-
[13]
Probability inequalities for sums of bounded random variables
Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58 0 (301): 0 13--30, 1963
work page 1963
-
[14]
Asymptotically efficient adaptive allocation rules
Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6 0 (1): 0 4--22, 1985
work page 1985
-
[15]
Tor Lattimore and Csaba Szepesv \'a ri. Bandit Algorithms. Cambridge University Press, 2020
work page 2020
-
[16]
Tibshirani, and Larry Wasserman
Jing Lei, Max G'Sell, Alessandro Rinaldo, Ryan J. Tibshirani, and Larry Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113 0 (523): 0 1094--1111, 2018
work page 2018
-
[17]
The tight constant in the D voretzky-- K iefer-- W olfowitz inequality
Pascal Massart. The tight constant in the D voretzky-- K iefer-- W olfowitz inequality. The Annals of Probability, 18 0 (3): 0 1269--1283, 1990
work page 1990
-
[18]
Konstantinos E. Nikolakakis, Dionysios S. Kalogerias, Or Sheffet, and Anand D. Sarwate. Quantile multi-armed bandits: Optimal best-arm identification and a differentially private scheme. IEEE Journal on Selected Areas in Information Theory, 2 0 (2): 0 534--548, 2021. doi:10.1109/JSAIT.2021.3081525
-
[19]
Yaniv Romano, Evan Patterson, and Emmanuel J. Cand \`e s. Conformalized quantile regression. In Advances in Neural Information Processing Systems, volume 32, 2019
work page 2019
-
[20]
Risk-aversion in multi-armed bandits
Amir Sani, Alessandro Lazaric, and R \'e mi Munos. Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, volume 25, 2012
work page 2012
-
[21]
Qualitative multi-armed bandits: A quantile-based approach
Balazs Szorenyi, R \'o bert Busa-Fekete, Paul Weng, and Eyke H \"u llermeier. Qualitative multi-armed bandits: A quantile-based approach. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1660--1668. PMLR, 2015
work page 2015
- [22]
-
[23]
Algorithmic Learning in a Random World
Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, 2005
work page 2005
-
[24]
Adaptive conformal predictions for time series
Margaux Zaffran, Aymeric Dieuleveut, Olivier F \'e ron, Yannig Goude, and Julie Josse. Adaptive conformal predictions for time series. In Proceedings of the 39th International Conference on Machine Learning (ICML), 2022
work page 2022
- [25]
-
[26]
Finite-time Analysis of the Multiarmed Bandit Problem , journal =
Auer, Peter and Cesa-Bianchi, Nicol. Finite-time Analysis of the Multiarmed Bandit Problem , journal =
-
[27]
Adaptive Conformal Inference under Distribution Shift , booktitle =
Gibbs, Isaac and Cand. Adaptive Conformal Inference under Distribution Shift , booktitle =
-
[28]
Bandit Algorithms , publisher =
Lattimore, Tor and Szepesv. Bandit Algorithms , publisher =
-
[29]
Advances in Applied Mathematics , volume =
Lai, Tze Leung and Robbins, Herbert , title =. Advances in Applied Mathematics , volume =
-
[30]
The Annals of Probability , volume =
Massart, Pascal , title =. The Annals of Probability , volume =
-
[31]
and Barber, Rina Foygel and Bates, Stephen , title =
Angelopoulos, Anastasios N. and Barber, Rina Foygel and Bates, Stephen , title =. Proceedings of the 41st International Conference on Machine Learning (ICML) , year =
-
[32]
Garivier, Aur. The. Proceedings of the 24th Annual Conference on Learning Theory (COLT) , year =
-
[33]
Exploration--Exploitation Tradeoff Using Variance Estimates in Multi-armed Bandits , journal =
Audibert, Jean-Yves and Munos, R. Exploration--Exploitation Tradeoff Using Variance Estimates in Multi-armed Bandits , journal =
- [34]
-
[35]
Vovk, Vladimir and Gammerman, Alex and Shafer, Glenn , title =
-
[36]
and Wasserman, Larry , title =
Lei, Jing and G'Sell, Max and Rinaldo, Alessandro and Tibshirani, Ryan J. and Wasserman, Larry , title =. Journal of the American Statistical Association , volume =
-
[37]
Angelopoulos, Anastasios N. and Bates, Stephen , title =. Foundations and Trends in Machine Learning , volume =
-
[38]
Adaptive Conformal Predictions for Time Series , booktitle =
Zaffran, Margaux and Dieuleveut, Aymeric and F. Adaptive Conformal Predictions for Time Series , booktitle =
-
[39]
Conformal Prediction Beyond Exchangeability , journal =
Barber, Rina Foygel and Cand. Conformal Prediction Beyond Exchangeability , journal =
- [40]
-
[41]
Risk-Aversion in Multi-armed Bandits , booktitle =
Sani, Amir and Lazaric, Alessandro and Munos, R. Risk-Aversion in Multi-armed Bandits , booktitle =
-
[42]
Qualitative Multi-Armed Bandits: A Quantile-Based Approach , booktitle =
Szorenyi, Balazs and Busa-Fekete, R. Qualitative Multi-Armed Bandits: A Quantile-Based Approach , booktitle =
-
[43]
Nikolakakis, Konstantinos E. and Kalogerias, Dionysios S. and Sheffet, Or and Sarwate, Anand D. , title =. IEEE Journal on Selected Areas in Information Theory , volume =. 2021 , doi =
work page 2021
-
[44]
Conformalized Quantile Regression , booktitle =
Romano, Yaniv and Patterson, Evan and Cand. Conformalized Quantile Regression , booktitle =
-
[45]
Proceedings of the 42nd International Conference on Machine Learning , year =
Ge, Haosen and Bastani, Hamsa and Bastani, Osbert , title =. Proceedings of the 42nd International Conference on Machine Learning , year =
-
[46]
Journal of the American Statistical Association , volume=
Probability inequalities for sums of bounded random variables , author=. Journal of the American Statistical Association , volume=
-
[47]
The Annals of Mathematical Statistics , volume=
Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator , author=. The Annals of Mathematical Statistics , volume=
-
[48]
Machine Learning and Knowledge Discovery in Databases , series =
David, Yahel and Shimkin, Nahum , title =. Machine Learning and Knowledge Discovery in Databases , series =
-
[49]
Exploration vs Exploitation vs Safety: Risk-Aware Multi-Armed Bandits , booktitle =
Galichet, Nicolas and Sebag, Mich. Exploration vs Exploitation vs Safety: Risk-Aware Multi-Armed Bandits , booktitle =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.