The Good, the Bad, and the Sampled: a No-Regret Approach to Safe Online Classification

Aldo Pacchiano; Spyros Dragazis; Tavor Z. Baharav

arxiv: 2510.01020 · v2 · submitted 2025-10-01 · 💻 cs.LG · cs.AI· math.ST· stat.ML· stat.TH

The Good, the Bad, and the Sampled: a No-Regret Approach to Safe Online Classification

Tavor Z. Baharav , Spyros Dragazis , Aldo Pacchiano This is my paper

Pith reviewed 2026-05-18 10:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.STstat.MLstat.TH

keywords safe online classificationno-regret learninglogistic regressionsequential testingerror-constrained classificationmedical screeninghigh probability bounds

0 comments

The pith

A method for online logistic classification keeps misclassification below alpha with high probability using only tilde O of square root T extra tests beyond an oracle.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines sequential decisions for binary outcomes such as disease status when the underlying risk follows an unknown logistic model. At each step a decision maker can pay for a revealing test or instead predict from observed features, with the aim of using as few tests as possible while ensuring the error rate stays below a preset alpha with probability at least one minus delta. The proposed procedure estimates both the logistic parameter and the feature distribution online and applies a conservative threshold to the logistic score to decide when prediction is safe. A sympathetic reader would care because the guarantee allows medical screening programs to operate with bounded risk and only modest extra testing cost relative to perfect prior knowledge.

Core claim

We study sequential testing for a binary disease outcome when risk follows an unknown logistic model. At each round the decision maker may either pay for a test revealing the true label or predict the outcome based on patient features and past data. The goal is to minimize costly tests while ensuring the misclassification rate stays below alpha with probability at least one minus delta. We propose a method that jointly estimates the logistic parameter theta star and the feature distribution, using a conservative threshold on the logistic score to decide when to test. We prove our procedure achieves the target error with high probability and requires only tilde O of square root T more tests.

What carries the argument

The conservative threshold applied to the online-estimated logistic score, which decides whether to predict or to pay for a label-revealing test while the model parameters and feature distribution are updated jointly.

If this is right

The misclassification rate remains below alpha with probability at least one minus delta throughout the sequence.
The total number of tests exceeds that of an oracle knowing the true model by only tilde O of square root T.
The logistic parameter is estimated accurately enough to support the safe threshold decisions.
The procedure applies directly to medical screening tasks that require both safety and efficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same online estimation-plus-threshold idea could be tested on other link functions beyond logistic for risk modeling.
In deployment the approach might reduce overall screening costs by limiting tests to those near the decision boundary.
The high-probability bound could be checked empirically on public medical datasets by measuring excess tests over long horizons.

Load-bearing premise

The risk follows an unknown logistic model and the feature distribution can be jointly estimated online while a conservative threshold on the logistic score suffices to control the misclassification rate.

What would settle it

A long run of patient classifications in which the observed misclassification rate exceeds alpha more often than the allowed delta or in which the number of excess tests grows faster than order square root of T would show the claimed guarantee does not hold.

Figures

Figures reproduced from arXiv: 2510.01020 by Aldo Pacchiano, Spyros Dragazis, Tavor Z. Baharav.

**Figure 1.** Figure 1: Threshold-based testing policy. When P and θ ⋆ are known, a threshold decision rule is optimal when the safety constraint is imposed only in expectation, as we show in the following proposition. 3 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Pessimistic choice of |τt|. Our testing threshold τt is designed to be systematically pessimistic. We begin with a plug-in estimator of the optimal threshold as τ ⋆ (θ L t , Pˆ t, α). To guarantee safety, we inflate our threshold to account for estimation errors. First, we reduce α to αt = max(0, α − p log(2t 2/δ′)/2t) (discussed in Section D.4) to guarantee (α, δ)-safety, if the true θ ⋆ and P were known.… view at source ↗

**Figure 3.** Figure 3: Simulation results. First and second row correspond to [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: B(0, 1) ∩ {x : |⟨x, θ⋆ ⟩| ≤ τ ⋆} Before proving Lemma 12 we will first prove an auxiliary lemma that allows us to work with a more convenient vector in the surface of the unit ball instead of θ ⋆ . For more details about orthogonal transformations we refer the reader to [25]. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: The CDF of Beta( d+1 2 , 1 2 ) for various values of d. f(z) = z − 1 2 (1 − z) d−1 2 , log (f(z)) = − 1 2 log z + d − 1 2 log(1 − z), log (f(z))′ = − 1 2z − d − 1 2(1 − z) < 0. Then, for all 0 < z < 1, f ′ (z) = log (f(z))′ f(z) < 0, and so F is concave [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

read the original abstract

We study sequential testing for a binary disease outcome when risk follows an unknown logistic model. At each round, the decision maker may either pay for a test revealing the true label or predict the outcome based on patient features and past data. The goal is to minimize costly tests while ensuring the misclassification rate stays below $\alpha$ with probability at least $1-\delta$. We propose a method that jointly estimates the logistic parameter $\theta^{\star}$ and the feature distribution, using a conservative threshold on the logistic score to decide when to test. We prove our procedure achieves the target error with high probability and requires only $\widetilde O(\sqrt{T})$ more tests than an oracle with full knowledge. This is the first no-regret guarantee for error-constrained logistic testing, with direct applications to medical screening. Simulations corroborate our theoretical results, showing safe classification of patients and efficient estimation of $\theta^{\star}$ with few excess tests.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first no-regret bound for keeping misclassification under alpha in online logistic testing via joint estimation and a conservative threshold, but the high-probability claim rests on whether the analysis handles adaptive sampling dependence.

read the letter

Colleague, the core result here is a procedure that jointly estimates the logistic parameter and feature distribution online, then uses a time-varying conservative threshold on the score to decide testing versus prediction. It claims to keep the misclassification rate below alpha with high probability while using only O(sqrt(T)) extra tests compared to an oracle that knows everything. This appears to be the first such no-regret guarantee in the error-constrained logistic setting, which is the main new piece relative to prior online classification and safe testing work. The simulations show the method controls error in practice and learns the parameter with limited excess tests, which is useful for sequential medical screening or similar costly-test scenarios. The joint estimation plus conservative threshold is a reasonable way to operationalize the safety constraint without knowing the distribution in advance. On the soft spots, the stress-test note flags a real issue worth checking in the proofs. Because the decision to obtain the next label depends on the current estimate, the samples are not independent of the estimation error, so standard Hoeffding-style bounds or a plain union bound over T rounds may not deliver a uniform 1-delta guarantee. A martingale or self-normalized concentration argument that accounts for the induced stopping times would be needed; if the paper relies on something weaker, the claimed high-probability control could fail even when the logistic model is correct. The citation pattern looks standard and there is no obvious circularity in how the excess-test quantity is defined. This paper is aimed at people working on online learning with hard safety constraints. A reader who cares about theoretical guarantees for sequential classification with limited labels will find the bound and the method worth examining. It deserves a serious referee to verify the concentration details and the threshold construction, even if revisions are likely needed on the adaptive analysis.

Referee Report

2 major / 2 minor

Summary. The manuscript studies sequential testing for binary disease outcomes under an unknown logistic risk model. At each round the decision maker may test (paying for the true label) or predict from features and past data. The goal is to keep the misclassification rate at most α with probability at least 1-δ while minimizing tests. The proposed algorithm jointly estimates the logistic parameter θ* from tested labels and the feature distribution from all observed x_t, then applies a time-varying conservative threshold on the logistic score to decide testing versus prediction. The central claims are that the procedure meets the target error with high probability and incurs only Õ(√T) excess tests relative to an oracle with full knowledge; this is presented as the first no-regret guarantee for error-constrained logistic testing, with supporting simulations.

Significance. If the high-probability guarantee and sublinear excess-query bound hold, the result would constitute a meaningful advance in safe online classification, supplying the first explicit no-regret analysis under a uniform error constraint for logistic models. The medical-screening motivation is concrete and the combination of joint online estimation with an adaptive threshold is technically interesting. Simulations are reported to corroborate both safety and efficient estimation.

major comments (2)

[§4] §4 (Proof of the main high-probability bound, presumably Theorem 1): the argument must control the supremum deviation of the estimated risk under the adaptive sampling rule induced by the conservative threshold. Standard Hoeffding or Bernstein inequalities on fixed samples do not directly apply because the decision to obtain the next label depends on the current estimate, creating dependence between the sampling process and the estimation error. Without an explicit martingale or self-normalized concentration argument that accounts for the stopping times, the claimed uniform 1-δ guarantee over all rounds is not yet established.
[§3] §3 (Algorithm and threshold definition): the precise functional form of the conservative threshold (how it is computed from the current estimates of θ* and the feature distribution) is not stated with sufficient explicitness to verify that it indeed enforces the misclassification rate ≤ α uniformly. The paper should supply the exact expression and the derivation showing why this choice yields the claimed high-probability control.

minor comments (2)

[Notation] The definition of the excess-test quantity relative to the oracle should be stated formally (e.g., as an expectation or high-probability statement) and its relationship to the fitted parameters clarified.
[Experiments] In the experimental section, report the concrete parameter values (α, δ, T, dimension, and how the logistic model and feature distribution are generated) so that the simulation results can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and valuable feedback on our manuscript. We address each of the major comments below and have made revisions to improve the clarity and rigor of the presentation.

read point-by-point responses

Referee: [§4] §4 (Proof of the main high-probability bound, presumably Theorem 1): the argument must control the supremum deviation of the estimated risk under the adaptive sampling rule induced by the conservative threshold. Standard Hoeffding or Bernstein inequalities on fixed samples do not directly apply because the decision to obtain the next label depends on the current estimate, creating dependence between the sampling process and the estimation error. Without an explicit martingale or self-normalized concentration argument that accounts for the stopping times, the claimed uniform 1-δ guarantee over all rounds is not yet established.

Authors: We appreciate the referee's point regarding the need for careful handling of the adaptive sampling in the concentration bounds. Upon reflection, the original proof sketch relied on standard concentration but indeed requires extension to account for the dependence. In the revised version, we will provide a complete proof using self-normalized martingale inequalities (such as those based on Freedman's inequality) to bound the supremum deviation of the estimated risk under the adaptive threshold rule. This will rigorously establish the uniform high-probability control. revision: yes
Referee: [§3] §3 (Algorithm and threshold definition): the precise functional form of the conservative threshold (how it is computed from the current estimates of θ* and the feature distribution) is not stated with sufficient explicitness to verify that it indeed enforces the misclassification rate ≤ α uniformly. The paper should supply the exact expression and the derivation showing why this choice yields the claimed high-probability control.

Authors: We agree that more explicit details on the threshold are necessary for verification. The conservative threshold at time t is defined as the smallest value τ such that the integral over the feature distribution of the logistic probability adjusted by the estimation error bound is at most α. We will include the precise formula for the threshold in terms of the current estimates ˆθ_t and the empirical feature distribution, along with the derivation from the high-probability bound on the logistic score deviation, in the updated Section 3. revision: yes

Circularity Check

0 steps flagged

Derivation self-contained; no circular reductions to inputs or self-citations

full rationale

The paper presents a joint online estimation procedure for θ* and the feature distribution, combined with a conservative logistic-score threshold for deciding when to test. It claims a high-probability guarantee that misclassification rate ≤ α and Õ(√T) excess tests versus an oracle. No equations or steps in the provided abstract or description reduce the excess-test bound or error control to a fitted parameter by construction, nor do they rely on load-bearing self-citations whose content is unverified. The central no-regret result is derived from the adaptive procedure and concentration arguments, which constitute independent mathematical content rather than tautological renaming or self-definition. This is the expected outcome for a theoretical proof paper whose bounds are stated relative to external oracles and standard tools.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the logistic risk model and the feasibility of online joint estimation of parameters and feature distribution; these are domain assumptions rather than derived quantities.

axioms (2)

domain assumption Risk follows an unknown logistic model
Explicitly stated as the generative model for the binary outcome in the abstract.
domain assumption Feature distribution can be estimated jointly with the logistic parameter
Used as the basis for the conservative threshold decision rule.

pith-pipeline@v0.9.0 · 5712 in / 1264 out tokens · 34314 ms · 2026-05-18T10:24:00.548117+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a method that jointly estimates the logistic parameter θ⋆ and the feature distribution, using a conservative threshold on the logistic score to decide when to test. ... τt ≜ τ⋆(θLt, P̂t, αt − ζt − 2Bt/√λtmin − εQ) + 3Bt/√λtmin + εQ
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 1. An algorithm ... satisfies (α,δ)-safety if P(∀T̄, (1/T̄)∑1{Ŷt≠Yt}≤α)≥1−δ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

[1]

Improved algorithms for linear stochastic bandits.Advances in neural information processing systems, 24, 2011

Yasin Abbasi-Yadkori, D´ avid P´ al, and Csaba Szepesv´ ari. Improved algorithms for linear stochastic bandits.Advances in neural information processing systems, 24, 2011. 6

work page 2011
[2]

Reinforcement learning based recommender systems: A survey.ACM Computing Surveys, 55(7):1–38, 2022

M Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey.ACM Computing Surveys, 55(7):1–38, 2022. 1

work page 2022
[3]

Interpretable operations research for high-stakes decisions: Designing the greek covid-19 testing system.INFORMS Journal on Applied Analytics, 52(5):398–411, 2022

Hamsa Bastani, Kimon Drakopoulos, Vishal Gupta, Jon Vlachogiannis, Christos Had- jichristodoulou, Pagona Lagiou, Gkikas Magiorkinis, Dimitrios Paraskevis, and Sotirios Tsiodras. Interpretable operations research for high-stakes decisions: Designing the greek covid-19 testing system.INFORMS Journal on Applied Analytics, 52(5):398–411, 2022. 2

work page 2022
[4]

Probably approximately correct labels

Emmanuel J Cand` es, Andrew Ilyas, and Tijana Zrnic. Probably approximately correct labels. arXiv preprint arXiv:2506.10908, 2025. 15

work page arXiv 2025
[5]

Cambridge university press, 2006

Nicolo Cesa-Bianchi and G´ abor Lugosi.Prediction, learning, and games. Cambridge university press, 2006. 6

work page 2006
[6]

Worst-case analysis of selective sampling for linear classification.Journal of Machine Learning Research, 7(7), 2006

Nicolo Cesa-Bianchi, Claudio Gentile, Luca Zaniboni, and Manfred Warmuth. Worst-case analysis of selective sampling for linear classification.Journal of Machine Learning Research, 7(7), 2006. 2

work page 2006
[7]

Concentration inequalities and martingale inequalities: a survey

Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet mathematics, 3(1):79–127, 2006. 40

work page 2006
[8]

Machine learning in drug discovery: a review.Artificial intelligence review, 55(3):1947–1999,

Suresh Dara, Swetha Dhamercherla, Surender Singh Jadav, CH Madhu Babu, and Mohamed Jawed Ahsan. Machine learning in drug discovery: a review.Artificial intelligence review, 55(3):1947–1999,

work page 1947
[9]

Analysis of perceptron-based active learning

Sanjoy Dasgupta, Adam Tauman Kalai, and Claire Monteleoni. Analysis of perceptron-based active learning. InInternational conference on computational learning theory, pages 249–263. Springer, 2005. 15

work page 2005
[10]

Selective sampling and active learning from single and multiple teachers.The Journal of Machine Learning Research, 13(1):2655–2697, 2012

Ofer Dekel, Claudio Gentile, and Karthik Sridharan. Selective sampling and active learning from single and multiple teachers.The Journal of Machine Learning Research, 13(1):2655–2697, 2012. 2

work page 2012
[11]

Efficiently learning halfspaces with tsybakov noise

Ilias Diakonikolas, Daniel M Kane, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Efficiently learning halfspaces with tsybakov noise. InProceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 88–101, 2021. 15

work page 2021
[12]

Online learning of halfspaces with massart noise.arXiv preprint arXiv:2405.12958, 2024

Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Online learning of halfspaces with massart noise.arXiv preprint arXiv:2405.12958, 2024. 15

work page arXiv 2024
[13]

Towards semi- supervised learning with non-random missing labels

Yue Duan, Zhen Zhao, Lei Qi, Luping Zhou, Lei Wang, and Yinghuan Shi. Towards semi- supervised learning with non-random missing labels. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16121–16131, 2023. 2

work page 2023
[14]

Computation of the regularized incomplete beta function

Vera Egorova, Amparo Gil, Javier Segura, NM Temme, et al. Computation of the regularized incomplete beta function. 2023. 24 11

work page 2023
[15]

Improved optimistic algorithms for logistic bandits

Louis Faury, Marc Abeille, Cl´ ement Calauz` enes, and Olivier Fercoq. Improved optimistic algorithms for logistic bandits. InInternational Conference on Machine Learning, pages 3052–3060. PMLR,

work page
[16]

John Wiley & Sons,

Gerald B Folland.Real analysis: modern techniques and their applications. John Wiley & Sons,

work page
[17]

Selective sampling using the query by committee algorithm.Machine learning, 28:133–168, 1997

Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query by committee algorithm.Machine learning, 28:133–168, 1997. 2

work page 1997
[18]

Online selective classifica- tion with limited feedback.Advances in Neural Information Processing Systems, 34:14529–14541,

Aditya Gangrade, Anil Kag, Ashok Cutkosky, and Venkatesh Saligrama. Online selective classifica- tion with limited feedback.Advances in Neural Information Processing Systems, 34:14529–14541,

work page
[19]

Selective classification via one-sided prediction

Aditya Gangrade, Anil Kag, and Venkatesh Saligrama. Selective classification via one-sided prediction. InInternational Conference on Artificial Intelligence and Statistics, pages 2179–2187. PMLR, 2021. 2 and 15

work page 2021
[20]

Safe linear bandits over unknown polytopes

Aditya Gangrade, Tianrui Chen, and Venkatesh Saligrama. Safe linear bandits over unknown polytopes. InThe Thirty Seventh Annual Conference on Learning Theory, pages 1755–1795. PMLR, 2024. 2

work page 2024
[21]

Safe machine learning.Statistics, 58(3):473–477, 2024

Paolo Giudici. Safe machine learning.Statistics, 58(3):473–477, 2024. 1

work page 2024
[22]

Adversarial resilience in sequential prediction via abstention.Advances in Neural Information Processing Systems, 36:8027–8047,

Surbhi Goel, Steve Hanneke, Shay Moran, and Abhishek Shetty. Adversarial resilience in sequential prediction via abstention.Advances in Neural Information Processing Systems, 36:8027–8047,

work page
[23]

arXiv preprint arXiv:2205.10330 , year =

Shangding Gu, Long Yang, Yali Du, Guang Chen, Florian Walter, Jun Wang, and Alois Knoll. A review of safe reinforcement learning: Methods, theory and applications.arXiv preprint arXiv:2205.10330, 2022. 1

work page arXiv 2022
[24]

Toward a general theory of online selective sampling: Trading off mistakes and queries

Steve Hanneke and Liu Yang. Toward a general theory of online selective sampling: Trading off mistakes and queries. InInternational Conference on Artificial Intelligence and Statistics, pages 3997–4005. PMLR, 2021. 2

work page 2021
[25]

Cambridge university press, 2012

Roger A Horn and Charles R Johnson.Matrix analysis. Cambridge university press, 2012. 24

work page 2012
[26]

Volumes of n-dimensional spheres and ellipsoids, 2014

Michael Jorgensen. Volumes of n-dimensional spheres and ellipsoids, 2014. 25 and 26

work page 2014
[27]

Conservative contextual linear bandits.Advances in Neural Information Processing Systems, 30,

Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi Yadkori, and Benjamin Van Roy. Conservative contextual linear bandits.Advances in Neural Information Processing Systems, 30,

work page
[28]

Cambridge University Press, 2020

Tor Lattimore and Csaba Szepesv´ ari.Bandit algorithms. Cambridge University Press, 2020. 1, 21, and 39

work page 2020
[29]

A unified confidence sequence for generalized linear models, with applications to bandits.Advances in Neural Information Processing Systems, 37:124640–124685, 2025

Junghyun Lee, Se-Young Yun, and Kwang-Sung Jun. A unified confidence sequence for generalized linear models, with applications to bandits.Advances in Neural Information Processing Systems, 37:124640–124685, 2025. 5 and 9

work page 2025
[30]

Concise formulas for the area and volume of a hyperspherical cap.Asian Journal of Mathematics & Statistics, 4(1):66–70, 2010

Shengqiao Li. Concise formulas for the area and volume of a hyperspherical cap.Asian Journal of Mathematics & Statistics, 4(1):66–70, 2010. 25

work page 2010
[31]

Better algorithms for selective sampling

Francesco Orabona, Nicolo Cesa-Bianchi, et al. Better algorithms for selective sampling. In Proceedings of the 28th international conference on machine learning: Bellevue, Washington, USA, june 28. july 2, 2011, pages 433–440. Omnipress, 2011. 2 and 5 12

work page 2011
[32]

Stochastic bandits with linear constraints

Aldo Pacchiano, Mohammad Ghavamzadeh, Peter Bartlett, and Heinrich Jiang. Stochastic bandits with linear constraints. InInternational conference on artificial intelligence and statistics, pages 2827–2835. PMLR, 2021. 2

work page 2021
[33]

Machine learning portfolio allocation.The Journal of Finance and Data Science, 8:35–54, 2022

Michael Pinelis and David Ruppert. Machine learning portfolio allocation.The Journal of Finance and Data Science, 8:35–54, 2022. 1

work page 2022
[34]

Selective sampling and imitation learning via online regression.Advances in Neural Information Processing Systems, 36:67213–67268,

Ayush Sekhari, Karthik Sridharan, Wen Sun, and Runzhe Wu. Selective sampling and imitation learning via online regression.Advances in Neural Information Processing Systems, 36:67213–67268,

work page
[35]

Active learning literature survey

Burr Settles. Active learning literature survey. 2009. 2

work page 2009
[36]

Query by committee

H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. Query by committee. InProceedings of the fifth annual workshop on Computational learning theory, pages 287–294, 1992. 2

work page 1992
[37]

Dynamic Ad Allocation: Bandits with Budgets

Aleksandrs Slivkins. Dynamic ad allocation: Bandits with budgets.arXiv preprint arXiv:1306.0155,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Reinforcement learning.Journal of Cognitive Neuroscience, 11(1):126–134, 1999

Richard S Sutton, Andrew G Barto, et al. Reinforcement learning.Journal of Cognitive Neuroscience, 11(1):126–134, 1999. 1

work page 1999
[39]

Optimal aggregation of classifiers in statistical learning.The Annals of Statistics, 32(1):135–166, 2004

Alexander B Tsybakov. Optimal aggregation of classifiers in statistical learning.The Annals of Statistics, 32(1):135–166, 2004. 15

work page 2004
[40]

Applications of machine learning in drug discovery and development.Nature reviews Drug discovery, 18(6):463–477, 2019

Jessica Vamathevan, Dominic Clark, Paul Czodrowski, Ian Dunham, Edgardo Ferran, George Lee, Bin Li, Anant Madabhushi, Parantu Shah, Michaela Spitzer, et al. Applications of machine learning in drug discovery and development.Nature reviews Drug discovery, 18(6):463–477, 2019. 1

work page 2019
[41]

Cambridge university press, 2018

Roman Vershynin.High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. 16, 19, 20, and 41

work page 2018
[42]

Cambridge university press, 2019

Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019. 18 and 19

work page 2019
[43]

Power con- strained bandits

Jiayu Yao, Emma Brunskill, Weiwei Pan, Susan Murphy, and Finale Doshi-Velez. Power con- strained bandits. InMachine Learning for Healthcare Conference, pages 209–259. PMLR, 2021. 2

work page 2021
[44]

expert” model also releases an uncertainty level Ui about its prediction. The algorithmic challenge is to leverage the uncertainty levels to produce “PAC labels

Zheqing Zhu and Benjamin Van Roy. Scalable neural contextual bandit for recommender sys- tems. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 3636–3646, 2023. 1 13 Appendix Contents 1 Introduction 1 2 Preliminaries 2 2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page 2023
[45]

If ˆYt = 1 thenE(1{ ˆYt ̸=Y t} | ˆYt = 1) = 1−p t

work page
[46]

sandwich

Else if ˆYt = 0 thenE(1{ ˆYt ̸=Y t} | ˆYt = 0) =p t. The optimal policy then is to output the prediction with the smallest error. The expected error then is equal to E(1{ ˆYt ̸=Y t})≜min{1−p t, pt}. We denote P(Zt = 0) = ηt. The optimal policy choice is reduced to the following optimization problem. 15 min {ηt} TX t=1 1−η t s.t. 1 T TX t=1 min{1−p t, pt}η...

work page
[47]

f(z) =z − 1 2 (1−z) d−1 2 , log (f(z)) =− 1 2 logz+ d−1 2 log(1−z), log (f(z))′ =− 1 2z − d−1 2(1−z) <0

for various values ofd. f(z) =z − 1 2 (1−z) d−1 2 , log (f(z)) =− 1 2 logz+ d−1 2 log(1−z), log (f(z))′ =− 1 2z − d−1 2(1−z) <0. Then, for all 0 < z < 1, f ′(z) = log (f(z))′ f(z) < 0, and so F is concave. Figure 5 illustrates the CDF across various values of parameterd >1. To continue in our analysis, we will need to show that the τ for which we are eval...

work page

[1] [1]

Improved algorithms for linear stochastic bandits.Advances in neural information processing systems, 24, 2011

Yasin Abbasi-Yadkori, D´ avid P´ al, and Csaba Szepesv´ ari. Improved algorithms for linear stochastic bandits.Advances in neural information processing systems, 24, 2011. 6

work page 2011

[2] [2]

Reinforcement learning based recommender systems: A survey.ACM Computing Surveys, 55(7):1–38, 2022

M Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey.ACM Computing Surveys, 55(7):1–38, 2022. 1

work page 2022

[3] [3]

Interpretable operations research for high-stakes decisions: Designing the greek covid-19 testing system.INFORMS Journal on Applied Analytics, 52(5):398–411, 2022

Hamsa Bastani, Kimon Drakopoulos, Vishal Gupta, Jon Vlachogiannis, Christos Had- jichristodoulou, Pagona Lagiou, Gkikas Magiorkinis, Dimitrios Paraskevis, and Sotirios Tsiodras. Interpretable operations research for high-stakes decisions: Designing the greek covid-19 testing system.INFORMS Journal on Applied Analytics, 52(5):398–411, 2022. 2

work page 2022

[4] [4]

Probably approximately correct labels

Emmanuel J Cand` es, Andrew Ilyas, and Tijana Zrnic. Probably approximately correct labels. arXiv preprint arXiv:2506.10908, 2025. 15

work page arXiv 2025

[5] [5]

Cambridge university press, 2006

Nicolo Cesa-Bianchi and G´ abor Lugosi.Prediction, learning, and games. Cambridge university press, 2006. 6

work page 2006

[6] [6]

Worst-case analysis of selective sampling for linear classification.Journal of Machine Learning Research, 7(7), 2006

Nicolo Cesa-Bianchi, Claudio Gentile, Luca Zaniboni, and Manfred Warmuth. Worst-case analysis of selective sampling for linear classification.Journal of Machine Learning Research, 7(7), 2006. 2

work page 2006

[7] [7]

Concentration inequalities and martingale inequalities: a survey

Fan Chung and Linyuan Lu. Concentration inequalities and martingale inequalities: a survey. Internet mathematics, 3(1):79–127, 2006. 40

work page 2006

[8] [8]

Machine learning in drug discovery: a review.Artificial intelligence review, 55(3):1947–1999,

Suresh Dara, Swetha Dhamercherla, Surender Singh Jadav, CH Madhu Babu, and Mohamed Jawed Ahsan. Machine learning in drug discovery: a review.Artificial intelligence review, 55(3):1947–1999,

work page 1947

[9] [9]

Analysis of perceptron-based active learning

Sanjoy Dasgupta, Adam Tauman Kalai, and Claire Monteleoni. Analysis of perceptron-based active learning. InInternational conference on computational learning theory, pages 249–263. Springer, 2005. 15

work page 2005

[10] [10]

Selective sampling and active learning from single and multiple teachers.The Journal of Machine Learning Research, 13(1):2655–2697, 2012

Ofer Dekel, Claudio Gentile, and Karthik Sridharan. Selective sampling and active learning from single and multiple teachers.The Journal of Machine Learning Research, 13(1):2655–2697, 2012. 2

work page 2012

[11] [11]

Efficiently learning halfspaces with tsybakov noise

Ilias Diakonikolas, Daniel M Kane, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Efficiently learning halfspaces with tsybakov noise. InProceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 88–101, 2021. 15

work page 2021

[12] [12]

Online learning of halfspaces with massart noise.arXiv preprint arXiv:2405.12958, 2024

Ilias Diakonikolas, Vasilis Kontonis, Christos Tzamos, and Nikos Zarifis. Online learning of halfspaces with massart noise.arXiv preprint arXiv:2405.12958, 2024. 15

work page arXiv 2024

[13] [13]

Towards semi- supervised learning with non-random missing labels

Yue Duan, Zhen Zhao, Lei Qi, Luping Zhou, Lei Wang, and Yinghuan Shi. Towards semi- supervised learning with non-random missing labels. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16121–16131, 2023. 2

work page 2023

[14] [14]

Computation of the regularized incomplete beta function

Vera Egorova, Amparo Gil, Javier Segura, NM Temme, et al. Computation of the regularized incomplete beta function. 2023. 24 11

work page 2023

[15] [15]

Improved optimistic algorithms for logistic bandits

Louis Faury, Marc Abeille, Cl´ ement Calauz` enes, and Olivier Fercoq. Improved optimistic algorithms for logistic bandits. InInternational Conference on Machine Learning, pages 3052–3060. PMLR,

work page

[16] [16]

John Wiley & Sons,

Gerald B Folland.Real analysis: modern techniques and their applications. John Wiley & Sons,

work page

[17] [17]

Selective sampling using the query by committee algorithm.Machine learning, 28:133–168, 1997

Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query by committee algorithm.Machine learning, 28:133–168, 1997. 2

work page 1997

[18] [18]

Online selective classifica- tion with limited feedback.Advances in Neural Information Processing Systems, 34:14529–14541,

Aditya Gangrade, Anil Kag, Ashok Cutkosky, and Venkatesh Saligrama. Online selective classifica- tion with limited feedback.Advances in Neural Information Processing Systems, 34:14529–14541,

work page

[19] [19]

Selective classification via one-sided prediction

Aditya Gangrade, Anil Kag, and Venkatesh Saligrama. Selective classification via one-sided prediction. InInternational Conference on Artificial Intelligence and Statistics, pages 2179–2187. PMLR, 2021. 2 and 15

work page 2021

[20] [20]

Safe linear bandits over unknown polytopes

Aditya Gangrade, Tianrui Chen, and Venkatesh Saligrama. Safe linear bandits over unknown polytopes. InThe Thirty Seventh Annual Conference on Learning Theory, pages 1755–1795. PMLR, 2024. 2

work page 2024

[21] [21]

Safe machine learning.Statistics, 58(3):473–477, 2024

Paolo Giudici. Safe machine learning.Statistics, 58(3):473–477, 2024. 1

work page 2024

[22] [22]

Adversarial resilience in sequential prediction via abstention.Advances in Neural Information Processing Systems, 36:8027–8047,

Surbhi Goel, Steve Hanneke, Shay Moran, and Abhishek Shetty. Adversarial resilience in sequential prediction via abstention.Advances in Neural Information Processing Systems, 36:8027–8047,

work page

[23] [23]

arXiv preprint arXiv:2205.10330 , year =

Shangding Gu, Long Yang, Yali Du, Guang Chen, Florian Walter, Jun Wang, and Alois Knoll. A review of safe reinforcement learning: Methods, theory and applications.arXiv preprint arXiv:2205.10330, 2022. 1

work page arXiv 2022

[24] [24]

Toward a general theory of online selective sampling: Trading off mistakes and queries

Steve Hanneke and Liu Yang. Toward a general theory of online selective sampling: Trading off mistakes and queries. InInternational Conference on Artificial Intelligence and Statistics, pages 3997–4005. PMLR, 2021. 2

work page 2021

[25] [25]

Cambridge university press, 2012

Roger A Horn and Charles R Johnson.Matrix analysis. Cambridge university press, 2012. 24

work page 2012

[26] [26]

Volumes of n-dimensional spheres and ellipsoids, 2014

Michael Jorgensen. Volumes of n-dimensional spheres and ellipsoids, 2014. 25 and 26

work page 2014

[27] [27]

Conservative contextual linear bandits.Advances in Neural Information Processing Systems, 30,

Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi Yadkori, and Benjamin Van Roy. Conservative contextual linear bandits.Advances in Neural Information Processing Systems, 30,

work page

[28] [28]

Cambridge University Press, 2020

Tor Lattimore and Csaba Szepesv´ ari.Bandit algorithms. Cambridge University Press, 2020. 1, 21, and 39

work page 2020

[29] [29]

A unified confidence sequence for generalized linear models, with applications to bandits.Advances in Neural Information Processing Systems, 37:124640–124685, 2025

Junghyun Lee, Se-Young Yun, and Kwang-Sung Jun. A unified confidence sequence for generalized linear models, with applications to bandits.Advances in Neural Information Processing Systems, 37:124640–124685, 2025. 5 and 9

work page 2025

[30] [30]

Concise formulas for the area and volume of a hyperspherical cap.Asian Journal of Mathematics & Statistics, 4(1):66–70, 2010

Shengqiao Li. Concise formulas for the area and volume of a hyperspherical cap.Asian Journal of Mathematics & Statistics, 4(1):66–70, 2010. 25

work page 2010

[31] [31]

Better algorithms for selective sampling

Francesco Orabona, Nicolo Cesa-Bianchi, et al. Better algorithms for selective sampling. In Proceedings of the 28th international conference on machine learning: Bellevue, Washington, USA, june 28. july 2, 2011, pages 433–440. Omnipress, 2011. 2 and 5 12

work page 2011

[32] [32]

Stochastic bandits with linear constraints

Aldo Pacchiano, Mohammad Ghavamzadeh, Peter Bartlett, and Heinrich Jiang. Stochastic bandits with linear constraints. InInternational conference on artificial intelligence and statistics, pages 2827–2835. PMLR, 2021. 2

work page 2021

[33] [33]

Machine learning portfolio allocation.The Journal of Finance and Data Science, 8:35–54, 2022

Michael Pinelis and David Ruppert. Machine learning portfolio allocation.The Journal of Finance and Data Science, 8:35–54, 2022. 1

work page 2022

[34] [34]

Selective sampling and imitation learning via online regression.Advances in Neural Information Processing Systems, 36:67213–67268,

Ayush Sekhari, Karthik Sridharan, Wen Sun, and Runzhe Wu. Selective sampling and imitation learning via online regression.Advances in Neural Information Processing Systems, 36:67213–67268,

work page

[35] [35]

Active learning literature survey

Burr Settles. Active learning literature survey. 2009. 2

work page 2009

[36] [36]

Query by committee

H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. Query by committee. InProceedings of the fifth annual workshop on Computational learning theory, pages 287–294, 1992. 2

work page 1992

[37] [37]

Dynamic Ad Allocation: Bandits with Budgets

Aleksandrs Slivkins. Dynamic ad allocation: Bandits with budgets.arXiv preprint arXiv:1306.0155,

work page internal anchor Pith review Pith/arXiv arXiv

[38] [38]

Reinforcement learning.Journal of Cognitive Neuroscience, 11(1):126–134, 1999

Richard S Sutton, Andrew G Barto, et al. Reinforcement learning.Journal of Cognitive Neuroscience, 11(1):126–134, 1999. 1

work page 1999

[39] [39]

Optimal aggregation of classifiers in statistical learning.The Annals of Statistics, 32(1):135–166, 2004

Alexander B Tsybakov. Optimal aggregation of classifiers in statistical learning.The Annals of Statistics, 32(1):135–166, 2004. 15

work page 2004

[40] [40]

Applications of machine learning in drug discovery and development.Nature reviews Drug discovery, 18(6):463–477, 2019

Jessica Vamathevan, Dominic Clark, Paul Czodrowski, Ian Dunham, Edgardo Ferran, George Lee, Bin Li, Anant Madabhushi, Parantu Shah, Michaela Spitzer, et al. Applications of machine learning in drug discovery and development.Nature reviews Drug discovery, 18(6):463–477, 2019. 1

work page 2019

[41] [41]

Cambridge university press, 2018

Roman Vershynin.High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018. 16, 19, 20, and 41

work page 2018

[42] [42]

Cambridge university press, 2019

Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019. 18 and 19

work page 2019

[43] [43]

Power con- strained bandits

Jiayu Yao, Emma Brunskill, Weiwei Pan, Susan Murphy, and Finale Doshi-Velez. Power con- strained bandits. InMachine Learning for Healthcare Conference, pages 209–259. PMLR, 2021. 2

work page 2021

[44] [44]

expert” model also releases an uncertainty level Ui about its prediction. The algorithmic challenge is to leverage the uncertainty levels to produce “PAC labels

Zheqing Zhu and Benjamin Van Roy. Scalable neural contextual bandit for recommender sys- tems. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 3636–3646, 2023. 1 13 Appendix Contents 1 Introduction 1 2 Preliminaries 2 2.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page 2023

[45] [45]

If ˆYt = 1 thenE(1{ ˆYt ̸=Y t} | ˆYt = 1) = 1−p t

work page

[46] [46]

sandwich

Else if ˆYt = 0 thenE(1{ ˆYt ̸=Y t} | ˆYt = 0) =p t. The optimal policy then is to output the prediction with the smallest error. The expected error then is equal to E(1{ ˆYt ̸=Y t})≜min{1−p t, pt}. We denote P(Zt = 0) = ηt. The optimal policy choice is reduced to the following optimization problem. 15 min {ηt} TX t=1 1−η t s.t. 1 T TX t=1 min{1−p t, pt}η...

work page

[47] [47]

f(z) =z − 1 2 (1−z) d−1 2 , log (f(z)) =− 1 2 logz+ d−1 2 log(1−z), log (f(z))′ =− 1 2z − d−1 2(1−z) <0

for various values ofd. f(z) =z − 1 2 (1−z) d−1 2 , log (f(z)) =− 1 2 logz+ d−1 2 log(1−z), log (f(z))′ =− 1 2z − d−1 2(1−z) <0. Then, for all 0 < z < 1, f ′(z) = log (f(z))′ f(z) < 0, and so F is concave. Figure 5 illustrates the CDF across various values of parameterd >1. To continue in our analysis, we will need to show that the τ for which we are eval...

work page