Online Conformal Prediction with Adversarial Semi-bandit Feedback via Regret Minimization

Junyoung Yang; Kyungmin Kim; Sangdon Park

arxiv: 2604.17984 · v1 · submitted 2026-04-20 · 💻 cs.LG · stat.ML

Online Conformal Prediction with Adversarial Semi-bandit Feedback via Regret Minimization

Junyoung Yang , Kyungmin Kim , Sangdon Park This is my paper

Pith reviewed 2026-05-10 05:04 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords online conformal predictionsemi-bandit feedbackadversarial banditsregret minimizationprediction setsuncertainty quantificationmiscoverage control

0 comments

The pith

Online conformal prediction with partial feedback achieves long-run coverage by linking it to adversarial bandit regret.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to build prediction sets sequentially when the true label is revealed only if it falls inside the current set, a semi-bandit feedback model with an adaptive adversary. It does so by casting the choice of each prediction set as an arm in an adversarial bandit problem and then using a regret-minimizing algorithm to drive the construction. A reader should care because the approach removes the usual full-feedback assumption while still delivering a distribution-free guarantee that the long-run fraction of miscoverages stays below a target level. The key step is proving that the coverage error is bounded by the learner's regret.

Core claim

We formulate online conformal prediction as an adversarial bandit problem by treating each candidate prediction set as an arm. Building on an existing algorithm for adversarial bandits, our method achieves a long-run coverage guarantee by explicitly establishing its connection to the regret of the learner.

What carries the argument

Reduction of conformal set selection to adversarial semi-bandit regret minimization, where each possible prediction set is an arm and feedback arrives only on the chosen arm when the label lies inside it.

If this is right

The miscoverage rate converges to the desired level in the long run without any distributional assumptions on the data.
Prediction sets of controlled average size can be produced online even though full labels are never observed.
The same regret bound applies in both i.i.d. and non-i.i.d. sequences, including adversarial ones.
Any improvement in the underlying adversarial bandit algorithm directly tightens the coverage guarantee.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The regret connection suggests that faster-regret bandit methods could shrink prediction sets without sacrificing coverage.
The semi-bandit model may apply directly to active-learning or costly-label streams where only uncertain points receive labels.
One could test whether the same reduction yields finite-time coverage bounds rather than only asymptotic ones.

Load-bearing premise

The true label is revealed exactly when it lies inside the constructed prediction set, and the adversary is adaptive but the feedback model remains semi-bandit.

What would settle it

Run the algorithm for many steps against an adaptive adversary that forces miscoverage whenever possible under the semi-bandit rule; if the long-run miscoverage rate exceeds the target level by more than a constant factor, the claimed guarantee fails.

Figures

Figures reproduced from arXiv: 2604.17984 by Junyoung Yang, Kyungmin Kim, Sangdon Park.

**Figure 2.** Figure 2: Long-run miscoverage MC(t) and inefficiency Ineff(t) as a function of time t ∈ [T] on the ImageNet dataset under both the i.i.d. and non-i.i.d. settings, averaged over 50 independent runs We conduct experiments on both classification and regression tasks, where both i.i.d. and non-i.i.d. settings are considered for each task. In Experiment 1 (Section 5.1), we consider classification task using ImageNet dat… view at source ↗

**Figure 3.** Figure 3: Long-run miscoverage MC(T) and inefficiency Ineff(T) on the Airfoil Self-Noise dataset under the i.i.d. and covariate shift settings, averaged over 50 independent runs 5.2 EXPERIMENT 2: COVERAGE VS INEFFICIENCY ON AIRFOIL SELF-NOISE For both the i.i.d. and covariate-shift settings, we evaluate the long-run miscoverage MC(T) and inefficiency Ineff(T) on the Airfoil Self-Noise dataset, averaged over 50 indep… view at source ↗

**Figure 4.** Figure 4: Overview of OCP-Unlock+ for online conformal prediction with semi-bandit feedback We now state a high-probability bound for OCP-Unlock+. Theorem 5. For any α ∈ (0, 0.5), T ∈ N, K ≥ 5, and δ ∈ (0, 1), run OCP-Unlock+ (Algorithm 1) with β = q ln K KT , γ = 1.05q K ln K T , and η = 0.95q ln K KT . Then, with probability at least 1 − δ, the regret Reg(T) with respect to the loss defined in Eq. 5 and Eq. 7 sati… view at source ↗

**Figure 5.** Figure 5: Ablation study on ImageNet: Long-run miscoverage [PITH_FULL_IMAGE:figures/full_fig_p040_5.png] view at source ↗

**Figure 6.** Figure 6: Ablation study on Airfoil Self-Noise: Long-run miscoverage [PITH_FULL_IMAGE:figures/full_fig_p041_6.png] view at source ↗

**Figure 7.** Figure 7: Sensitivity of OCP-Unlock+ to α on ImageNet. Box plots summarize the long-run miscoverage MC(T) across 50 independent runs for α ∈ {0.1, 0.2, 0.3, 0.4} under the i.i.d. (left) and non-i.i.d. (right) settings. Each box shows the interquartile range with the median marked inside, while the whiskers indicate the spread of the results. (a) MC(T) (i.i.d.) (b) MC(T) (Covariate Shift) [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 8.** Figure 8: Sensitivity of OCP-Unlock+ to α on Airfoil Self-Noise. Box plots summarize the longrun miscoverage MC(T) across 50 independent runs for α ∈ {0.1, 0.2, 0.3, 0.4} under the i.i.d. (left) and covariate shift (right) settings. Each box shows the interquartile range with the median marked inside, while the whiskers indicate the spread of the results. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_8.png] view at source ↗

read the original abstract

Uncertainty quantification is crucial in safety-critical systems, where decisions must be made under uncertainty. In particular, we consider the problem of online uncertainty quantification, where data points arrive sequentially. Online conformal prediction is a principled online uncertainty quantification method that dynamically constructs a prediction set at each time step. While existing methods for online conformal prediction provide long-run coverage guarantees without any distributional assumptions, they typically assume a full feedback setting in which the true label is always observed. In this paper, we propose a novel learning method for online conformal prediction with partial feedback from an adaptive adversary-a more challenging setup where the true label is revealed only when it lies inside the constructed prediction set. Specifically, we formulate online conformal prediction as an adversarial bandit problem by treating each candidate prediction set as an arm. Building on an existing algorithm for adversarial bandits, our method achieves a long-run coverage guarantee by explicitly establishing its connection to the regret of the learner. Finally, we empirically demonstrate the effectiveness of our method in both independent and identically distributed (i.i.d.) and non-i.i.d. settings, showing that it successfully controls the miscoverage rate while maintaining a reasonable size of the prediction set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They reduce online conformal prediction under semi-bandit feedback to adversarial regret minimization and pull the coverage guarantee from the regret bound.

read the letter

The main takeaway is that this paper turns the partial-feedback version of online conformal prediction into a standard adversarial bandit problem and derives the long-run coverage directly from a no-regret guarantee. They treat candidate prediction sets as arms, run an existing bandit algorithm, and note that the best fixed arm in hindsight has miscoverage exactly alpha, so sublinear regret gives the desired bound even against an adaptive adversary. That reduction is the clearest new piece relative to prior full-feedback online conformal work. It is a clean way to handle the harder feedback model where the label appears only when it falls inside the set. The paper shows the approach works in both i.i.d. and non-i.i.d. regimes and keeps prediction sets reasonable in size. The stress-test note confirms the 0-1 loss is observable each round and the comparator arm exists, so the math does not appear circular or load-bearing on fitted quantities. On the downside, the empirical section is thin on baseline details and run-to-run variance, which makes it harder to judge practical gains over simpler alternatives. The abstract also leaves the precise regret bound and feedback observability implicit, though the provided stress-test logic suggests they line up. This is useful for people already working on online uncertainty quantification or bandit-style conformal methods in sequential decision settings. It is not reshaping the broader field, but the reduction is worth seeing if you care about partial feedback. I would send it to a serious referee because the central claim is grounded and the setting is relevant.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes a method for online conformal prediction under adversarial semi-bandit feedback, where the true label is revealed only if it falls inside the constructed prediction set. It formulates the problem as an adversarial bandit by treating candidate prediction sets as arms, applies an existing no-regret algorithm, and derives a long-run coverage guarantee (miscoverage at most alpha plus sublinear regret) from the connection to the learner's regret against the best fixed arm in hindsight. The 0-1 miscoverage loss is treated as observable under the given feedback model. Empirical results are reported for both i.i.d. and non-i.i.d. settings.

Significance. If the regret connection holds, this provides a clean theoretical extension of online conformal prediction to partial-feedback settings that arise in many practical applications. The explicit reduction to standard adversarial bandit regret bounds (with the (1-alpha)-quantile arm as comparator) is a strength, as is the absence of distributional assumptions. The empirical validation across data regimes adds relevance, though limited baseline and variance details reduce the strength of the practical claims.

minor comments (2)

[Empirical Evaluation] Empirical Evaluation section: the description of baselines is insufficient; it is unclear which prior online conformal prediction methods are used for comparison and how they are modified for the semi-bandit feedback model.
[Empirical Evaluation] Empirical Evaluation section: no standard deviations, confidence intervals, or multiple-run statistics are reported for the miscoverage rates and average set sizes, which weakens assessment of result stability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript, recognition of the clean reduction to adversarial bandit regret, and recommendation for minor revision. We address the concern raised in the significance assessment below.

read point-by-point responses

Referee: limited baseline and variance details reduce the strength of the practical claims

Authors: We agree that the empirical section would benefit from additional baselines and variability reporting. In the revised manuscript we will add comparisons against simple baselines (e.g., always outputting the full label space and a fixed-size quantile set) and report mean performance together with standard deviations across at least ten independent random seeds for both the i.i.d. and non-i.i.d. regimes. These changes will strengthen the practical claims while leaving the theoretical results unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper models online conformal prediction under semi-bandit feedback by treating candidate prediction sets as arms in an adversarial bandit problem, with the 0-1 miscoverage loss directly observable from whether the label is revealed. It invokes an existing no-regret algorithm for adversarial bandits and derives the long-run coverage guarantee from the standard regret bound relative to the best fixed arm in hindsight. The comparator arm is defined as the (1-alpha)-quantile arm whose hindsight miscoverage equals alpha, so the average loss bound directly yields the desired guarantee without any self-referential fitting, ansatz smuggling, or load-bearing self-citation. The reduction relies on external bandit theory applied to the explicitly observable loss, making the derivation self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard regret bounds from adversarial bandit literature and the semi-bandit feedback assumption; no new free parameters or invented entities are introduced in the abstract.

axioms (2)

standard math Standard regret bounds hold for the chosen adversarial bandit algorithm under adaptive adversaries.
Invoked when linking low regret to long-run coverage guarantee.
domain assumption The feedback model is exactly semi-bandit: label observed iff inside the chosen prediction set.
Core modeling assumption stated in the problem setup.

pith-pipeline@v0.9.0 · 5507 in / 1246 out tokens · 41749 ms · 2026-05-10T05:04:08.685274+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Online Conformal Prediction for Non-Exchangeable Panel Data
stat.ML 2026-05 unverdicted novelty 6.0

An online conformal prediction framework for non-exchangeable panel data that forms prediction sets using related units' contemporaneous data with adaptive similarity weights and miscoverage levels to deliver stepwise...

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 1 Pith paper

[1]

Haosen Ge, Hamsa Bastani, and Osbert Bastani

Accessed: 2025-09-22. Haosen Ge, Hamsa Bastani, and Osbert Bastani. Stochastic Online Conformal Prediction with Semi-bandit Feedback. InICML,

work page 2025
[2]

Language Models with Conformal Factuality Guaran- tees

11 Published as a conference paper at ICLR 2026 Christopher Mohri and Tatsunori Hashimoto. Language Models with Conformal Factuality Guaran- tees. InICML,

work page 2026
[3]

At each round t∈[T] , the learner chooses an arm πt from a given set Π, where we refer to the problem as K-armed bandit problem when |Π|=K

12 Published as a conference paper at ICLR 2026 A PRELIMINARY A.1 ADVERSARIALBANDITPROBLEM The multi-armed bandit problem is a sequential game between a learner and an environment (Slivkins, 2019; Lattimore & Szepesvári, 2020), where the game proceeds over T∈N rounds. At each round t∈[T] , the learner chooses an arm πt from a given set Π, where we refer t...

work page 2026
[4]

Theorem 2(Auer et al

by controlling the variance of regret. Theorem 2(Auer et al. (2002)).For any δ∈(0,1) and ℓt(π)∈[ℓ min, ℓmax], run EXP3.P (Algo- rithm

work page 2002
[5]

Then, with probability at least 1−δ, the regretReg(T)satisfies Reg(T)≤(ℓ max −ℓ min) r T K lnK ln(δ−1) + 5.15 √ T KlnK !

with β= q lnK KT , γ= 1.05 q KlnK T ,andη= 0.95 q lnK KT . Then, with probability at least 1−δ, the regretReg(T)satisfies Reg(T)≤(ℓ max −ℓ min) r T K lnK ln(δ−1) + 5.15 √ T KlnK ! . Algorithm 2EXP3.P 1:procedureEXP3.P(Π, T, η, γ, β) 2:Initialize cumulative estimated gain: ˜G0(π)←0for allπ∈Π 3:fort∈ {1, . . . , T}do 4:Update strategy:p t(π)←(1−γ) exp(η ˜Gt...

work page 2026
[6]

T Cgap(T)o(T) −1 | {z } CMC(T) . 15 Published as a conference paper at ICLR 2026 CEXP3.P-STYLEALGORITHMS FORONLINECONFORMALPREDICTION WITHADVERSARIALPARTIALFEEDBACK We propose three modifications to the EXP3.P algorithm tailored to the setting of online conformal prediction with adversarial partial feedback: OCP-Bandit (Algorithm 3), OCP-Unlock (Algo- rit...

work page 2026
[7]

for online conformal prediction with adversarial partial feedback. Accordingly, irrespective of the miscoverage feedback mt(πt) from the chosen arm, the unlocking set is defined as a singleton set containing the chosen armπ t: Πt(πt) :={π t}.(13) The biased gain estimator˜gt(π|Πt(πt))is then defined as in Eq. 9: ˜gt (π|Π t(πt)) :=1(m t(πt) = 0)| {z } no u...

work page 2002
[8]

Then, with probability at least1−δ, the regretReg(T)with respect to the loss defined in Eq

with β= q lnK KT , γ= 1.05 q KlnK T ,andη= 0.95 q lnK KT . Then, with probability at least1−δ, the regretReg(T)with respect to the loss defined in Eq. 5 and Eq. 7 satisfies Reg(T)≤(ℓ max −ℓ min) 5.15 √ T KlnK+ r T K lnK ln(δ−1) ! . See Appendix D for proof and additional details. 16 Published as a conference paper at ICLR 2026 Algorithm 3OCP-Bandit 1:proc...

work page 2026
[9]

Then, with probability at least 1−δ , the regretReg(T)with respect to the loss defined in Eq

with β= q lnK KT , γ= 1.05 q KlnK T ,andη= 0.95 q lnK KT . Then, with probability at least 1−δ , the regretReg(T)with respect to the loss defined in Eq. 5 and Eq. 7 satisfies Reg(T)≤(ℓ max −ℓ min) 5.15 √ T KlnK+ r T K lnK ln(δ−1) +o(T) −1T ! . See Appendix F for proof and additional details. 17 Published as a conference paper at ICLR 2026 Algorithm 4OCP-U...

work page 2026
[10]

However, we further strengthen themiscoverage-firstrationale in the design of the biased gain estimator˜gt(π|Πt(πt))(Eq

as OCP-Unlock: Πt(πt) := ( Πifm t(πt) = 0 {π∈Π :π≥π t}ifm t(πt) = 1 . However, we further strengthen themiscoverage-firstrationale in the design of the biased gain estimator˜gt(π|Πt(πt))(Eq. 11), defined as follows: ˜gt (π|Π t(πt)) :=1(m t(πt) = 0)| {z } full unlocking ×(A) +1(m t(πt) = 1)| {z } partial unlocking ×(B), where (A) =1(π∈Π ∗ t ) ( gt(π)P ˜π∈Π...

work page 2026
[11]

Note that OCP-Unlock+ achieves the tightest bound as long as( √ K− √ C) √ TlnK >2o(T) −1T

See Appendix G for proof and additional details. Note that OCP-Unlock+ achieves the tightest bound as long as( √ K− √ C) √ TlnK >2o(T) −1T . In addition, using the same argument based on Lemma 1 and Theorem 5 to derive the high-probability upper bound of MC(T)−α (Theorem 1), we can show that both long-run coverage results of OCP-BanditandOCP-Unlockare at ...

work page 2026
[12]

Note that Eq

satisfies the following high-probability inequality for all π∈Π . Note that Eq. 14 is equivalent to Eq. 9, which is re-expressed to clarify the difference between the biased gain estimators of OCP-Unlock and OCP-Unlock+, as detailed in Appendix C. Lemma 2.Let β≤1 and δ∈(0,1) . Then, for each π∈Π , the following holds with probability at least1−δ: TX t=1 g...

work page 2026
[13]

Plugging Eq

≤ −(1−γ−(1 +β)ηK) max˜π∈Π ˜GT(˜π) +lnK η (∵Property of log-sum-exponential) ≤ −(1−γ−(1 +β)ηK) max˜π∈Π TX t=1 gt(˜π) +ln(Kδ−1) β + lnK η , (22) 21 Published as a conference paper at ICLR 2026 where the last inequality holds due to the Lemma 2 and the initial assumption that γ≤ 1 2 and (1 +β)Kη≤γ . Plugging Eq. 22 into Eq. 19, the following holds with proba...

work page 2026
[14]

Therefore, ℓt(π1)≤ℓ t(π2),∀π 1 ∈Π ∗ t ,∀π 2 ∈(Π ∗ t )c.(26) •ℓ max = 1−α(1−α)− cα 1+α(1−2α) o(T) −1 •ℓ min =α+α(1−α)− 1 + 1 1−α (cα)o(T) −1 23 Published as a conference paper at ICLR 2026 E.3 PROPERTIES OF THENORMALIZEDGAIN The followings are properties of the normalized gaingt(π) (Eq

work page 2026
[15]

1.π∈Π t(πt)⊂(Π ∗ t )c:ℓ t(π)≥ℓ t(πt)⇔g t(π)≤g t(πt)(∵Eq.25) 2.π∈Π t(πt)c: ˜gt(π)−g t(πt) = cα o(T) −1 1+α(1−2α)π − o(T) −1 1+α(1−2α)πt (1−2α)(1−α) + (1 + 1 1−α − 1 1+α(1−2α))(cα)o(T) −1 =o(T) −1 (∵Taylor expansion) Therefore, gt(π)≤g t(πt)∀π∈Π(π t), ˜gt(π)≤g t(πt) +o(T) −1 ∀π∈Π(π t)c.(28) 24 Published as a conference paper at ICLR 2026 F PROOF OFTHEOREM4 ...

work page 2026
[16]

Then, for each π∈Π , the following holds with probability at least1−δ: TX t=1 gt(π)≤ TX t=1 ˜gt(π|Πt(πt)) + ln(δ−1) β

Lemma 3.Let β≤1 and δ∈(0,1) . Then, for each π∈Π , the following holds with probability at least1−δ: TX t=1 gt(π)≤ TX t=1 ˜gt(π|Πt(πt)) + ln(δ−1) β . Proof.Step 1: Useful Decomposition. Let E t be the expectation with respect to the distribution pt, from which πt is sampled. For each π∈Π, the biased gain estimator can be decomposed as the following: ˜gt (...

work page 2026
[17]

Then, by letting∆ t(π) :=βg t(π)−β×(C1),∆ t(π)≤1for allπ∈Π

βgt(π)−β×(B1) =βg t(π)−β gt(π)P ˜π∈Πt(πt) pt(˜π)≤1 Given π∈Π , Claim 1 and Claim 2 always hold irrespective of the choice of πt by the algorithm. Then, by letting∆ t(π) :=βg t(π)−β×(C1),∆ t(π)≤1for allπ∈Π. Step 3: ShowEexp hPT t=1 (∆t(π)−β×(C2)) i ≤1. Therefore, since (1)exp(x)≤1 +x+x 2 forx≤1and (2)∆ t(π)≤1, forβ≤1, we have E t [exp (∆t(π)−β×(C2))]≤E t h...

work page 2026
[18]

Plugging Eq

≤ −(1−γ−(1 +β)ηK) max˜π∈Π ˜GT(˜π) +lnK η (∵Property of log-sum-exponential) ≤ −(1−γ−(1 +β)ηK) max˜π∈Π TX t=1 gt(˜π) +ln(Kδ−1) β + lnK η , (37) 29 Published as a conference paper at ICLR 2026 where the last inequality holds due to the Lemma 3 and the initial assumption that γ≤ 1 2 and (1 +β)Kη≤γ . Plugging Eq. 37 into Eq. 34, the following holds with proba...

work page 2026
[19]

Then, for each π∈Π , the following holds with probability at least1−δ: TX t=1 gt(π)≤ TX t=1 ˜gt(π|Πt(πt)) + ln(δ−1) β

Lemma 4.Let β≤1 and δ∈(0,1) . Then, for each π∈Π , the following holds with probability at least1−δ: TX t=1 gt(π)≤ TX t=1 ˜gt(π|Πt(πt)) + ln(δ−1) β . Proof.Step 1: Useful Decomposition. Let E t be the expectation with respect to the distribution pt, from which πt is sampled. For each π∈Π, the biased gain estimator can be decomposed as the following: ˜gt (...

work page 2026
[20]

Then, by letting∆ t(π) :=βg t(π)−β×(C1),∆ t(π)≤1for allπ∈Π

–π∈Π t(πt) βgt(π)−β×(B1) =βg t(π)−βg t(π) = 0 –π∈Π t(πt)c βgt(π)−β×(B1) =βg t(π)−β˜gt(π)≤1 (∵β≤1andg t(π),˜gt(π)∈[0,1]) Given π∈Π , Claim 1 and Claim 2 always hold irrespective of the choice of πt by the algorithm. Then, by letting∆ t(π) :=βg t(π)−β×(C1),∆ t(π)≤1for allπ∈Π. Step 3: ShowEexp hPT t=1 (∆t(π)−β×(C2)) i ≤1. Therefore, since (1)exp(x)≤1 +x+x 2 ...

work page 2026
[21]

≤η 2 1 + 2β 1−γ X ˜π∈Π ˜gt(π|Πt(πt)), (46) where the last inequality holds due to the following. •m(π t) = 0 E ˜π∼ωt η2˜gt(˜π|Πt(πt))2 =η2 ( P ˜π∈Π∗ t wt(˜π) gt(˜π)+βP π′ ∈Π∗ t pt(π′) +β !2 +P ˜π∈(Π∗ t )c wt(˜π) gt(˜π)+ βP π′ ≤˜πpt(π′) 2 ) ≤η2 nP ˜π∈Π∗ t wt(˜π) 1+2β pt(˜π) ˜gt(˜π|Πt(πt))+P ˜π∈(Π∗ t )c wt(˜π) 1+2β pt(˜π) ˜gt(˜π|Πt(πt)) o (∵gt(π)∈[0,1]) ≤η2...

work page 2026
[22]

Plugging Eq

≤ −(1−r−(1 + 2β)ηK)max˜π∈Π ˜Gt(˜π) +lnK η (∵Property of log-sum-exponential) ≤ −(1−r−(1 + 2β)ηK)max˜π∈Π TX t=1 gt(˜π) +ln(Kδ−1) β + lnK η , (47) where the last inequality holds due to the Lemma 4 and the initial assumption that γ≤ 1 2 and (1 + 2β)Kη≤γ . Plugging Eq. 47 into Eq. 44, the following holds with probability 1− δ K for all π∈Π: Rπ(T)≤ ℓmax −ℓ mi...

work page 2026
[23]

37 Published as a conference paper at ICLR 2026 H.3 CHOICE OFHYPERPARAMETERc For all experiments, we set c= 40 in the loss function ℓt(π;α, c) =d t(π;α)+a t(π;α, c) , as defined in Eq. 5, Eq. 6, and Eq

work page 2026
[24]

For the discretization level, we use K= 200 for ImageNet and K= 20 for Airfoil Self-Noise

We report the long-run miscoverageMC(T) and inefficiencyIneff(T) , averaged over 50 independent trials, across variousT and α settings for both classification and regression tasks. For the discretization level, we use K= 200 for ImageNet and K= 20 for Airfoil Self-Noise. The results show that our method consistently achieves decent performance for a fixed...

work page 2026
[25]

(top) and non-i.i.d

(a)MC(t)(i.i.d.) (b)Ineff(t)(i.i.d.) (c)MC(t)(Non-i.i.d.) (d)Ineff(t)(Non-i.i.d.) Figure 5: Ablation study on ImageNet: Long-run miscoverage MC(t) and inefficiency Ineff(t) as a function of time t∈[T] , under the i.i.d. (top) and non-i.i.d. (bottom) settings. The solid curves denote the average over 50 independent runs, and the shaded regions indicate the...

work page 2026
[26]

Box plots summarize the long-run miscoverage MC(T) across 50 independent runs for α∈ {0.1,0.2,0.3,0.4} under the i.i.d

(a)MC(T)(i.i.d.) (b)MC(T)(Non-i.i.d.) Figure 7: Sensitivity of OCP-Unlock+ to α on ImageNet. Box plots summarize the long-run miscoverage MC(T) across 50 independent runs for α∈ {0.1,0.2,0.3,0.4} under the i.i.d. (left) and non-i.i.d. (right) settings. Each box shows the interquartile range with the median marked inside, while the whiskers indicate the sp...

work page 2026

[1] [1]

Haosen Ge, Hamsa Bastani, and Osbert Bastani

Accessed: 2025-09-22. Haosen Ge, Hamsa Bastani, and Osbert Bastani. Stochastic Online Conformal Prediction with Semi-bandit Feedback. InICML,

work page 2025

[2] [2]

Language Models with Conformal Factuality Guaran- tees

11 Published as a conference paper at ICLR 2026 Christopher Mohri and Tatsunori Hashimoto. Language Models with Conformal Factuality Guaran- tees. InICML,

work page 2026

[3] [3]

At each round t∈[T] , the learner chooses an arm πt from a given set Π, where we refer to the problem as K-armed bandit problem when |Π|=K

12 Published as a conference paper at ICLR 2026 A PRELIMINARY A.1 ADVERSARIALBANDITPROBLEM The multi-armed bandit problem is a sequential game between a learner and an environment (Slivkins, 2019; Lattimore & Szepesvári, 2020), where the game proceeds over T∈N rounds. At each round t∈[T] , the learner chooses an arm πt from a given set Π, where we refer t...

work page 2026

[4] [4]

Theorem 2(Auer et al

by controlling the variance of regret. Theorem 2(Auer et al. (2002)).For any δ∈(0,1) and ℓt(π)∈[ℓ min, ℓmax], run EXP3.P (Algo- rithm

work page 2002

[5] [5]

Then, with probability at least 1−δ, the regretReg(T)satisfies Reg(T)≤(ℓ max −ℓ min) r T K lnK ln(δ−1) + 5.15 √ T KlnK !

with β= q lnK KT , γ= 1.05 q KlnK T ,andη= 0.95 q lnK KT . Then, with probability at least 1−δ, the regretReg(T)satisfies Reg(T)≤(ℓ max −ℓ min) r T K lnK ln(δ−1) + 5.15 √ T KlnK ! . Algorithm 2EXP3.P 1:procedureEXP3.P(Π, T, η, γ, β) 2:Initialize cumulative estimated gain: ˜G0(π)←0for allπ∈Π 3:fort∈ {1, . . . , T}do 4:Update strategy:p t(π)←(1−γ) exp(η ˜Gt...

work page 2026

[6] [6]

T Cgap(T)o(T) −1 | {z } CMC(T) . 15 Published as a conference paper at ICLR 2026 CEXP3.P-STYLEALGORITHMS FORONLINECONFORMALPREDICTION WITHADVERSARIALPARTIALFEEDBACK We propose three modifications to the EXP3.P algorithm tailored to the setting of online conformal prediction with adversarial partial feedback: OCP-Bandit (Algorithm 3), OCP-Unlock (Algo- rit...

work page 2026

[7] [7]

for online conformal prediction with adversarial partial feedback. Accordingly, irrespective of the miscoverage feedback mt(πt) from the chosen arm, the unlocking set is defined as a singleton set containing the chosen armπ t: Πt(πt) :={π t}.(13) The biased gain estimator˜gt(π|Πt(πt))is then defined as in Eq. 9: ˜gt (π|Π t(πt)) :=1(m t(πt) = 0)| {z } no u...

work page 2002

[8] [8]

Then, with probability at least1−δ, the regretReg(T)with respect to the loss defined in Eq

with β= q lnK KT , γ= 1.05 q KlnK T ,andη= 0.95 q lnK KT . Then, with probability at least1−δ, the regretReg(T)with respect to the loss defined in Eq. 5 and Eq. 7 satisfies Reg(T)≤(ℓ max −ℓ min) 5.15 √ T KlnK+ r T K lnK ln(δ−1) ! . See Appendix D for proof and additional details. 16 Published as a conference paper at ICLR 2026 Algorithm 3OCP-Bandit 1:proc...

work page 2026

[9] [9]

Then, with probability at least 1−δ , the regretReg(T)with respect to the loss defined in Eq

with β= q lnK KT , γ= 1.05 q KlnK T ,andη= 0.95 q lnK KT . Then, with probability at least 1−δ , the regretReg(T)with respect to the loss defined in Eq. 5 and Eq. 7 satisfies Reg(T)≤(ℓ max −ℓ min) 5.15 √ T KlnK+ r T K lnK ln(δ−1) +o(T) −1T ! . See Appendix F for proof and additional details. 17 Published as a conference paper at ICLR 2026 Algorithm 4OCP-U...

work page 2026

[10] [10]

However, we further strengthen themiscoverage-firstrationale in the design of the biased gain estimator˜gt(π|Πt(πt))(Eq

as OCP-Unlock: Πt(πt) := ( Πifm t(πt) = 0 {π∈Π :π≥π t}ifm t(πt) = 1 . However, we further strengthen themiscoverage-firstrationale in the design of the biased gain estimator˜gt(π|Πt(πt))(Eq. 11), defined as follows: ˜gt (π|Π t(πt)) :=1(m t(πt) = 0)| {z } full unlocking ×(A) +1(m t(πt) = 1)| {z } partial unlocking ×(B), where (A) =1(π∈Π ∗ t ) ( gt(π)P ˜π∈Π...

work page 2026

[11] [11]

Note that OCP-Unlock+ achieves the tightest bound as long as( √ K− √ C) √ TlnK >2o(T) −1T

See Appendix G for proof and additional details. Note that OCP-Unlock+ achieves the tightest bound as long as( √ K− √ C) √ TlnK >2o(T) −1T . In addition, using the same argument based on Lemma 1 and Theorem 5 to derive the high-probability upper bound of MC(T)−α (Theorem 1), we can show that both long-run coverage results of OCP-BanditandOCP-Unlockare at ...

work page 2026

[12] [12]

Note that Eq

satisfies the following high-probability inequality for all π∈Π . Note that Eq. 14 is equivalent to Eq. 9, which is re-expressed to clarify the difference between the biased gain estimators of OCP-Unlock and OCP-Unlock+, as detailed in Appendix C. Lemma 2.Let β≤1 and δ∈(0,1) . Then, for each π∈Π , the following holds with probability at least1−δ: TX t=1 g...

work page 2026

[13] [13]

Plugging Eq

≤ −(1−γ−(1 +β)ηK) max˜π∈Π ˜GT(˜π) +lnK η (∵Property of log-sum-exponential) ≤ −(1−γ−(1 +β)ηK) max˜π∈Π TX t=1 gt(˜π) +ln(Kδ−1) β + lnK η , (22) 21 Published as a conference paper at ICLR 2026 where the last inequality holds due to the Lemma 2 and the initial assumption that γ≤ 1 2 and (1 +β)Kη≤γ . Plugging Eq. 22 into Eq. 19, the following holds with proba...

work page 2026

[14] [14]

Therefore, ℓt(π1)≤ℓ t(π2),∀π 1 ∈Π ∗ t ,∀π 2 ∈(Π ∗ t )c.(26) •ℓ max = 1−α(1−α)− cα 1+α(1−2α) o(T) −1 •ℓ min =α+α(1−α)− 1 + 1 1−α (cα)o(T) −1 23 Published as a conference paper at ICLR 2026 E.3 PROPERTIES OF THENORMALIZEDGAIN The followings are properties of the normalized gaingt(π) (Eq

work page 2026

[15] [15]

1.π∈Π t(πt)⊂(Π ∗ t )c:ℓ t(π)≥ℓ t(πt)⇔g t(π)≤g t(πt)(∵Eq.25) 2.π∈Π t(πt)c: ˜gt(π)−g t(πt) = cα o(T) −1 1+α(1−2α)π − o(T) −1 1+α(1−2α)πt (1−2α)(1−α) + (1 + 1 1−α − 1 1+α(1−2α))(cα)o(T) −1 =o(T) −1 (∵Taylor expansion) Therefore, gt(π)≤g t(πt)∀π∈Π(π t), ˜gt(π)≤g t(πt) +o(T) −1 ∀π∈Π(π t)c.(28) 24 Published as a conference paper at ICLR 2026 F PROOF OFTHEOREM4 ...

work page 2026

[16] [16]

Then, for each π∈Π , the following holds with probability at least1−δ: TX t=1 gt(π)≤ TX t=1 ˜gt(π|Πt(πt)) + ln(δ−1) β

Lemma 3.Let β≤1 and δ∈(0,1) . Then, for each π∈Π , the following holds with probability at least1−δ: TX t=1 gt(π)≤ TX t=1 ˜gt(π|Πt(πt)) + ln(δ−1) β . Proof.Step 1: Useful Decomposition. Let E t be the expectation with respect to the distribution pt, from which πt is sampled. For each π∈Π, the biased gain estimator can be decomposed as the following: ˜gt (...

work page 2026

[17] [17]

Then, by letting∆ t(π) :=βg t(π)−β×(C1),∆ t(π)≤1for allπ∈Π

βgt(π)−β×(B1) =βg t(π)−β gt(π)P ˜π∈Πt(πt) pt(˜π)≤1 Given π∈Π , Claim 1 and Claim 2 always hold irrespective of the choice of πt by the algorithm. Then, by letting∆ t(π) :=βg t(π)−β×(C1),∆ t(π)≤1for allπ∈Π. Step 3: ShowEexp hPT t=1 (∆t(π)−β×(C2)) i ≤1. Therefore, since (1)exp(x)≤1 +x+x 2 forx≤1and (2)∆ t(π)≤1, forβ≤1, we have E t [exp (∆t(π)−β×(C2))]≤E t h...

work page 2026

[18] [18]

Plugging Eq

≤ −(1−γ−(1 +β)ηK) max˜π∈Π ˜GT(˜π) +lnK η (∵Property of log-sum-exponential) ≤ −(1−γ−(1 +β)ηK) max˜π∈Π TX t=1 gt(˜π) +ln(Kδ−1) β + lnK η , (37) 29 Published as a conference paper at ICLR 2026 where the last inequality holds due to the Lemma 3 and the initial assumption that γ≤ 1 2 and (1 +β)Kη≤γ . Plugging Eq. 37 into Eq. 34, the following holds with proba...

work page 2026

[19] [19]

Then, for each π∈Π , the following holds with probability at least1−δ: TX t=1 gt(π)≤ TX t=1 ˜gt(π|Πt(πt)) + ln(δ−1) β

Lemma 4.Let β≤1 and δ∈(0,1) . Then, for each π∈Π , the following holds with probability at least1−δ: TX t=1 gt(π)≤ TX t=1 ˜gt(π|Πt(πt)) + ln(δ−1) β . Proof.Step 1: Useful Decomposition. Let E t be the expectation with respect to the distribution pt, from which πt is sampled. For each π∈Π, the biased gain estimator can be decomposed as the following: ˜gt (...

work page 2026

[20] [20]

Then, by letting∆ t(π) :=βg t(π)−β×(C1),∆ t(π)≤1for allπ∈Π

–π∈Π t(πt) βgt(π)−β×(B1) =βg t(π)−βg t(π) = 0 –π∈Π t(πt)c βgt(π)−β×(B1) =βg t(π)−β˜gt(π)≤1 (∵β≤1andg t(π),˜gt(π)∈[0,1]) Given π∈Π , Claim 1 and Claim 2 always hold irrespective of the choice of πt by the algorithm. Then, by letting∆ t(π) :=βg t(π)−β×(C1),∆ t(π)≤1for allπ∈Π. Step 3: ShowEexp hPT t=1 (∆t(π)−β×(C2)) i ≤1. Therefore, since (1)exp(x)≤1 +x+x 2 ...

work page 2026

[21] [21]

≤η 2 1 + 2β 1−γ X ˜π∈Π ˜gt(π|Πt(πt)), (46) where the last inequality holds due to the following. •m(π t) = 0 E ˜π∼ωt η2˜gt(˜π|Πt(πt))2 =η2 ( P ˜π∈Π∗ t wt(˜π) gt(˜π)+βP π′ ∈Π∗ t pt(π′) +β !2 +P ˜π∈(Π∗ t )c wt(˜π) gt(˜π)+ βP π′ ≤˜πpt(π′) 2 ) ≤η2 nP ˜π∈Π∗ t wt(˜π) 1+2β pt(˜π) ˜gt(˜π|Πt(πt))+P ˜π∈(Π∗ t )c wt(˜π) 1+2β pt(˜π) ˜gt(˜π|Πt(πt)) o (∵gt(π)∈[0,1]) ≤η2...

work page 2026

[22] [22]

Plugging Eq

≤ −(1−r−(1 + 2β)ηK)max˜π∈Π ˜Gt(˜π) +lnK η (∵Property of log-sum-exponential) ≤ −(1−r−(1 + 2β)ηK)max˜π∈Π TX t=1 gt(˜π) +ln(Kδ−1) β + lnK η , (47) where the last inequality holds due to the Lemma 4 and the initial assumption that γ≤ 1 2 and (1 + 2β)Kη≤γ . Plugging Eq. 47 into Eq. 44, the following holds with probability 1− δ K for all π∈Π: Rπ(T)≤ ℓmax −ℓ mi...

work page 2026

[23] [23]

37 Published as a conference paper at ICLR 2026 H.3 CHOICE OFHYPERPARAMETERc For all experiments, we set c= 40 in the loss function ℓt(π;α, c) =d t(π;α)+a t(π;α, c) , as defined in Eq. 5, Eq. 6, and Eq

work page 2026

[24] [24]

For the discretization level, we use K= 200 for ImageNet and K= 20 for Airfoil Self-Noise

We report the long-run miscoverageMC(T) and inefficiencyIneff(T) , averaged over 50 independent trials, across variousT and α settings for both classification and regression tasks. For the discretization level, we use K= 200 for ImageNet and K= 20 for Airfoil Self-Noise. The results show that our method consistently achieves decent performance for a fixed...

work page 2026

[25] [25]

(top) and non-i.i.d

(a)MC(t)(i.i.d.) (b)Ineff(t)(i.i.d.) (c)MC(t)(Non-i.i.d.) (d)Ineff(t)(Non-i.i.d.) Figure 5: Ablation study on ImageNet: Long-run miscoverage MC(t) and inefficiency Ineff(t) as a function of time t∈[T] , under the i.i.d. (top) and non-i.i.d. (bottom) settings. The solid curves denote the average over 50 independent runs, and the shaded regions indicate the...

work page 2026

[26] [26]

Box plots summarize the long-run miscoverage MC(T) across 50 independent runs for α∈ {0.1,0.2,0.3,0.4} under the i.i.d

(a)MC(T)(i.i.d.) (b)MC(T)(Non-i.i.d.) Figure 7: Sensitivity of OCP-Unlock+ to α on ImageNet. Box plots summarize the long-run miscoverage MC(T) across 50 independent runs for α∈ {0.1,0.2,0.3,0.4} under the i.i.d. (left) and non-i.i.d. (right) settings. Each box shows the interquartile range with the median marked inside, while the whiskers indicate the sp...

work page 2026