Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback

Gaurav Sinha; Tanmay Goyal

arxiv: 2506.13163 · v3 · submitted 2025-06-16 · 💻 cs.LG

Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback

Tanmay Goyal , Gaurav Sinha This is my paper

Pith reviewed 2026-05-19 08:57 UTC · model grok-4.3

classification 💻 cs.LG

keywords contextual banditsslate banditslogistic regressionregret boundsbandit feedbackoptimism in face of uncertaintylocal planningglobal learning

0 comments

The pith

Slate-GLM-OFU achieves Õ(√T) regret for logistic contextual slate bandits under a diversity assumption with polynomial per-round time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the logistic contextual slate bandit problem in which an agent must pick a slate of N items from an exponentially large candidate set and receives only a single binary reward drawn from a logistic model. It introduces Slate-GLM-OFU and Slate-GLM-TS that perform local planning by choosing items independently per slot while still learning parameters jointly across the slate. Under a well-studied diversity assumption on contexts or feature vectors, the regret of Slate-GLM-OFU is bounded by Õ(√T). A reader would care because the approach keeps computation feasible for large slates while delivering near-optimal cumulative reward, enabling applications such as item-set recommendation and in-context example selection for language models.

Core claim

Slate-GLM-OFU and Slate-GLM-TS solve the logistic contextual slate bandit problem by separating local planning (independent slot selections for N^O(1) per-round time) from global learning (joint parameter estimation), and Slate-GLM-OFU attains Õ(√T) regret under the diversity assumption.

What carries the argument

The Slate-GLM-OFU algorithm, which applies optimism in the face of uncertainty to a logistic GLM while using local planning for independent slot selections and global learning for joint parameter estimation.

If this is right

Per-round computation stays polynomial in N even though the slate space is exponential in N.
Regret scales as Õ(√T) under the diversity assumption on contexts or features.
The algorithms show lower regret and faster runtime than prior baselines across synthetic settings.
The same method selects in-context examples for language-model prompts and reaches competitive accuracy on binary classification tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The local-planning plus global-learning split may transfer to other combinatorial action spaces where exhaustive enumeration is impossible.
Recommendation systems that must return small sets from catalogs of millions of items could adopt similar per-round efficiency.
Relaxing or quantifying the diversity assumption would clarify when the regret bound applies to naturally occurring data.

Load-bearing premise

The diversity assumption on the context distribution or feature vectors must hold to obtain the Õ(√T) regret bound.

What would settle it

An instance satisfying the diversity assumption yet producing regret strictly larger than Õ(√T) for Slate-GLM-OFU would falsify the stated regret guarantee.

Figures

Figures reproduced from arXiv: 2506.13163 by Gaurav Sinha, Tanmay Goyal.

read the original abstract

We study the Logistic Contextual Slate Bandit problem, where, at each round, an agent selects a slate of $N$ items from an exponentially large set (of size $2^{\Omega(N)}$) of candidate slates provided by the environment. A single binary reward, determined by a logistic model, is observed for the chosen slate. Our objective is to develop algorithms that maximize cumulative reward over $T$ rounds while maintaining low per-round computational costs. We propose two algorithms, Slate-GLM-OFU and Slate-GLM-TS, that accomplish this goal. These algorithms achieve $N^{O(1)}$ per-round time complexity via local planning (independent slot selections), and low regret through global learning (joint parameter estimation). We provide theoretical and empirical evidence supporting these claims. Under a well-studied diversity assumption, we prove that Slate-GLM-OFU incurs only $\tilde{O}(\sqrt{T})$ regret. Extensive experiments across a wide range of synthetic settings demonstrate that our algorithms consistently outperform state-of-the-art baselines, achieving both the lowest regret and the fastest runtime. Furthermore, we apply our algorithm to select in-context examples in prompts of Language Models for solving binary classification tasks such as sentiment analysis. Our approach achieves competitive test accuracy, making it a viable alternative in practical scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives practical algorithms for logistic contextual slate bandits that run in N^O(1) time and claim Õ(√T) regret under a diversity assumption, with the main thing to check being whether that assumption still works when slates are picked via per-slot optimism.

read the letter

The main thing to know is that Slate-GLM-OFU and Slate-GLM-TS handle the exponential slate space by splitting local per-slot planning from global logistic parameter updates. This keeps computation polynomial while still delivering the claimed regret bound when the context or feature distribution meets the usual diversity condition. The LLM in-context example selection experiment is a concrete plus and shows the method is not just theoretical.

Referee Report

2 major / 2 minor

Summary. The paper introduces the logistic contextual slate bandit problem, in which an agent must select a slate of N items from an exponentially large candidate set at each round and receives a single binary reward drawn from a logistic model on the chosen slate. The authors propose Slate-GLM-OFU and Slate-GLM-TS, which perform local (per-slot) optimistic planning for computational efficiency while conducting global parameter estimation from slate-level bandit feedback. They claim that, under a well-studied diversity assumption on the context or feature distribution, Slate-GLM-OFU achieves Õ(√T) regret, and they report that both algorithms outperform baselines in synthetic experiments and yield competitive accuracy when used to select in-context examples for language-model binary classification tasks.

Significance. If the regret analysis holds, the work supplies a computationally tractable approach to contextual bandits with structured, exponentially large action spaces and partial (slate-level) feedback. The separation of local planning from global learning is a technically interesting design choice that could extend to other combinatorial settings. The empirical evaluation on both synthetic instances and an LLM in-context learning application provides concrete evidence of practical relevance, though the external diversity assumption narrows the scope of the theoretical guarantee.

major comments (2)

[regret analysis / Theorem on Õ(√T) bound] The central regret claim (abstract and the proof of the Õ(√T) bound) conditions on a diversity assumption that guarantees linear growth of the minimum eigenvalue of the accumulated design matrix. Because the algorithm selects slates via independent per-slot optimistic choices rather than direct sampling from the assumed distribution, it is necessary to verify that the selected slates still produce the required eigenvalue growth under bandit (slate-level) feedback; this step is load-bearing for the stated regret rate.
[algorithm description and estimation procedure] The global parameter estimation step uses only a single logistic observation per round on the entire slate. The manuscript should explicitly state the form of the negative log-likelihood or loss function employed for joint estimation across slots and confirm that the resulting estimator remains consistent when the per-slot optimistic selections are correlated through the shared parameter vector.

minor comments (2)

[assumptions section] Clarify the precise statement of the diversity assumption (minimum eigenvalue or coverage condition) and its dependence on the context distribution versus the feature vectors; a short remark on how it is verified or assumed in the experiments would help readers.
[experiments] In the experimental section, report the per-round runtime scaling with N and the size of the candidate slate set to substantiate the N^{O(1)} claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments. We address each major comment point by point below, with clear indications of planned revisions to improve clarity and rigor.

read point-by-point responses

Referee: [regret analysis / Theorem on Õ(√T) bound] The central regret claim (abstract and the proof of the Õ(√T) bound) conditions on a diversity assumption that guarantees linear growth of the minimum eigenvalue of the accumulated design matrix. Because the algorithm selects slates via independent per-slot optimistic choices rather than direct sampling from the assumed distribution, it is necessary to verify that the selected slates still produce the required eigenvalue growth under bandit (slate-level) feedback; this step is load-bearing for the stated regret rate.

Authors: We appreciate the referee highlighting this critical detail in the analysis. The diversity assumption applies to the exogenous context (or feature) distribution and is independent of the selection mechanism. In the existing proof of the Õ(√T) regret bound for Slate-GLM-OFU, the per-slot optimistic selections are shown to maintain sufficient coverage because optimism biases toward under-explored directions in the shared parameter space, preserving the linear eigenvalue growth with high probability. To make this verification fully explicit and address the concern about bandit feedback, we will insert a new supporting lemma in the appendix that directly bounds the minimum eigenvalue of the design matrix accumulated from the algorithm's chosen slates, confirming it grows linearly under the stated assumptions. revision: yes
Referee: [algorithm description and estimation procedure] The global parameter estimation step uses only a single logistic observation per round on the entire slate. The manuscript should explicitly state the form of the negative log-likelihood or loss function employed for joint estimation across slots and confirm that the resulting estimator remains consistent when the per-slot optimistic selections are correlated through the shared parameter vector.

Authors: We agree that the estimation procedure merits a more explicit description. The global MLE uses the standard negative log-likelihood for logistic regression on the slate-level observation: for slate s_t with aggregated feature vector x_t and binary reward y_t, the per-round loss is −y_t log(σ(θ^⊤ x_t)) − (1−y_t) log(1−σ(θ^⊤ x_t)), where σ denotes the sigmoid. The estimator is the minimizer of the cumulative loss over all prior rounds. Consistency follows from the diversity assumption, which guarantees linear growth of the minimum eigenvalue of the accumulated Gram matrix; this condition is known to suffice for consistency of the GLM MLE even under adaptive, correlated covariate selection induced by the shared parameter and optimistic policy. We will revise Section 3 to state the loss function explicitly and add a short remark on consistency in the theoretical section. revision: yes

Circularity Check

0 steps flagged

No circularity: regret bound conditioned on external well-studied diversity assumption

full rationale

The central theoretical claim states that Slate-GLM-OFU incurs Õ(√T) regret under a well-studied diversity assumption on the context distribution or feature vectors. This assumption is invoked as an independent, external condition rather than being defined in terms of the paper's own fitted parameters, per-slot optimistic choices, or any self-citation chain. The algorithms separate local planning from global parameter estimation, but the regret analysis is explicitly conditioned on the external assumption without reducing the bound to a quantity constructed from the algorithm's own outputs. No self-definitional steps, fitted-input-as-prediction patterns, or load-bearing self-citations appear in the derivation chain. The result is therefore self-contained against the stated external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central regret claim rests on one external domain assumption; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Diversity assumption on the context or feature distribution
Invoked to obtain the Õ(√T) regret bound for Slate-GLM-OFU.

pith-pipeline@v0.9.0 · 5760 in / 1101 out tokens · 42661 ms · 2026-05-19T08:57:51.424414+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under a well-studied diversity assumption, we prove that Slate-GLM-OFU incurs only Õ(√T) regret... E[ x_i^t (x_i^t)^T | F_t ] ≽ ρ κ I

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Logistic Bandits with $\tilde{O}(\sqrt{dT})$ Regret without Context Diversity Assumptions
cs.LG 2026-04 unverdicted novelty 8.0

SupSplitLog achieves Õ(√(dT)) regret for logistic bandits without context diversity assumptions by splitting samples for an initial estimator and Newton correction, and can adapt to data-dependent bounds.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · cited by 1 Pith paper

[1]

Niladri S

URL https://arxiv.org/abs/2010.12642. Niladri S. Chatterji, Vidya Muthukumar, and Peter L. Bartlett. Osom: A simultaneously optimal algorithm for multi- armed and linear contextual bandits, 2020. URL https://arxiv.org/abs/1905.10040. Jin Chen, Ju Xu, Gangwei Jiang, Tiezheng Ge, Zhiqiang Zhang, Defu Lian, and Kai Zheng. Automated creative optimization for ...

work page doi:10.1007/978-3-031-70365-2 2010
[2]

EFFICIENT ALGORITHMS FOR LOGISTIC CONTEXTUAL SLATE BANDITS WITH BANDIT FEEDBACK

URL https://arxiv.org/abs/2404.06831. Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. Off-policy evaluation for slate recommendation. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wal- lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems...

work page doi:10.1214/ecp.v16-1624 2017
[3]

γt(δ) = O(S2N dlog(t/δ))

work page
[4]

βt(δ) = O(S6N dlog(t/δ))

work page
[5]

Let µ : R → R be the logistic function, i.e., µ(x) = 1 /(1 + exp(−x)) and ˙µ, ¨µ be the first and second derivative of µ

ηt(δ) = O(S2N dlog(t/δ)) Claim A.1. Let µ : R → R be the logistic function, i.e., µ(x) = 1 /(1 + exp(−x)) and ˙µ, ¨µ be the first and second derivative of µ. The following are true

work page
[6]

|¨µ(x)| ≤ ˙µ(x), ∀x ∈ R

work page
[7]

Let ˙µ be the derivative of the logistic function

˙µ(x) ≤ ˙µ(y) exp(|x − y|), ∀x, y ∈ R Definition A.1. Let ˙µ be the derivative of the logistic function. Define functionsα : R × R → R and ˜α : R × R → R as follows

work page
[8]

α(x, y) = 1R 0 ˙µ(x + v(y − x)) dv

work page
[9]

˜α(x, y) = 1R 0 (1 − v) ˙µ(x + v(y − x)) dv Definition A.2. (Exact Taylor Expansion for the Logistic Function) The logistic functionµ(x) can be expanded using an Exact Taylor Expansion as follows: µ(x) = µ(y) + ˙µ(y)(x − y) + 1Z 0 (1 − v)¨µ(x + v(y − x)) dv(x − y)2 Definition A.3. (Mean Value Theorem for the Logistic Function) The logistic function µ can ...

work page
[10]

Wt = I + t−1P s=1 ˙µ(x⊤ s θs+1)xsx⊤ s

work page
[11]

W i t = I + t−1P s=1 ˙µ(x⊤ s θs+1)xi sxi s ⊤

work page
[12]

V H t = γt(δ)I + P x∈Ht xx⊤/κ We define the following additional matrices

work page
[13]

,W N t ) 12 Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback

Ut = diag(W 1 t , . . . ,W N t ) 12 Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback

work page
[14]

W i,j t = t−1P s=1 ˙µ(x⊤ s θs+1)xi sxj s ⊤

work page
[15]

V H,i t = γt(δ)I + P x∈Ht xixi⊤ /κ

work page
[16]

V H,i,j t = γt(δ)I + P x∈Ht xixj ⊤ /κ

work page
[17]

U H t = diag(V H,1 t , . . . ,V H,N t ) B SLATE-GLM-OFU Let xi ∈ Rd, we define the “lift“ ˜xi ∈ RdN , of xi as follows, ˜xi(j) = 0 if j /∈ [(i − 1)d, id − 1] x(j − (i − 1)d) otherwise In other words, consider ˜xi to be a vector with N slots of dimension d, such that the ith slot is xi while the rest of the slots are assigned the zero vector. Then, for any...

work page
[18]

E hp ˙µ (x⊺ t θt+1)xi t|Ft i = 0d

work page
[19]

E h ˙µ (x⊺ t θt+1) xi txj t ⊤ |Ft i = 0d where i ̸= j

work page
[20]

We attempt to bound ˙µ (x⊺ t θt+1)

E h ˙µ (x⊺ t θt+1) xi txi t ⊤ |Ft i ≽ ρκId Proof. We attempt to bound ˙µ (x⊺ t θt+1). Using the Cauchy-Schwarz inequality, it is easy to see that −S ≤ x⊺ t θt+1 ≤ S. Since ˙µ (.) is an increasing function on (−∞, 0] and a decreasing function on [0, ∞), we have that ˙µ (x⊺ t θt+1) ∈ ˙µ (S) , 1 4 if x⊺ t θt+1 ∈ [0, S] ˙µ (−S) , 1 4 if x⊺ t θt+1 ∈ [−S, 0] Si...

work page 2024
[21]

Finally, following the same line of thought as Lemma B.8 and Lemma B.9, and using the fact that 1 κ ≤ 1 4, we obtain 3 4 U H t ≼ V H t ≼ 5 4 U H t Lemma B.12

Then, it is easy to see P {E ′ 0} ≥ 1 − 2δ. Finally, following the same line of thought as Lemma B.8 and Lemma B.9, and using the fact that 1 κ ≤ 1 4, we obtain 3 4 U H t ≼ V H t ≼ 5 4 U H t Lemma B.12. (Faury et al. [2022], Proposition 7) Let δ ∈ (0, 1) and {(θt, Wt, θt)}r be maintained by the ada- OFU-ECOLog algorithm. Then, P n ∀t ≥ 1 : θ⋆ ∈ θt and ∥θ⋆...

work page 2022
[22]

The proof for the first part is the same as the proof for the first part in Proposition 5 in Faury et al

diam X (Θ) ≤ 1 Proof. The proof for the first part is the same as the proof for the first part in Proposition 5 in Faury et al. [2022] since the proof does not depend on the manner in which the arm is selected. 27 Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback For the second part notice that: diamX (Θ) = max x∈X max θ1,θ2∈...

work page 2022
[23]

Concentration: There exist constants c and c ′ such that ∀δ ∈ (0, 1) Pη∼DT S ( ∥η∥2 ≤ r cd log c ′ d δ ) ≥ 1 − δ

work page
[24]

Also, we provide additional details about our experimental setup

Anti-Concentration: There exists a strictly positive probability p such that for any u ∈ Rd Pη∼DT S u⊤η ≥ ∥u∥2 ≥ p F Additional Experiments and Experimental Details In this section, we provide additional plots to back the experiments shown in Section 5. Also, we provide additional details about our experimental setup. In all of the figures, the shaded reg...

work page 2010

[1] [1]

Niladri S

URL https://arxiv.org/abs/2010.12642. Niladri S. Chatterji, Vidya Muthukumar, and Peter L. Bartlett. Osom: A simultaneously optimal algorithm for multi- armed and linear contextual bandits, 2020. URL https://arxiv.org/abs/1905.10040. Jin Chen, Ju Xu, Gangwei Jiang, Tiezheng Ge, Zhiqiang Zhang, Defu Lian, and Kai Zheng. Automated creative optimization for ...

work page doi:10.1007/978-3-031-70365-2 2010

[2] [2]

EFFICIENT ALGORITHMS FOR LOGISTIC CONTEXTUAL SLATE BANDITS WITH BANDIT FEEDBACK

URL https://arxiv.org/abs/2404.06831. Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miro Dudik, John Langford, Damien Jose, and Imed Zitouni. Off-policy evaluation for slate recommendation. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wal- lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems...

work page doi:10.1214/ecp.v16-1624 2017

[3] [3]

γt(δ) = O(S2N dlog(t/δ))

work page

[4] [4]

βt(δ) = O(S6N dlog(t/δ))

work page

[5] [5]

Let µ : R → R be the logistic function, i.e., µ(x) = 1 /(1 + exp(−x)) and ˙µ, ¨µ be the first and second derivative of µ

ηt(δ) = O(S2N dlog(t/δ)) Claim A.1. Let µ : R → R be the logistic function, i.e., µ(x) = 1 /(1 + exp(−x)) and ˙µ, ¨µ be the first and second derivative of µ. The following are true

work page

[6] [6]

|¨µ(x)| ≤ ˙µ(x), ∀x ∈ R

work page

[7] [7]

Let ˙µ be the derivative of the logistic function

˙µ(x) ≤ ˙µ(y) exp(|x − y|), ∀x, y ∈ R Definition A.1. Let ˙µ be the derivative of the logistic function. Define functionsα : R × R → R and ˜α : R × R → R as follows

work page

[8] [8]

α(x, y) = 1R 0 ˙µ(x + v(y − x)) dv

work page

[9] [9]

˜α(x, y) = 1R 0 (1 − v) ˙µ(x + v(y − x)) dv Definition A.2. (Exact Taylor Expansion for the Logistic Function) The logistic functionµ(x) can be expanded using an Exact Taylor Expansion as follows: µ(x) = µ(y) + ˙µ(y)(x − y) + 1Z 0 (1 − v)¨µ(x + v(y − x)) dv(x − y)2 Definition A.3. (Mean Value Theorem for the Logistic Function) The logistic function µ can ...

work page

[10] [10]

Wt = I + t−1P s=1 ˙µ(x⊤ s θs+1)xsx⊤ s

work page

[11] [11]

W i t = I + t−1P s=1 ˙µ(x⊤ s θs+1)xi sxi s ⊤

work page

[12] [12]

V H t = γt(δ)I + P x∈Ht xx⊤/κ We define the following additional matrices

work page

[13] [13]

,W N t ) 12 Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback

Ut = diag(W 1 t , . . . ,W N t ) 12 Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback

work page

[14] [14]

W i,j t = t−1P s=1 ˙µ(x⊤ s θs+1)xi sxj s ⊤

work page

[15] [15]

V H,i t = γt(δ)I + P x∈Ht xixi⊤ /κ

work page

[16] [16]

V H,i,j t = γt(δ)I + P x∈Ht xixj ⊤ /κ

work page

[17] [17]

U H t = diag(V H,1 t , . . . ,V H,N t ) B SLATE-GLM-OFU Let xi ∈ Rd, we define the “lift“ ˜xi ∈ RdN , of xi as follows, ˜xi(j) = 0 if j /∈ [(i − 1)d, id − 1] x(j − (i − 1)d) otherwise In other words, consider ˜xi to be a vector with N slots of dimension d, such that the ith slot is xi while the rest of the slots are assigned the zero vector. Then, for any...

work page

[18] [18]

E hp ˙µ (x⊺ t θt+1)xi t|Ft i = 0d

work page

[19] [19]

E h ˙µ (x⊺ t θt+1) xi txj t ⊤ |Ft i = 0d where i ̸= j

work page

[20] [20]

We attempt to bound ˙µ (x⊺ t θt+1)

E h ˙µ (x⊺ t θt+1) xi txi t ⊤ |Ft i ≽ ρκId Proof. We attempt to bound ˙µ (x⊺ t θt+1). Using the Cauchy-Schwarz inequality, it is easy to see that −S ≤ x⊺ t θt+1 ≤ S. Since ˙µ (.) is an increasing function on (−∞, 0] and a decreasing function on [0, ∞), we have that ˙µ (x⊺ t θt+1) ∈ ˙µ (S) , 1 4 if x⊺ t θt+1 ∈ [0, S] ˙µ (−S) , 1 4 if x⊺ t θt+1 ∈ [−S, 0] Si...

work page 2024

[21] [21]

Finally, following the same line of thought as Lemma B.8 and Lemma B.9, and using the fact that 1 κ ≤ 1 4, we obtain 3 4 U H t ≼ V H t ≼ 5 4 U H t Lemma B.12

Then, it is easy to see P {E ′ 0} ≥ 1 − 2δ. Finally, following the same line of thought as Lemma B.8 and Lemma B.9, and using the fact that 1 κ ≤ 1 4, we obtain 3 4 U H t ≼ V H t ≼ 5 4 U H t Lemma B.12. (Faury et al. [2022], Proposition 7) Let δ ∈ (0, 1) and {(θt, Wt, θt)}r be maintained by the ada- OFU-ECOLog algorithm. Then, P n ∀t ≥ 1 : θ⋆ ∈ θt and ∥θ⋆...

work page 2022

[22] [22]

The proof for the first part is the same as the proof for the first part in Proposition 5 in Faury et al

diam X (Θ) ≤ 1 Proof. The proof for the first part is the same as the proof for the first part in Proposition 5 in Faury et al. [2022] since the proof does not depend on the manner in which the arm is selected. 27 Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback For the second part notice that: diamX (Θ) = max x∈X max θ1,θ2∈...

work page 2022

[23] [23]

Concentration: There exist constants c and c ′ such that ∀δ ∈ (0, 1) Pη∼DT S ( ∥η∥2 ≤ r cd log c ′ d δ ) ≥ 1 − δ

work page

[24] [24]

Also, we provide additional details about our experimental setup

Anti-Concentration: There exists a strictly positive probability p such that for any u ∈ Rd Pη∼DT S u⊤η ≥ ∥u∥2 ≥ p F Additional Experiments and Experimental Details In this section, we provide additional plots to back the experiments shown in Section 5. Also, we provide additional details about our experimental setup. In all of the figures, the shaded reg...

work page 2010