Logistic Bandits with tilde{O}(sqrt{dT}) Regret without Context Diversity Assumptions
Pith reviewed 2026-05-08 12:27 UTC · model grok-4.3
The pith
SupSplitLog achieves Õ(√(dT)) regret in K-armed logistic bandits by splitting samples into an initial estimator and a one-step Newton correction, without any context diversity assumptions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SupSplitLog is the first algorithm for the K-armed logistic bandit problem that achieves Õ(√(dT)) regret without any context diversity assumption, by splitting the collected samples into two disjoint subsets when constructing estimators; one is used to compute an initial-point estimator, while the other is used to apply a Newton-type one-step correction procedure. The splitting rule is carefully designed to balance the accuracy requirements of the initial-point estimator and the one-step correction procedure. SupSplitLog strictly improves on the existing algorithms in terms of the dependence on dimension d in the regret upper bound and can be adapted to deduce a regret bound that grows with
What carries the argument
Sample splitting into two disjoint subsets, one for an initial-point estimator and the other for a Newton-type one-step correction, with a rule that balances their accuracy requirements.
If this is right
- The regret upper bound improves in its dependence on dimension d compared with earlier logistic-bandit algorithms.
- The method adapts directly to a data-dependent complexity measure that avoids explicit dependence on d when contexts lie in a low-dimensional subspace.
- Numerical experiments demonstrate lower regret than existing methods across a range of context distributions.
- The same splitting technique yields regret guarantees that remain valid even when minimum-eigenvalue assumptions on the context covariance are violated.
Where Pith is reading between the lines
- Similar sample-splitting ideas could be applied to other generalized linear bandit problems where initial estimators are hard to obtain under weak assumptions.
- The data-dependent bound opens the possibility of fully adaptive algorithms that automatically exploit low-dimensional structure without knowing it in advance.
- In practice the approach may reduce the amount of exploration needed in recommendation settings where user features are highly correlated.
Load-bearing premise
A splitting rule exists that simultaneously satisfies the accuracy requirements of the initial-point estimator and the one-step Newton correction without introducing new assumptions on the context process.
What would settle it
A run of the algorithm on a sequence of identical or low-rank context vectors that shows the observed regret remains O(√(dT)) while prior methods either incur linear regret or require an explicit diversity parameter.
Figures
read the original abstract
We study the $K$-armed logistic bandit problem, where at each round, the agent observes $K$ feature vectors associated with $K$ actions. Existing approaches that achieve a rate-optimal $\tilde{\mathcal{O}}(\sqrt{dT})$ regret bound rely heavily on context diversity assumptions, such as strict positivity of the minimum eigenvalue of a context covariance matrix. These assumptions, however, impose strong restrictions on the context process, as they rule out the situation where the context vectors are concentrated in a low-dimensional subspace. In this paper, we propose SupSplitLog, which, to the best of our knowledge, is the first algorithm for logistic bandits that achieves $\tilde{\mathcal{O}}(\sqrt{dT})$ regret without any context diversity assumption. The key idea is to split the collected samples into two disjoint subsets when constructing estimators; one is used to compute an initial-point estimator, while the other is used to apply a Newton-type one-step correction procedure. The splitting rule is carefully designed to balance the accuracy requirements of the initial-point estimator and the one-step correction procedure. Moreover, SupSplitLog strictly improves on the existing algorithms in terms of the dependence on dimension $d$ in the regret upper bound. Furthermore, SupSplitLog can be adapted simply to deduce a regret bound that grows with a data-dependent complexity measure, avoiding a direct dependence on $d$, which is favorable when the context vectors are concentrated in a low-dimensional subspace. We also provide experimental results that demonstrate numerically the superiority of our algorithm, validating the theoretical results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SupSplitLog for the K-armed logistic bandit problem. At each round the agent sees K feature vectors; the algorithm splits collected samples into two disjoint subsets S1 and S2. S1 is used to form an initial MLE, while S2 is used for a one-step Newton correction. The splitting rule is constructed to balance the accuracy requirements of the two estimators. The central claim is that this yields the first Õ(√(dT)) regret bound without any context-diversity (minimum-eigenvalue) assumption on the context process. A data-dependent regret bound that can be smaller than √d when contexts lie in a low-dimensional subspace is also derived, together with supporting experiments.
Significance. If the analysis holds, the result removes a strong and often unrealistic assumption required by all prior Õ(√(dT)) logistic-bandit algorithms. The explicit splitting construction, the strictly better d-dependence, and the data-dependent extension are genuine technical contributions. No machine-checked proofs or open-source code are mentioned, but the algorithm is fully specified and the regret statements are stated as derived guarantees rather than fitted quantities.
major comments (1)
- [§4.2, Theorem 4.1] §4.2 and Theorem 4.1: the proof that the adaptive splitting rule simultaneously guarantees (i) an initial estimator on S1 whose error is small enough for linear convergence of the Newton step and (ii) sufficient |S2| for concentration of the corrected estimator, is not fully explicit in the low-diversity regime. When the Gram matrix on S1 can have arbitrarily small minimum eigenvalue, the current argument only shows that the threshold rule keeps |S1| “large with high probability”; it does not derive a uniform lower bound on |S1| that works for every possible context sequence satisfying the problem statement. This step is load-bearing for the Õ(√(dT)) claim.
minor comments (2)
- [Eq. (5)] Eq. (5): the logistic loss is written with a sign that is inconsistent with the gradient expression two lines later; a quick sign check would remove ambiguity.
- [Figure 3] Figure 3 caption: the legend labels “SupSplitLog (ours)” and “SupSplitLog (data-dep)” are easily confused; consider renaming the second curve.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments on our manuscript. We address the single major comment below and will incorporate the requested clarifications into the revised version.
read point-by-point responses
-
Referee: [§4.2, Theorem 4.1] §4.2 and Theorem 4.1: the proof that the adaptive splitting rule simultaneously guarantees (i) an initial estimator on S1 whose error is small enough for linear convergence of the Newton step and (ii) sufficient |S2| for concentration of the corrected estimator, is not fully explicit in the low-diversity regime. When the Gram matrix on S1 can have arbitrarily small minimum eigenvalue, the current argument only shows that the threshold rule keeps |S1| “large with high probability”; it does not derive a uniform lower bound on |S1| that works for every possible context sequence satisfying the problem statement. This step is load-bearing for the Õ(√(dT)) claim.
Authors: We thank the referee for highlighting the need for greater explicitness in the low-diversity case. The adaptive splitting rule in SupSplitLog allocates samples to S1 until the minimum eigenvalue of the empirical Gram matrix computed on S1 exceeds a data-dependent threshold chosen to guarantee that the initial MLE error is at most O(1/√|S1|), which is the condition required for the subsequent one-step Newton update on S2 to converge linearly. Because the stopping decision for S1 depends solely on the observed context vectors (via the Gram matrix), the resulting |S1| is deterministically at least the smallest integer satisfying the eigenvalue threshold for that particular context sequence. This yields a uniform lower bound on |S1| that holds for every admissible context sequence without invoking any minimum-eigenvalue assumption on the context process. High-probability concentration statements are then applied conditionally on this realized |S1|. We will revise §4.2 and the proof of Theorem 4.1 to state this deterministic lower bound explicitly and to trace how it produces the claimed Õ(√(dT)) regret uniformly over all context sequences. revision: yes
Circularity Check
No circularity; derivation self-contained via explicit algorithm construction and regret analysis
full rationale
The paper defines SupSplitLog via an explicit sample-splitting procedure whose accuracy requirements are balanced by a deterministic rule, then derives the Õ(√(dT)) regret bound from concentration and one-step Newton analysis under only the logistic link and bounded features. No equation reduces a claimed prediction to a fitted parameter or prior self-result by construction; the splitting thresholds are stated as design choices whose simultaneous satisfaction is proved directly rather than assumed. Self-citations, if present, are not load-bearing for the central claim. The derivation therefore stands on its own stated assumptions and does not collapse to tautology.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Arya Akhavan, Amitis Shidani, Alex Ayoub, and David Janz. Bernstein-type dimension-free concentration for self-normalised martingales.arXiv preprint arXiv:2507.20982,
work page internal anchor Pith review arXiv
-
[2]
Neural logistic bandits.arXiv preprint arXiv:2505.02069,
Seoungbin Bae and Dabeen Lee. Neural logistic bandits.arXiv preprint arXiv:2505.02069,
-
[3]
Queue length regret bounds for contextual queueing bandits.arXiv preprint arXiv:2601.19300,
Seoungbin Bae, Garyeong Kang, and Dabeen Lee. Queue length regret bounds for contextual queueing bandits.arXiv preprint arXiv:2601.19300,
work page internal anchor Pith review arXiv
-
[4]
Qiwei Di, Tao Jin, Yue Wu, Heyang Zhao, Farzad Farnoud, and Quanquan Gu. Variance-aware regret bounds for stochastic contextual dueling bandits.arXiv preprint arXiv:2310.00968,
-
[5]
URLhttps://openreview.net/forum?id=94y5XfiJ7N
ISSN 2835-8856. URLhttps://openreview.net/forum?id=94y5XfiJ7N. Tanmay Goyal and Gaurav Sinha. Efficient algorithms for logistic contextual slate bandits with bandit feedback.arXiv preprint arXiv:2506.13163,
work page internal anchor Pith review arXiv
-
[6]
URL https: //proceedings.mlr.press/v5/kanade09a.html
Yue Kang, Mingshuo Liu, Bongsoo Yi, Jing Lyu, Zhi Zhang, Doudou Zhou, and Yao Li. Single index bandits: Generalized linear contextual bandits with unknown reward functions.arXiv preprint arXiv:2506.12751,
-
[7]
Junghyun Lee, Se-Young Yun, and Kwang-Sung Jun
URL https: //openreview.net/forum?id=FXQ09DpwXt. Junghyun Lee, Se-Young Yun, and Kwang-Sung Jun. Improved regret bounds of (multinomial) logistic bandits via regret-to-confidence-set conversion. InInternational Conference on Artificial Intelligence and Statistics, pages 4474–4482. PMLR, 2024a. Junghyun Lee, Se-Young Yun, and Kwang-Sung Jun. A unified conf...
2071
-
[8]
Alberto Maria Metelli, Simone Drago, and Marco Mussi. Generalized kernelized bandits: A novel self-normalized bernstein-like dimension-free inequality and regret bounds.arXiv preprint arXiv:2508.01681,
-
[9]
Finite-time analysis of kernelised contextual bandits.arXiv preprint arXiv:1309.6869,
Michal Valko, Nathaniel Korda, Rémi Munos, Ilias Flaounas, and Nelo Cristianini. Finite-time analysis of kernelised contextual bandits.arXiv preprint arXiv:1309.6869,
-
[10]
Yu-Jie Zhang, Sheng-An Xu, Peng Zhao, and Masashi Sugiyama. Generalized linear bandits: Almost optimal regret with one-pass update.arXiv preprint arXiv:2507.11847,
-
[11]
15 A Additional Related Work Logistic bandits naturally model sequential decision-making problems with binary feedback, and have therefore been studied in a range of application domains. Recent examples include queueing and service systems (Kim and Oh, 2024; Fu et al., 2024; Bae et al., 2026), as well as preference-based and pairwise-feedback settings (Ve...
2024
-
[12]
and its refinements (Valko et al., 2013; Chu et al., 2011; Li et al., 2019,
2013
-
[13]
For theK-armed logistic bandit setting, Li et al
provide the basic independent-sample construction on which later finite-armed logistic bandit algorithms are built. For theK-armed logistic bandit setting, Li et al. (2017) came up with an algorithm with a regret bound of order˜O( √ dT). Later, Jun et al. (2021) refined the algorithm and improved the dependence onκon the regret bound. More recently, Kim e...
2017
-
[14]
C Experimental details ThissectionprovidesimplementationdetailsforthesyntheticexperimentinFigure1
Substituting termA 1, termA 2, and termA3 back completes the proof. C Experimental details ThissectionprovidesimplementationdetailsforthesyntheticexperimentinFigure1. Allexperiments were conducted on a server equipped with an AMD EPYC 9354 32-Core Processor, 251 GiB of RAM, and one NVIDIA RTX A6000 GPU. 20 Environment.We use a synthetic logistic bandit en...
2000
-
[15]
All experiments are repeated over10 independent runs, and the plots report the mean performance with±1σerror bars. Baseline algorithms.We compare the refined version of our method in Algorithm 2 with three baselines: SupCB-GLM (Li et al., 2017), SupLogistic (Jun et al., 2021), and DDRTS-GLM (Kim et al., 2023). During implementation, Algorithm 2 maintains ...
2017
-
[16]
Therefore, we have⟨∇L(¯θ)⟩θ−¯θ≥0for allθ∈Θ
Proof.L (θ)is a strongly convex, differentiable function and¯θminimizesL(θ)over a convex setΘ. Therefore, we have⟨∇L(¯θ)⟩θ−¯θ≥0for allθ∈Θ. Substituting ∇L(¯θ) = ∑ i∈P ( µ(xT i,ai ¯θ)−ri ) xi,ai +λ¯θ,θ←θ∗∈Θ, we have − ⟨∑ i∈P ( µ(xT i,ai ¯θ)−ri ) xi,ai ⟩ ¯θ−θ∗≥λ ⣨ ¯θ ⟩ ¯θ−θ∗ =⇒ − ⟨∑ i∈P [( µ(xT i,ai ¯θ)−µ(xT i,aiθ∗) ) − ( ri−µ(xT i,aiθ∗) )] xi,ai ⟩ ¯θ−θ∗≥λ ...
2011
-
[17]
Denote ˆa= maxa∈A(s′)m(s′) t,a
On the good eventE, ⏐⏐⏐m(s′) t,a∗−x⊤ t,a∗θ∗ ⏐⏐⏐≤w(s′) t,a∗≤2−s′ , where the first inequality follows from Theorem 13, and the second inequality holds since the algorithm reach levels′+ 1implies that the condition (a) in level s′is not satisfied. Denote ˆa= maxa∈A(s′)m(s′) t,a . Then, m(s′) t,a∗≥x⊤ t,a∗θ∗−2−s′ ≥x⊤ t,ˆaθ∗−2−s′ ≥m(s′) t,ˆa−2·2−s′ =⇒m(s′) t,a...
2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.