Best Arm Identification in Generalized Linear Bandits via Hybrid Feedback
Pith reviewed 2026-05-08 11:29 UTC · model grok-4.3
The pith
A likelihood-ratio confidence sequence unifies absolute and relative feedback for best arm identification in generalized linear bandits.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a likelihood-ratio-based confidence sequence unifies heterogeneous generalized linear observations from absolute rewards and relative comparisons, producing an explicit ellipsoidal confidence set under the self-concordance assumption on the link function. This sequence supports a hybrid Track-and-Stop algorithm that adaptively allocates queries by tracking the minimax-optimal design over the joint action space of arms and pairs, establishing δ-correctness together with high-probability upper bounds on the stopping time, and extends directly to a cost-aware setting with heterogeneous acquisition costs.
What carries the argument
The likelihood-ratio-based confidence sequence that unifies absolute and relative generalized linear feedback into an explicit ellipsoidal confidence set under self-concordance.
If this is right
- The algorithm stops with the correct best arm with probability at least 1-δ.
- High-probability upper bounds hold on the total number of queries required.
- The framework extends to a cost-aware variant that optimizes under different costs for reward and comparison queries.
- Empirical tests show improved sample efficiency over non-hybrid baselines.
Where Pith is reading between the lines
- The same unification approach could be tested on other mixed-observation models if the ellipsoidal form is relaxed.
- Practitioners might use the cost-aware version to decide dynamically between cheap comparisons and costly direct measurements based on their relative information value.
Load-bearing premise
The link function must satisfy self-concordance so that the likelihood-ratio sequence produces an explicit ellipsoidal confidence set.
What would settle it
Generate data from a generalized linear bandit whose link function violates self-concordance and check whether the proposed confidence sets still contain the true parameter with the stated probability.
Figures
read the original abstract
We study fixed-confidence best arm identification in generalized linear bandits under a hybrid feedback model: at each round, the learner may query either (i) absolute reward feedback from a single arm or (ii) relative (dueling) feedback from an arm pair, both governed by generalized linear models. We introduce a likelihood-ratio--based confidence sequence that unifies heterogeneous generalized linear observations and yields an explicit ellipsoidal confidence set under a self-concordance assumption. Building on this confidence set, we propose a hybrid Track-and-Stop algorithm that adaptively allocates queries by tracking a minimax-optimal design over a joint action space of arms and pairs. We establish $\delta$-correctness and provide high-probability upper bounds on the stopping time. We further extend the framework to a cost-aware setting that accounts for heterogeneous acquisition costs across feedback modalities. Empirical experiments demonstrate that the proposed algorithms significantly improve sample efficiency over baseline methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies fixed-confidence best arm identification in generalized linear bandits under a hybrid feedback model, where the learner can query either absolute reward feedback from a single arm or relative dueling feedback from an arm pair, both modeled via generalized linear models. It introduces a likelihood-ratio-based confidence sequence that unifies these heterogeneous observations and, under an explicit self-concordance assumption on the link function, produces a closed-form ellipsoidal confidence set. Building on this, the authors propose a hybrid Track-and-Stop algorithm that adaptively allocates queries by tracking a minimax-optimal design over the joint space of arms and pairs. They prove δ-correctness together with high-probability upper bounds on the stopping time, extend the framework to a cost-aware variant with heterogeneous acquisition costs, and report empirical improvements in sample efficiency over baselines.
Significance. If the derivations hold, the work provides a practically relevant unification of absolute and relative feedback in generalized linear best-arm identification, with the likelihood-ratio construction and resulting explicit ellipsoidal set constituting a clear technical strength. The high-probability stopping-time bounds and the cost-aware extension are valuable for deployment, and the empirical results supply concrete evidence of improved efficiency. The explicit handling of the joint action space in the Track-and-Stop design is a further positive feature.
major comments (2)
- [§3] §3 (likelihood-ratio confidence sequence): the passage from the general likelihood-ratio martingale to the explicit ellipsoidal set is load-bearing for all subsequent guarantees; the self-concordance assumption is stated, but the manuscript should make explicit whether the resulting ellipsoid is tight or contains a hidden multiplicative looseness that would propagate into the stopping-time bound.
- [§4.2] §4.2 (hybrid Track-and-Stop analysis): the minimax design is defined over the joint arm/pair action space and is claimed to be tracked independently of the data; a concrete verification that the change-of-measure argument continues to yield the stated high-probability stopping-time bound after this extension would strengthen the central claim.
minor comments (3)
- The notation distinguishing single-arm actions from pair actions in the joint design could be made more uniform across the algorithm pseudocode and the theoretical statements.
- A short discussion of how restrictive the self-concordance assumption is for standard link functions (logistic, probit) would help readers assess applicability.
- The empirical section would benefit from reporting the empirical coverage of the ellipsoidal sets in addition to regret or sample-complexity curves.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments. We address each major comment below and will revise the manuscript to incorporate the requested clarifications.
read point-by-point responses
-
Referee: [§3] §3 (likelihood-ratio confidence sequence): the passage from the general likelihood-ratio martingale to the explicit ellipsoidal set is load-bearing for all subsequent guarantees; the self-concordance assumption is stated, but the manuscript should make explicit whether the resulting ellipsoid is tight or contains a hidden multiplicative looseness that would propagate into the stopping-time bound.
Authors: We agree that explicit clarification is warranted. Under the self-concordance assumption the ellipsoidal set is obtained directly from the likelihood-ratio martingale without additional multiplicative looseness; the self-concordance parameter enters only through the radius, which is already accounted for in the stopping-time analysis. In the revision we will add a short remark in §3 making this tightness explicit and confirming that no further inflation occurs in the high-probability bounds. revision: yes
-
Referee: [§4.2] §4.2 (hybrid Track-and-Stop analysis): the minimax design is defined over the joint arm/pair action space and is claimed to be tracked independently of the data; a concrete verification that the change-of-measure argument continues to yield the stated high-probability stopping-time bound after this extension would strengthen the central claim.
Authors: We thank the referee for highlighting this point. Because the minimax design is computed once over the fixed joint action space and remains data-independent, the standard change-of-measure argument extends directly. In the revised §4.2 we will insert a concise, step-by-step verification of the change-of-measure that confirms the high-probability stopping-time bound holds unchanged for the hybrid setting. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's central construction begins with a likelihood-ratio confidence sequence derived directly from the heterogeneous GLM observations (absolute and dueling feedback), which under the explicitly stated self-concordance assumption on the link function produces an ellipsoidal set. The hybrid Track-and-Stop algorithm then allocates queries by tracking a minimax-optimal design over the joint arm/pair action space, with δ-correctness and stopping-time bounds obtained via standard change-of-measure arguments adapted to the hybrid setting. No step reduces by construction to a fitted parameter renamed as prediction, a self-citation chain, or an ansatz imported without independent justification; the self-concordance condition is an external modeling assumption rather than a fitted or self-referential quantity, and the minimax design is defined independently of the realized data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The link function of the generalized linear model is self-concordant.
Reference graph
Works this paper leans on
-
[1]
Best arm identification in multi-armed bandits
Jean-Yves Audibert and Sébastien Bubeck. Best arm identification in multi-armed bandits. In COLT-23th Conference on learning theory-2010,
work page 2010
-
[2]
Confidence sequences for generalized linear models via regret analysis
Eugenio Clerico, Hamish Flynn, Gergely Neu, et al. Confidence sequences for generalized linear models via regret analysis. arXiv preprint arXiv:2504.16555,
-
[3]
Confidence estimation via sequential likelihood mixing
Johannes Kirschner, Andreas Krause, Michele Meziu, and Mojmir Mutny. Confidence estimation via sequential likelihood mixing. arXiv preprint arXiv:2502.14689,
-
[4]
A unified confidence sequence for generalized linear models, with applications to bandits
Junghyun Lee, Se-Young Yun, and Kwang-Sung Jun. A unified confidence sequence for generalized linear models, with applications to bandits. arXiv preprint arXiv:2407.13977,
-
[5]
URL https://proceedings.neurips.cc/paper_files/paper/2012/file/ 9adeb82fffb5444e81fa0ce8ad8afe7a-Paper.pdf. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information...
work page 2012
-
[6]
Near optimal pure exploration in logistic bandits
Eduardo Ochoa Rivera and Ambuj Tewari. Near optimal pure exploration in logistic bandits. arXiv preprint arXiv:2410.20640,
-
[7]
Fusing reward and dueling feedback in stochastic bandits
Xuchuang Wang, Qirun Zeng, Jinhang Zuo, Xutong Liu, Mohammad Hajiesmaili, John Lui, and Adam Wierman. Fusing reward and dueling feedback in stochastic bandits. arXiv preprint arXiv:2504.15812,
-
[8]
Conversational contextual bandit: Algorithm and application
Xiaoying Zhang, Hong Xie, Hang Li, and John CS Lui. Conversational contextual bandit: Algorithm and application. In Proceedings of the web conference 2020, pages 662–672,
work page 2020
-
[9]
11 Appendix Contents 1 Introduction 2 1.1 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Problem Formulation 4 3 Confidence Sequences for Hybrid GLB 5 3.1 Maximum Likelihood Estimation . . . . . . . ...
work page 2024
-
[10]
Therefore, Lt(θ) − Lt( ˆθt) ≥ ∥θ − ˆθt∥2 ˜Gc t(ˆθt,θ) + ∥θ − ˆθt∥2 ˜Gd t (ˆθt,θ). Lemma A.2. For t ≥ 1, consider a hybrid cumulative loss Lt(θ) = Lc t(θ) + Ld t (θ), with Hc t (θ) ≜ ∇2Lc t(θ) and Hd t (θ) ≜ ∇2Ld t (θ). Let ˆθt ∈ arg minθ∈Θ Lt(θ). Then for all θ ∈ Θ, Lt(θ) − Lt( ˆθt) ≥ ∥θ − ˆθt∥2 At . 13 Proof of Lemma A.2. Combining the above inequalities...
work page 2020
-
[11]
Then ϕ⋆(·) is continuous at θ⋆, and W ⋆(θ⋆) is convex, compact, and nonempty
Let ϕ⋆(θ⋆) ≜ min w∈∆A ϕ(w, θ⋆), W ⋆(θ⋆) ≜ argmin w∈∆A ϕ(w, θ⋆). Then ϕ⋆(·) is continuous at θ⋆, and W ⋆(θ⋆) is convex, compact, and nonempty. Lemma A.6 (Deterministic tracking error bound; adapted from Lemma 6 of Jedra and Proutiere [2020]). Let (ws)s≥1 be any sequence of distributions on A. Let T ⊂ N be the set of tracking rounds and define nt ≜ |T ∩ { 1...
work page 2020
-
[12]
Since Bt ≥ cmint, the logarithmic term in βt(δ) can be bounded by a logarithm of Bt up to constants depending only on the cost vector. Using βt(δ) ≤ c0 log(1/δ) + c1d log(1 + t) and the same fixed-point argument as in the proof of Theorem 4.5, it is enough that Bt > CT ⋆ cost(θ⋆) log 1 δ + d log T ⋆ cost(θ⋆) + d log log e δ + O(1), for a constant C > 0 de...
work page 2016
-
[13]
also studies best-arm identification with both reward and comparison feedback, but under a more specialized logistic structure. Their analysis treats the feedback through a 18 logistic-bandit model and develops curvature-aware confidence bounds to control the difficulty caused by vanishing Fisher information. Our formulation is broader in two respects. Fi...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.