Best Arm Identification in Generalized Linear Bandits via Hybrid Feedback

Fang Kong; Jiayi Shen; Jinhang Zuo; Qirun Zeng; Xuchuang Wang; Xutong Liu

arxiv: 2605.05745 · v1 · submitted 2026-05-07 · 💻 cs.AI

Best Arm Identification in Generalized Linear Bandits via Hybrid Feedback

Qirun Zeng , Xuchuang Wang , Jiayi Shen , Xutong Liu , Fang Kong , Jinhang Zuo This is my paper

Pith reviewed 2026-05-08 11:29 UTC · model grok-4.3

classification 💻 cs.AI

keywords best arm identificationgeneralized linear banditshybrid feedbacktrack-and-stopconfidence sequencesself-concordancedueling bandits

0 comments

The pith

A likelihood-ratio confidence sequence unifies absolute and relative feedback for best arm identification in generalized linear bandits.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops methods for identifying the best arm when feedback can come either as a direct reward from one arm or as a comparison between two arms, both following generalized linear models. It constructs a single likelihood-ratio confidence sequence that handles both observation types together and produces an explicit ellipsoidal set for the unknown parameter vector once the link function meets a self-concordance condition. This set supports a hybrid Track-and-Stop algorithm that adaptively chooses the next query type by tracking a minimax-optimal design over the combined space of single arms and pairs. The algorithm is shown to stop with the correct answer with probability at least 1-δ and to have explicit high-probability bounds on the total number of queries. A cost-aware version further accounts for different costs of the two feedback modalities, yielding better sample use than single-feedback baselines.

Core claim

The central claim is that a likelihood-ratio-based confidence sequence unifies heterogeneous generalized linear observations from absolute rewards and relative comparisons, producing an explicit ellipsoidal confidence set under the self-concordance assumption on the link function. This sequence supports a hybrid Track-and-Stop algorithm that adaptively allocates queries by tracking the minimax-optimal design over the joint action space of arms and pairs, establishing δ-correctness together with high-probability upper bounds on the stopping time, and extends directly to a cost-aware setting with heterogeneous acquisition costs.

What carries the argument

The likelihood-ratio-based confidence sequence that unifies absolute and relative generalized linear feedback into an explicit ellipsoidal confidence set under self-concordance.

If this is right

The algorithm stops with the correct best arm with probability at least 1-δ.
High-probability upper bounds hold on the total number of queries required.
The framework extends to a cost-aware variant that optimizes under different costs for reward and comparison queries.
Empirical tests show improved sample efficiency over non-hybrid baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same unification approach could be tested on other mixed-observation models if the ellipsoidal form is relaxed.
Practitioners might use the cost-aware version to decide dynamically between cheap comparisons and costly direct measurements based on their relative information value.

Load-bearing premise

The link function must satisfy self-concordance so that the likelihood-ratio sequence produces an explicit ellipsoidal confidence set.

What would settle it

Generate data from a generalized linear bandit whose link function violates self-concordance and check whether the proposed confidence sets still contain the true parameter with the stated probability.

Figures

Figures reproduced from arXiv: 2605.05745 by Fang Kong, Jiayi Shen, Jinhang Zuo, Qirun Zeng, Xuchuang Wang, Xutong Liu.

**Figure 1.** Figure 1: Experimental results under different settings view at source ↗

**Figure 2.** Figure 2: Additional experiments. Basis-plus-rotated-arm construction. In view at source ↗

read the original abstract

We study fixed-confidence best arm identification in generalized linear bandits under a hybrid feedback model: at each round, the learner may query either (i) absolute reward feedback from a single arm or (ii) relative (dueling) feedback from an arm pair, both governed by generalized linear models. We introduce a likelihood-ratio--based confidence sequence that unifies heterogeneous generalized linear observations and yields an explicit ellipsoidal confidence set under a self-concordance assumption. Building on this confidence set, we propose a hybrid Track-and-Stop algorithm that adaptively allocates queries by tracking a minimax-optimal design over a joint action space of arms and pairs. We establish $\delta$-correctness and provide high-probability upper bounds on the stopping time. We further extend the framework to a cost-aware setting that accounts for heterogeneous acquisition costs across feedback modalities. Empirical experiments demonstrate that the proposed algorithms significantly improve sample efficiency over baseline methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper cleanly unifies reward and dueling feedback in GLM best-arm ID via a likelihood-ratio confidence sequence and joint Track-and-Stop, but the closed-form ellipsoid needs self-concordance.

read the letter

The main takeaway is that you can now run fixed-confidence best-arm identification in a generalized linear bandit while freely mixing single-arm reward pulls with arm-pair duels, all under one model. They build a likelihood-ratio confidence sequence that covers both observation types and, when the link function is self-concordant, yields an explicit ellipsoidal set. From that set they run a hybrid Track-and-Stop algorithm that tracks a minimax design over the combined space of arms and pairs, plus a cost-aware version that charges different prices for the two query types. The δ-correctness and high-probability stopping-time bounds follow from the usual change-of-measure arguments once the set is in hand. The self-concordance assumption is stated up front, so there is no hidden circularity in the derivation. This combination of hybrid feedback, unified sequence, and joint-action design does not appear in the cited prior work on pure-reward or pure-dueling GLM bandits, so the technical step is real. The cost-aware extension is a straightforward but useful addition for applications where one modality is cheaper. The experiments are reported to beat baselines on sample efficiency, which is the expected outcome when you add the cheaper query type. The soft spot is the self-concordance requirement itself. It buys the nice ellipsoid, but it excludes some link functions that might otherwise be natural; without it the set loses its convenient shape and the analysis would need a different argument. The stopping-time bounds are also the usual high-probability upper bounds rather than tight constants, though the paper does not overclaim on that point. This work is aimed at people already comfortable with GLM bandits and Track-and-Stop who want to add dueling options without starting over. A reader working on sequential testing with mixed signals will get usable algorithms and bounds they can adapt. It is coherent enough on its own terms to deserve a serious referee.

Referee Report

2 major / 3 minor

Summary. The manuscript studies fixed-confidence best arm identification in generalized linear bandits under a hybrid feedback model, where the learner can query either absolute reward feedback from a single arm or relative dueling feedback from an arm pair, both modeled via generalized linear models. It introduces a likelihood-ratio-based confidence sequence that unifies these heterogeneous observations and, under an explicit self-concordance assumption on the link function, produces a closed-form ellipsoidal confidence set. Building on this, the authors propose a hybrid Track-and-Stop algorithm that adaptively allocates queries by tracking a minimax-optimal design over the joint space of arms and pairs. They prove δ-correctness together with high-probability upper bounds on the stopping time, extend the framework to a cost-aware variant with heterogeneous acquisition costs, and report empirical improvements in sample efficiency over baselines.

Significance. If the derivations hold, the work provides a practically relevant unification of absolute and relative feedback in generalized linear best-arm identification, with the likelihood-ratio construction and resulting explicit ellipsoidal set constituting a clear technical strength. The high-probability stopping-time bounds and the cost-aware extension are valuable for deployment, and the empirical results supply concrete evidence of improved efficiency. The explicit handling of the joint action space in the Track-and-Stop design is a further positive feature.

major comments (2)

[§3] §3 (likelihood-ratio confidence sequence): the passage from the general likelihood-ratio martingale to the explicit ellipsoidal set is load-bearing for all subsequent guarantees; the self-concordance assumption is stated, but the manuscript should make explicit whether the resulting ellipsoid is tight or contains a hidden multiplicative looseness that would propagate into the stopping-time bound.
[§4.2] §4.2 (hybrid Track-and-Stop analysis): the minimax design is defined over the joint arm/pair action space and is claimed to be tracked independently of the data; a concrete verification that the change-of-measure argument continues to yield the stated high-probability stopping-time bound after this extension would strengthen the central claim.

minor comments (3)

The notation distinguishing single-arm actions from pair actions in the joint design could be made more uniform across the algorithm pseudocode and the theoretical statements.
A short discussion of how restrictive the self-concordance assumption is for standard link functions (logistic, probit) would help readers assess applicability.
The empirical section would benefit from reporting the empirical coverage of the ellipsoidal sets in addition to regret or sample-complexity curves.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses

Referee: [§3] §3 (likelihood-ratio confidence sequence): the passage from the general likelihood-ratio martingale to the explicit ellipsoidal set is load-bearing for all subsequent guarantees; the self-concordance assumption is stated, but the manuscript should make explicit whether the resulting ellipsoid is tight or contains a hidden multiplicative looseness that would propagate into the stopping-time bound.

Authors: We agree that explicit clarification is warranted. Under the self-concordance assumption the ellipsoidal set is obtained directly from the likelihood-ratio martingale without additional multiplicative looseness; the self-concordance parameter enters only through the radius, which is already accounted for in the stopping-time analysis. In the revision we will add a short remark in §3 making this tightness explicit and confirming that no further inflation occurs in the high-probability bounds. revision: yes
Referee: [§4.2] §4.2 (hybrid Track-and-Stop analysis): the minimax design is defined over the joint arm/pair action space and is claimed to be tracked independently of the data; a concrete verification that the change-of-measure argument continues to yield the stated high-probability stopping-time bound after this extension would strengthen the central claim.

Authors: We thank the referee for highlighting this point. Because the minimax design is computed once over the fixed joint action space and remains data-independent, the standard change-of-measure argument extends directly. In the revised §4.2 we will insert a concise, step-by-step verification of the change-of-measure that confirms the high-probability stopping-time bound holds unchanged for the hybrid setting. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central construction begins with a likelihood-ratio confidence sequence derived directly from the heterogeneous GLM observations (absolute and dueling feedback), which under the explicitly stated self-concordance assumption on the link function produces an ellipsoidal set. The hybrid Track-and-Stop algorithm then allocates queries by tracking a minimax-optimal design over the joint arm/pair action space, with δ-correctness and stopping-time bounds obtained via standard change-of-measure arguments adapted to the hybrid setting. No step reduces by construction to a fitted parameter renamed as prediction, a self-citation chain, or an ansatz imported without independent justification; the self-concordance condition is an external modeling assumption rather than a fitted or self-referential quantity, and the minimax design is defined independently of the realized data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the self-concordance assumption for the GLM link function and on standard concentration properties of likelihood ratios for generalized linear models. No free parameters or new invented entities are introduced in the abstract.

axioms (1)

domain assumption The link function of the generalized linear model is self-concordant.
Invoked to obtain an explicit ellipsoidal confidence set from the likelihood-ratio sequence.

pith-pipeline@v0.9.0 · 5466 in / 1326 out tokens · 72560 ms · 2026-05-08T11:29:27.929226+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Best arm identification in multi-armed bandits

Jean-Yves Audibert and Sébastien Bubeck. Best arm identification in multi-armed bandits. In COLT-23th Conference on learning theory-2010,

work page 2010
[2]

Confidence sequences for generalized linear models via regret analysis

Eugenio Clerico, Hamish Flynn, Gergely Neu, et al. Confidence sequences for generalized linear models via regret analysis. arXiv preprint arXiv:2504.16555,

work page arXiv
[3]

Confidence estimation via sequential likelihood mixing

Johannes Kirschner, Andreas Krause, Michele Meziu, and Mojmir Mutny. Confidence estimation via sequential likelihood mixing. arXiv preprint arXiv:2502.14689,

work page arXiv
[4]

A unified confidence sequence for generalized linear models, with applications to bandits

Junghyun Lee, Se-Young Yun, and Kwang-Sung Jun. A unified confidence sequence for generalized linear models, with applications to bandits. arXiv preprint arXiv:2407.13977,

work page arXiv
[5]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

URL https://proceedings.neurips.cc/paper_files/paper/2012/file/ 9adeb82fffb5444e81fa0ce8ad8afe7a-Paper.pdf. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information...

work page 2012
[6]

Near optimal pure exploration in logistic bandits

Eduardo Ochoa Rivera and Ambuj Tewari. Near optimal pure exploration in logistic bandits. arXiv preprint arXiv:2410.20640,

work page arXiv
[7]

Fusing reward and dueling feedback in stochastic bandits

Xuchuang Wang, Qirun Zeng, Jinhang Zuo, Xutong Liu, Mohammad Hajiesmaili, John Lui, and Adam Wierman. Fusing reward and dueling feedback in stochastic bandits. arXiv preprint arXiv:2504.15812,

work page arXiv
[8]

Conversational contextual bandit: Algorithm and application

Xiaoying Zhang, Hong Xie, Hang Li, and John CS Lui. Conversational contextual bandit: Algorithm and application. In Proceedings of the web conference 2020, pages 662–672,

work page 2020
[9]

3 1.2 Related Work

11 Appendix Contents 1 Introduction 2 1.1 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Problem Formulation 4 3 Confidence Sequences for Hybrid GLB 5 3.1 Maximum Likelihood Estimation . . . . . . . ...

work page 2024
[10]

Lemma A.2

Therefore, Lt(θ) − Lt( ˆθt) ≥ ∥θ − ˆθt∥2 ˜Gc t(ˆθt,θ) + ∥θ − ˆθt∥2 ˜Gd t (ˆθt,θ). Lemma A.2. For t ≥ 1, consider a hybrid cumulative loss Lt(θ) = Lc t(θ) + Ld t (θ), with Hc t (θ) ≜ ∇2Lc t(θ) and Hd t (θ) ≜ ∇2Ld t (θ). Let ˆθt ∈ arg minθ∈Θ Lt(θ). Then for all θ ∈ Θ, Lt(θ) − Lt( ˆθt) ≥ ∥θ − ˆθt∥2 At . 13 Proof of Lemma A.2. Combining the above inequalities...

work page 2020
[11]

Then ϕ⋆(·) is continuous at θ⋆, and W ⋆(θ⋆) is convex, compact, and nonempty

Let ϕ⋆(θ⋆) ≜ min w∈∆A ϕ(w, θ⋆), W ⋆(θ⋆) ≜ argmin w∈∆A ϕ(w, θ⋆). Then ϕ⋆(·) is continuous at θ⋆, and W ⋆(θ⋆) is convex, compact, and nonempty. Lemma A.6 (Deterministic tracking error bound; adapted from Lemma 6 of Jedra and Proutiere [2020]). Let (ws)s≥1 be any sequence of distributions on A. Let T ⊂ N be the set of tracking rounds and define nt ≜ |T ∩ { 1...

work page 2020
[12]

Since Bt ≥ cmint, the logarithmic term in βt(δ) can be bounded by a logarithm of Bt up to constants depending only on the cost vector. Using βt(δ) ≤ c0 log(1/δ) + c1d log(1 + t) and the same fixed-point argument as in the proof of Theorem 4.5, it is enough that Bt > CT ⋆ cost(θ⋆) log 1 δ + d log T ⋆ cost(θ⋆) + d log log e δ + O(1), for a constant C > 0 de...

work page 2016
[13]

also studies best-arm identification with both reward and comparison feedback, but under a more specialized logistic structure. Their analysis treats the feedback through a 18 logistic-bandit model and develops curvature-aware confidence bounds to control the difficulty caused by vanishing Fisher information. Our formulation is broader in two respects. Fi...

work page 2021

[1] [1]

Best arm identification in multi-armed bandits

Jean-Yves Audibert and Sébastien Bubeck. Best arm identification in multi-armed bandits. In COLT-23th Conference on learning theory-2010,

work page 2010

[2] [2]

Confidence sequences for generalized linear models via regret analysis

Eugenio Clerico, Hamish Flynn, Gergely Neu, et al. Confidence sequences for generalized linear models via regret analysis. arXiv preprint arXiv:2504.16555,

work page arXiv

[3] [3]

Confidence estimation via sequential likelihood mixing

Johannes Kirschner, Andreas Krause, Michele Meziu, and Mojmir Mutny. Confidence estimation via sequential likelihood mixing. arXiv preprint arXiv:2502.14689,

work page arXiv

[4] [4]

A unified confidence sequence for generalized linear models, with applications to bandits

Junghyun Lee, Se-Young Yun, and Kwang-Sung Jun. A unified confidence sequence for generalized linear models, with applications to bandits. arXiv preprint arXiv:2407.13977,

work page arXiv

[5] [5]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

URL https://proceedings.neurips.cc/paper_files/paper/2012/file/ 9adeb82fffb5444e81fa0ce8ad8afe7a-Paper.pdf. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information...

work page 2012

[6] [6]

Near optimal pure exploration in logistic bandits

Eduardo Ochoa Rivera and Ambuj Tewari. Near optimal pure exploration in logistic bandits. arXiv preprint arXiv:2410.20640,

work page arXiv

[7] [7]

Fusing reward and dueling feedback in stochastic bandits

Xuchuang Wang, Qirun Zeng, Jinhang Zuo, Xutong Liu, Mohammad Hajiesmaili, John Lui, and Adam Wierman. Fusing reward and dueling feedback in stochastic bandits. arXiv preprint arXiv:2504.15812,

work page arXiv

[8] [8]

Conversational contextual bandit: Algorithm and application

Xiaoying Zhang, Hong Xie, Hang Li, and John CS Lui. Conversational contextual bandit: Algorithm and application. In Proceedings of the web conference 2020, pages 662–672,

work page 2020

[9] [9]

3 1.2 Related Work

11 Appendix Contents 1 Introduction 2 1.1 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Problem Formulation 4 3 Confidence Sequences for Hybrid GLB 5 3.1 Maximum Likelihood Estimation . . . . . . . ...

work page 2024

[10] [10]

Lemma A.2

Therefore, Lt(θ) − Lt( ˆθt) ≥ ∥θ − ˆθt∥2 ˜Gc t(ˆθt,θ) + ∥θ − ˆθt∥2 ˜Gd t (ˆθt,θ). Lemma A.2. For t ≥ 1, consider a hybrid cumulative loss Lt(θ) = Lc t(θ) + Ld t (θ), with Hc t (θ) ≜ ∇2Lc t(θ) and Hd t (θ) ≜ ∇2Ld t (θ). Let ˆθt ∈ arg minθ∈Θ Lt(θ). Then for all θ ∈ Θ, Lt(θ) − Lt( ˆθt) ≥ ∥θ − ˆθt∥2 At . 13 Proof of Lemma A.2. Combining the above inequalities...

work page 2020

[11] [11]

Then ϕ⋆(·) is continuous at θ⋆, and W ⋆(θ⋆) is convex, compact, and nonempty

Let ϕ⋆(θ⋆) ≜ min w∈∆A ϕ(w, θ⋆), W ⋆(θ⋆) ≜ argmin w∈∆A ϕ(w, θ⋆). Then ϕ⋆(·) is continuous at θ⋆, and W ⋆(θ⋆) is convex, compact, and nonempty. Lemma A.6 (Deterministic tracking error bound; adapted from Lemma 6 of Jedra and Proutiere [2020]). Let (ws)s≥1 be any sequence of distributions on A. Let T ⊂ N be the set of tracking rounds and define nt ≜ |T ∩ { 1...

work page 2020

[12] [12]

Since Bt ≥ cmint, the logarithmic term in βt(δ) can be bounded by a logarithm of Bt up to constants depending only on the cost vector. Using βt(δ) ≤ c0 log(1/δ) + c1d log(1 + t) and the same fixed-point argument as in the proof of Theorem 4.5, it is enough that Bt > CT ⋆ cost(θ⋆) log 1 δ + d log T ⋆ cost(θ⋆) + d log log e δ + O(1), for a constant C > 0 de...

work page 2016

[13] [13]

also studies best-arm identification with both reward and comparison feedback, but under a more specialized logistic structure. Their analysis treats the feedback through a 18 logistic-bandit model and develops curvature-aware confidence bounds to control the difficulty caused by vanishing Fisher information. Our formulation is broader in two respects. Fi...

work page 2021