Sequential Experimental Design for Transductive Linear Bandits

Kevin Jamieson; Lalit Jain; Lillian Ratliff; Tanner Fiez

arxiv: 1906.08399 · v1 · pith:MO4H5QOOnew · submitted 2019-06-20 · 📊 stat.ML · cs.LG

Sequential Experimental Design for Transductive Linear Bandits

Tanner Fiez , Lalit Jain , Kevin Jamieson , Lillian Ratliff This is my paper

Pith reviewed 2026-05-25 19:44 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords transductive linear banditslinear banditscombinatorial banditssequential experimental designinstance-dependent boundsinformation-theoretic lower boundsbandit algorithms

0 comments

The pith

An algorithm for the transductive linear bandit problem nearly matches the information-theoretic lower bound on queries needed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a transductive linear bandit setting in which the learner must identify the maximizing vector from a target set Z by making sequential noisy inner-product queries from a possibly different measurement set X. It derives instance-dependent lower bounds on the number of queries required to succeed with fixed probability and gives an algorithm whose query count matches those bounds up to logarithmic factors. The construction specializes to ordinary linear bandits when X equals Z and supplies the first non-asymptotic procedure for that case that comes close to the lower bound. A reader would care because the setting captures real constraints in which allowable measurements differ from the items whose quality must be ranked.

Core claim

We introduce the transductive linear bandit problem and supply instance-dependent lower bounds together with an algorithm that matches the bounds up to logarithmic factors; when X coincides with Z the same procedure is the first non-asymptotic linear-bandit algorithm whose sample complexity nearly attains the information-theoretic lower bound.

What carries the argument

A sequential algorithm that selects queries from the known finite set X, using precomputed instance-dependent quantities derived from the geometry of X and Z, to certify the argmax element of Z with probability 1-delta.

If this is right

When X equals Z the procedure gives a nearly optimal solution for ordinary linear bandits.
The same bounds and algorithm cover combinatorial bandits when X consists of standard basis vectors and Z consists of binary vectors.
The query count adapts to the particular geometry of X and Z, yielding fewer measurements on easier instances.
The approach directly supports experimental design tasks in which the affordable measurements differ from the items whose ranking is required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The finite-set assumption suggests that discretizations of continuous measurement and target spaces could inherit similar guarantees.
In applications such as drug screening the separation of X and Z could translate into concrete reductions in laboratory tests while still ranking in-vivo candidates.
The instance-dependent analysis points toward possible extensions that update bounds online when X or Z are revealed incrementally.

Load-bearing premise

The sets X and Z are finite and known in advance, allowing the algorithm to plan queries and compute instance-dependent bounds from them.

What would settle it

On a concrete finite instance with known theta-star, run the algorithm and check whether the number of queries needed to reach success probability 1-delta exceeds the paper's lower bound by more than a logarithmic factor.

Figures

Figures reproduced from arXiv: 1906.08399 by Kevin Jamieson, Lalit Jain, Lillian Ratliff, Tanner Fiez.

**Figure 1.** Figure 1: Simulation Results simulation are depicted in Fig. 1d. The RAGE algorithm significantly outperforms the static allocation. 6 Conclusion In this paper we have proposed the problem of best-arm identification for transductive linear bandits, provided an algorithm, and matching upper and lower bounds. As a remark it is straightforward to exit our algorithm early with an ε-good arm. It still remains to develop … view at source ↗

read the original abstract

In this paper we introduce the transductive linear bandit problem: given a set of measurement vectors $\mathcal{X}\subset \mathbb{R}^d$, a set of items $\mathcal{Z}\subset \mathbb{R}^d$, a fixed confidence $\delta$, and an unknown vector $\theta^{\ast}\in \mathbb{R}^d$, the goal is to infer $\text{argmax}_{z\in \mathcal{Z}} z^\top\theta^\ast$ with probability $1-\delta$ by making as few sequentially chosen noisy measurements of the form $x^\top\theta^{\ast}$ as possible. When $\mathcal{X}=\mathcal{Z}$, this setting generalizes linear bandits, and when $\mathcal{X}$ is the standard basis vectors and $\mathcal{Z}\subset \{0,1\}^d$, combinatorial bandits. Such a transductive setting naturally arises when the set of measurement vectors is limited due to factors such as availability or cost. As an example, in drug discovery the compounds and dosages $\mathcal{X}$ a practitioner may be willing to evaluate in the lab in vitro due to cost or safety reasons may differ vastly from those compounds and dosages $\mathcal{Z}$ that can be safely administered to patients in vivo. Alternatively, in recommender systems for books, the set of books $\mathcal{X}$ a user is queried about may be restricted to well known best-sellers even though the goal might be to recommend more esoteric titles $\mathcal{Z}$. In this paper, we provide instance-dependent lower bounds for the transductive setting, an algorithm that matches these up to logarithmic factors, and an evaluation. In particular, we provide the first non-asymptotic algorithm for linear bandits that nearly achieves the information theoretic lower bound.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a transductive linear bandit setting with instance-dependent lower bounds and a matching algorithm up to logs, and claims the first non-asymptotic near-optimal result for standard linear bandits when X=Z.

read the letter

The core contribution is the transductive formulation that separates the measurement set X from the target set Z, along with instance-dependent lower bounds and an algorithm that comes within logarithmic factors of those bounds. When X equals Z the result specializes to linear bandits, and the authors position it as the first non-asymptotic algorithm that nearly matches the information-theoretic lower bound in that case. The motivation with drug discovery and recommender systems is straightforward and shows where the separation between X and Z matters in practice. The abstract is explicit that X and Z are finite and known ahead of time, which lets the algorithm plan allocations and stopping rules; that precondition is not hidden and is required for the planning step that achieves the claimed match. Without the full proofs it is impossible to check the exact constants or whether the log factors hide dimension-dependent blowups, but the stated claims are modest and the circularity burden looks low. The work is aimed at researchers in bandit algorithms and sequential experimental design. It is a clean generalization with a concrete algorithmic result, so it deserves a serious referee even if the analysis needs tightening in revision.

Referee Report

2 major / 2 minor

Summary. The paper introduces the transductive linear bandit problem: given finite known sets X and Z of vectors in R^d, a fixed delta, and unknown theta*, identify argmax_{z in Z} z^T theta* with probability 1-delta using as few sequential noisy queries x^T theta* (x in X) as possible. It derives instance-dependent lower bounds on the number of queries, presents an algorithm that matches these bounds up to logarithmic factors, and specializes the result to linear bandits when X=Z. The central claim is that this yields the first non-asymptotic algorithm for linear bandits that nearly achieves the information-theoretic lower bound.

Significance. If the lower bounds and matching analysis hold, the result strengthens the theory of optimal sequential design for linear models by handling the transductive case where measurement and target sets differ. The non-asymptotic, instance-dependent guarantee and explicit matching (up to logs) are strengths, as is the explicit scope to finite known X and Z that enables the planning step. Applications to settings like drug discovery are noted but not central to the contribution.

major comments (2)

The abstract and introduction state that X and Z are finite and known in advance; this precondition is load-bearing for the algorithm's ability to precompute optimal allocations and stopping thresholds that achieve the near-matching to the lower bound. No extension removing this assumption is claimed, so the result is correctly scoped but the 'first non-asymptotic nearly matching' claim for linear bandits holds only inside this finite-known regime.
The lower-bound derivation and algorithm analysis must be checked for whether the log-factor gap is unavoidable or an artifact of the proof technique; if the paper's main theorem (likely in the analysis section) shows an explicit constant or improved log term under additional assumptions, that should be highlighted as it affects how close 'nearly achieves' is.

minor comments (2)

Notation for the transductive setting (X vs Z) should be introduced with a clear diagram or example early in the paper to aid readers coming from standard linear bandits.
The evaluation section would benefit from explicit comparison tables showing query complexity versus existing linear-bandit algorithms on the same instances.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their careful reading of the manuscript and for the constructive comments. We respond to each major comment below.

read point-by-point responses

Referee: The abstract and introduction state that X and Z are finite and known in advance; this precondition is load-bearing for the algorithm's ability to precompute optimal allocations and stopping thresholds that achieve the near-matching to the lower bound. No extension removing this assumption is claimed, so the result is correctly scoped but the 'first non-asymptotic nearly matching' claim for linear bandits holds only inside this finite-known regime.

Authors: We agree that the finiteness and a priori knowledge of both X and Z are essential to the algorithm, as they enable the offline computation of the optimal allocation and the stopping thresholds used to achieve the near-matching guarantee. The manuscript already scopes all results—including the specialization to linear bandits when X=Z—to this finite-known setting, with no claims made about extensions that remove the assumption. The 'first non-asymptotic nearly matching' statement is therefore already correctly limited to the regime in which the paper operates. revision: no
Referee: The lower-bound derivation and algorithm analysis must be checked for whether the log-factor gap is unavoidable or an artifact of the proof technique; if the paper's main theorem (likely in the analysis section) shows an explicit constant or improved log term under additional assumptions, that should be highlighted as it affects how close 'nearly achieves' is.

Authors: The logarithmic factors originate from the non-asymptotic concentration inequalities and union bounds employed in the stopping-time analysis over the finite sets. The main theorem states the bound with these explicit log terms and does not provide improved dependence under any additional assumptions. While a tighter characterization of whether the gap is information-theoretically necessary would be valuable, the current analysis already yields the first non-asymptotic guarantee that matches the lower bound up to logarithmic factors. revision: no

standing simulated objections not resolved

Determining whether the logarithmic gap is information-theoretically unavoidable or an artifact of the current proof technique would require a refined lower-bound analysis beyond the scope of this work.

Circularity Check

0 steps flagged

No significant circularity; lower bounds and matching algorithm are independently derived

full rationale

The paper states finite known X and Z as modeling assumptions in the problem definition, derives instance-dependent information-theoretic lower bounds, and constructs an algorithm whose sample complexity matches those bounds up to log factors. No quoted step reduces a claimed prediction or optimality result to a fitted parameter, self-definition, or load-bearing self-citation chain. The non-asymptotic guarantee for linear bandits is presented as a construction that achieves the external lower bound rather than re-deriving it from its own outputs. This is the normal case of a self-contained derivation against information-theoretic benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract supplies insufficient technical detail to enumerate free parameters, axioms, or invented entities used in the bounds or algorithm.

pith-pipeline@v0.9.0 · 5845 in / 1128 out tokens · 42094 ms · 2026-05-25T19:44:13.154784+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Logging Policy Design for Off-Policy Evaluation
stat.ML 2026-05 unverdicted novelty 7.0

Derives optimal logging policies for off-policy evaluation by balancing reward concentration against action coverage in known, unknown, and partially known regimes of target policy and rewards.
Logging Policy Design for Off-Policy Evaluation
stat.ML 2026-05 unverdicted novelty 5.0

Derives optimal logging policies for minimizing off-policy evaluation error under known, unknown, and partially known target policies and reward distributions.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Improved algorithms for linear stochastic bandits

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. InAdvances in Neural Information Processing Systems, pages 2312–2320, 2011

work page 2011
[2]

Near-Optimal Discrete Optimization for Experimental Design: A Regret Minimization Approach

Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, and Yining Wang. Near-optimal discrete optimization for experimental design: A regret minimization approach.arXiv preprint arXiv:1711.05174, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

Best arm identiﬁcation in multi-armed bandits

Jean-Yves Audibert and Sébastien Bubeck. Best arm identiﬁcation in multi-armed bandits. In Conference on Learning Theory, pages 41–53, 2010

work page 2010
[4]

Using conﬁdence bounds for exploitation-exploration trade-oﬀs.Journal of Machine Learning Research, 3:397–422, 2002

Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-oﬀs.Journal of Machine Learning Research, 3:397–422, 2002

work page 2002
[5]

Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

work page 2003
[6]

Pure exploration in multi-armed bandits problems

Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. InInternational Conference on Algorithmic Learning Theory, pages 23–37, 2009

work page 2009
[7]

Disagreement-Based Combinatorial Pure Exploration: Sample Complexity Bounds and an Efficient Algorithm

Tongyi Cao and Akshay Krishnamurthy. Disagreement-based combinatorial pure exploration: Eﬃcient algorithms and an analysis with localization. arXiv preprint arXiv:1711.08018, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Pure exploration of multi-armed bandit under matroid constraints

Lijie Chen, Anupam Gupta, and Jian Li. Pure exploration of multi-armed bandit under matroid constraints. InConference on Learning Theory, pages 647–669, 2016

work page 2016
[9]

Nearly Optimal Sampling Algorithms for Combinatorial Pure Exploration

Lijie Chen, Anupam Gupta, Jian Li, Mingda Qiao, and Ruosong Wang. Nearly optimal samplingalgorithmsforcombinatorialpureexploration. arXiv preprint arXiv:1706.01081, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

On the Optimal Sample Complexity for Best Arm Identification

Lijie Chen and Jian Li. On the optimal sample complexity for best arm identiﬁcation. arXiv preprint arXiv:1511.03774, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[11]

Combinatorial pure exploration of multi-armed bandits

Shouyuan Chen, Tian Lin, Irwin King, Michael R Lyu, and Wei Chen. Combinatorial pure exploration of multi-armed bandits. InAdvances in Neural Information Processing Systems, pages 379–387, 2014

work page 2014
[12]

Stochastic linear optimization under bandit feedback

Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008

work page 2008
[13]

Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems.Journal of machine learning research, 7(Jun):1079–1105, 2006

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems.Journal of machine learning research, 7(Jun):1079–1105, 2006

work page 2006
[14]

Best arm identiﬁ- cation: A uniﬁed approach to ﬁxed budget and ﬁxed conﬁdence

Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identiﬁ- cation: A uniﬁed approach to ﬁxed budget and ﬁxed conﬁdence. InAdvances in Neural Information Processing Systems, pages 3212–3220, 2012. 11

work page 2012
[15]

On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning

Matthew Hoﬀman, Bobak Shahriari, and Nando Freitas. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In International Conference on Artiﬁcial Intelligence and Statistics, pages 365–374, 2014

work page 2014
[16]

Revisiting frank-wolfe: Projection-free sparse convex optimization

Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In ICML (1), pages 427–435, 2013

work page 2013
[17]

lil’ucb: An optimal exploration algorithm for multi-armed bandits

Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ucb: An optimal exploration algorithm for multi-armed bandits. InConference on Learning Theory, pages 423–439, 2014

work page 2014
[18]

Almost optimal exploration in multi- armed bandits

Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi- armed bandits. InInternational Conference on Machine Learning, pages 1238–1246, 2013

work page 2013
[19]

Veriﬁcation based solution for structured mab problems

Zohar S Karnin. Veriﬁcation based solution for structured mab problems. InAdvances in Neural Information Processing Systems, pages 145–153, 2016

work page 2016
[20]

On the complexity of best-arm identiﬁcation in multi-armed bandit models.Journal of Machine Learning Research, 17:1–42, 2016

Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identiﬁcation in multi-armed bandit models.Journal of Machine Learning Research, 17:1–42, 2016

work page 2016
[21]

Best Arm Identification in Generalized Linear Bandits

Abbas Kazerouni and Lawrence M Wein. Best arm identiﬁcation in generalized linear bandits. arXiv preprint arXiv:1905.08224, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[22]

The equivalence of two extremum problems.Canadian Journal of Mathematics, 12:363–366, 1960

Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems.Canadian Journal of Mathematics, 12:363–366, 1960

work page 1960
[23]

The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits

Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of ﬁnite-armed linear bandits.arXiv preprint arXiv:1610.04491, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[24]

A contextual-bandit approach to personalized news article recommendation

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010

work page 2010
[25]

SIAM, 2006

Friedrich Pukelsheim.Optimal design of experiments. SIAM, 2006

work page 2006
[26]

Princeton university press, 2015

Ralph Tyrell Rockafellar.Convex analysis. Princeton university press, 2015

work page 2015
[27]

Sequential resource allocation in linear stochastic bandits

Marta Soare. Sequential resource allocation in linear stochastic bandits. PhD thesis, Université Lille 1-Sciences et Technologies, 2015

work page 2015
[28]

Best-arm identiﬁcation in linear bandits

Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identiﬁcation in linear bandits. In Advances in Neural Information Processing Systems, pages 828–836, 2014

work page 2014
[29]

Best arm identiﬁcation in linear bandits with linear dimension dependency

Chao Tao, Saúl Blanco, and Yuan Zhou. Best arm identiﬁcation in linear bandits with linear dimension dependency. InInternational Conference on Machine Learning, pages 4884–4893, 2018

work page 2018
[30]

A. B. Tsybakov. Optimal aggregation of classiﬁers in statistical learning.Annals of Statistics, pages 135–166, 2004. 12

work page 2004
[31]

A fully adaptive algorithm for pure exploration in linear bandits

Liyuan Xu, Junya Honda, and Masashi Sugiyama. A fully adaptive algorithm for pure exploration in linear bandits. InInternational Conference on Artiﬁcial Intelligence and Statistics, pages 843–851, 2018

work page 2018
[32]

Active learning via transductive experimental design

Kai Yu, Jinbo Bi, and Volker Tresp. Active learning via transductive experimental design. In Proceedings of the 23rd international conference on Machine learning, pages 1081–1088. ACM, 2006. 13 A Proof of Theorem 2 Proof. Let the good event for thetth round of Algorithm 1 be Et := { Nt≤ max{⌈8(2t+1)2ρ(Y(St))(1+ϵ) log(|Z| 2 δt )⌉, r(ϵ) }} ∩{ z∗∈ ˆZt+1}∩{ ˆ...

work page 2006
[33]

, Np are positive integers constrained such that∑ i≤p Ni≥ N

Given the number of samplesN to obtain and the cardinality of the support ofλ, samples to allocate to arms in the support ofλ are computed usingNi =⌈(N− 1 2 p)λi⌉, where N1, N2, . . . , Np are positive integers constrained such that∑ i≤p Ni≥ N. 16

work page
[34]

Following the previous phase of the rounding procedure, loop until the discrepancy (∑ i≤p Ni)− N = 0, from either decreasing a sample countNj which obtains (Nj− 1)/λj = mini≤p(Ni− 1)/λi to Nj + 1, or increasing a sample countNj which obtains Nj/λj = maxi≤p Ni/λi to Nj− 1. The eﬃcient design apportionment theorem in Section 12.5 of [25] provides the founda...

work page
[35]

Because (1− cos(α))2≈ α4/4 and these inequalities must hold for allj = 1,

this hypothesis test requiresNj + Nj+d/2≥ c log(1/δ) (1−cos(α))2 for some universal constant c > 0. Because (1− cos(α))2≈ α4/4 and these inequalities must hold for allj = 1, . . . , d/2 simultaneously for the single static allocation, we obtain the result. 18 E Proof of Lemma 1 Proof. ρ(Y) = min λ∈ |X| max y∈Y ∥y∥2 (∑ x∈Xλxxx⊤)−1 = 1 γ2 Y min λ∈ |X| max y...

work page

[1] [1]

Improved algorithms for linear stochastic bandits

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. InAdvances in Neural Information Processing Systems, pages 2312–2320, 2011

work page 2011

[2] [2]

Near-Optimal Discrete Optimization for Experimental Design: A Regret Minimization Approach

Zeyuan Allen-Zhu, Yuanzhi Li, Aarti Singh, and Yining Wang. Near-optimal discrete optimization for experimental design: A regret minimization approach.arXiv preprint arXiv:1711.05174, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

Best arm identiﬁcation in multi-armed bandits

Jean-Yves Audibert and Sébastien Bubeck. Best arm identiﬁcation in multi-armed bandits. In Conference on Learning Theory, pages 41–53, 2010

work page 2010

[4] [4]

Using conﬁdence bounds for exploitation-exploration trade-oﬀs.Journal of Machine Learning Research, 3:397–422, 2002

Peter Auer. Using conﬁdence bounds for exploitation-exploration trade-oﬀs.Journal of Machine Learning Research, 3:397–422, 2002

work page 2002

[5] [5]

Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization.Operations Research Letters, 31(3):167–175, 2003

work page 2003

[6] [6]

Pure exploration in multi-armed bandits problems

Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. InInternational Conference on Algorithmic Learning Theory, pages 23–37, 2009

work page 2009

[7] [7]

Disagreement-Based Combinatorial Pure Exploration: Sample Complexity Bounds and an Efficient Algorithm

Tongyi Cao and Akshay Krishnamurthy. Disagreement-based combinatorial pure exploration: Eﬃcient algorithms and an analysis with localization. arXiv preprint arXiv:1711.08018, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Pure exploration of multi-armed bandit under matroid constraints

Lijie Chen, Anupam Gupta, and Jian Li. Pure exploration of multi-armed bandit under matroid constraints. InConference on Learning Theory, pages 647–669, 2016

work page 2016

[9] [9]

Nearly Optimal Sampling Algorithms for Combinatorial Pure Exploration

Lijie Chen, Anupam Gupta, Jian Li, Mingda Qiao, and Ruosong Wang. Nearly optimal samplingalgorithmsforcombinatorialpureexploration. arXiv preprint arXiv:1706.01081, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

On the Optimal Sample Complexity for Best Arm Identification

Lijie Chen and Jian Li. On the optimal sample complexity for best arm identiﬁcation. arXiv preprint arXiv:1511.03774, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[11] [11]

Combinatorial pure exploration of multi-armed bandits

Shouyuan Chen, Tian Lin, Irwin King, Michael R Lyu, and Wei Chen. Combinatorial pure exploration of multi-armed bandits. InAdvances in Neural Information Processing Systems, pages 379–387, 2014

work page 2014

[12] [12]

Stochastic linear optimization under bandit feedback

Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. 2008

work page 2008

[13] [13]

Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems.Journal of machine learning research, 7(Jun):1079–1105, 2006

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems.Journal of machine learning research, 7(Jun):1079–1105, 2006

work page 2006

[14] [14]

Best arm identiﬁ- cation: A uniﬁed approach to ﬁxed budget and ﬁxed conﬁdence

Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identiﬁ- cation: A uniﬁed approach to ﬁxed budget and ﬁxed conﬁdence. InAdvances in Neural Information Processing Systems, pages 3212–3220, 2012. 11

work page 2012

[15] [15]

On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning

Matthew Hoﬀman, Bobak Shahriari, and Nando Freitas. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In International Conference on Artiﬁcial Intelligence and Statistics, pages 365–374, 2014

work page 2014

[16] [16]

Revisiting frank-wolfe: Projection-free sparse convex optimization

Martin Jaggi. Revisiting frank-wolfe: Projection-free sparse convex optimization. In ICML (1), pages 427–435, 2013

work page 2013

[17] [17]

lil’ucb: An optimal exploration algorithm for multi-armed bandits

Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ucb: An optimal exploration algorithm for multi-armed bandits. InConference on Learning Theory, pages 423–439, 2014

work page 2014

[18] [18]

Almost optimal exploration in multi- armed bandits

Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi- armed bandits. InInternational Conference on Machine Learning, pages 1238–1246, 2013

work page 2013

[19] [19]

Veriﬁcation based solution for structured mab problems

Zohar S Karnin. Veriﬁcation based solution for structured mab problems. InAdvances in Neural Information Processing Systems, pages 145–153, 2016

work page 2016

[20] [20]

On the complexity of best-arm identiﬁcation in multi-armed bandit models.Journal of Machine Learning Research, 17:1–42, 2016

Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identiﬁcation in multi-armed bandit models.Journal of Machine Learning Research, 17:1–42, 2016

work page 2016

[21] [21]

Best Arm Identification in Generalized Linear Bandits

Abbas Kazerouni and Lawrence M Wein. Best arm identiﬁcation in generalized linear bandits. arXiv preprint arXiv:1905.08224, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[22] [22]

The equivalence of two extremum problems.Canadian Journal of Mathematics, 12:363–366, 1960

Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems.Canadian Journal of Mathematics, 12:363–366, 1960

work page 1960

[23] [23]

The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits

Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of ﬁnite-armed linear bandits.arXiv preprint arXiv:1610.04491, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[24] [24]

A contextual-bandit approach to personalized news article recommendation

Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. InProceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010

work page 2010

[25] [25]

SIAM, 2006

Friedrich Pukelsheim.Optimal design of experiments. SIAM, 2006

work page 2006

[26] [26]

Princeton university press, 2015

Ralph Tyrell Rockafellar.Convex analysis. Princeton university press, 2015

work page 2015

[27] [27]

Sequential resource allocation in linear stochastic bandits

Marta Soare. Sequential resource allocation in linear stochastic bandits. PhD thesis, Université Lille 1-Sciences et Technologies, 2015

work page 2015

[28] [28]

Best-arm identiﬁcation in linear bandits

Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identiﬁcation in linear bandits. In Advances in Neural Information Processing Systems, pages 828–836, 2014

work page 2014

[29] [29]

Best arm identiﬁcation in linear bandits with linear dimension dependency

Chao Tao, Saúl Blanco, and Yuan Zhou. Best arm identiﬁcation in linear bandits with linear dimension dependency. InInternational Conference on Machine Learning, pages 4884–4893, 2018

work page 2018

[30] [30]

A. B. Tsybakov. Optimal aggregation of classiﬁers in statistical learning.Annals of Statistics, pages 135–166, 2004. 12

work page 2004

[31] [31]

A fully adaptive algorithm for pure exploration in linear bandits

Liyuan Xu, Junya Honda, and Masashi Sugiyama. A fully adaptive algorithm for pure exploration in linear bandits. InInternational Conference on Artiﬁcial Intelligence and Statistics, pages 843–851, 2018

work page 2018

[32] [32]

Active learning via transductive experimental design

Kai Yu, Jinbo Bi, and Volker Tresp. Active learning via transductive experimental design. In Proceedings of the 23rd international conference on Machine learning, pages 1081–1088. ACM, 2006. 13 A Proof of Theorem 2 Proof. Let the good event for thetth round of Algorithm 1 be Et := { Nt≤ max{⌈8(2t+1)2ρ(Y(St))(1+ϵ) log(|Z| 2 δt )⌉, r(ϵ) }} ∩{ z∗∈ ˆZt+1}∩{ ˆ...

work page 2006

[33] [33]

, Np are positive integers constrained such that∑ i≤p Ni≥ N

Given the number of samplesN to obtain and the cardinality of the support ofλ, samples to allocate to arms in the support ofλ are computed usingNi =⌈(N− 1 2 p)λi⌉, where N1, N2, . . . , Np are positive integers constrained such that∑ i≤p Ni≥ N. 16

work page

[34] [34]

Following the previous phase of the rounding procedure, loop until the discrepancy (∑ i≤p Ni)− N = 0, from either decreasing a sample countNj which obtains (Nj− 1)/λj = mini≤p(Ni− 1)/λi to Nj + 1, or increasing a sample countNj which obtains Nj/λj = maxi≤p Ni/λi to Nj− 1. The eﬃcient design apportionment theorem in Section 12.5 of [25] provides the founda...

work page

[35] [35]

Because (1− cos(α))2≈ α4/4 and these inequalities must hold for allj = 1,

this hypothesis test requiresNj + Nj+d/2≥ c log(1/δ) (1−cos(α))2 for some universal constant c > 0. Because (1− cos(α))2≈ α4/4 and these inequalities must hold for allj = 1, . . . , d/2 simultaneously for the single static allocation, we obtain the result. 18 E Proof of Lemma 1 Proof. ρ(Y) = min λ∈ |X| max y∈Y ∥y∥2 (∑ x∈Xλxxx⊤)−1 = 1 γ2 Y min λ∈ |X| max y...

work page