Pure Exploration Beyond Reward Feedback: The Role of Post-Action Context

Alireza Rezaeimoghadam; Amir Mohammad Abouei; Mohammad Shahverdikondori; Negar Kiyavash

arxiv: 2502.03061 · v2 · submitted 2025-02-05 · 💻 cs.LG

Pure Exploration Beyond Reward Feedback: The Role of Post-Action Context

Mohammad Shahverdikondori , Amir Mohammad Abouei , Alireza Rezaeimoghadam , Negar Kiyavash This is my paper

Pith reviewed 2026-05-23 04:18 UTC · model grok-4.3

classification 💻 cs.LG

keywords best arm identificationmulti-armed banditspost-action contextsample complexitypure explorationseparator modelnon-separator model

0 comments

The pith

Best arm identification with post-action context admits asymptotically optimal algorithms in both separator and non-separator models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces best arm identification in stochastic multi-armed bandits where the learner receives post-action context in addition to the reward. It distinguishes two models: separator, in which reward depends only on context, and non-separator, in which reward depends on both action and context. For each model it derives instance-dependent lower bounds on the number of samples required to identify the best arm with fixed confidence, then constructs algorithms that match those bounds asymptotically. It further proves that any procedure ignoring the post-action context must use strictly more samples on some instances.

Core claim

In the fixed-confidence pure-exploration setting, instance-dependent lower bounds on sample complexity hold for both the separator and non-separator post-action context models; the G-tracking algorithm achieves the separator bound by tracking context probabilities via the geometry of the context space, while an extension of Track-and-Stop achieves the non-separator bound.

What carries the argument

The separator versus non-separator distinction for post-action context, together with the G-tracking rule that directly tracks contexts rather than actions.

If this is right

Methods that discard post-action context are provably suboptimal in sample complexity for both models.
G-tracking attains the separator lower bound by exploiting the geometry of the context space.
Track-and-Stop extends directly to the non-separator model while preserving asymptotic optimality.
The derived lower bounds are tight, so no algorithm can improve the leading term of the sample complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The geometry-tracking idea may extend to other pure-exploration tasks whose observation space has exploitable structure beyond scalar rewards.
If context spaces are continuous rather than discrete, discretization or kernel methods could be needed to retain the same asymptotic guarantees.
In applications where context is cheap to obtain but actions are costly, the sample-complexity savings translate directly into fewer expensive actions.

Load-bearing premise

After every action the learner always receives the post-action context, and the reward is generated exactly according to either the separator or the non-separator dependence on that context.

What would settle it

An instance of the separator or non-separator model in which any algorithm that ignores the post-action context matches the sample complexity of G-tracking or extended Track-and-Stop up to lower-order terms.

Figures

Figures reproduced from arXiv: 2502.03061 by Alireza Rezaeimoghadam, Amir Mohammad Abouei, Mohammad Shahverdikondori, Negar Kiyavash.

**Figure 2.** Figure 2: Illustration of G-tracking rule for an instance with [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: The results of different algorithms for an instance [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the relative positions of points in the proof of Lemma [PITH_FULL_IMAGE:figures/full_fig_p029_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of the L 2 distance of the frequencies of pulled arms and the optimal frequency over time between two algorithms. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of the stopping times of different algorithms on randomly generated instances. [PITH_FULL_IMAGE:figures/full_fig_p036_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of the L 2 distance of the frequencies of observed contexts and the optimal frequency over time among different algorithms. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗

read the original abstract

We introduce the problem of best arm identification (BAI) with post-action context, a new BAI problem in a stochastic multi-armed bandit environment and the fixed-confidence setting. The problem addresses the scenarios in which the learner receives a post-action context in addition to the reward after playing each action. This post-action context provides additional information that can significantly facilitate the decision process. We analyze two different types of the post-action context: (i) separator, where the reward depends solely on the context, and (ii) non-separator, where the reward depends on both the action and the context. For both cases, we derive instance-dependent lower bounds on the sample complexity and propose algorithms that asymptotically achieve the optimal sample complexity. For the separator setting, we propose a novel sampling rule called G-tracking, which uses the geometry of the context space to directly track the contexts rather than the actions. For the non-separator setting, we do so by demonstrating that the Track-and-Stop algorithm can be extended to this setting. Moreover, in both settings, we theoretically and empirically show that algorithms that ignore the post-action context are sub-optimal. Finally, our empirical results showcase the advantage of our approaches compared to the state of the art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adds post-action context to best-arm identification, splits it into separator and non-separator cases, and supplies matching lower bounds plus algorithms that hit them asymptotically.

read the letter

The main thing to know is that the authors define BAI with post-action context, distinguish separator (reward depends only on context) from non-separator (reward depends on both), derive instance-dependent lower bounds for each, and give algorithms that achieve those bounds asymptotically: G-tracking for the separator case and an extension of Track-and-Stop for the non-separator case. They also show that algorithms ignoring the context are suboptimal both in theory and in the experiments they run.

Referee Report

0 major / 2 minor

Summary. The paper introduces best arm identification (BAI) with post-action context in stochastic multi-armed bandits under fixed confidence. It distinguishes separator contexts (reward depends only on context) from non-separator contexts (reward depends on both action and context). For both settings, instance-dependent lower bounds on sample complexity are derived, and algorithms are proposed that asymptotically match these bounds: G-tracking (which tracks contexts geometrically) for the separator case, and an extension of Track-and-Stop for the non-separator case. The work also shows that algorithms ignoring post-action context are sub-optimal, both theoretically and empirically.

Significance. If the lower-bound derivations and asymptotic optimality proofs hold, the paper meaningfully extends pure exploration theory to richer feedback models beyond rewards alone. The separator/non-separator distinction and the geometry-aware G-tracking rule are technically interesting contributions. Credit is due for providing matching upper and lower bounds rather than only algorithmic heuristics, and for the empirical demonstration that context-ignoring baselines are provably sub-optimal.

minor comments (2)

In the problem formulation, the precise measurability assumptions on the context space and the support of the context distribution should be stated explicitly to make the lower-bound constructions fully rigorous.
The empirical section would benefit from reporting the number of independent runs and confidence intervals on the sample-complexity plots to allow direct comparison with the theoretical rates.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report does not raise any specific major comments, so we have no individual points to rebut. We will incorporate any minor suggestions during revision.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines a new BAI problem variant with post-action context under explicit separator and non-separator models, derives instance-dependent lower bounds from those models, and shows asymptotic matching via G-tracking (new) and Track-and-Stop extension (standard). No step reduces a claimed prediction or optimality result to a fitted quantity from the same data, a self-citation chain, or a definitional tautology; the modeling assumptions are stated upfront and used consistently in the bounds and analyses without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based solely on the abstract; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5764 in / 1130 out tokens · 72812 ms · 2026-05-23T04:18:21.130866+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Active Context Selection Improves Simple Regret in Contextual Bandits
cs.LG 2026-05 accept novelty 7.0

Active sampling with allocation q_j proportional to p_j to the 2/3 achieves tight regret sqrt(n/T) times norm of p to the 2/3 for known context distribution p, with improvement up to Theta(k to the 1/4) over passive sampling.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Improved algorithms for linear stochastic bandits

Abbasi-Yadkori, Y., P \'a l, D., and Szepesv \'a ri, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011

work page 2011
[3]

A rewriting system for convex optimization problems

Agrawal, A., Verschueren, R., Diamond, S., and Boyd, S. A rewriting system for convex optimization problems. Journal of Control and Decision, 5 0 (1): 0 42--60, 2018

work page 2018
[4]

and Proutiere, A

Al Marjani, A. and Proutiere, A. Adaptive sampling for best policy identification in markov decision processes. In International Conference on Machine Learning, pp.\ 7459--7468. PMLR, 2021

work page 2021
[5]

Navigating to the best policy in markov decision processes

Al Marjani, A., Garivier, A., and Proutiere, A. Navigating to the best policy in markov decision processes. Advances in Neural Information Processing Systems, 34: 0 25852--25864, 2021

work page 2021
[6]

arXiv preprint arXiv:2311.05638 , year=

Al-Marjani, A., Tirinzoni, A., and Kaufmann, E. Towards instance-optimality in online pac reinforcement learning. arXiv preprint arXiv:2311.05638, 2023

work page arXiv 2023
[7]

and Bubeck, S

Audibert, J.-Y. and Bubeck, S. Best arm identification in multi-armed bandits. In COLT-23th Conference on learning theory-2010, pp.\ 13--p, 2010

work page 2010
[8]

Using confidence bounds for exploitation-exploration trade-offs

Auer, P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3 0 (Nov): 0 397--422, 2002

work page 2002
[9]

A non-asymptotic approach to best-arm identification for gaussian bandits

Barrier, A., Garivier, A., and Koc \'a k, T. A non-asymptotic approach to best-arm identification for gaussian bandits. In International Conference on Artificial Intelligence and Statistics, pp.\ 10078--10109. PMLR, 2022

work page 2022
[10]

Topological spaces: including a triatment of mltivalued functions, vector spaces and convexity

Berge, C. Topological spaces: including a triatment of mltivalued functions, vector spaces and convexity. Oliver and Boyd, 1963

work page 1963
[11]

and Zeevi, A

Besbes, O. and Zeevi, A. Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operations research, 57 0 (6): 0 1407--1420, 2009

work page 2009
[12]

Adaptively exploiting d-separators with causal bandits

Bilodeau, B., Wang, L., and Roy, D. Adaptively exploiting d-separators with causal bandits. Advances in Neural Information Processing Systems, 35: 0 20381--20392, 2022

work page 2022
[13]

H., Jain, L., and Jamieson, K

Camilleri, R., Wagenmaker, A., Morgenstern, J. H., Jain, L., and Jamieson, K. G. Active learning with safety constraints. Advances in Neural Information Processing Systems, 35: 0 33201--33214, 2022

work page 2022
[14]

Pure exploration in bandits with linear constraints

Carlsson, E., Basu, D., Johansson, F., and Dubhashi, D. Pure exploration in bandits with linear constraints. In International Conference on Artificial Intelligence and Statistics, pp.\ 334--342. PMLR, 2024

work page 2024
[15]

and Locatelli, A

Carpentier, A. and Locatelli, A. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Conference on Learning Theory, pp.\ 590--604. PMLR, 2016

work page 2016
[16]

Nearly optimal sampling algorithms for combinatorial pure exploration

Chen, L., Gupta, A., Li, J., Qiao, M., and Wang, R. Nearly optimal sampling algorithms for combinatorial pure exploration. In Conference on Learning Theory, pp.\ 482--534. PMLR, 2017

work page 2017
[17]

and Koolen, W

Degenne, R. and Koolen, W. M. Pure exploration with multiple correct answers. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[18]

Gamification of pure exploration for linear bandits

Degenne, R., M \'e nard, P., Shang, X., and Valko, M. Gamification of pure exploration for linear bandits. In International Conference on Machine Learning, pp.\ 2432--2442. PMLR, 2020

work page 2020
[19]

and Boyd, S

Diamond, S. and Boyd, S. CVXPY : A P ython-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17 0 (83): 0 1--5, 2016

work page 2016
[20]

M., and Restelli, M

Eldowa, K., Cesa-Bianchi, N., Metelli, A. M., and Restelli, M. Information capacity regret bounds for bandits with mediator feedback. arXiv preprint arXiv:2402.10282, 2024

work page arXiv 2024
[21]

Faizal, F. Z. and Nair, J. Constrained pure exploration multi-armed bandits with a fixed budget. arXiv preprint arXiv:2211.14768, 2022

work page arXiv 2022
[22]

G., and Ratliff, L

Fiez, T., Jain, L., Jamieson, K. G., and Ratliff, L. Sequential experimental design for transductive linear bandits. Advances in neural information processing systems, 32, 2019

work page 2019
[23]

Best arm identification: A unified approach to fixed budget and fixed confidence

Gabillon, V., Ghavamzadeh, M., and Lazaric, A. Best arm identification: A unified approach to fixed budget and fixed confidence. Advances in Neural Information Processing Systems, 25, 2012

work page 2012
[24]

Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations

Gai, Y., Krishnamachari, B., and Jain, R. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20 0 (5): 0 1466--1478, 2012

work page 2012
[25]

and Capp \'e , O

Garivier, A. and Capp \'e , O. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual conference on learning theory, pp.\ 359--376. JMLR Workshop and Conference Proceedings, 2011

work page 2011
[26]

and Kaufmann, E

Garivier, A. and Kaufmann, E. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pp.\ 998--1027. PMLR, 2016

work page 2016
[27]

Achieving counterfactual fairness for causal bandit

Huang, W., Zhang, L., and Wu, X. Achieving counterfactual fairness for causal bandit. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pp.\ 6952--6959, 2022

work page 2022
[28]

Revisiting Frank-Wolfe : Projection-free sparse convex optimization

Jaggi, M. Revisiting Frank-Wolfe : Projection-free sparse convex optimization. In Dasgupta, S. and McAllester, D. (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp.\ 427--435, Atlanta, Georgia, USA, 17--19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/jaggi13.html

work page 2013
[29]

and Nowak, R

Jamieson, K. and Nowak, R. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 2014 48th annual conference on information sciences and systems (CISS), pp.\ 1--6. IEEE, 2014

work page 2014
[30]

Confounded budgeted causal bandits

Jamshidi, F., Etesami, J., and Kiyavash, N. Confounded budgeted causal bandits. In Causal Learning and Reasoning, pp.\ 423--461. PMLR, 2024

work page 2024
[31]

and Proutiere, A

Jedra, Y. and Proutiere, A. Optimal best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 33: 0 10007--10017, 2020

work page 2020
[32]

and Krishnasamy, S

Juneja, S. and Krishnasamy, S. Sample complexity of partition identification using multi-armed bandits. In Conference on Learning Theory, pp.\ 1824--1852. PMLR, 2019

work page 2019
[33]

Pac subset selection in stochastic multi-armed bandits

Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. Pac subset selection in stochastic multi-armed bandits. In ICML, volume 12, pp.\ 655--662, 2012

work page 2012
[34]

Almost optimal exploration in multi-armed bandits

Karnin, Z., Koren, T., and Somekh, O. Almost optimal exploration in multi-armed bandits. In International conference on machine learning, pp.\ 1238--1246. PMLR, 2013

work page 2013
[35]

and Ariu, K

Kato, M. and Ariu, K. The role of contextual information in best arm identification. arXiv preprint arXiv:2106.14077, 2021

work page arXiv 2021
[36]

Contributions to the Optimal Solution of Several Bandit Problems

Kaufmann, E. Contributions to the Optimal Solution of Several Bandit Problems. PhD thesis, Universit \'e de Lille, 2020

work page 2020
[37]

and Kalyanakrishnan, S

Kaufmann, E. and Kalyanakrishnan, S. Information complexity in bandit subset selection. In Conference on Learning Theory, pp.\ 228--251. PMLR, 2013

work page 2013
[38]

and Koolen, W

Kaufmann, E. and Koolen, W. M. Mixture martingales revisited with applications to sequential tests and confidence intervals. Journal of Machine Learning Research, 22 0 (246): 0 1--44, 2021

work page 2021
[39]

On the complexity of best-arm identification in multi-armed bandit models

Kaufmann, E., Capp \'e , O., and Garivier, A. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17 0 (1): 0 1--42, 2016

work page 2016
[40]

M., and Garivier, A

Kaufmann, E., Koolen, W. M., and Garivier, A. Sequential test for the lowest mean: From thompson to murphy sampling. Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[41]

and Leighton, T

Kleinberg, R. and Leighton, T. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pp.\ 594--605. IEEE, 2003

work page 2003
[42]

and Zhang, T

Langford, J. and Zhang, T. The epoch-greedy algorithm for contextual multi-armed bandits. Advances in neural information processing systems, 20 0 (1): 0 96--1, 2007

work page 2007
[43]

Lattimore, F., Lattimore, T., and Reid, M. D. Causal bandits: Learning good interventions via causal inference. Advances in neural information processing systems, 29, 2016

work page 2016
[44]

and Szepesv \'a ri, C

Lattimore, T. and Szepesv \'a ri, C. Bandit algorithms. Cambridge University Press, 2020

work page 2020
[45]

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp.\ 661--670, 2010

work page 2010
[46]

G., Jain, L., et al

Li, Z., Ratliff, L., Jamieson, K. G., Jain, L., et al. Instance-optimal pac algorithms for contextual bandits. Advances in Neural Information Processing Systems, 35: 0 37590--37603, 2022

work page 2022
[47]

Liu, Z., Attias, I., and Roy, D. M. Causal bandits: The pareto optimal frontier of adaptivity, a reduction to linear bandits, and limitations around unknown marginals. arXiv preprint arXiv:2407.00950, 2024

work page arXiv 2024
[48]

Regret analysis of bandit problems with causal background knowledge

Lu, Y., Meisami, A., Tewari, A., and Yan, W. Regret analysis of bandit problems with causal background knowledge. In Conference on Uncertainty in Artificial Intelligence, pp.\ 141--150. PMLR, 2020

work page 2020
[49]

Causal contextual bandits with adaptive context

Madhavan, R., Maiti, A., Sinha, G., and Barman, S. Causal contextual bandits with adaptive context. arXiv preprint arXiv:2405.18626, 2024

work page arXiv 2024
[50]

and Tsitsiklis, J

Mannor, S. and Tsitsiklis, J. N. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5 0 (Jun): 0 623--648, 2004

work page 2004
[51]

Finding all -good arms in stochastic bandits

Mason, B., Jain, L., Tripathy, A., and Nowak, R. Finding all -good arms in stochastic bandits. Advances in Neural Information Processing Systems, 33: 0 20707--20718, 2020

work page 2020
[52]

Gradient Ascent for Active Exploration in Bandit Problems

M \'e nard, P. Gradient ascent for active exploration in bandit problems. arXiv preprint arXiv:1905.08165, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[53]

D., Jonsson, A., Kaufmann, E., Leurent, E., and Valko, M

M \'e nard, P., Domingues, O. D., Jonsson, A., Kaufmann, E., Leurent, E., and Valko, M. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pp.\ 7599--7608. PMLR, 2021

work page 2021
[54]

Mohammadi Zaki, A. M. and Gopalan, A. Improved pure exploration in linear bandits with no-regret learning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp.\ 3709--3715, 2022

work page 2022
[55]

Sample complexity reduction via policy difference estimation in tabular reinforcement learning

Narang, A., Wagenmaker, A., Ratliff, L., and Jamieson, K. Sample complexity reduction via policy difference estimation in tabular reinforcement learning. arXiv preprint arXiv:2406.06856, 2024

work page arXiv 2024
[56]

M., and Restelli, M

Poiani, R., Metelli, A. M., and Restelli, M. Pure exploration under mediators' feedback. arXiv preprint arXiv:2308.15552, 2023

work page arXiv 2023
[57]

S., Karthik, P., Karamchandani, N., and Nair, J

Reddy, K. S., Karthik, P., Karamchandani, N., and Nair, J. Best arm identification in bandits with limited precision sampling. In 2023 IEEE International Symposium on Information Theory (ISIT), pp.\ 1466--1471. IEEE, 2023

work page 2023
[58]

Best-arm identification in linear bandits

Soare, M., Lazaric, A., and Munos, R. Best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 27, 2014

work page 2014
[59]

Stephens, C. J. Pure exploration in multi-armed bandits. 2023

work page 2023
[60]

A bad arm existence checking problem: How to utilize asymmetric problem structure? Machine learning, 109 0 (2): 0 327--372, 2020

Tabata, K., Nakamura, A., Honda, J., and Komatsuzaki, T. A bad arm existence checking problem: How to utilize asymmetric problem structure? Machine learning, 109 0 (2): 0 327--372, 2020

work page 2020
[61]

Pure exploration for constrained best mixed arm identification with a fixed budget

Tang, D., Jain, R., Nayyar, A., and Nuzzo, P. Pure exploration for constrained best mixed arm identification with a fixed budget. arXiv preprint arXiv:2405.15090, 2024

work page arXiv 2024
[62]

and Murphy, S

Tewari, A. and Murphy, S. A. From ads to interventions: Contextual bandits in mobile health. Mobile health: sensors, analytic methods, and applications, pp.\ 495--517, 2017

work page 2017
[63]

Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25: 0 285--294, 1933

work page 1933
[64]

Causal bandits for linear structural equation models

Varici, B., Shanmugam, K., Sattigeri, P., and Tajer, A. Causal bandits for linear structural equation models. Journal of Machine Learning Research, 24 0 (297): 0 1--59, 2023

work page 2023
[65]

J., Simchowitz, M., and Jamieson, K

Wagenmaker, A. J., Simchowitz, M., and Jamieson, K. Beyond no regret: Instance-dependent pac reinforcement learning. In Conference on Learning Theory, pp.\ 358--418. PMLR, 2022

work page 2022
[66]

Fairness of exposure in stochastic bandits

Wang, L., Bai, Y., Sun, W., and Joachims, T. Fairness of exposure in stochastic bandits. In International Conference on Machine Learning, pp.\ 10686--10696. PMLR, 2021 a

work page 2021
[67]

Fast pure exploration via frank-wolfe

Wang, P.-A., Tzeng, R.-C., and Proutiere, A. Fast pure exploration via frank-wolfe. Advances in Neural Information Processing Systems, 34: 0 5810--5821, 2021 b

work page 2021
[68]

and Chen, W

Xiong, N. and Chen, W. Combinatorial pure exploration of causal bandits. arXiv preprint arXiv:2206.07883, 2022

work page arXiv 2022
[69]

A fully adaptive algorithm for pure exploration in linear bandits

Xu, L., Honda, J., and Sugiyama, M. A fully adaptive algorithm for pure exploration in linear bandits. In International Conference on Artificial Intelligence and Statistics, pp.\ 843--851. PMLR, 2018

work page 2018
[70]

Robust causal bandits for linear models

Yan, Z., Mukherjee, A., Var c , B., and Tajer, A. Robust causal bandits for linear models. IEEE Journal on Selected Areas in Information Theory, 2024

work page 2024

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

Improved algorithms for linear stochastic bandits

Abbasi-Yadkori, Y., P \'a l, D., and Szepesv \'a ri, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011

work page 2011

[3] [3]

A rewriting system for convex optimization problems

Agrawal, A., Verschueren, R., Diamond, S., and Boyd, S. A rewriting system for convex optimization problems. Journal of Control and Decision, 5 0 (1): 0 42--60, 2018

work page 2018

[4] [4]

and Proutiere, A

Al Marjani, A. and Proutiere, A. Adaptive sampling for best policy identification in markov decision processes. In International Conference on Machine Learning, pp.\ 7459--7468. PMLR, 2021

work page 2021

[5] [5]

Navigating to the best policy in markov decision processes

Al Marjani, A., Garivier, A., and Proutiere, A. Navigating to the best policy in markov decision processes. Advances in Neural Information Processing Systems, 34: 0 25852--25864, 2021

work page 2021

[6] [6]

arXiv preprint arXiv:2311.05638 , year=

Al-Marjani, A., Tirinzoni, A., and Kaufmann, E. Towards instance-optimality in online pac reinforcement learning. arXiv preprint arXiv:2311.05638, 2023

work page arXiv 2023

[7] [7]

and Bubeck, S

Audibert, J.-Y. and Bubeck, S. Best arm identification in multi-armed bandits. In COLT-23th Conference on learning theory-2010, pp.\ 13--p, 2010

work page 2010

[8] [8]

Using confidence bounds for exploitation-exploration trade-offs

Auer, P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3 0 (Nov): 0 397--422, 2002

work page 2002

[9] [9]

A non-asymptotic approach to best-arm identification for gaussian bandits

Barrier, A., Garivier, A., and Koc \'a k, T. A non-asymptotic approach to best-arm identification for gaussian bandits. In International Conference on Artificial Intelligence and Statistics, pp.\ 10078--10109. PMLR, 2022

work page 2022

[10] [10]

Topological spaces: including a triatment of mltivalued functions, vector spaces and convexity

Berge, C. Topological spaces: including a triatment of mltivalued functions, vector spaces and convexity. Oliver and Boyd, 1963

work page 1963

[11] [11]

and Zeevi, A

Besbes, O. and Zeevi, A. Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operations research, 57 0 (6): 0 1407--1420, 2009

work page 2009

[12] [12]

Adaptively exploiting d-separators with causal bandits

Bilodeau, B., Wang, L., and Roy, D. Adaptively exploiting d-separators with causal bandits. Advances in Neural Information Processing Systems, 35: 0 20381--20392, 2022

work page 2022

[13] [13]

H., Jain, L., and Jamieson, K

Camilleri, R., Wagenmaker, A., Morgenstern, J. H., Jain, L., and Jamieson, K. G. Active learning with safety constraints. Advances in Neural Information Processing Systems, 35: 0 33201--33214, 2022

work page 2022

[14] [14]

Pure exploration in bandits with linear constraints

Carlsson, E., Basu, D., Johansson, F., and Dubhashi, D. Pure exploration in bandits with linear constraints. In International Conference on Artificial Intelligence and Statistics, pp.\ 334--342. PMLR, 2024

work page 2024

[15] [15]

and Locatelli, A

Carpentier, A. and Locatelli, A. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Conference on Learning Theory, pp.\ 590--604. PMLR, 2016

work page 2016

[16] [16]

Nearly optimal sampling algorithms for combinatorial pure exploration

Chen, L., Gupta, A., Li, J., Qiao, M., and Wang, R. Nearly optimal sampling algorithms for combinatorial pure exploration. In Conference on Learning Theory, pp.\ 482--534. PMLR, 2017

work page 2017

[17] [17]

and Koolen, W

Degenne, R. and Koolen, W. M. Pure exploration with multiple correct answers. Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[18] [18]

Gamification of pure exploration for linear bandits

Degenne, R., M \'e nard, P., Shang, X., and Valko, M. Gamification of pure exploration for linear bandits. In International Conference on Machine Learning, pp.\ 2432--2442. PMLR, 2020

work page 2020

[19] [19]

and Boyd, S

Diamond, S. and Boyd, S. CVXPY : A P ython-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17 0 (83): 0 1--5, 2016

work page 2016

[20] [20]

M., and Restelli, M

Eldowa, K., Cesa-Bianchi, N., Metelli, A. M., and Restelli, M. Information capacity regret bounds for bandits with mediator feedback. arXiv preprint arXiv:2402.10282, 2024

work page arXiv 2024

[21] [21]

Faizal, F. Z. and Nair, J. Constrained pure exploration multi-armed bandits with a fixed budget. arXiv preprint arXiv:2211.14768, 2022

work page arXiv 2022

[22] [22]

G., and Ratliff, L

Fiez, T., Jain, L., Jamieson, K. G., and Ratliff, L. Sequential experimental design for transductive linear bandits. Advances in neural information processing systems, 32, 2019

work page 2019

[23] [23]

Best arm identification: A unified approach to fixed budget and fixed confidence

Gabillon, V., Ghavamzadeh, M., and Lazaric, A. Best arm identification: A unified approach to fixed budget and fixed confidence. Advances in Neural Information Processing Systems, 25, 2012

work page 2012

[24] [24]

Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations

Gai, Y., Krishnamachari, B., and Jain, R. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20 0 (5): 0 1466--1478, 2012

work page 2012

[25] [25]

and Capp \'e , O

Garivier, A. and Capp \'e , O. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual conference on learning theory, pp.\ 359--376. JMLR Workshop and Conference Proceedings, 2011

work page 2011

[26] [26]

and Kaufmann, E

Garivier, A. and Kaufmann, E. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pp.\ 998--1027. PMLR, 2016

work page 2016

[27] [27]

Achieving counterfactual fairness for causal bandit

Huang, W., Zhang, L., and Wu, X. Achieving counterfactual fairness for causal bandit. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pp.\ 6952--6959, 2022

work page 2022

[28] [28]

Revisiting Frank-Wolfe : Projection-free sparse convex optimization

Jaggi, M. Revisiting Frank-Wolfe : Projection-free sparse convex optimization. In Dasgupta, S. and McAllester, D. (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp.\ 427--435, Atlanta, Georgia, USA, 17--19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/jaggi13.html

work page 2013

[29] [29]

and Nowak, R

Jamieson, K. and Nowak, R. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 2014 48th annual conference on information sciences and systems (CISS), pp.\ 1--6. IEEE, 2014

work page 2014

[30] [30]

Confounded budgeted causal bandits

Jamshidi, F., Etesami, J., and Kiyavash, N. Confounded budgeted causal bandits. In Causal Learning and Reasoning, pp.\ 423--461. PMLR, 2024

work page 2024

[31] [31]

and Proutiere, A

Jedra, Y. and Proutiere, A. Optimal best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 33: 0 10007--10017, 2020

work page 2020

[32] [32]

and Krishnasamy, S

Juneja, S. and Krishnasamy, S. Sample complexity of partition identification using multi-armed bandits. In Conference on Learning Theory, pp.\ 1824--1852. PMLR, 2019

work page 2019

[33] [33]

Pac subset selection in stochastic multi-armed bandits

Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. Pac subset selection in stochastic multi-armed bandits. In ICML, volume 12, pp.\ 655--662, 2012

work page 2012

[34] [34]

Almost optimal exploration in multi-armed bandits

Karnin, Z., Koren, T., and Somekh, O. Almost optimal exploration in multi-armed bandits. In International conference on machine learning, pp.\ 1238--1246. PMLR, 2013

work page 2013

[35] [35]

and Ariu, K

Kato, M. and Ariu, K. The role of contextual information in best arm identification. arXiv preprint arXiv:2106.14077, 2021

work page arXiv 2021

[36] [36]

Contributions to the Optimal Solution of Several Bandit Problems

Kaufmann, E. Contributions to the Optimal Solution of Several Bandit Problems. PhD thesis, Universit \'e de Lille, 2020

work page 2020

[37] [37]

and Kalyanakrishnan, S

Kaufmann, E. and Kalyanakrishnan, S. Information complexity in bandit subset selection. In Conference on Learning Theory, pp.\ 228--251. PMLR, 2013

work page 2013

[38] [38]

and Koolen, W

Kaufmann, E. and Koolen, W. M. Mixture martingales revisited with applications to sequential tests and confidence intervals. Journal of Machine Learning Research, 22 0 (246): 0 1--44, 2021

work page 2021

[39] [39]

On the complexity of best-arm identification in multi-armed bandit models

Kaufmann, E., Capp \'e , O., and Garivier, A. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17 0 (1): 0 1--42, 2016

work page 2016

[40] [40]

M., and Garivier, A

Kaufmann, E., Koolen, W. M., and Garivier, A. Sequential test for the lowest mean: From thompson to murphy sampling. Advances in Neural Information Processing Systems, 31, 2018

work page 2018

[41] [41]

and Leighton, T

Kleinberg, R. and Leighton, T. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pp.\ 594--605. IEEE, 2003

work page 2003

[42] [42]

and Zhang, T

Langford, J. and Zhang, T. The epoch-greedy algorithm for contextual multi-armed bandits. Advances in neural information processing systems, 20 0 (1): 0 96--1, 2007

work page 2007

[43] [43]

Lattimore, F., Lattimore, T., and Reid, M. D. Causal bandits: Learning good interventions via causal inference. Advances in neural information processing systems, 29, 2016

work page 2016

[44] [44]

and Szepesv \'a ri, C

Lattimore, T. and Szepesv \'a ri, C. Bandit algorithms. Cambridge University Press, 2020

work page 2020

[45] [45]

Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp.\ 661--670, 2010

work page 2010

[46] [46]

G., Jain, L., et al

Li, Z., Ratliff, L., Jamieson, K. G., Jain, L., et al. Instance-optimal pac algorithms for contextual bandits. Advances in Neural Information Processing Systems, 35: 0 37590--37603, 2022

work page 2022

[47] [47]

Liu, Z., Attias, I., and Roy, D. M. Causal bandits: The pareto optimal frontier of adaptivity, a reduction to linear bandits, and limitations around unknown marginals. arXiv preprint arXiv:2407.00950, 2024

work page arXiv 2024

[48] [48]

Regret analysis of bandit problems with causal background knowledge

Lu, Y., Meisami, A., Tewari, A., and Yan, W. Regret analysis of bandit problems with causal background knowledge. In Conference on Uncertainty in Artificial Intelligence, pp.\ 141--150. PMLR, 2020

work page 2020

[49] [49]

Causal contextual bandits with adaptive context

Madhavan, R., Maiti, A., Sinha, G., and Barman, S. Causal contextual bandits with adaptive context. arXiv preprint arXiv:2405.18626, 2024

work page arXiv 2024

[50] [50]

and Tsitsiklis, J

Mannor, S. and Tsitsiklis, J. N. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5 0 (Jun): 0 623--648, 2004

work page 2004

[51] [51]

Finding all -good arms in stochastic bandits

Mason, B., Jain, L., Tripathy, A., and Nowak, R. Finding all -good arms in stochastic bandits. Advances in Neural Information Processing Systems, 33: 0 20707--20718, 2020

work page 2020

[52] [52]

Gradient Ascent for Active Exploration in Bandit Problems

M \'e nard, P. Gradient ascent for active exploration in bandit problems. arXiv preprint arXiv:1905.08165, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[53] [53]

D., Jonsson, A., Kaufmann, E., Leurent, E., and Valko, M

M \'e nard, P., Domingues, O. D., Jonsson, A., Kaufmann, E., Leurent, E., and Valko, M. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pp.\ 7599--7608. PMLR, 2021

work page 2021

[54] [54]

Mohammadi Zaki, A. M. and Gopalan, A. Improved pure exploration in linear bandits with no-regret learning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp.\ 3709--3715, 2022

work page 2022

[55] [55]

Sample complexity reduction via policy difference estimation in tabular reinforcement learning

Narang, A., Wagenmaker, A., Ratliff, L., and Jamieson, K. Sample complexity reduction via policy difference estimation in tabular reinforcement learning. arXiv preprint arXiv:2406.06856, 2024

work page arXiv 2024

[56] [56]

M., and Restelli, M

Poiani, R., Metelli, A. M., and Restelli, M. Pure exploration under mediators' feedback. arXiv preprint arXiv:2308.15552, 2023

work page arXiv 2023

[57] [57]

S., Karthik, P., Karamchandani, N., and Nair, J

Reddy, K. S., Karthik, P., Karamchandani, N., and Nair, J. Best arm identification in bandits with limited precision sampling. In 2023 IEEE International Symposium on Information Theory (ISIT), pp.\ 1466--1471. IEEE, 2023

work page 2023

[58] [58]

Best-arm identification in linear bandits

Soare, M., Lazaric, A., and Munos, R. Best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 27, 2014

work page 2014

[59] [59]

Stephens, C. J. Pure exploration in multi-armed bandits. 2023

work page 2023

[60] [60]

A bad arm existence checking problem: How to utilize asymmetric problem structure? Machine learning, 109 0 (2): 0 327--372, 2020

Tabata, K., Nakamura, A., Honda, J., and Komatsuzaki, T. A bad arm existence checking problem: How to utilize asymmetric problem structure? Machine learning, 109 0 (2): 0 327--372, 2020

work page 2020

[61] [61]

Pure exploration for constrained best mixed arm identification with a fixed budget

Tang, D., Jain, R., Nayyar, A., and Nuzzo, P. Pure exploration for constrained best mixed arm identification with a fixed budget. arXiv preprint arXiv:2405.15090, 2024

work page arXiv 2024

[62] [62]

and Murphy, S

Tewari, A. and Murphy, S. A. From ads to interventions: Contextual bandits in mobile health. Mobile health: sensors, analytic methods, and applications, pp.\ 495--517, 2017

work page 2017

[63] [63]

Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25: 0 285--294, 1933

work page 1933

[64] [64]

Causal bandits for linear structural equation models

Varici, B., Shanmugam, K., Sattigeri, P., and Tajer, A. Causal bandits for linear structural equation models. Journal of Machine Learning Research, 24 0 (297): 0 1--59, 2023

work page 2023

[65] [65]

J., Simchowitz, M., and Jamieson, K

Wagenmaker, A. J., Simchowitz, M., and Jamieson, K. Beyond no regret: Instance-dependent pac reinforcement learning. In Conference on Learning Theory, pp.\ 358--418. PMLR, 2022

work page 2022

[66] [66]

Fairness of exposure in stochastic bandits

Wang, L., Bai, Y., Sun, W., and Joachims, T. Fairness of exposure in stochastic bandits. In International Conference on Machine Learning, pp.\ 10686--10696. PMLR, 2021 a

work page 2021

[67] [67]

Fast pure exploration via frank-wolfe

Wang, P.-A., Tzeng, R.-C., and Proutiere, A. Fast pure exploration via frank-wolfe. Advances in Neural Information Processing Systems, 34: 0 5810--5821, 2021 b

work page 2021

[68] [68]

and Chen, W

Xiong, N. and Chen, W. Combinatorial pure exploration of causal bandits. arXiv preprint arXiv:2206.07883, 2022

work page arXiv 2022

[69] [69]

A fully adaptive algorithm for pure exploration in linear bandits

Xu, L., Honda, J., and Sugiyama, M. A fully adaptive algorithm for pure exploration in linear bandits. In International Conference on Artificial Intelligence and Statistics, pp.\ 843--851. PMLR, 2018

work page 2018

[70] [70]

Robust causal bandits for linear models

Yan, Z., Mukherjee, A., Var c , B., and Tajer, A. Robust causal bandits for linear models. IEEE Journal on Selected Areas in Information Theory, 2024

work page 2024