pith. sign in

arxiv: 2502.03061 · v2 · submitted 2025-02-05 · 💻 cs.LG

Pure Exploration Beyond Reward Feedback: The Role of Post-Action Context

Pith reviewed 2026-05-23 04:18 UTC · model grok-4.3

classification 💻 cs.LG
keywords best arm identificationmulti-armed banditspost-action contextsample complexitypure explorationseparator modelnon-separator model
0
0 comments X

The pith

Best arm identification with post-action context admits asymptotically optimal algorithms in both separator and non-separator models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces best arm identification in stochastic multi-armed bandits where the learner receives post-action context in addition to the reward. It distinguishes two models: separator, in which reward depends only on context, and non-separator, in which reward depends on both action and context. For each model it derives instance-dependent lower bounds on the number of samples required to identify the best arm with fixed confidence, then constructs algorithms that match those bounds asymptotically. It further proves that any procedure ignoring the post-action context must use strictly more samples on some instances.

Core claim

In the fixed-confidence pure-exploration setting, instance-dependent lower bounds on sample complexity hold for both the separator and non-separator post-action context models; the G-tracking algorithm achieves the separator bound by tracking context probabilities via the geometry of the context space, while an extension of Track-and-Stop achieves the non-separator bound.

What carries the argument

The separator versus non-separator distinction for post-action context, together with the G-tracking rule that directly tracks contexts rather than actions.

If this is right

  • Methods that discard post-action context are provably suboptimal in sample complexity for both models.
  • G-tracking attains the separator lower bound by exploiting the geometry of the context space.
  • Track-and-Stop extends directly to the non-separator model while preserving asymptotic optimality.
  • The derived lower bounds are tight, so no algorithm can improve the leading term of the sample complexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The geometry-tracking idea may extend to other pure-exploration tasks whose observation space has exploitable structure beyond scalar rewards.
  • If context spaces are continuous rather than discrete, discretization or kernel methods could be needed to retain the same asymptotic guarantees.
  • In applications where context is cheap to obtain but actions are costly, the sample-complexity savings translate directly into fewer expensive actions.

Load-bearing premise

After every action the learner always receives the post-action context, and the reward is generated exactly according to either the separator or the non-separator dependence on that context.

What would settle it

An instance of the separator or non-separator model in which any algorithm that ignores the post-action context matches the sample complexity of G-tracking or extended Track-and-Stop up to lower-order terms.

Figures

Figures reproduced from arXiv: 2502.03061 by Alireza Rezaeimoghadam, Amir Mohammad Abouei, Mohammad Shahverdikondori, Negar Kiyavash.

Figure 1
Figure 1. Figure 1: Two possible structures for the post-action context. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of G-tracking rule for an instance with [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The results of different algorithms for an instance [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of the relative positions of points in the proof of Lemma [PITH_FULL_IMAGE:figures/full_fig_p029_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of the L 2 distance of the frequencies of pulled arms and the optimal frequency over time between two algorithms. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of the stopping times of different algorithms on randomly generated instances. [PITH_FULL_IMAGE:figures/full_fig_p036_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of the L 2 distance of the frequencies of observed contexts and the optimal frequency over time among different algorithms. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗
read the original abstract

We introduce the problem of best arm identification (BAI) with post-action context, a new BAI problem in a stochastic multi-armed bandit environment and the fixed-confidence setting. The problem addresses the scenarios in which the learner receives a post-action context in addition to the reward after playing each action. This post-action context provides additional information that can significantly facilitate the decision process. We analyze two different types of the post-action context: (i) separator, where the reward depends solely on the context, and (ii) non-separator, where the reward depends on both the action and the context. For both cases, we derive instance-dependent lower bounds on the sample complexity and propose algorithms that asymptotically achieve the optimal sample complexity. For the separator setting, we propose a novel sampling rule called G-tracking, which uses the geometry of the context space to directly track the contexts rather than the actions. For the non-separator setting, we do so by demonstrating that the Track-and-Stop algorithm can be extended to this setting. Moreover, in both settings, we theoretically and empirically show that algorithms that ignore the post-action context are sub-optimal. Finally, our empirical results showcase the advantage of our approaches compared to the state of the art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces best arm identification (BAI) with post-action context in stochastic multi-armed bandits under fixed confidence. It distinguishes separator contexts (reward depends only on context) from non-separator contexts (reward depends on both action and context). For both settings, instance-dependent lower bounds on sample complexity are derived, and algorithms are proposed that asymptotically match these bounds: G-tracking (which tracks contexts geometrically) for the separator case, and an extension of Track-and-Stop for the non-separator case. The work also shows that algorithms ignoring post-action context are sub-optimal, both theoretically and empirically.

Significance. If the lower-bound derivations and asymptotic optimality proofs hold, the paper meaningfully extends pure exploration theory to richer feedback models beyond rewards alone. The separator/non-separator distinction and the geometry-aware G-tracking rule are technically interesting contributions. Credit is due for providing matching upper and lower bounds rather than only algorithmic heuristics, and for the empirical demonstration that context-ignoring baselines are provably sub-optimal.

minor comments (2)
  1. In the problem formulation, the precise measurability assumptions on the context space and the support of the context distribution should be stated explicitly to make the lower-bound constructions fully rigorous.
  2. The empirical section would benefit from reporting the number of independent runs and confidence intervals on the sample-complexity plots to allow direct comparison with the theoretical rates.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report does not raise any specific major comments, so we have no individual points to rebut. We will incorporate any minor suggestions during revision.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines a new BAI problem variant with post-action context under explicit separator and non-separator models, derives instance-dependent lower bounds from those models, and shows asymptotic matching via G-tracking (new) and Track-and-Stop extension (standard). No step reduces a claimed prediction or optimality result to a fitted quantity from the same data, a self-citation chain, or a definitional tautology; the modeling assumptions are stated upfront and used consistently in the bounds and analyses without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based solely on the abstract; no explicit free parameters, axioms, or invented entities are described in the provided text.

pith-pipeline@v0.9.0 · 5764 in / 1130 out tokens · 72812 ms · 2026-05-23T04:18:21.130866+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Active Context Selection Improves Simple Regret in Contextual Bandits

    cs.LG 2026-05 accept novelty 7.0

    Active sampling with allocation q_j proportional to p_j to the 2/3 achieves tight regret sqrt(n/T) times norm of p to the 2/3 for known context distribution p, with improvement up to Theta(k to the 1/4) over passive sampling.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Improved algorithms for linear stochastic bandits

    Abbasi-Yadkori, Y., P \'a l, D., and Szepesv \'a ri, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011

  3. [3]

    A rewriting system for convex optimization problems

    Agrawal, A., Verschueren, R., Diamond, S., and Boyd, S. A rewriting system for convex optimization problems. Journal of Control and Decision, 5 0 (1): 0 42--60, 2018

  4. [4]

    and Proutiere, A

    Al Marjani, A. and Proutiere, A. Adaptive sampling for best policy identification in markov decision processes. In International Conference on Machine Learning, pp.\ 7459--7468. PMLR, 2021

  5. [5]

    Navigating to the best policy in markov decision processes

    Al Marjani, A., Garivier, A., and Proutiere, A. Navigating to the best policy in markov decision processes. Advances in Neural Information Processing Systems, 34: 0 25852--25864, 2021

  6. [6]

    arXiv preprint arXiv:2311.05638 , year=

    Al-Marjani, A., Tirinzoni, A., and Kaufmann, E. Towards instance-optimality in online pac reinforcement learning. arXiv preprint arXiv:2311.05638, 2023

  7. [7]

    and Bubeck, S

    Audibert, J.-Y. and Bubeck, S. Best arm identification in multi-armed bandits. In COLT-23th Conference on learning theory-2010, pp.\ 13--p, 2010

  8. [8]

    Using confidence bounds for exploitation-exploration trade-offs

    Auer, P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3 0 (Nov): 0 397--422, 2002

  9. [9]

    A non-asymptotic approach to best-arm identification for gaussian bandits

    Barrier, A., Garivier, A., and Koc \'a k, T. A non-asymptotic approach to best-arm identification for gaussian bandits. In International Conference on Artificial Intelligence and Statistics, pp.\ 10078--10109. PMLR, 2022

  10. [10]

    Topological spaces: including a triatment of mltivalued functions, vector spaces and convexity

    Berge, C. Topological spaces: including a triatment of mltivalued functions, vector spaces and convexity. Oliver and Boyd, 1963

  11. [11]

    and Zeevi, A

    Besbes, O. and Zeevi, A. Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operations research, 57 0 (6): 0 1407--1420, 2009

  12. [12]

    Adaptively exploiting d-separators with causal bandits

    Bilodeau, B., Wang, L., and Roy, D. Adaptively exploiting d-separators with causal bandits. Advances in Neural Information Processing Systems, 35: 0 20381--20392, 2022

  13. [13]

    H., Jain, L., and Jamieson, K

    Camilleri, R., Wagenmaker, A., Morgenstern, J. H., Jain, L., and Jamieson, K. G. Active learning with safety constraints. Advances in Neural Information Processing Systems, 35: 0 33201--33214, 2022

  14. [14]

    Pure exploration in bandits with linear constraints

    Carlsson, E., Basu, D., Johansson, F., and Dubhashi, D. Pure exploration in bandits with linear constraints. In International Conference on Artificial Intelligence and Statistics, pp.\ 334--342. PMLR, 2024

  15. [15]

    and Locatelli, A

    Carpentier, A. and Locatelli, A. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Conference on Learning Theory, pp.\ 590--604. PMLR, 2016

  16. [16]

    Nearly optimal sampling algorithms for combinatorial pure exploration

    Chen, L., Gupta, A., Li, J., Qiao, M., and Wang, R. Nearly optimal sampling algorithms for combinatorial pure exploration. In Conference on Learning Theory, pp.\ 482--534. PMLR, 2017

  17. [17]

    and Koolen, W

    Degenne, R. and Koolen, W. M. Pure exploration with multiple correct answers. Advances in Neural Information Processing Systems, 32, 2019

  18. [18]

    Gamification of pure exploration for linear bandits

    Degenne, R., M \'e nard, P., Shang, X., and Valko, M. Gamification of pure exploration for linear bandits. In International Conference on Machine Learning, pp.\ 2432--2442. PMLR, 2020

  19. [19]

    and Boyd, S

    Diamond, S. and Boyd, S. CVXPY : A P ython-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17 0 (83): 0 1--5, 2016

  20. [20]

    M., and Restelli, M

    Eldowa, K., Cesa-Bianchi, N., Metelli, A. M., and Restelli, M. Information capacity regret bounds for bandits with mediator feedback. arXiv preprint arXiv:2402.10282, 2024

  21. [21]

    Faizal, F. Z. and Nair, J. Constrained pure exploration multi-armed bandits with a fixed budget. arXiv preprint arXiv:2211.14768, 2022

  22. [22]

    G., and Ratliff, L

    Fiez, T., Jain, L., Jamieson, K. G., and Ratliff, L. Sequential experimental design for transductive linear bandits. Advances in neural information processing systems, 32, 2019

  23. [23]

    Best arm identification: A unified approach to fixed budget and fixed confidence

    Gabillon, V., Ghavamzadeh, M., and Lazaric, A. Best arm identification: A unified approach to fixed budget and fixed confidence. Advances in Neural Information Processing Systems, 25, 2012

  24. [24]

    Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations

    Gai, Y., Krishnamachari, B., and Jain, R. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20 0 (5): 0 1466--1478, 2012

  25. [25]

    and Capp \'e , O

    Garivier, A. and Capp \'e , O. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual conference on learning theory, pp.\ 359--376. JMLR Workshop and Conference Proceedings, 2011

  26. [26]

    and Kaufmann, E

    Garivier, A. and Kaufmann, E. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pp.\ 998--1027. PMLR, 2016

  27. [27]

    Achieving counterfactual fairness for causal bandit

    Huang, W., Zhang, L., and Wu, X. Achieving counterfactual fairness for causal bandit. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pp.\ 6952--6959, 2022

  28. [28]

    Revisiting Frank-Wolfe : Projection-free sparse convex optimization

    Jaggi, M. Revisiting Frank-Wolfe : Projection-free sparse convex optimization. In Dasgupta, S. and McAllester, D. (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp.\ 427--435, Atlanta, Georgia, USA, 17--19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/jaggi13.html

  29. [29]

    and Nowak, R

    Jamieson, K. and Nowak, R. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 2014 48th annual conference on information sciences and systems (CISS), pp.\ 1--6. IEEE, 2014

  30. [30]

    Confounded budgeted causal bandits

    Jamshidi, F., Etesami, J., and Kiyavash, N. Confounded budgeted causal bandits. In Causal Learning and Reasoning, pp.\ 423--461. PMLR, 2024

  31. [31]

    and Proutiere, A

    Jedra, Y. and Proutiere, A. Optimal best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 33: 0 10007--10017, 2020

  32. [32]

    and Krishnasamy, S

    Juneja, S. and Krishnasamy, S. Sample complexity of partition identification using multi-armed bandits. In Conference on Learning Theory, pp.\ 1824--1852. PMLR, 2019

  33. [33]

    Pac subset selection in stochastic multi-armed bandits

    Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. Pac subset selection in stochastic multi-armed bandits. In ICML, volume 12, pp.\ 655--662, 2012

  34. [34]

    Almost optimal exploration in multi-armed bandits

    Karnin, Z., Koren, T., and Somekh, O. Almost optimal exploration in multi-armed bandits. In International conference on machine learning, pp.\ 1238--1246. PMLR, 2013

  35. [35]

    and Ariu, K

    Kato, M. and Ariu, K. The role of contextual information in best arm identification. arXiv preprint arXiv:2106.14077, 2021

  36. [36]

    Contributions to the Optimal Solution of Several Bandit Problems

    Kaufmann, E. Contributions to the Optimal Solution of Several Bandit Problems. PhD thesis, Universit \'e de Lille, 2020

  37. [37]

    and Kalyanakrishnan, S

    Kaufmann, E. and Kalyanakrishnan, S. Information complexity in bandit subset selection. In Conference on Learning Theory, pp.\ 228--251. PMLR, 2013

  38. [38]

    and Koolen, W

    Kaufmann, E. and Koolen, W. M. Mixture martingales revisited with applications to sequential tests and confidence intervals. Journal of Machine Learning Research, 22 0 (246): 0 1--44, 2021

  39. [39]

    On the complexity of best-arm identification in multi-armed bandit models

    Kaufmann, E., Capp \'e , O., and Garivier, A. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17 0 (1): 0 1--42, 2016

  40. [40]

    M., and Garivier, A

    Kaufmann, E., Koolen, W. M., and Garivier, A. Sequential test for the lowest mean: From thompson to murphy sampling. Advances in Neural Information Processing Systems, 31, 2018

  41. [41]

    and Leighton, T

    Kleinberg, R. and Leighton, T. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pp.\ 594--605. IEEE, 2003

  42. [42]

    and Zhang, T

    Langford, J. and Zhang, T. The epoch-greedy algorithm for contextual multi-armed bandits. Advances in neural information processing systems, 20 0 (1): 0 96--1, 2007

  43. [43]

    Lattimore, F., Lattimore, T., and Reid, M. D. Causal bandits: Learning good interventions via causal inference. Advances in neural information processing systems, 29, 2016

  44. [44]

    and Szepesv \'a ri, C

    Lattimore, T. and Szepesv \'a ri, C. Bandit algorithms. Cambridge University Press, 2020

  45. [45]

    Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp.\ 661--670, 2010

  46. [46]

    G., Jain, L., et al

    Li, Z., Ratliff, L., Jamieson, K. G., Jain, L., et al. Instance-optimal pac algorithms for contextual bandits. Advances in Neural Information Processing Systems, 35: 0 37590--37603, 2022

  47. [47]

    Liu, Z., Attias, I., and Roy, D. M. Causal bandits: The pareto optimal frontier of adaptivity, a reduction to linear bandits, and limitations around unknown marginals. arXiv preprint arXiv:2407.00950, 2024

  48. [48]

    Regret analysis of bandit problems with causal background knowledge

    Lu, Y., Meisami, A., Tewari, A., and Yan, W. Regret analysis of bandit problems with causal background knowledge. In Conference on Uncertainty in Artificial Intelligence, pp.\ 141--150. PMLR, 2020

  49. [49]

    Causal contextual bandits with adaptive context

    Madhavan, R., Maiti, A., Sinha, G., and Barman, S. Causal contextual bandits with adaptive context. arXiv preprint arXiv:2405.18626, 2024

  50. [50]

    and Tsitsiklis, J

    Mannor, S. and Tsitsiklis, J. N. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5 0 (Jun): 0 623--648, 2004

  51. [51]

    Finding all -good arms in stochastic bandits

    Mason, B., Jain, L., Tripathy, A., and Nowak, R. Finding all -good arms in stochastic bandits. Advances in Neural Information Processing Systems, 33: 0 20707--20718, 2020

  52. [52]

    Gradient Ascent for Active Exploration in Bandit Problems

    M \'e nard, P. Gradient ascent for active exploration in bandit problems. arXiv preprint arXiv:1905.08165, 2019

  53. [53]

    D., Jonsson, A., Kaufmann, E., Leurent, E., and Valko, M

    M \'e nard, P., Domingues, O. D., Jonsson, A., Kaufmann, E., Leurent, E., and Valko, M. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pp.\ 7599--7608. PMLR, 2021

  54. [54]

    Mohammadi Zaki, A. M. and Gopalan, A. Improved pure exploration in linear bandits with no-regret learning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp.\ 3709--3715, 2022

  55. [55]

    Sample complexity reduction via policy difference estimation in tabular reinforcement learning

    Narang, A., Wagenmaker, A., Ratliff, L., and Jamieson, K. Sample complexity reduction via policy difference estimation in tabular reinforcement learning. arXiv preprint arXiv:2406.06856, 2024

  56. [56]

    M., and Restelli, M

    Poiani, R., Metelli, A. M., and Restelli, M. Pure exploration under mediators' feedback. arXiv preprint arXiv:2308.15552, 2023

  57. [57]

    S., Karthik, P., Karamchandani, N., and Nair, J

    Reddy, K. S., Karthik, P., Karamchandani, N., and Nair, J. Best arm identification in bandits with limited precision sampling. In 2023 IEEE International Symposium on Information Theory (ISIT), pp.\ 1466--1471. IEEE, 2023

  58. [58]

    Best-arm identification in linear bandits

    Soare, M., Lazaric, A., and Munos, R. Best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 27, 2014

  59. [59]

    Stephens, C. J. Pure exploration in multi-armed bandits. 2023

  60. [60]

    A bad arm existence checking problem: How to utilize asymmetric problem structure? Machine learning, 109 0 (2): 0 327--372, 2020

    Tabata, K., Nakamura, A., Honda, J., and Komatsuzaki, T. A bad arm existence checking problem: How to utilize asymmetric problem structure? Machine learning, 109 0 (2): 0 327--372, 2020

  61. [61]

    Pure exploration for constrained best mixed arm identification with a fixed budget

    Tang, D., Jain, R., Nayyar, A., and Nuzzo, P. Pure exploration for constrained best mixed arm identification with a fixed budget. arXiv preprint arXiv:2405.15090, 2024

  62. [62]

    and Murphy, S

    Tewari, A. and Murphy, S. A. From ads to interventions: Contextual bandits in mobile health. Mobile health: sensors, analytic methods, and applications, pp.\ 495--517, 2017

  63. [63]

    Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25: 0 285--294, 1933

  64. [64]

    Causal bandits for linear structural equation models

    Varici, B., Shanmugam, K., Sattigeri, P., and Tajer, A. Causal bandits for linear structural equation models. Journal of Machine Learning Research, 24 0 (297): 0 1--59, 2023

  65. [65]

    J., Simchowitz, M., and Jamieson, K

    Wagenmaker, A. J., Simchowitz, M., and Jamieson, K. Beyond no regret: Instance-dependent pac reinforcement learning. In Conference on Learning Theory, pp.\ 358--418. PMLR, 2022

  66. [66]

    Fairness of exposure in stochastic bandits

    Wang, L., Bai, Y., Sun, W., and Joachims, T. Fairness of exposure in stochastic bandits. In International Conference on Machine Learning, pp.\ 10686--10696. PMLR, 2021 a

  67. [67]

    Fast pure exploration via frank-wolfe

    Wang, P.-A., Tzeng, R.-C., and Proutiere, A. Fast pure exploration via frank-wolfe. Advances in Neural Information Processing Systems, 34: 0 5810--5821, 2021 b

  68. [68]

    and Chen, W

    Xiong, N. and Chen, W. Combinatorial pure exploration of causal bandits. arXiv preprint arXiv:2206.07883, 2022

  69. [69]

    A fully adaptive algorithm for pure exploration in linear bandits

    Xu, L., Honda, J., and Sugiyama, M. A fully adaptive algorithm for pure exploration in linear bandits. In International Conference on Artificial Intelligence and Statistics, pp.\ 843--851. PMLR, 2018

  70. [70]

    Robust causal bandits for linear models

    Yan, Z., Mukherjee, A., Var c , B., and Tajer, A. Robust causal bandits for linear models. IEEE Journal on Selected Areas in Information Theory, 2024