Pure Exploration Beyond Reward Feedback: The Role of Post-Action Context
Pith reviewed 2026-05-23 04:18 UTC · model grok-4.3
The pith
Best arm identification with post-action context admits asymptotically optimal algorithms in both separator and non-separator models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the fixed-confidence pure-exploration setting, instance-dependent lower bounds on sample complexity hold for both the separator and non-separator post-action context models; the G-tracking algorithm achieves the separator bound by tracking context probabilities via the geometry of the context space, while an extension of Track-and-Stop achieves the non-separator bound.
What carries the argument
The separator versus non-separator distinction for post-action context, together with the G-tracking rule that directly tracks contexts rather than actions.
If this is right
- Methods that discard post-action context are provably suboptimal in sample complexity for both models.
- G-tracking attains the separator lower bound by exploiting the geometry of the context space.
- Track-and-Stop extends directly to the non-separator model while preserving asymptotic optimality.
- The derived lower bounds are tight, so no algorithm can improve the leading term of the sample complexity.
Where Pith is reading between the lines
- The geometry-tracking idea may extend to other pure-exploration tasks whose observation space has exploitable structure beyond scalar rewards.
- If context spaces are continuous rather than discrete, discretization or kernel methods could be needed to retain the same asymptotic guarantees.
- In applications where context is cheap to obtain but actions are costly, the sample-complexity savings translate directly into fewer expensive actions.
Load-bearing premise
After every action the learner always receives the post-action context, and the reward is generated exactly according to either the separator or the non-separator dependence on that context.
What would settle it
An instance of the separator or non-separator model in which any algorithm that ignores the post-action context matches the sample complexity of G-tracking or extended Track-and-Stop up to lower-order terms.
Figures
read the original abstract
We introduce the problem of best arm identification (BAI) with post-action context, a new BAI problem in a stochastic multi-armed bandit environment and the fixed-confidence setting. The problem addresses the scenarios in which the learner receives a post-action context in addition to the reward after playing each action. This post-action context provides additional information that can significantly facilitate the decision process. We analyze two different types of the post-action context: (i) separator, where the reward depends solely on the context, and (ii) non-separator, where the reward depends on both the action and the context. For both cases, we derive instance-dependent lower bounds on the sample complexity and propose algorithms that asymptotically achieve the optimal sample complexity. For the separator setting, we propose a novel sampling rule called G-tracking, which uses the geometry of the context space to directly track the contexts rather than the actions. For the non-separator setting, we do so by demonstrating that the Track-and-Stop algorithm can be extended to this setting. Moreover, in both settings, we theoretically and empirically show that algorithms that ignore the post-action context are sub-optimal. Finally, our empirical results showcase the advantage of our approaches compared to the state of the art.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces best arm identification (BAI) with post-action context in stochastic multi-armed bandits under fixed confidence. It distinguishes separator contexts (reward depends only on context) from non-separator contexts (reward depends on both action and context). For both settings, instance-dependent lower bounds on sample complexity are derived, and algorithms are proposed that asymptotically match these bounds: G-tracking (which tracks contexts geometrically) for the separator case, and an extension of Track-and-Stop for the non-separator case. The work also shows that algorithms ignoring post-action context are sub-optimal, both theoretically and empirically.
Significance. If the lower-bound derivations and asymptotic optimality proofs hold, the paper meaningfully extends pure exploration theory to richer feedback models beyond rewards alone. The separator/non-separator distinction and the geometry-aware G-tracking rule are technically interesting contributions. Credit is due for providing matching upper and lower bounds rather than only algorithmic heuristics, and for the empirical demonstration that context-ignoring baselines are provably sub-optimal.
minor comments (2)
- In the problem formulation, the precise measurability assumptions on the context space and the support of the context distribution should be stated explicitly to make the lower-bound constructions fully rigorous.
- The empirical section would benefit from reporting the number of independent runs and confidence intervals on the sample-complexity plots to allow direct comparison with the theoretical rates.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report does not raise any specific major comments, so we have no individual points to rebut. We will incorporate any minor suggestions during revision.
Circularity Check
No significant circularity detected
full rationale
The paper defines a new BAI problem variant with post-action context under explicit separator and non-separator models, derives instance-dependent lower bounds from those models, and shows asymptotic matching via G-tracking (new) and Track-and-Stop extension (standard). No step reduces a claimed prediction or optimality result to a fitted quantity from the same data, a self-citation chain, or a definitional tautology; the modeling assumptions are stated upfront and used consistently in the bounds and analyses without circular reduction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Active Context Selection Improves Simple Regret in Contextual Bandits
Active sampling with allocation q_j proportional to p_j to the 2/3 achieves tight regret sqrt(n/T) times norm of p to the 2/3 for known context distribution p, with improvement up to Theta(k to the 1/4) over passive sampling.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Improved algorithms for linear stochastic bandits
Abbasi-Yadkori, Y., P \'a l, D., and Szepesv \'a ri, C. Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24, 2011
work page 2011
-
[3]
A rewriting system for convex optimization problems
Agrawal, A., Verschueren, R., Diamond, S., and Boyd, S. A rewriting system for convex optimization problems. Journal of Control and Decision, 5 0 (1): 0 42--60, 2018
work page 2018
-
[4]
Al Marjani, A. and Proutiere, A. Adaptive sampling for best policy identification in markov decision processes. In International Conference on Machine Learning, pp.\ 7459--7468. PMLR, 2021
work page 2021
-
[5]
Navigating to the best policy in markov decision processes
Al Marjani, A., Garivier, A., and Proutiere, A. Navigating to the best policy in markov decision processes. Advances in Neural Information Processing Systems, 34: 0 25852--25864, 2021
work page 2021
-
[6]
arXiv preprint arXiv:2311.05638 , year=
Al-Marjani, A., Tirinzoni, A., and Kaufmann, E. Towards instance-optimality in online pac reinforcement learning. arXiv preprint arXiv:2311.05638, 2023
-
[7]
Audibert, J.-Y. and Bubeck, S. Best arm identification in multi-armed bandits. In COLT-23th Conference on learning theory-2010, pp.\ 13--p, 2010
work page 2010
-
[8]
Using confidence bounds for exploitation-exploration trade-offs
Auer, P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3 0 (Nov): 0 397--422, 2002
work page 2002
-
[9]
A non-asymptotic approach to best-arm identification for gaussian bandits
Barrier, A., Garivier, A., and Koc \'a k, T. A non-asymptotic approach to best-arm identification for gaussian bandits. In International Conference on Artificial Intelligence and Statistics, pp.\ 10078--10109. PMLR, 2022
work page 2022
-
[10]
Topological spaces: including a triatment of mltivalued functions, vector spaces and convexity
Berge, C. Topological spaces: including a triatment of mltivalued functions, vector spaces and convexity. Oliver and Boyd, 1963
work page 1963
-
[11]
Besbes, O. and Zeevi, A. Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms. Operations research, 57 0 (6): 0 1407--1420, 2009
work page 2009
-
[12]
Adaptively exploiting d-separators with causal bandits
Bilodeau, B., Wang, L., and Roy, D. Adaptively exploiting d-separators with causal bandits. Advances in Neural Information Processing Systems, 35: 0 20381--20392, 2022
work page 2022
-
[13]
Camilleri, R., Wagenmaker, A., Morgenstern, J. H., Jain, L., and Jamieson, K. G. Active learning with safety constraints. Advances in Neural Information Processing Systems, 35: 0 33201--33214, 2022
work page 2022
-
[14]
Pure exploration in bandits with linear constraints
Carlsson, E., Basu, D., Johansson, F., and Dubhashi, D. Pure exploration in bandits with linear constraints. In International Conference on Artificial Intelligence and Statistics, pp.\ 334--342. PMLR, 2024
work page 2024
-
[15]
Carpentier, A. and Locatelli, A. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Conference on Learning Theory, pp.\ 590--604. PMLR, 2016
work page 2016
-
[16]
Nearly optimal sampling algorithms for combinatorial pure exploration
Chen, L., Gupta, A., Li, J., Qiao, M., and Wang, R. Nearly optimal sampling algorithms for combinatorial pure exploration. In Conference on Learning Theory, pp.\ 482--534. PMLR, 2017
work page 2017
-
[17]
Degenne, R. and Koolen, W. M. Pure exploration with multiple correct answers. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[18]
Gamification of pure exploration for linear bandits
Degenne, R., M \'e nard, P., Shang, X., and Valko, M. Gamification of pure exploration for linear bandits. In International Conference on Machine Learning, pp.\ 2432--2442. PMLR, 2020
work page 2020
-
[19]
Diamond, S. and Boyd, S. CVXPY : A P ython-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17 0 (83): 0 1--5, 2016
work page 2016
-
[20]
Eldowa, K., Cesa-Bianchi, N., Metelli, A. M., and Restelli, M. Information capacity regret bounds for bandits with mediator feedback. arXiv preprint arXiv:2402.10282, 2024
- [21]
-
[22]
Fiez, T., Jain, L., Jamieson, K. G., and Ratliff, L. Sequential experimental design for transductive linear bandits. Advances in neural information processing systems, 32, 2019
work page 2019
-
[23]
Best arm identification: A unified approach to fixed budget and fixed confidence
Gabillon, V., Ghavamzadeh, M., and Lazaric, A. Best arm identification: A unified approach to fixed budget and fixed confidence. Advances in Neural Information Processing Systems, 25, 2012
work page 2012
-
[24]
Gai, Y., Krishnamachari, B., and Jain, R. Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations. IEEE/ACM Transactions on Networking, 20 0 (5): 0 1466--1478, 2012
work page 2012
-
[25]
Garivier, A. and Capp \'e , O. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th annual conference on learning theory, pp.\ 359--376. JMLR Workshop and Conference Proceedings, 2011
work page 2011
-
[26]
Garivier, A. and Kaufmann, E. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pp.\ 998--1027. PMLR, 2016
work page 2016
-
[27]
Achieving counterfactual fairness for causal bandit
Huang, W., Zhang, L., and Wu, X. Achieving counterfactual fairness for causal bandit. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pp.\ 6952--6959, 2022
work page 2022
-
[28]
Revisiting Frank-Wolfe : Projection-free sparse convex optimization
Jaggi, M. Revisiting Frank-Wolfe : Projection-free sparse convex optimization. In Dasgupta, S. and McAllester, D. (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp.\ 427--435, Atlanta, Georgia, USA, 17--19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/jaggi13.html
work page 2013
-
[29]
Jamieson, K. and Nowak, R. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 2014 48th annual conference on information sciences and systems (CISS), pp.\ 1--6. IEEE, 2014
work page 2014
-
[30]
Confounded budgeted causal bandits
Jamshidi, F., Etesami, J., and Kiyavash, N. Confounded budgeted causal bandits. In Causal Learning and Reasoning, pp.\ 423--461. PMLR, 2024
work page 2024
-
[31]
Jedra, Y. and Proutiere, A. Optimal best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 33: 0 10007--10017, 2020
work page 2020
-
[32]
Juneja, S. and Krishnasamy, S. Sample complexity of partition identification using multi-armed bandits. In Conference on Learning Theory, pp.\ 1824--1852. PMLR, 2019
work page 2019
-
[33]
Pac subset selection in stochastic multi-armed bandits
Kalyanakrishnan, S., Tewari, A., Auer, P., and Stone, P. Pac subset selection in stochastic multi-armed bandits. In ICML, volume 12, pp.\ 655--662, 2012
work page 2012
-
[34]
Almost optimal exploration in multi-armed bandits
Karnin, Z., Koren, T., and Somekh, O. Almost optimal exploration in multi-armed bandits. In International conference on machine learning, pp.\ 1238--1246. PMLR, 2013
work page 2013
-
[35]
Kato, M. and Ariu, K. The role of contextual information in best arm identification. arXiv preprint arXiv:2106.14077, 2021
-
[36]
Contributions to the Optimal Solution of Several Bandit Problems
Kaufmann, E. Contributions to the Optimal Solution of Several Bandit Problems. PhD thesis, Universit \'e de Lille, 2020
work page 2020
-
[37]
Kaufmann, E. and Kalyanakrishnan, S. Information complexity in bandit subset selection. In Conference on Learning Theory, pp.\ 228--251. PMLR, 2013
work page 2013
-
[38]
Kaufmann, E. and Koolen, W. M. Mixture martingales revisited with applications to sequential tests and confidence intervals. Journal of Machine Learning Research, 22 0 (246): 0 1--44, 2021
work page 2021
-
[39]
On the complexity of best-arm identification in multi-armed bandit models
Kaufmann, E., Capp \'e , O., and Garivier, A. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17 0 (1): 0 1--42, 2016
work page 2016
-
[40]
Kaufmann, E., Koolen, W. M., and Garivier, A. Sequential test for the lowest mean: From thompson to murphy sampling. Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[41]
Kleinberg, R. and Leighton, T. The value of knowing a demand curve: Bounds on regret for online posted-price auctions. In 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings., pp.\ 594--605. IEEE, 2003
work page 2003
-
[42]
Langford, J. and Zhang, T. The epoch-greedy algorithm for contextual multi-armed bandits. Advances in neural information processing systems, 20 0 (1): 0 96--1, 2007
work page 2007
-
[43]
Lattimore, F., Lattimore, T., and Reid, M. D. Causal bandits: Learning good interventions via causal inference. Advances in neural information processing systems, 29, 2016
work page 2016
-
[44]
Lattimore, T. and Szepesv \'a ri, C. Bandit algorithms. Cambridge University Press, 2020
work page 2020
-
[45]
Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp.\ 661--670, 2010
work page 2010
-
[46]
Li, Z., Ratliff, L., Jamieson, K. G., Jain, L., et al. Instance-optimal pac algorithms for contextual bandits. Advances in Neural Information Processing Systems, 35: 0 37590--37603, 2022
work page 2022
- [47]
-
[48]
Regret analysis of bandit problems with causal background knowledge
Lu, Y., Meisami, A., Tewari, A., and Yan, W. Regret analysis of bandit problems with causal background knowledge. In Conference on Uncertainty in Artificial Intelligence, pp.\ 141--150. PMLR, 2020
work page 2020
-
[49]
Causal contextual bandits with adaptive context
Madhavan, R., Maiti, A., Sinha, G., and Barman, S. Causal contextual bandits with adaptive context. arXiv preprint arXiv:2405.18626, 2024
-
[50]
Mannor, S. and Tsitsiklis, J. N. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5 0 (Jun): 0 623--648, 2004
work page 2004
-
[51]
Finding all -good arms in stochastic bandits
Mason, B., Jain, L., Tripathy, A., and Nowak, R. Finding all -good arms in stochastic bandits. Advances in Neural Information Processing Systems, 33: 0 20707--20718, 2020
work page 2020
-
[52]
Gradient Ascent for Active Exploration in Bandit Problems
M \'e nard, P. Gradient ascent for active exploration in bandit problems. arXiv preprint arXiv:1905.08165, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[53]
D., Jonsson, A., Kaufmann, E., Leurent, E., and Valko, M
M \'e nard, P., Domingues, O. D., Jonsson, A., Kaufmann, E., Leurent, E., and Valko, M. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pp.\ 7599--7608. PMLR, 2021
work page 2021
-
[54]
Mohammadi Zaki, A. M. and Gopalan, A. Improved pure exploration in linear bandits with no-regret learning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp.\ 3709--3715, 2022
work page 2022
-
[55]
Sample complexity reduction via policy difference estimation in tabular reinforcement learning
Narang, A., Wagenmaker, A., Ratliff, L., and Jamieson, K. Sample complexity reduction via policy difference estimation in tabular reinforcement learning. arXiv preprint arXiv:2406.06856, 2024
-
[56]
Poiani, R., Metelli, A. M., and Restelli, M. Pure exploration under mediators' feedback. arXiv preprint arXiv:2308.15552, 2023
-
[57]
S., Karthik, P., Karamchandani, N., and Nair, J
Reddy, K. S., Karthik, P., Karamchandani, N., and Nair, J. Best arm identification in bandits with limited precision sampling. In 2023 IEEE International Symposium on Information Theory (ISIT), pp.\ 1466--1471. IEEE, 2023
work page 2023
-
[58]
Best-arm identification in linear bandits
Soare, M., Lazaric, A., and Munos, R. Best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 27, 2014
work page 2014
-
[59]
Stephens, C. J. Pure exploration in multi-armed bandits. 2023
work page 2023
-
[60]
Tabata, K., Nakamura, A., Honda, J., and Komatsuzaki, T. A bad arm existence checking problem: How to utilize asymmetric problem structure? Machine learning, 109 0 (2): 0 327--372, 2020
work page 2020
-
[61]
Pure exploration for constrained best mixed arm identification with a fixed budget
Tang, D., Jain, R., Nayyar, A., and Nuzzo, P. Pure exploration for constrained best mixed arm identification with a fixed budget. arXiv preprint arXiv:2405.15090, 2024
-
[62]
Tewari, A. and Murphy, S. A. From ads to interventions: Contextual bandits in mobile health. Mobile health: sensors, analytic methods, and applications, pp.\ 495--517, 2017
work page 2017
-
[63]
Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25: 0 285--294, 1933
work page 1933
-
[64]
Causal bandits for linear structural equation models
Varici, B., Shanmugam, K., Sattigeri, P., and Tajer, A. Causal bandits for linear structural equation models. Journal of Machine Learning Research, 24 0 (297): 0 1--59, 2023
work page 2023
-
[65]
J., Simchowitz, M., and Jamieson, K
Wagenmaker, A. J., Simchowitz, M., and Jamieson, K. Beyond no regret: Instance-dependent pac reinforcement learning. In Conference on Learning Theory, pp.\ 358--418. PMLR, 2022
work page 2022
-
[66]
Fairness of exposure in stochastic bandits
Wang, L., Bai, Y., Sun, W., and Joachims, T. Fairness of exposure in stochastic bandits. In International Conference on Machine Learning, pp.\ 10686--10696. PMLR, 2021 a
work page 2021
-
[67]
Fast pure exploration via frank-wolfe
Wang, P.-A., Tzeng, R.-C., and Proutiere, A. Fast pure exploration via frank-wolfe. Advances in Neural Information Processing Systems, 34: 0 5810--5821, 2021 b
work page 2021
-
[68]
Xiong, N. and Chen, W. Combinatorial pure exploration of causal bandits. arXiv preprint arXiv:2206.07883, 2022
-
[69]
A fully adaptive algorithm for pure exploration in linear bandits
Xu, L., Honda, J., and Sugiyama, M. A fully adaptive algorithm for pure exploration in linear bandits. In International Conference on Artificial Intelligence and Statistics, pp.\ 843--851. PMLR, 2018
work page 2018
-
[70]
Robust causal bandits for linear models
Yan, Z., Mukherjee, A., Var c , B., and Tajer, A. Robust causal bandits for linear models. IEEE Journal on Selected Areas in Information Theory, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.