Constrained pure exploration multi-armed bandits with a fixed budget.arXiv preprint arXiv:2211.14768,

Faizal, F · 2022 · arXiv 2211.14768

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

Adapting multi-objective pure-exploration bandits enables efficient Pareto prompt set recovery and best feasible prompt identification for LLMs, with linear-case guarantees and empirical gains over baselines.

Pure Exploration Beyond Reward Feedback: The Role of Post-Action Context

cs.LG · 2025-02-05 · unverdicted · novelty 6.0

Introduces BAI with post-action context in fixed-confidence stochastic bandits, derives instance-dependent lower bounds, and gives asymptotically optimal algorithms for separator and non-separator cases.

Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes

cs.LG · 2026-05-05 · unverdicted · novelty 5.0

A new posterior sampling algorithm for (ε, δ)-PAC policy identification in tabular MDPs achieves asymptotic optimality in sample complexity and posterior contraction rate with O(S²AH) runtime per episode.

citing papers explorer

Showing 3 of 3 citing papers.

Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits cs.LG · 2026-05-14 · unverdicted · none · ref 8
Adapting multi-objective pure-exploration bandits enables efficient Pareto prompt set recovery and best feasible prompt identification for LLMs, with linear-case guarantees and empirical gains over baselines.
Pure Exploration Beyond Reward Feedback: The Role of Post-Action Context cs.LG · 2025-02-05 · unverdicted · none · ref 21
Introduces BAI with post-action context in fixed-confidence stochastic bandits, derives instance-dependent lower bounds, and gives asymptotically optimal algorithms for separator and non-separator cases.
Optimal Posterior Sampling for Policy Identification in Tabular Markov Decision Processes cs.LG · 2026-05-05 · unverdicted · none · ref 88
A new posterior sampling algorithm for (ε, δ)-PAC policy identification in tabular MDPs achieves asymptotic optimality in sample complexity and posterior contraction rate with O(S²AH) runtime per episode.

Constrained pure exploration multi-armed bandits with a fixed budget.arXiv preprint arXiv:2211.14768,

fields

years

verdicts

representative citing papers

citing papers explorer