Adaptive Exploration for Latent-State Bandits
read the original abstract
We study bandits whose rewards depend on an unobserved Markov state that evolves independently of the learner's actions. The optimal arm can change even though the learner observes only past actions and rewards. We propose algorithms that feed LinUCB with two summaries of the hidden state: a lagged action-reward pair and, when available, a probe fingerprint formed from rewards of multiple arms. The adaptive variants refresh the fingerprint using residual, margin, and staleness tests. In synthetic stress tests over state count, transition rate, noise, and horizon, these methods reduce dynamic regret relative to standard, adversarial, and non-stationary bandit baselines when the summaries distinguish states and are updated often enough. Ablations and misspecification tests identify the main failure modes: weak fingerprint separation, high noise, and state changes during sequential probes.
This paper has not been read by Pith yet.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.