A Direct Approach for Handling Contextual Bandits with Latent State Dynamics
Pith reviewed 2026-05-10 16:57 UTC · model grok-4.3
The pith
Contextual bandits with hidden Markov states admit high-probability regret bounds via direct online estimation of the chain parameters.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We study the finite-armed linear bandit model with contexts and rewards governed by a finite hidden Markov chain, incorporating direct dependencies on the hidden states. We obtain stronger, high-probability regret bounds for a fully adaptive strategy that estimates HMM parameters online. These bounds do not depend on the reward functions and only depend on the model through the estimation of the HMM parameters.
What carries the argument
The fully adaptive strategy that jointly estimates the HMM transition and emission parameters online while selecting arms on the basis of the current posterior over hidden states.
If this is right
- Regret scales with the accuracy of online HMM parameter estimation rather than with reward gaps or other model-specific quantities.
- The same algorithm and analysis deliver high-probability guarantees instead of only expectation bounds.
- The approach applies directly to settings in which rewards depend on the latent state itself, not merely on its posterior.
- Estimation error of the chain becomes the sole model-dependent term controlling the regret.
Where Pith is reading between the lines
- Similar direct estimation ideas may apply to other latent-variable sequential decision problems whose parameters admit online consistent estimators.
- In practice the method could be tested on recommendation or user-state tracking tasks where hidden dynamics are Markovian but rewards are observed directly from the state.
- If faster or distribution-free HMM estimators become available, the same regret template would immediately improve without changing the decision rule.
Load-bearing premise
The hidden Markov model parameters can be estimated online at a rate sufficient for the regret bounds to hold, with the joint estimation and decision process not introducing new dependencies on reward gaps.
What would settle it
A concrete instance of the model in which the online HMM estimator converges slower than the rate required by the regret analysis, causing observed cumulative regret to violate the stated high-probability bound.
read the original abstract
We revisit the finite-armed linear bandit model by Nelson et al. (2022), where contexts and rewards are governed by a finite hidden Markov chain. Nelson et al. (2022) approach this model by a reduction to linear contextual bandits; but to do so, they actually introduce a simplification in which rewards are linear functions of the posterior probabilities over the hidden states given the observed contexts, rather than functions of the hidden states themselves. Their analysis (but not their algorithm) also does not take into account the estimation of the HMM parameters, and only tackles expected, not high-probability, bounds, which suffer in addition from unnecessary complex dependencies on the model (like reward gaps). We instead study the more natural model incorporating direct dependencies in the hidden states (on top of dependencies on the observed contexts, as is natural for contextual bandits) and also obtain stronger, high-probability, regret bounds for a fully adaptive strategy that estimates HMM parameters online. These bounds do not depend on the reward functions and only depend on the model through the estimation of the HMM parameters.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper studies contextual bandits where both contexts and rewards are generated by a finite-state hidden Markov model with direct dependencies on the latent states (beyond observed contexts). It proposes a fully adaptive online algorithm that jointly estimates the HMM parameters while selecting actions, and claims high-probability regret bounds that depend on the model only through the rate of HMM parameter estimation, with no dependence on the reward functions or reward gaps (in contrast to the reduction-based approach of Nelson et al. (2022), which uses posterior probabilities and yields weaker expected regret bounds).
Significance. If the claimed decoupling holds, the result would be a meaningful advance: it supplies stronger high-probability guarantees for a more natural latent-state model while remaining fully adaptive and online. The independence from reward functions (if rigorously established) removes a common source of looseness in bandit analyses and could inform algorithm design in partially observable settings. No machine-checked proofs or reproducible code are referenced.
major comments (2)
- [Proof of the main regret bound (likely §4 or §5)] The central claim (Abstract) that the high-probability regret bounds 'do not depend on the reward functions and only depend on the model through the estimation of the HMM parameters' is load-bearing. Because rewards are part of the HMM emissions, any concentration inequality for the emission-parameter estimates is necessarily reward-dependent; the analysis must therefore demonstrate a uniform high-probability bound on the estimation error that remains independent of realized rewards and of the adaptive filtration induced by the policy. Please supply the precise statement of the relevant concentration result (e.g., the theorem or lemma establishing the HMM estimation rate) and the argument that it holds uniformly without introducing gap-dependent terms or stopping-time probabilities that scale with reward gaps.
- [Adaptive estimation and regret decomposition] The manuscript must clarify how the adaptive policy (which uses current HMM estimates to choose actions) feeds back into the HMM estimation process without re-introducing reward dependence. If the analysis relies on a union bound or a martingale argument whose failure probability implicitly depends on the reward realizations, the claimed independence fails. A concrete counter-example or a worked calculation showing the bound remains gap-free would strengthen the result.
minor comments (2)
- [Introduction and Preliminaries] Notation for the HMM transition and emission matrices should be introduced once and used consistently; currently the abstract and introduction use slightly different symbols for the same quantities.
- [Related Work] The comparison with Nelson et al. (2022) would benefit from an explicit table contrasting the model assumptions, the form of the regret bound, and whether the bound is high-probability or in expectation.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive comments. The concerns about the uniformity of the concentration bounds and the feedback from the adaptive policy are well-taken, and we will revise the manuscript to provide the requested clarifications and expanded arguments.
read point-by-point responses
-
Referee: The central claim (Abstract) that the high-probability regret bounds 'do not depend on the reward functions and only depend on the model through the estimation of the HMM parameters' is load-bearing. Because rewards are part of the HMM emissions, any concentration inequality for the emission-parameter estimates is necessarily reward-dependent; the analysis must therefore demonstrate a uniform high-probability bound on the estimation error that remains independent of realized rewards and of the adaptive filtration induced by the policy. Please supply the precise statement of the relevant concentration result (e.g., the theorem or lemma establishing the HMM estimation rate) and the argument that it holds uniformly without introducing gap-dependent terms or stopping-time probabilities that scale with reward gaps.
Authors: We agree this is the load-bearing claim and thank the referee for requiring a precise statement. Lemma 3.2 gives the uniform high-probability bound on the HMM parameter estimation error: with probability at least 1-δ, the total variation distance between the estimated and true emission/transition matrices is O(√(S²A log(T/δ)/n) + 1/n), where n is the number of observations up to time T. The proof applies Freedman's inequality to the vector-valued martingale of observed emissions (contexts, actions, and bounded rewards in [0,1]). Because the variance proxy is bounded by a constant independent of the reward function, no gap-dependent terms appear. The filtration is generated by the observed history; the adaptive policy affects the sampling distribution but does not enter the concentration bound itself, which is uniform over all policies. We will add an expanded proof of Lemma 3.2 in the revision together with a remark explicitly ruling out stopping-time dependence on reward gaps. revision: yes
-
Referee: The manuscript must clarify how the adaptive policy (which uses current HMM estimates to choose actions) feeds back into the HMM estimation process without re-introducing reward dependence. If the analysis relies on a union bound or a martingale argument whose failure probability implicitly depends on the reward realizations, the claimed independence fails. A concrete counter-example or a worked calculation showing the bound remains gap-free would strengthen the result.
Authors: Section 4.3 decomposes instantaneous regret into an estimation-error term plus an exploration term that depends only on the current HMM estimate. The online HMM estimator (Algorithm 2) updates on every observed (context, action, reward) triple; because rewards remain bounded, the martingale differences in Lemma 3.2 stay uniformly bounded regardless of the policy's action choices. We use an epoch-doubling schedule so that the number of samples per epoch is deterministic given the epoch index, removing any implicit dependence on reward realizations in the union bound. In the revision we will insert a short worked calculation (new Appendix C) for a two-state HMM showing that the probability of the bad event remains O(1/T) uniformly, with no scaling by reward gaps. This confirms the claimed independence. revision: partial
Circularity Check
No circularity; regret bounds derived independently via online HMM estimation
full rationale
The paper's central derivation establishes high-probability regret bounds for a fully adaptive online strategy that estimates HMM parameters jointly with decision-making. These bounds are explicitly stated to depend on the model only through HMM estimation rates and to be free of reward-function or gap dependencies. No equations or steps reduce the claimed result to a fitted input by construction, nor does the argument rest on self-citations whose validity is presupposed. The analysis is self-contained against external benchmarks such as standard concentration inequalities for adaptive processes, with the independence from rewards following from the separation of the HMM estimation error control from the policy's reward-dependent choices.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Contexts and rewards are generated by a finite-state hidden Markov chain with direct dependence on the hidden states.
Reference graph
Works this paper leans on
-
[1]
Observes the context xt, drawn independently by the environment from νht
-
[2]
Obtains the belief estimate ˆbt by feeding x1,..., xt to the subroutine B
-
[3]
Computes the estimated mean rewards ˆrt(a) = ∑ h∈ [H] ˆbt(h) ϕ(a, xt)Tˆθt− 1,h for alla ∈ A
-
[4]
Picks an action at ∈ argmax a∈A { ˆrt(a) +εt,a }
-
[5]
Obtains and observes the reward r′ t(at) = ∑ h∈ [H] bt(h) ϕ(at, xt)Tθ⋆ h +ηt(at)
-
[6]
Computes ˆθt as in Equation (11). Analysis. At a high level, the analysis adapts the LinUCB proof to handle the substitution of the true contextsbt⊗ ϕ(a, xt) by estimations thereof. We follow closely classic analyses of LinUCB (the original reference by Abbasi-Y adkori et al., 2011, the monograph by Lattimore & Szepesv´ ari, 2020, Chapters 19 and 20, as w...
work page 2011
-
[7]
up to some permutation ρ of [H]
We restate the algorithm of Anandkumar et al. (2012, Section 4.2) in Appendix C.4. C.1. Proof of Lemma 4.3: Belief-Estimation Error We explain how the belief-estimation error bound from Lemma 4.3 follows from the application of two known results on HMM parameter estimation (one for the estimation of the para meters themselves, one for the guarantees induc...
work page 2012
-
[8]
26 A Direct Approach for Handling Contextual Bandits with Late nt State Dynamics C.4
Similarly as above, this rewriting entails σmin(A4) ⩾σmin(A3)σmin ( diag(π ) ) σmin ( AT 1 ) ⩾ ( σmin(E)σmin(M )εM )2 =σ 2. 26 A Direct Approach for Handling Contextual Bandits with Late nt State Dynamics C.4. Reminder on the Spectral Method for MHH Parameter Estim ation Finally, for the sake of self-completedness, we recall the s pectral method for HMM p...
work page 2012
-
[9]
Compute the empirical moments ˜P (t) 3, 1 = 1 t − 2 t− 1∑ s=2 xs+1 ⊗ xs− 1, ˜P (t) 3, 2 = 1 t − 2 t− 1∑ s=2 xs+1 ⊗ xs, ˜P (t) 3, 1, 2 = 1 t − 2 t− 1∑ s=2 xs+1 ⊗ xs− 1 ⊗ xs
-
[10]
Compute ˜U3, ˜U1 ∈ RX× H , the matrices whose columns are, respectively, the left and right singular vectors of ˜P (t) 3, 1 associated with its largest H singular values. Compute ˜U2 ∈ RX× H , the matrix whose columns are the right singular vectors of ˜P (t) 3, 2 associated with its largestH singular values
-
[11]
Define the contraction of the third-order tensor ˜P (t) 3, 1, 2(z) along its third mode for each vector z ∈ RX by [ ˜P (t) 3, 1, 2(z) ] i,j := X∑ k=1 [ ˜P (t) 3, 1, 2 ] i,j,k zk, and let ˜B(t) 3, 1, 2(z) = ( ˜UT 3 ˜P (t) 3, 1, 2(z) ˜U1 )( ˜UT 3 ˜P (t) 3, 1 ˜U1 )− 1 ; then, compute a matrix ˜R ∈ RH× H , with columns of unit Euclidean norm, such that ˜R− 1 ˜...
-
[12]
For each h ∈ [H], using the matrix ˜R computed in Step 3, define ˜λ h, 1, ˜λ h, 2,..., ˜λ h,H as the diagonal entries of ˜R− 1 ˜B(t) 3, 1, 2( ˜U2Γh, ·) ˜R and form the matrix ˜L ∈ RH× H with entries ˜Lh,j := ˜λ h,j for allh,j ∈ [H]
-
[13]
Define ˜Ot = ˜U2Γ− 1 ˜L and output ˜Et = ˜OT t and ˜M t = ((˜UT 3 ˜Ot )− 1 ˜R )T . 28 A Direct Approach for Handling Contextual Bandits with Late nt State Dynamics D. HMM Forgetting Properties and Related Reminders This appendix justifies the exponentially fast forgetting o f the initial distribution of the HMM stated in Assumption 5.1. Assumption 5.1 (expo...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.