Thompson Sampling for Infinite-Horizon Discounted Decision Processes
Pith reviewed 2026-05-24 01:38 UTC · model grok-4.3
The pith
Thompson sampling achieves complete learning in infinite-horizon discounted MDPs with Borel spaces by making residual regret vanish exponentially.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a canonical probability space for these MDPs and decompose expected regret into expected finite-time regret, expected state regret, and expected residual regret. The probabilistic residual regret is defined as the conditional expected remaining loss given history. Under assumptions extending those used for finite state-action spaces, Thompson sampling makes the expected residual regret converge to zero exponentially fast. Under additional mild conditions ensuring the relevant limits exist, the probabilistic residual regret converges to zero almost surely, so Thompson sampling achieves complete learning.
What carries the argument
The decomposition of expected regret into finite-time, state, and residual components, together with the probabilistic residual regret as its sample-path counterpart.
If this is right
- Exponential decay of residual regret means the algorithm eventually incurs no additional loss relative to the optimal policy from any time onward.
- Almost-sure convergence of the probabilistic residual regret guarantees that complete learning holds on almost every sample path.
- The new regret decomposition supplies a way to separate irreversible state effects from ongoing policy suboptimality in infinite-horizon problems.
- Complete learning extends the notion of asymptotic optimality previously shown only for finite spaces to the continuous Borel case.
Where Pith is reading between the lines
- The same three-term decomposition could be used to study other posterior-sampling or bandit algorithms in continuous-state MDPs.
- In applied settings such as continuous inventory control, one could monitor the realized residual regret to test whether learning is occurring on observed trajectories.
- If the mild limit conditions hold in practice, the result suggests that Thompson sampling eventually behaves like the optimal policy even when states are drawn from uncountable spaces.
Load-bearing premise
The assumptions used for finite state and action spaces extend to the Borel setting in a manner that preserves the exponential convergence properties for residual regret.
What would settle it
A concrete Borel MDP satisfying the paper's extended assumptions in which the expected residual regret for Thompson sampling stays bounded away from zero for all large times.
Figures
read the original abstract
This paper develops a viable notion of learning for sampling-based algorithms that applies in broader settings than previously considered. More specifically, we model a discounted infinite-horizon MDPs with Borel state and action spaces, whose rewards and transitions depend on an unknown parameter. To analyze adaptive learning algorithms based on sampling we introduce a general canonical probability space in this setting. Since standard definitions of regret are inadequate for policy evaluation in this setting, we propose new metrics that arise from decomposing the standard expected regret in discounted infinite-horizon MDPs into three terms: (i) the expected finite-time regret, (ii) the expected state regret, and (iii) the expected residual regret. Component (i) translates into the traditional concept of expected regret over a finite horizon. Term (ii) reflects how much future performance is compromised at a given time because earlier decisions have led the system to a less favorable state than under an optimal policy. Finally, metric (iii) measures regret with respect to the optimal reward from the current period onward, disregarding the irreversible consequences of past decisions. We further disaggregate this term by introducing the probabilistic residual regret, a finer, sample-path version of (iii) that captures the remaining loss in future performance from the current period onward, conditional on the observed history. Its expectation coincides with (iii). We then focus on Thompson sampling (TS); under assumptions that extend those used in prior work on finite state and action spaces to the Borel setting, we show that component (iii) for TS converges to zero exponentially fast. We further show that, under mild conditions ensuring the existence of the relevant limits, its probabilistic counterpart converges to zero almost surely and TS achieves complete learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a notion of learning for sampling-based algorithms in discounted infinite-horizon MDPs with Borel state and action spaces. It introduces a canonical probability space and decomposes the standard expected regret into three components: (i) expected finite-time regret, (ii) expected state regret, and (iii) expected residual regret (with a finer probabilistic/sample-path version whose expectation matches (iii)). For Thompson sampling, under assumptions extending prior finite-space results to the Borel setting, it claims that component (iii) converges exponentially to zero; under mild conditions on the relevant limits, the probabilistic version converges almost surely to zero, implying complete learning.
Significance. If correct, the work meaningfully extends Thompson sampling analysis beyond finite spaces to general Borel MDPs, a setting common in continuous control and reinforcement learning. The regret decomposition separates irreversible state effects from ongoing performance gaps in a way that could support finer algorithm comparisons. The exponential decay and almost-sure convergence results are strong forms of learning guarantees. The canonical probability space construction is a technical contribution that may enable future sampling-based analyses in this setting.
major comments (1)
- [Abstract] Abstract: the central claim that component (iii) converges exponentially (and its probabilistic counterpart a.s.) rests on 'assumptions that extend those used in prior work on finite state and action spaces to the Borel setting,' yet the abstract provides no explicit verification that the Borel measurability conditions or existence of the relevant limits hold in the introduced canonical probability space; this extension is load-bearing for the headline result.
minor comments (1)
- The decomposition into (i)–(iii) would be clearer if accompanied by a short diagram or table showing how each term maps to standard finite-horizon regret notions.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation of the paper's significance and for identifying a point of clarification in the abstract. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that component (iii) converges exponentially (and its probabilistic counterpart a.s.) rests on 'assumptions that extend those used in prior work on finite state and action spaces to the Borel setting,' yet the abstract provides no explicit verification that the Borel measurability conditions or existence of the relevant limits hold in the introduced canonical probability space; this extension is load-bearing for the headline result.
Authors: The abstract is a concise summary; the full construction of the canonical probability space (Section 3) is explicitly designed to preserve Borel measurability of the relevant maps, and the assumptions in Section 4 are stated as direct extensions of the finite-space conditions from prior work, with the mild conditions for almost-sure convergence already noted in the abstract. We agree the abstract could signal this more clearly and will revise it to briefly reference that the canonical space satisfies the required measurability and that the limits exist under the stated conditions. The technical verification remains in the body, as is standard for such results. revision: yes
Circularity Check
No significant circularity
full rationale
The paper decomposes regret into three explicit components and proves exponential decay of the residual term (iii) for Thompson sampling under assumptions that extend finite-space results to Borel MDPs. No equation, definition, or limit statement reduces the claimed convergence to a fitted quantity, self-referential construction, or load-bearing self-citation. The derivation is presented as building on but independent of prior finite-space work, with new metrics and sample-path versions introduced explicitly. This is the most common honest finding for papers whose central claims remain externally falsifiable.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The MDP has Borel state and action spaces whose rewards and transitions depend on an unknown parameter.
- domain assumption Assumptions exist that extend the finite-state finite-action Thompson sampling analysis to the Borel case while preserving the required measurability and convergence properties.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose new metrics that arise from decomposing the standard expected regret in discounted infinite-horizon MDPs into three terms: (i) the expected finite-time regret, (ii) the expected state regret, and (iii) the expected residual regret.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Under assumptions that extend those used in prior work on finite state and action spaces to the Borel setting, we show that component (iii) for TS converges to zero exponentially fast.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Approximation of Discrete-Time Infinite-Horizon Mean-Field Equilibria via Finite-Horizon Mean-Field Equilibria
Finite-horizon mean-field equilibria accumulate to non-stationary infinite-horizon mean-field equilibria and converge to stationary ones under stated conditions, with explicit error bounds and a new uniqueness criterion.
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address author booktitle chapter doi edition editor eid howpublished institution isbn issn journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in "" FUNCTION format.date year ...
-
[3]
Conference on Learning Theory, 39--1
Agrawal S, Goyal N (2012) Analysis of thompson sampling for the multi-armed bandit problem. Conference on Learning Theory, 39--1
work page 2012
-
[4]
IEEE Transactions on Automatic Control 64(10):4137--4152
Banjevi \'c D, Kim MJ (2019) Thompson sampling for stochastic control: The continuous parameter case. IEEE Transactions on Automatic Control 64(10):4137--4152
work page 2019
-
[5]
Chung KL (2001) A course in probability theory (Academic press)
work page 2001
-
[6]
The Annals of Mathematical Statistics 1386--1403
Freedman DA (1963) On the asymptotic behavior of bayes' estimates in the discrete case. The Annals of Mathematical Statistics 1386--1403
work page 1963
-
[7]
Conference on Learning Theory, 861--898
Gopalan A, Mannor S (2015) Thompson sampling for learning parameterized markov decision processes. Conference on Learning Theory, 861--898
work page 2015
-
[8]
Hern \'a ndez-Lerma O (2012) Adaptive Markov control processes, volume 79 (Springer Science & Business Media)
work page 2012
-
[9]
Hern \'a ndez-Lerma O, Lasserre JB (2012) Discrete-time Markov control processes: basic optimality criteria, volume 30 (Springer Science & Business Media)
work page 2012
-
[10]
Kalkanli C, Ozgur A (2020) Asymptotic convergence of thompson sampling. arXiv e-prints arXiv--2011
work page 2020
-
[11]
Kearns M, Singh S (2002) Near-optimal reinforcement learning in polynomial time. Machine learning 49:209--232
work page 2002
-
[12]
IEEE Transactions on Automatic Control 62(12):6415--6422
Kim MJ (2017) Thompson sampling for stochastic control: The finite parameter case. IEEE Transactions on Automatic Control 62(12):6415--6422
work page 2017
-
[13]
Advances in applied mathematics 6(1):4--22
Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1):4--22
work page 1985
-
[14]
Thompson Sampling is Asymptotically Optimal in General Environments
Leike J, Lattimore T, Orseau L, Hutter M (2016) Thompson sampling is asymptotically optimal in general environments. arXiv preprint arXiv:1602.07905
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[15]
Puterman ML (2014) Markov Decision Processes.: Discrete Stochastic Dynamic Programming (John Wiley & Sons)
work page 2014
-
[16]
Stochastics: An International Journal of Probability and Stochastic Processes 20(1):51--71
Sch \"a l M (1987) Estimation and control in discounted stochastic dynamic programming. Stochastics: An International Journal of Probability and Stochastic Processes 20(1):51--71
work page 1987
-
[17]
Shreve SE (1978) Stochastic optimal control: The discrete time case (Academic Press)
work page 1978
-
[18]
Sutton RS, Barto AG (2018) Reinforcement learning: An introduction (MIT press)
work page 2018
-
[19]
Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285--294
work page 1933
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.