Thompson Sampling for Infinite-Horizon Discounted Decision Processes

Alba V. Olivares-Nadal; Cagla Keceli; Daniel Adelman

arxiv: 2405.08253 · v3 · submitted 2024-05-14 · 📊 stat.ML · cs.LG· math.OC

Thompson Sampling for Infinite-Horizon Discounted Decision Processes

Daniel Adelman , Cagla Keceli , Alba V. Olivares-Nadal This is my paper

Pith reviewed 2026-05-24 01:38 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.OC

keywords Thompson samplinginfinite-horizon MDPsBorel spacesregret decompositioncomplete learningdiscounted decision processesresidual regretadaptive learning

0 comments

The pith

Thompson sampling achieves complete learning in infinite-horizon discounted MDPs with Borel spaces by making residual regret vanish exponentially.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework for evaluating learning in sampling-based algorithms applied to discounted infinite-horizon MDPs whose states and actions live in Borel spaces and whose dynamics depend on an unknown parameter. Standard regret definitions do not suffice for policy evaluation in this infinite setting, so the work decomposes expected regret into three parts: finite-time regret, state regret that captures irreversible movement to worse states, and residual regret that measures ongoing suboptimality from the current time onward. For Thompson sampling the analysis establishes that residual regret decays exponentially fast under assumptions that extend finite-space results to the Borel case, while the sample-path version of residual regret converges almost surely to zero under mild limit conditions, which the authors call complete learning.

Core claim

We introduce a canonical probability space for these MDPs and decompose expected regret into expected finite-time regret, expected state regret, and expected residual regret. The probabilistic residual regret is defined as the conditional expected remaining loss given history. Under assumptions extending those used for finite state-action spaces, Thompson sampling makes the expected residual regret converge to zero exponentially fast. Under additional mild conditions ensuring the relevant limits exist, the probabilistic residual regret converges to zero almost surely, so Thompson sampling achieves complete learning.

What carries the argument

The decomposition of expected regret into finite-time, state, and residual components, together with the probabilistic residual regret as its sample-path counterpart.

If this is right

Exponential decay of residual regret means the algorithm eventually incurs no additional loss relative to the optimal policy from any time onward.
Almost-sure convergence of the probabilistic residual regret guarantees that complete learning holds on almost every sample path.
The new regret decomposition supplies a way to separate irreversible state effects from ongoing policy suboptimality in infinite-horizon problems.
Complete learning extends the notion of asymptotic optimality previously shown only for finite spaces to the continuous Borel case.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three-term decomposition could be used to study other posterior-sampling or bandit algorithms in continuous-state MDPs.
In applied settings such as continuous inventory control, one could monitor the realized residual regret to test whether learning is occurring on observed trajectories.
If the mild limit conditions hold in practice, the result suggests that Thompson sampling eventually behaves like the optimal policy even when states are drawn from uncountable spaces.

Load-bearing premise

The assumptions used for finite state and action spaces extend to the Borel setting in a manner that preserves the exponential convergence properties for residual regret.

What would settle it

A concrete Borel MDP satisfying the paper's extended assumptions in which the expected residual regret for Thompson sampling stays bounded away from zero for all large times.

Figures

Figures reproduced from arXiv: 2405.08253 by Alba V. Olivares-Nadal, Cagla Keceli, Daniel Adelman.

**Figure 2.** Figure 2: Illustration of two different sample paths giving rise to the expected state regret. [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

**Figure 3.** Figure 3: Deterministic reward depending on the control, where true parameter is B. [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Stochastic rewards depending on the control, where the true parameter is B. [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Evolution of expected posterior and expected residual regret in Example [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Constant reward depending only on the first control, picked at t=0. [PITH_FULL_IMAGE:figures/full_fig_p037_6.png] view at source ↗

read the original abstract

This paper develops a viable notion of learning for sampling-based algorithms that applies in broader settings than previously considered. More specifically, we model a discounted infinite-horizon MDPs with Borel state and action spaces, whose rewards and transitions depend on an unknown parameter. To analyze adaptive learning algorithms based on sampling we introduce a general canonical probability space in this setting. Since standard definitions of regret are inadequate for policy evaluation in this setting, we propose new metrics that arise from decomposing the standard expected regret in discounted infinite-horizon MDPs into three terms: (i) the expected finite-time regret, (ii) the expected state regret, and (iii) the expected residual regret. Component (i) translates into the traditional concept of expected regret over a finite horizon. Term (ii) reflects how much future performance is compromised at a given time because earlier decisions have led the system to a less favorable state than under an optimal policy. Finally, metric (iii) measures regret with respect to the optimal reward from the current period onward, disregarding the irreversible consequences of past decisions. We further disaggregate this term by introducing the probabilistic residual regret, a finer, sample-path version of (iii) that captures the remaining loss in future performance from the current period onward, conditional on the observed history. Its expectation coincides with (iii). We then focus on Thompson sampling (TS); under assumptions that extend those used in prior work on finite state and action spaces to the Borel setting, we show that component (iii) for TS converges to zero exponentially fast. We further show that, under mild conditions ensuring the existence of the relevant limits, its probabilistic counterpart converges to zero almost surely and TS achieves complete learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main advance is a three-part regret decomposition for discounted infinite-horizon Borel MDPs plus exponential decay of the residual term under Thompson sampling.

read the letter

The paper introduces a regret decomposition for discounted infinite-horizon MDPs with Borel state and action spaces: finite-time regret, state regret, and residual regret, along with a sample-path probabilistic version of the residual. It then shows that Thompson sampling drives the residual component to zero exponentially fast and that the probabilistic version converges almost surely to zero under mild limit conditions, yielding complete learning. This is the concrete new piece relative to finite-space Thompson sampling results. The decomposition separates irreversible state effects from ongoing performance in a way that standard regret definitions do not handle cleanly for infinite horizons. The setup of a canonical probability space and the focus on sampling algorithms in continuous spaces are direct and useful. The extension of finite-space assumptions to Borel is presented as the key modeling step, and the claims follow from that. The main soft spot is that the abstract only sketches the assumptions without the explicit statements or derivations, so it is not yet possible to judge how restrictive the Borel technical conditions turn out to be or whether the exponential rate survives in typical applied cases. If the proofs close without hidden gaps, the result is fine; the stress-test note finds no internal inconsistency in the decomposition itself. This work is for researchers in theoretical reinforcement learning who need regret tools beyond finite spaces. A reader working on adaptive policies in continuous-state MDPs would get value from the new metrics. It deserves serious referee time because the problem is well-posed and the approach is grounded in prior finite results.

Referee Report

1 major / 1 minor

Summary. The paper develops a notion of learning for sampling-based algorithms in discounted infinite-horizon MDPs with Borel state and action spaces. It introduces a canonical probability space and decomposes the standard expected regret into three components: (i) expected finite-time regret, (ii) expected state regret, and (iii) expected residual regret (with a finer probabilistic/sample-path version whose expectation matches (iii)). For Thompson sampling, under assumptions extending prior finite-space results to the Borel setting, it claims that component (iii) converges exponentially to zero; under mild conditions on the relevant limits, the probabilistic version converges almost surely to zero, implying complete learning.

Significance. If correct, the work meaningfully extends Thompson sampling analysis beyond finite spaces to general Borel MDPs, a setting common in continuous control and reinforcement learning. The regret decomposition separates irreversible state effects from ongoing performance gaps in a way that could support finer algorithm comparisons. The exponential decay and almost-sure convergence results are strong forms of learning guarantees. The canonical probability space construction is a technical contribution that may enable future sampling-based analyses in this setting.

major comments (1)

[Abstract] Abstract: the central claim that component (iii) converges exponentially (and its probabilistic counterpart a.s.) rests on 'assumptions that extend those used in prior work on finite state and action spaces to the Borel setting,' yet the abstract provides no explicit verification that the Borel measurability conditions or existence of the relevant limits hold in the introduced canonical probability space; this extension is load-bearing for the headline result.

minor comments (1)

The decomposition into (i)–(iii) would be clearer if accompanied by a short diagram or table showing how each term maps to standard finite-horizon regret notions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive evaluation of the paper's significance and for identifying a point of clarification in the abstract. We address the major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that component (iii) converges exponentially (and its probabilistic counterpart a.s.) rests on 'assumptions that extend those used in prior work on finite state and action spaces to the Borel setting,' yet the abstract provides no explicit verification that the Borel measurability conditions or existence of the relevant limits hold in the introduced canonical probability space; this extension is load-bearing for the headline result.

Authors: The abstract is a concise summary; the full construction of the canonical probability space (Section 3) is explicitly designed to preserve Borel measurability of the relevant maps, and the assumptions in Section 4 are stated as direct extensions of the finite-space conditions from prior work, with the mild conditions for almost-sure convergence already noted in the abstract. We agree the abstract could signal this more clearly and will revise it to briefly reference that the canonical space satisfies the required measurability and that the limits exist under the stated conditions. The technical verification remains in the body, as is standard for such results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper decomposes regret into three explicit components and proves exponential decay of the residual term (iii) for Thompson sampling under assumptions that extend finite-space results to Borel MDPs. No equation, definition, or limit statement reduces the claimed convergence to a fitted quantity, self-referential construction, or load-bearing self-citation. The derivation is presented as building on but independent of prior finite-space work, with new metrics and sample-path versions introduced explicitly. This is the most common honest finding for papers whose central claims remain externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on unstated technical assumptions needed to extend finite-space Thompson sampling analysis to Borel spaces; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption The MDP has Borel state and action spaces whose rewards and transitions depend on an unknown parameter.
Stated directly in the abstract as the modeling setting.
domain assumption Assumptions exist that extend the finite-state finite-action Thompson sampling analysis to the Borel case while preserving the required measurability and convergence properties.
Invoked in the abstract when stating the convergence results for Thompson sampling.

pith-pipeline@v0.9.0 · 5848 in / 1364 out tokens · 24736 ms · 2026-05-24T01:38:27.427342+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose new metrics that arise from decomposing the standard expected regret in discounted infinite-horizon MDPs into three terms: (i) the expected finite-time regret, (ii) the expected state regret, and (iii) the expected residual regret.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Under assumptions that extend those used in prior work on finite state and action spaces to the Borel setting, we show that component (iii) for TS converges to zero exponentially fast.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Approximation of Discrete-Time Infinite-Horizon Mean-Field Equilibria via Finite-Horizon Mean-Field Equilibria
math.OC 2025-09 unverdicted novelty 5.0

Finite-horizon mean-field equilibria accumulate to non-stationary infinite-horizon mean-field equilibria and converge to stationary ones under stated conditions, with explicit error bounds and a new uniqueness criterion.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter doi edition editor eid howpublished institution isbn issn journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in "" FUNCTION format.date year ...

work page
[3]

Conference on Learning Theory, 39--1

Agrawal S, Goyal N (2012) Analysis of thompson sampling for the multi-armed bandit problem. Conference on Learning Theory, 39--1

work page 2012
[4]

IEEE Transactions on Automatic Control 64(10):4137--4152

Banjevi \'c D, Kim MJ (2019) Thompson sampling for stochastic control: The continuous parameter case. IEEE Transactions on Automatic Control 64(10):4137--4152

work page 2019
[5]

Chung KL (2001) A course in probability theory (Academic press)

work page 2001
[6]

The Annals of Mathematical Statistics 1386--1403

Freedman DA (1963) On the asymptotic behavior of bayes' estimates in the discrete case. The Annals of Mathematical Statistics 1386--1403

work page 1963
[7]

Conference on Learning Theory, 861--898

Gopalan A, Mannor S (2015) Thompson sampling for learning parameterized markov decision processes. Conference on Learning Theory, 861--898

work page 2015
[8]

Hern \'a ndez-Lerma O (2012) Adaptive Markov control processes, volume 79 (Springer Science & Business Media)

work page 2012
[9]

Hern \'a ndez-Lerma O, Lasserre JB (2012) Discrete-time Markov control processes: basic optimality criteria, volume 30 (Springer Science & Business Media)

work page 2012
[10]

arXiv e-prints arXiv--2011

Kalkanli C, Ozgur A (2020) Asymptotic convergence of thompson sampling. arXiv e-prints arXiv--2011

work page 2020
[11]

Machine learning 49:209--232

Kearns M, Singh S (2002) Near-optimal reinforcement learning in polynomial time. Machine learning 49:209--232

work page 2002
[12]

IEEE Transactions on Automatic Control 62(12):6415--6422

Kim MJ (2017) Thompson sampling for stochastic control: The finite parameter case. IEEE Transactions on Automatic Control 62(12):6415--6422

work page 2017
[13]

Advances in applied mathematics 6(1):4--22

Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1):4--22

work page 1985
[14]

Thompson Sampling is Asymptotically Optimal in General Environments

Leike J, Lattimore T, Orseau L, Hutter M (2016) Thompson sampling is asymptotically optimal in general environments. arXiv preprint arXiv:1602.07905

work page internal anchor Pith review Pith/arXiv arXiv 2016
[15]

Puterman ML (2014) Markov Decision Processes.: Discrete Stochastic Dynamic Programming (John Wiley & Sons)

work page 2014
[16]

Stochastics: An International Journal of Probability and Stochastic Processes 20(1):51--71

Sch \"a l M (1987) Estimation and control in discounted stochastic dynamic programming. Stochastics: An International Journal of Probability and Stochastic Processes 20(1):51--71

work page 1987
[17]

Shreve SE (1978) Stochastic optimal control: The discrete time case (Academic Press)

work page 1978
[18]

Sutton RS, Barto AG (2018) Reinforcement learning: An introduction (MIT press)

work page 2018
[19]

Biometrika 25(3/4):285--294

Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285--294

work page 1933

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter doi edition editor eid howpublished institution isbn issn journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in "" FUNCTION format.date year ...

work page

[3] [3]

Conference on Learning Theory, 39--1

Agrawal S, Goyal N (2012) Analysis of thompson sampling for the multi-armed bandit problem. Conference on Learning Theory, 39--1

work page 2012

[4] [4]

IEEE Transactions on Automatic Control 64(10):4137--4152

Banjevi \'c D, Kim MJ (2019) Thompson sampling for stochastic control: The continuous parameter case. IEEE Transactions on Automatic Control 64(10):4137--4152

work page 2019

[5] [5]

Chung KL (2001) A course in probability theory (Academic press)

work page 2001

[6] [6]

The Annals of Mathematical Statistics 1386--1403

Freedman DA (1963) On the asymptotic behavior of bayes' estimates in the discrete case. The Annals of Mathematical Statistics 1386--1403

work page 1963

[7] [7]

Conference on Learning Theory, 861--898

Gopalan A, Mannor S (2015) Thompson sampling for learning parameterized markov decision processes. Conference on Learning Theory, 861--898

work page 2015

[8] [8]

Hern \'a ndez-Lerma O (2012) Adaptive Markov control processes, volume 79 (Springer Science & Business Media)

work page 2012

[9] [9]

Hern \'a ndez-Lerma O, Lasserre JB (2012) Discrete-time Markov control processes: basic optimality criteria, volume 30 (Springer Science & Business Media)

work page 2012

[10] [10]

arXiv e-prints arXiv--2011

Kalkanli C, Ozgur A (2020) Asymptotic convergence of thompson sampling. arXiv e-prints arXiv--2011

work page 2020

[11] [11]

Machine learning 49:209--232

Kearns M, Singh S (2002) Near-optimal reinforcement learning in polynomial time. Machine learning 49:209--232

work page 2002

[12] [12]

IEEE Transactions on Automatic Control 62(12):6415--6422

Kim MJ (2017) Thompson sampling for stochastic control: The finite parameter case. IEEE Transactions on Automatic Control 62(12):6415--6422

work page 2017

[13] [13]

Advances in applied mathematics 6(1):4--22

Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1):4--22

work page 1985

[14] [14]

Thompson Sampling is Asymptotically Optimal in General Environments

Leike J, Lattimore T, Orseau L, Hutter M (2016) Thompson sampling is asymptotically optimal in general environments. arXiv preprint arXiv:1602.07905

work page internal anchor Pith review Pith/arXiv arXiv 2016

[15] [15]

Puterman ML (2014) Markov Decision Processes.: Discrete Stochastic Dynamic Programming (John Wiley & Sons)

work page 2014

[16] [16]

Stochastics: An International Journal of Probability and Stochastic Processes 20(1):51--71

Sch \"a l M (1987) Estimation and control in discounted stochastic dynamic programming. Stochastics: An International Journal of Probability and Stochastic Processes 20(1):51--71

work page 1987

[17] [17]

Shreve SE (1978) Stochastic optimal control: The discrete time case (Academic Press)

work page 1978

[18] [18]

Sutton RS, Barto AG (2018) Reinforcement learning: An introduction (MIT press)

work page 2018

[19] [19]

Biometrika 25(3/4):285--294

Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285--294

work page 1933