pith. sign in

arxiv: 2405.08253 · v3 · submitted 2024-05-14 · 📊 stat.ML · cs.LG· math.OC

Thompson Sampling for Infinite-Horizon Discounted Decision Processes

Pith reviewed 2026-05-24 01:38 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.OC
keywords Thompson samplinginfinite-horizon MDPsBorel spacesregret decompositioncomplete learningdiscounted decision processesresidual regretadaptive learning
0
0 comments X

The pith

Thompson sampling achieves complete learning in infinite-horizon discounted MDPs with Borel spaces by making residual regret vanish exponentially.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework for evaluating learning in sampling-based algorithms applied to discounted infinite-horizon MDPs whose states and actions live in Borel spaces and whose dynamics depend on an unknown parameter. Standard regret definitions do not suffice for policy evaluation in this infinite setting, so the work decomposes expected regret into three parts: finite-time regret, state regret that captures irreversible movement to worse states, and residual regret that measures ongoing suboptimality from the current time onward. For Thompson sampling the analysis establishes that residual regret decays exponentially fast under assumptions that extend finite-space results to the Borel case, while the sample-path version of residual regret converges almost surely to zero under mild limit conditions, which the authors call complete learning.

Core claim

We introduce a canonical probability space for these MDPs and decompose expected regret into expected finite-time regret, expected state regret, and expected residual regret. The probabilistic residual regret is defined as the conditional expected remaining loss given history. Under assumptions extending those used for finite state-action spaces, Thompson sampling makes the expected residual regret converge to zero exponentially fast. Under additional mild conditions ensuring the relevant limits exist, the probabilistic residual regret converges to zero almost surely, so Thompson sampling achieves complete learning.

What carries the argument

The decomposition of expected regret into finite-time, state, and residual components, together with the probabilistic residual regret as its sample-path counterpart.

If this is right

  • Exponential decay of residual regret means the algorithm eventually incurs no additional loss relative to the optimal policy from any time onward.
  • Almost-sure convergence of the probabilistic residual regret guarantees that complete learning holds on almost every sample path.
  • The new regret decomposition supplies a way to separate irreversible state effects from ongoing policy suboptimality in infinite-horizon problems.
  • Complete learning extends the notion of asymptotic optimality previously shown only for finite spaces to the continuous Borel case.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same three-term decomposition could be used to study other posterior-sampling or bandit algorithms in continuous-state MDPs.
  • In applied settings such as continuous inventory control, one could monitor the realized residual regret to test whether learning is occurring on observed trajectories.
  • If the mild limit conditions hold in practice, the result suggests that Thompson sampling eventually behaves like the optimal policy even when states are drawn from uncountable spaces.

Load-bearing premise

The assumptions used for finite state and action spaces extend to the Borel setting in a manner that preserves the exponential convergence properties for residual regret.

What would settle it

A concrete Borel MDP satisfying the paper's extended assumptions in which the expected residual regret for Thompson sampling stays bounded away from zero for all large times.

Figures

Figures reproduced from arXiv: 2405.08253 by Alba V. Olivares-Nadal, Cagla Keceli, Daniel Adelman.

Figure 1
Figure 1. Figure 1: Evolution of the stochastic process, in the case when [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of two different sample paths giving rise to the expected state regret. [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Deterministic reward depending on the control, where true parameter is B. [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Stochastic rewards depending on the control, where the true parameter is B. [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evolution of expected posterior and expected residual regret in Example [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Constant reward depending only on the first control, picked at t=0. [PITH_FULL_IMAGE:figures/full_fig_p037_6.png] view at source ↗
read the original abstract

This paper develops a viable notion of learning for sampling-based algorithms that applies in broader settings than previously considered. More specifically, we model a discounted infinite-horizon MDPs with Borel state and action spaces, whose rewards and transitions depend on an unknown parameter. To analyze adaptive learning algorithms based on sampling we introduce a general canonical probability space in this setting. Since standard definitions of regret are inadequate for policy evaluation in this setting, we propose new metrics that arise from decomposing the standard expected regret in discounted infinite-horizon MDPs into three terms: (i) the expected finite-time regret, (ii) the expected state regret, and (iii) the expected residual regret. Component (i) translates into the traditional concept of expected regret over a finite horizon. Term (ii) reflects how much future performance is compromised at a given time because earlier decisions have led the system to a less favorable state than under an optimal policy. Finally, metric (iii) measures regret with respect to the optimal reward from the current period onward, disregarding the irreversible consequences of past decisions. We further disaggregate this term by introducing the probabilistic residual regret, a finer, sample-path version of (iii) that captures the remaining loss in future performance from the current period onward, conditional on the observed history. Its expectation coincides with (iii). We then focus on Thompson sampling (TS); under assumptions that extend those used in prior work on finite state and action spaces to the Borel setting, we show that component (iii) for TS converges to zero exponentially fast. We further show that, under mild conditions ensuring the existence of the relevant limits, its probabilistic counterpart converges to zero almost surely and TS achieves complete learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper develops a notion of learning for sampling-based algorithms in discounted infinite-horizon MDPs with Borel state and action spaces. It introduces a canonical probability space and decomposes the standard expected regret into three components: (i) expected finite-time regret, (ii) expected state regret, and (iii) expected residual regret (with a finer probabilistic/sample-path version whose expectation matches (iii)). For Thompson sampling, under assumptions extending prior finite-space results to the Borel setting, it claims that component (iii) converges exponentially to zero; under mild conditions on the relevant limits, the probabilistic version converges almost surely to zero, implying complete learning.

Significance. If correct, the work meaningfully extends Thompson sampling analysis beyond finite spaces to general Borel MDPs, a setting common in continuous control and reinforcement learning. The regret decomposition separates irreversible state effects from ongoing performance gaps in a way that could support finer algorithm comparisons. The exponential decay and almost-sure convergence results are strong forms of learning guarantees. The canonical probability space construction is a technical contribution that may enable future sampling-based analyses in this setting.

major comments (1)
  1. [Abstract] Abstract: the central claim that component (iii) converges exponentially (and its probabilistic counterpart a.s.) rests on 'assumptions that extend those used in prior work on finite state and action spaces to the Borel setting,' yet the abstract provides no explicit verification that the Borel measurability conditions or existence of the relevant limits hold in the introduced canonical probability space; this extension is load-bearing for the headline result.
minor comments (1)
  1. The decomposition into (i)–(iii) would be clearer if accompanied by a short diagram or table showing how each term maps to standard finite-horizon regret notions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive evaluation of the paper's significance and for identifying a point of clarification in the abstract. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that component (iii) converges exponentially (and its probabilistic counterpart a.s.) rests on 'assumptions that extend those used in prior work on finite state and action spaces to the Borel setting,' yet the abstract provides no explicit verification that the Borel measurability conditions or existence of the relevant limits hold in the introduced canonical probability space; this extension is load-bearing for the headline result.

    Authors: The abstract is a concise summary; the full construction of the canonical probability space (Section 3) is explicitly designed to preserve Borel measurability of the relevant maps, and the assumptions in Section 4 are stated as direct extensions of the finite-space conditions from prior work, with the mild conditions for almost-sure convergence already noted in the abstract. We agree the abstract could signal this more clearly and will revise it to briefly reference that the canonical space satisfies the required measurability and that the limits exist under the stated conditions. The technical verification remains in the body, as is standard for such results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper decomposes regret into three explicit components and proves exponential decay of the residual term (iii) for Thompson sampling under assumptions that extend finite-space results to Borel MDPs. No equation, definition, or limit statement reduces the claimed convergence to a fitted quantity, self-referential construction, or load-bearing self-citation. The derivation is presented as building on but independent of prior finite-space work, with new metrics and sample-path versions introduced explicitly. This is the most common honest finding for papers whose central claims remain externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on unstated technical assumptions needed to extend finite-space Thompson sampling analysis to Borel spaces; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption The MDP has Borel state and action spaces whose rewards and transitions depend on an unknown parameter.
    Stated directly in the abstract as the modeling setting.
  • domain assumption Assumptions exist that extend the finite-state finite-action Thompson sampling analysis to the Borel case while preserving the required measurability and convergence properties.
    Invoked in the abstract when stating the convergence results for Thompson sampling.

pith-pipeline@v0.9.0 · 5848 in / 1364 out tokens · 24736 ms · 2026-05-24T01:38:27.427342+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Approximation of Discrete-Time Infinite-Horizon Mean-Field Equilibria via Finite-Horizon Mean-Field Equilibria

    math.OC 2025-09 unverdicted novelty 5.0

    Finite-horizon mean-field equilibria accumulate to non-stationary infinite-horizon mean-field equilibria and converge to stationary ones under stated conditions, with explicit error bounds and a new uniqueness criterion.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address author booktitle chapter doi edition editor eid howpublished institution isbn issn journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in "" FUNCTION format.date year ...

  3. [3]

    Conference on Learning Theory, 39--1

    Agrawal S, Goyal N (2012) Analysis of thompson sampling for the multi-armed bandit problem. Conference on Learning Theory, 39--1

  4. [4]

    IEEE Transactions on Automatic Control 64(10):4137--4152

    Banjevi \'c D, Kim MJ (2019) Thompson sampling for stochastic control: The continuous parameter case. IEEE Transactions on Automatic Control 64(10):4137--4152

  5. [5]

    Chung KL (2001) A course in probability theory (Academic press)

  6. [6]

    The Annals of Mathematical Statistics 1386--1403

    Freedman DA (1963) On the asymptotic behavior of bayes' estimates in the discrete case. The Annals of Mathematical Statistics 1386--1403

  7. [7]

    Conference on Learning Theory, 861--898

    Gopalan A, Mannor S (2015) Thompson sampling for learning parameterized markov decision processes. Conference on Learning Theory, 861--898

  8. [8]

    Hern \'a ndez-Lerma O (2012) Adaptive Markov control processes, volume 79 (Springer Science & Business Media)

  9. [9]

    Hern \'a ndez-Lerma O, Lasserre JB (2012) Discrete-time Markov control processes: basic optimality criteria, volume 30 (Springer Science & Business Media)

  10. [10]

    arXiv e-prints arXiv--2011

    Kalkanli C, Ozgur A (2020) Asymptotic convergence of thompson sampling. arXiv e-prints arXiv--2011

  11. [11]

    Machine learning 49:209--232

    Kearns M, Singh S (2002) Near-optimal reinforcement learning in polynomial time. Machine learning 49:209--232

  12. [12]

    IEEE Transactions on Automatic Control 62(12):6415--6422

    Kim MJ (2017) Thompson sampling for stochastic control: The finite parameter case. IEEE Transactions on Automatic Control 62(12):6415--6422

  13. [13]

    Advances in applied mathematics 6(1):4--22

    Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1):4--22

  14. [14]

    Thompson Sampling is Asymptotically Optimal in General Environments

    Leike J, Lattimore T, Orseau L, Hutter M (2016) Thompson sampling is asymptotically optimal in general environments. arXiv preprint arXiv:1602.07905

  15. [15]

    Puterman ML (2014) Markov Decision Processes.: Discrete Stochastic Dynamic Programming (John Wiley & Sons)

  16. [16]

    Stochastics: An International Journal of Probability and Stochastic Processes 20(1):51--71

    Sch \"a l M (1987) Estimation and control in discounted stochastic dynamic programming. Stochastics: An International Journal of Probability and Stochastic Processes 20(1):51--71

  17. [17]

    Shreve SE (1978) Stochastic optimal control: The discrete time case (Academic Press)

  18. [18]

    Sutton RS, Barto AG (2018) Reinforcement learning: An introduction (MIT press)

  19. [19]

    Biometrika 25(3/4):285--294

    Thompson WR (1933) On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4):285--294