Restless dependent bandits with fading memory

Alexandra Carpentier; Gilles Blanchard; Oleksandr Zadorozhnyi

arxiv: 1906.10454 · v1 · pith:5EZYYC4Anew · submitted 2019-06-25 · 📊 stat.ML · cs.LG

Restless dependent bandits with fading memory

Oleksandr Zadorozhnyi , Gilles Blanchard , Alexandra Carpentier This is my paper

Pith reviewed 2026-05-25 16:07 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords multi-armed banditsdependent observationsC-mixing processesregret analysisUCB algorithmpseudo-regretmixing dependence

0 comments

The pith

A C-Mix Improved UCB algorithm achieves pseudo-regret bounds comparable to the i.i.d. case for bandits whose arm samples follow weak C-mixing processes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the stochastic multi-armed bandit setting when successive rewards from each arm are drawn from a weak C-mixing process instead of being independent. It introduces the C-Mix Improved UCB algorithm and derives both problem-dependent and problem-independent regret bounds under two regimes of the mixing coefficients. For fast-mixing processes the pseudo-regret matches the i.i.d. upper bound up to a constant factor; for slow-mixing processes the bound equals the independent-case expression plus an additive term that does not scale with the number of arms, and this is shown to be tight up to a log(T) factor by a matching minimax lower bound.

Core claim

We establish a C-Mix Improved UCB algorithm and provide both problem-dependent and independent regret analysis in two different scenarios. In the first, so-called fast-mixing scenario, we show that pseudo-regret enjoys the same upper bound (up to a factor) as for i.i.d. observations; whereas in the second, slow mixing scenario, we discover a surprising effect, that the regret upper bound is similar to the independent case, with an incremental additive term which does not depend on the number of arms. The analysis of slow mixing scenario is supported with a minmax lower bound, which (up to a log(T) factor) matches the obtained upper bound.

What carries the argument

The C-Mix Improved UCB algorithm, which modifies the usual UCB confidence intervals by incorporating a priori bounds on the C-mixing coefficients to control the dependence between successive observations from the same arm.

If this is right

In the fast-mixing regime the pseudo-regret upper bound is the same order as the i.i.d. case up to a multiplicative constant.
In the slow-mixing regime the upper bound equals the independent-case bound plus an additive term that does not depend on the number of arms.
A minimax lower bound matches the slow-mixing upper bound up to a log(T) factor.
The algorithm and analysis apply to both problem-dependent and problem-independent regret measures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The additive penalty in the slow-mixing case suggests that long but decaying temporal correlations impose a fixed overhead rather than scaling with problem size.
The same mixing-coefficient technique could be applied to other sequential decision problems such as contextual bandits or reinforcement learning with fading memory.
If mixing rates can be estimated online, the method might extend to settings where the dependence structure is unknown in advance.

Load-bearing premise

The arm samples are generated from weak C-mixing processes whose mixing coefficients are known or can be bounded a priori.

What would settle it

A simulation or explicit construction with known slow-mixing coefficients in which the observed regret grows with the number of arms beyond the claimed additive term that is independent of arm count.

read the original abstract

We study the stochastic multi-armed bandit problem in the case when the arm samples are dependent over time and generated from so-called weak $\cC$-mixing processes. We establish a $\cC-$Mix Improved UCB agorithm and provide both problem-dependent and independent regret analysis in two different scenarios. In the first, so-called fast-mixing scenario, we show that pseudo-regret enjoys the same upper bound (up to a factor) as for i.i.d. observations; whereas in the second, slow mixing scenario, we discover a surprising effect, that the regret upper bound is similar to the independent case, with an incremental {\em additive} term which does not depend on the number of arms. The analysis of slow mixing scenario is supported with a minmax lower bound, which (up to a $\log(T)$ factor) matches the obtained upper bound.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends UCB to weak C-mixing bandits with matching upper and lower bounds in the slow-mixing regime and an arm-independent additive term, but everything rests on knowing the mixing rates in advance.

read the letter

The core contribution is a C-Mix Improved UCB algorithm plus regret bounds that recover the usual i.i.d. rates (up to constants) under fast mixing and, under slow mixing, add only an arm-independent term while matching a minimax lower bound up to a log T factor. That separation and the form of the slow-mixing bound look new relative to the mixing-bandit papers cited in the abstract. The analysis is carried out under standard weak-mixing assumptions and appears to use the quantitative decay rates in a direct way to control dependence in the concentration inequalities. The lower bound is a useful check that the upper bound is not loose in the slow case. The main limitation is exactly what the abstract and stress-test note: the algorithm and all deviation bounds require the mixing coefficients to be known or bounded a priori. There is no estimator for those rates, no robustness result under misspecification, and no discussion of how an analyst would obtain them in practice. That assumption is load-bearing; without it the indices cannot be formed and the proofs do not go through. The paper is therefore a clean theoretical extension for people already working inside the mixing-bandit model, but it does not yet address how to make the method operational when the dependence structure is unknown. I would bring this to a reading group focused on non-i.i.d. bandits. It is worth sending to peer review because the algorithmic construction and the matching lower bound are concrete and falsifiable once the mixing rates are given; a referee can check the proofs and ask for estimation extensions if needed.

Referee Report

1 major / 0 minor

Summary. The paper studies the stochastic multi-armed bandit problem where arm samples are generated from weak C-mixing processes with dependence over time. It introduces a C-Mix Improved UCB algorithm and derives both problem-dependent and problem-independent pseudo-regret bounds in two regimes: fast-mixing (where the bound matches the i.i.d. case up to a constant factor) and slow-mixing (where the bound is similar to the independent case plus an additive term independent of the number of arms). The slow-mixing upper bound is supported by a matching minimax lower bound up to a log(T) factor.

Significance. If the derivations hold under the stated assumptions, the work would meaningfully extend bandit regret analysis to dependent observations with quantifiable mixing rates. The slow-mixing result is notable for isolating an additive penalty independent of K, and the provision of both upper and lower bounds in that regime strengthens the contribution. The paper does not ship machine-checked proofs or reproducible code.

major comments (1)

[Abstract / Assumptions] Abstract and weakest-assumption paragraph: both the C-Mix Improved UCB index construction and the deviation inequalities controlling temporal dependence explicitly require the C-mixing coefficients to be known or bounded a priori. No estimation procedure from data, no robustness analysis under rate misspecification, and no discussion of how the algorithm would be implemented when rates are unknown are provided. This assumption is load-bearing for the claimed regret bounds in both scenarios.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback. We address the single major comment below, clarifying the role of the known mixing-rate assumption in our theoretical analysis.

read point-by-point responses

Referee: [Abstract / Assumptions] Abstract and weakest-assumption paragraph: both the C-Mix Improved UCB index construction and the deviation inequalities controlling temporal dependence explicitly require the C-mixing coefficients to be known or bounded a priori. No estimation procedure from data, no robustness analysis under rate misspecification, and no discussion of how the algorithm would be implemented when rates are unknown are provided. This assumption is load-bearing for the claimed regret bounds in both scenarios.

Authors: We agree that the analysis requires the C-mixing coefficients (or an upper bound on them) to be known a priori; this is explicitly stated in the problem formulation and is used both to tune the UCB index and to apply the concentration inequalities for C-mixing processes. The paper’s contribution is a theoretical characterization of regret under this known-rate model, analogous to how many analyses of dependent bandits or time-series bandits begin with known dependence parameters before considering estimation. We do not provide an estimator for the mixing coefficients, nor a robustness analysis under misspecification, because these are orthogonal and technically demanding questions (consistent estimation of mixing rates typically requires additional ergodicity or parametric assumptions). We will revise the abstract, introduction, and the paragraph on weakest assumptions to (i) restate the known-rate requirement more prominently and (ii) add a short remark that practical deployment would require either prior knowledge or a separate estimation step, which we leave for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; regret bounds derived from external mixing assumptions

full rationale

The paper assumes mixing coefficients are known or bounded a priori as an input to both the C-Mix Improved UCB algorithm and the deviation inequalities in the regret proofs. This is a modeling assumption external to the derivation, not a quantity fitted or defined from the paper's own outputs. No self-definitional steps, fitted-input predictions, load-bearing self-citations, or ansatz smuggling appear in the provided abstract or reader's summary. The fast- and slow-mixing regret bounds follow from applying standard concentration tools scaled by the given mixing rates; the derivation chain remains self-contained against those external rates rather than reducing to its own results by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the quantitative weak C-mixing assumption and on standard concentration inequalities adapted to mixing sequences; no new entities are postulated and no parameters appear to be fitted inside the regret expressions.

axioms (1)

domain assumption Arm reward sequences are generated by weak C-mixing processes whose mixing coefficients decay at a known rate.
Invoked in the abstract to separate fast-mixing and slow-mixing scenarios and to control dependence in the regret analysis.

pith-pipeline@v0.9.0 · 5676 in / 1276 out tokens · 34277 ms · 2026-05-25T16:07:55.199658+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We establish a C-Mix Improved UCB algorithm... pseudo-regret... fast-mixing scenario... slow mixing scenario... Φ_C-mixing coefficients
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Φ_C(k) = sup... lim Φ_C(k)=0... polynomially mixing Φ_C(t)≤c0 t^{-α}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

Audiffren and L

J. Audiffren and L. Ralaivola. Cornering stationary and restless mixing bandits with remix-ucb. In NIPS, 2015

work page 2015
[2]

Auer and R

P. Auer and R. Ortner. Ucb revisited: improved upper bounds for the stochastic multi-armed bandit problem. Peroodica Mathematica Hungarica, 61 0 (1-2): 0 55--65, 2010

work page 2010
[3]

Using confidence bounds for exploration-exploitation trade-offs

Peter Auer. Using confidence bounds for exploration-exploitation trade-offs. Journal of Machine Learning Research, 3: 0 397--422, 2002

work page 2002
[4]

Bubeck and N

S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5 0 (1): 0 1--122, 2012

work page 2012
[5]

Bubeck, N

S. Bubeck, N. Cesa-Bianchi, and G. Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 2013

work page 2013
[6]

Bandit Games and Clustering Foundations

Sebastien Bubeck. Bandit Games and Clustering Foundations. PhD thesis, Universite des Sciences et technologie de Lille - Lille I, 2010

work page 2010
[7]

K. Canda. Strong mixing properties of linear stochastic processes. Journal of Applied Pobability, 11: 0 401--408, 1974

work page 1974
[8]

Dedecker and F

J. Dedecker and F. Merlevede. Moment bounds for dependent sequences in smooth banach spaces. Stochastic Processes and their Applications, 125 0 (9): 0 3401--3429, 2015

work page 2015
[9]

Dedecker, P

J. Dedecker, P. Doukhan, G. Lang, R. Leon, S. Louhichi, and C. Prieur. Weak dependence with examples and applications. Springer, New York, 2007

work page 2007
[10]

Desautels, A

T. Desautels, A. Krause, and John W. Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. Journal of Machine Learning Research, pages 4053--4103, 2014

work page 2014
[11]

Gr\"unew\"alder and A

S. Gr\"unew\"alder and A. Khaleghi. Approximations of the restless bandit problem. Preprint, 2017

work page 2017
[12]

S. Guha, K. Munagala, and Pal M. Multi-armed bandit problems with delayed feedback. Arxiv. Preprint, 2010 a

work page 2010
[13]

S. Guha, K. Munagala, and P. Shi. Approximation algorithms for restless bandit problems. Journal of the ACM (JACM), 58 0 (1), 2010 b

work page 2010
[14]

Jarner and G.O

S.F. Jarner and G.O. Roberts. Polynomial convergence rates of markov chains. The Annals of Applied Probability, 12 0 (1): 0 224--247, 2002

work page 2002
[15]

Joulani, A

P. Joulani, A. Gyorgy, and Szeperavi C. Online learning under delayed feedback. In ICML 2013, volume 28, pages 1453--1461, 2013

work page 2013
[16]

Kallenberg

O. Kallenberg. Commutativity properties of conditional distributions and P alm measures. Communications on Stochastic Analysis, 4 0 (1): 0 Article 3, 2010

work page 2010
[17]

Kallenberg

O. Kallenberg. Random measures. Theory and Applications. 2nd edition. Springer, 2017

work page 2017
[18]

Maume-Deschamps

V. Maume-Deschamps. Exponential inequalities and functional estimations for weak dependent data; applications to dynamical systems. Stoch. Dyn., 6 0 (4): 0 535--560, 2006

work page 2006
[19]

G. Neu, A. Gy\"orgy, C. Szepesvari, and A. Antos. Online markov decision processes under bandit feedback. In IEEE Transactions on Automatic Control, volume 59, pages 676--691, 2013

work page 2013
[20]

Ortner, D

R. Ortner, D. Ryabko, P. Auer, and R. Munos. Regret bounds for restless markov bandits. Theoretical Computer Science, pages 62--76, 2014

work page 2014
[21]

Peligrad, S

M. Peligrad, S. Utev, and Wu W.B. A maximal l_ p -inequality for stationary sequences and its applications. Proceedings of the Mathematical Society, 135 0 (2): 0 541--550, 2007

work page 2007
[22]

E. Rio. Sur le th\'eoreme de berry-esseen pour les suites faiblement d\'ependantes. Probab. Th. Rel. Fields, 104: 0 255--282, 1996

work page 1996
[23]

E. Rio. Th\'eorie asymptotique des processus al\'eatoires faibalement d\'ependants. Springer, Berlin, 2000

work page 2000
[24]

H. Robbins. Some aspects of the sequential design of the experiments. Bulletin of the American mathematical Society, 58 0 (5): 0 527--535, 1952

work page 1952
[25]

Rosenblatt

M. Rosenblatt. Gaussian and Non-gaussian linear time series and random fields. Springer, New York, 2000

work page 2000
[26]

P. Whittle. Restless bandits: Activity allocation in a changing world. Journal of Applied Probability, 1988

work page 1988
[27]

Wintenberger

O. Wintenberger. Deviation inequalities for sums of weakly dependent time series. Electronic Communications in Probability, 0 (15): 0 489--503, 2010

work page 2010

[1] [1]

Audiffren and L

J. Audiffren and L. Ralaivola. Cornering stationary and restless mixing bandits with remix-ucb. In NIPS, 2015

work page 2015

[2] [2]

Auer and R

P. Auer and R. Ortner. Ucb revisited: improved upper bounds for the stochastic multi-armed bandit problem. Peroodica Mathematica Hungarica, 61 0 (1-2): 0 55--65, 2010

work page 2010

[3] [3]

Using confidence bounds for exploration-exploitation trade-offs

Peter Auer. Using confidence bounds for exploration-exploitation trade-offs. Journal of Machine Learning Research, 3: 0 397--422, 2002

work page 2002

[4] [4]

Bubeck and N

S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5 0 (1): 0 1--122, 2012

work page 2012

[5] [5]

Bubeck, N

S. Bubeck, N. Cesa-Bianchi, and G. Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 2013

work page 2013

[6] [6]

Bandit Games and Clustering Foundations

Sebastien Bubeck. Bandit Games and Clustering Foundations. PhD thesis, Universite des Sciences et technologie de Lille - Lille I, 2010

work page 2010

[7] [7]

K. Canda. Strong mixing properties of linear stochastic processes. Journal of Applied Pobability, 11: 0 401--408, 1974

work page 1974

[8] [8]

Dedecker and F

J. Dedecker and F. Merlevede. Moment bounds for dependent sequences in smooth banach spaces. Stochastic Processes and their Applications, 125 0 (9): 0 3401--3429, 2015

work page 2015

[9] [9]

Dedecker, P

J. Dedecker, P. Doukhan, G. Lang, R. Leon, S. Louhichi, and C. Prieur. Weak dependence with examples and applications. Springer, New York, 2007

work page 2007

[10] [10]

Desautels, A

T. Desautels, A. Krause, and John W. Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. Journal of Machine Learning Research, pages 4053--4103, 2014

work page 2014

[11] [11]

Gr\"unew\"alder and A

S. Gr\"unew\"alder and A. Khaleghi. Approximations of the restless bandit problem. Preprint, 2017

work page 2017

[12] [12]

S. Guha, K. Munagala, and Pal M. Multi-armed bandit problems with delayed feedback. Arxiv. Preprint, 2010 a

work page 2010

[13] [13]

S. Guha, K. Munagala, and P. Shi. Approximation algorithms for restless bandit problems. Journal of the ACM (JACM), 58 0 (1), 2010 b

work page 2010

[14] [14]

Jarner and G.O

S.F. Jarner and G.O. Roberts. Polynomial convergence rates of markov chains. The Annals of Applied Probability, 12 0 (1): 0 224--247, 2002

work page 2002

[15] [15]

Joulani, A

P. Joulani, A. Gyorgy, and Szeperavi C. Online learning under delayed feedback. In ICML 2013, volume 28, pages 1453--1461, 2013

work page 2013

[16] [16]

Kallenberg

O. Kallenberg. Commutativity properties of conditional distributions and P alm measures. Communications on Stochastic Analysis, 4 0 (1): 0 Article 3, 2010

work page 2010

[17] [17]

Kallenberg

O. Kallenberg. Random measures. Theory and Applications. 2nd edition. Springer, 2017

work page 2017

[18] [18]

Maume-Deschamps

V. Maume-Deschamps. Exponential inequalities and functional estimations for weak dependent data; applications to dynamical systems. Stoch. Dyn., 6 0 (4): 0 535--560, 2006

work page 2006

[19] [19]

G. Neu, A. Gy\"orgy, C. Szepesvari, and A. Antos. Online markov decision processes under bandit feedback. In IEEE Transactions on Automatic Control, volume 59, pages 676--691, 2013

work page 2013

[20] [20]

Ortner, D

R. Ortner, D. Ryabko, P. Auer, and R. Munos. Regret bounds for restless markov bandits. Theoretical Computer Science, pages 62--76, 2014

work page 2014

[21] [21]

Peligrad, S

M. Peligrad, S. Utev, and Wu W.B. A maximal l_ p -inequality for stationary sequences and its applications. Proceedings of the Mathematical Society, 135 0 (2): 0 541--550, 2007

work page 2007

[22] [22]

E. Rio. Sur le th\'eoreme de berry-esseen pour les suites faiblement d\'ependantes. Probab. Th. Rel. Fields, 104: 0 255--282, 1996

work page 1996

[23] [23]

E. Rio. Th\'eorie asymptotique des processus al\'eatoires faibalement d\'ependants. Springer, Berlin, 2000

work page 2000

[24] [24]

H. Robbins. Some aspects of the sequential design of the experiments. Bulletin of the American mathematical Society, 58 0 (5): 0 527--535, 1952

work page 1952

[25] [25]

Rosenblatt

M. Rosenblatt. Gaussian and Non-gaussian linear time series and random fields. Springer, New York, 2000

work page 2000

[26] [26]

P. Whittle. Restless bandits: Activity allocation in a changing world. Journal of Applied Probability, 1988

work page 1988

[27] [27]

Wintenberger

O. Wintenberger. Deviation inequalities for sums of weakly dependent time series. Electronic Communications in Probability, 0 (15): 0 489--503, 2010

work page 2010