pith. sign in

arxiv: 1906.10454 · v1 · pith:5EZYYC4Anew · submitted 2019-06-25 · 📊 stat.ML · cs.LG

Restless dependent bandits with fading memory

Pith reviewed 2026-05-25 16:07 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords multi-armed banditsdependent observationsC-mixing processesregret analysisUCB algorithmpseudo-regretmixing dependence
0
0 comments X

The pith

A C-Mix Improved UCB algorithm achieves pseudo-regret bounds comparable to the i.i.d. case for bandits whose arm samples follow weak C-mixing processes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the stochastic multi-armed bandit setting when successive rewards from each arm are drawn from a weak C-mixing process instead of being independent. It introduces the C-Mix Improved UCB algorithm and derives both problem-dependent and problem-independent regret bounds under two regimes of the mixing coefficients. For fast-mixing processes the pseudo-regret matches the i.i.d. upper bound up to a constant factor; for slow-mixing processes the bound equals the independent-case expression plus an additive term that does not scale with the number of arms, and this is shown to be tight up to a log(T) factor by a matching minimax lower bound.

Core claim

We establish a C-Mix Improved UCB algorithm and provide both problem-dependent and independent regret analysis in two different scenarios. In the first, so-called fast-mixing scenario, we show that pseudo-regret enjoys the same upper bound (up to a factor) as for i.i.d. observations; whereas in the second, slow mixing scenario, we discover a surprising effect, that the regret upper bound is similar to the independent case, with an incremental additive term which does not depend on the number of arms. The analysis of slow mixing scenario is supported with a minmax lower bound, which (up to a log(T) factor) matches the obtained upper bound.

What carries the argument

The C-Mix Improved UCB algorithm, which modifies the usual UCB confidence intervals by incorporating a priori bounds on the C-mixing coefficients to control the dependence between successive observations from the same arm.

If this is right

  • In the fast-mixing regime the pseudo-regret upper bound is the same order as the i.i.d. case up to a multiplicative constant.
  • In the slow-mixing regime the upper bound equals the independent-case bound plus an additive term that does not depend on the number of arms.
  • A minimax lower bound matches the slow-mixing upper bound up to a log(T) factor.
  • The algorithm and analysis apply to both problem-dependent and problem-independent regret measures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The additive penalty in the slow-mixing case suggests that long but decaying temporal correlations impose a fixed overhead rather than scaling with problem size.
  • The same mixing-coefficient technique could be applied to other sequential decision problems such as contextual bandits or reinforcement learning with fading memory.
  • If mixing rates can be estimated online, the method might extend to settings where the dependence structure is unknown in advance.

Load-bearing premise

The arm samples are generated from weak C-mixing processes whose mixing coefficients are known or can be bounded a priori.

What would settle it

A simulation or explicit construction with known slow-mixing coefficients in which the observed regret grows with the number of arms beyond the claimed additive term that is independent of arm count.

read the original abstract

We study the stochastic multi-armed bandit problem in the case when the arm samples are dependent over time and generated from so-called weak $\cC$-mixing processes. We establish a $\cC-$Mix Improved UCB agorithm and provide both problem-dependent and independent regret analysis in two different scenarios. In the first, so-called fast-mixing scenario, we show that pseudo-regret enjoys the same upper bound (up to a factor) as for i.i.d. observations; whereas in the second, slow mixing scenario, we discover a surprising effect, that the regret upper bound is similar to the independent case, with an incremental {\em additive} term which does not depend on the number of arms. The analysis of slow mixing scenario is supported with a minmax lower bound, which (up to a $\log(T)$ factor) matches the obtained upper bound.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper studies the stochastic multi-armed bandit problem where arm samples are generated from weak C-mixing processes with dependence over time. It introduces a C-Mix Improved UCB algorithm and derives both problem-dependent and problem-independent pseudo-regret bounds in two regimes: fast-mixing (where the bound matches the i.i.d. case up to a constant factor) and slow-mixing (where the bound is similar to the independent case plus an additive term independent of the number of arms). The slow-mixing upper bound is supported by a matching minimax lower bound up to a log(T) factor.

Significance. If the derivations hold under the stated assumptions, the work would meaningfully extend bandit regret analysis to dependent observations with quantifiable mixing rates. The slow-mixing result is notable for isolating an additive penalty independent of K, and the provision of both upper and lower bounds in that regime strengthens the contribution. The paper does not ship machine-checked proofs or reproducible code.

major comments (1)
  1. [Abstract / Assumptions] Abstract and weakest-assumption paragraph: both the C-Mix Improved UCB index construction and the deviation inequalities controlling temporal dependence explicitly require the C-mixing coefficients to be known or bounded a priori. No estimation procedure from data, no robustness analysis under rate misspecification, and no discussion of how the algorithm would be implemented when rates are unknown are provided. This assumption is load-bearing for the claimed regret bounds in both scenarios.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback. We address the single major comment below, clarifying the role of the known mixing-rate assumption in our theoretical analysis.

read point-by-point responses
  1. Referee: [Abstract / Assumptions] Abstract and weakest-assumption paragraph: both the C-Mix Improved UCB index construction and the deviation inequalities controlling temporal dependence explicitly require the C-mixing coefficients to be known or bounded a priori. No estimation procedure from data, no robustness analysis under rate misspecification, and no discussion of how the algorithm would be implemented when rates are unknown are provided. This assumption is load-bearing for the claimed regret bounds in both scenarios.

    Authors: We agree that the analysis requires the C-mixing coefficients (or an upper bound on them) to be known a priori; this is explicitly stated in the problem formulation and is used both to tune the UCB index and to apply the concentration inequalities for C-mixing processes. The paper’s contribution is a theoretical characterization of regret under this known-rate model, analogous to how many analyses of dependent bandits or time-series bandits begin with known dependence parameters before considering estimation. We do not provide an estimator for the mixing coefficients, nor a robustness analysis under misspecification, because these are orthogonal and technically demanding questions (consistent estimation of mixing rates typically requires additional ergodicity or parametric assumptions). We will revise the abstract, introduction, and the paragraph on weakest assumptions to (i) restate the known-rate requirement more prominently and (ii) add a short remark that practical deployment would require either prior knowledge or a separate estimation step, which we leave for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; regret bounds derived from external mixing assumptions

full rationale

The paper assumes mixing coefficients are known or bounded a priori as an input to both the C-Mix Improved UCB algorithm and the deviation inequalities in the regret proofs. This is a modeling assumption external to the derivation, not a quantity fitted or defined from the paper's own outputs. No self-definitional steps, fitted-input predictions, load-bearing self-citations, or ansatz smuggling appear in the provided abstract or reader's summary. The fast- and slow-mixing regret bounds follow from applying standard concentration tools scaled by the given mixing rates; the derivation chain remains self-contained against those external rates rather than reducing to its own results by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the quantitative weak C-mixing assumption and on standard concentration inequalities adapted to mixing sequences; no new entities are postulated and no parameters appear to be fitted inside the regret expressions.

axioms (1)
  • domain assumption Arm reward sequences are generated by weak C-mixing processes whose mixing coefficients decay at a known rate.
    Invoked in the abstract to separate fast-mixing and slow-mixing scenarios and to control dependence in the regret analysis.

pith-pipeline@v0.9.0 · 5676 in / 1276 out tokens · 34277 ms · 2026-05-25T16:07:55.199658+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

  1. [1]

    Audiffren and L

    J. Audiffren and L. Ralaivola. Cornering stationary and restless mixing bandits with remix-ucb. In NIPS, 2015

  2. [2]

    Auer and R

    P. Auer and R. Ortner. Ucb revisited: improved upper bounds for the stochastic multi-armed bandit problem. Peroodica Mathematica Hungarica, 61 0 (1-2): 0 55--65, 2010

  3. [3]

    Using confidence bounds for exploration-exploitation trade-offs

    Peter Auer. Using confidence bounds for exploration-exploitation trade-offs. Journal of Machine Learning Research, 3: 0 397--422, 2002

  4. [4]

    Bubeck and N

    S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5 0 (1): 0 1--122, 2012

  5. [5]

    Bubeck, N

    S. Bubeck, N. Cesa-Bianchi, and G. Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 2013

  6. [6]

    Bandit Games and Clustering Foundations

    Sebastien Bubeck. Bandit Games and Clustering Foundations. PhD thesis, Universite des Sciences et technologie de Lille - Lille I, 2010

  7. [7]

    K. Canda. Strong mixing properties of linear stochastic processes. Journal of Applied Pobability, 11: 0 401--408, 1974

  8. [8]

    Dedecker and F

    J. Dedecker and F. Merlevede. Moment bounds for dependent sequences in smooth banach spaces. Stochastic Processes and their Applications, 125 0 (9): 0 3401--3429, 2015

  9. [9]

    Dedecker, P

    J. Dedecker, P. Doukhan, G. Lang, R. Leon, S. Louhichi, and C. Prieur. Weak dependence with examples and applications. Springer, New York, 2007

  10. [10]

    Desautels, A

    T. Desautels, A. Krause, and John W. Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. Journal of Machine Learning Research, pages 4053--4103, 2014

  11. [11]

    Gr\"unew\"alder and A

    S. Gr\"unew\"alder and A. Khaleghi. Approximations of the restless bandit problem. Preprint, 2017

  12. [12]

    S. Guha, K. Munagala, and Pal M. Multi-armed bandit problems with delayed feedback. Arxiv. Preprint, 2010 a

  13. [13]

    S. Guha, K. Munagala, and P. Shi. Approximation algorithms for restless bandit problems. Journal of the ACM (JACM), 58 0 (1), 2010 b

  14. [14]

    Jarner and G.O

    S.F. Jarner and G.O. Roberts. Polynomial convergence rates of markov chains. The Annals of Applied Probability, 12 0 (1): 0 224--247, 2002

  15. [15]

    Joulani, A

    P. Joulani, A. Gyorgy, and Szeperavi C. Online learning under delayed feedback. In ICML 2013, volume 28, pages 1453--1461, 2013

  16. [16]

    Kallenberg

    O. Kallenberg. Commutativity properties of conditional distributions and P alm measures. Communications on Stochastic Analysis, 4 0 (1): 0 Article 3, 2010

  17. [17]

    Kallenberg

    O. Kallenberg. Random measures. Theory and Applications. 2nd edition. Springer, 2017

  18. [18]

    Maume-Deschamps

    V. Maume-Deschamps. Exponential inequalities and functional estimations for weak dependent data; applications to dynamical systems. Stoch. Dyn., 6 0 (4): 0 535--560, 2006

  19. [19]

    G. Neu, A. Gy\"orgy, C. Szepesvari, and A. Antos. Online markov decision processes under bandit feedback. In IEEE Transactions on Automatic Control, volume 59, pages 676--691, 2013

  20. [20]

    Ortner, D

    R. Ortner, D. Ryabko, P. Auer, and R. Munos. Regret bounds for restless markov bandits. Theoretical Computer Science, pages 62--76, 2014

  21. [21]

    Peligrad, S

    M. Peligrad, S. Utev, and Wu W.B. A maximal l_ p -inequality for stationary sequences and its applications. Proceedings of the Mathematical Society, 135 0 (2): 0 541--550, 2007

  22. [22]

    E. Rio. Sur le th\'eoreme de berry-esseen pour les suites faiblement d\'ependantes. Probab. Th. Rel. Fields, 104: 0 255--282, 1996

  23. [23]

    E. Rio. Th\'eorie asymptotique des processus al\'eatoires faibalement d\'ependants. Springer, Berlin, 2000

  24. [24]

    H. Robbins. Some aspects of the sequential design of the experiments. Bulletin of the American mathematical Society, 58 0 (5): 0 527--535, 1952

  25. [25]

    Rosenblatt

    M. Rosenblatt. Gaussian and Non-gaussian linear time series and random fields. Springer, New York, 2000

  26. [26]

    P. Whittle. Restless bandits: Activity allocation in a changing world. Journal of Applied Probability, 1988

  27. [27]

    Wintenberger

    O. Wintenberger. Deviation inequalities for sums of weakly dependent time series. Electronic Communications in Probability, 0 (15): 0 489--503, 2010