pith. machine review for the scientific record. sign in

arxiv: 2604.21432 · v1 · submitted 2026-04-23 · 📊 stat.ML · cs.LG

Recognition: unknown

A single algorithm for both restless and rested rotting bandits

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:39 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords rotting banditsrested banditsrestless banditsnon-stationary banditsadaptive UCBregret boundsmonotonic decay
0
0 comments X

The pith

A single algorithm achieves near-optimal regret for both rested and restless rotting bandits without knowing which setting applies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that one algorithm, RAW-UCB, can handle rotting bandits in both the rested case where pulling an arm causes its reward to decay and the restless case where decay occurs regardless of pulls. It delivers near-optimal regret bounds while remaining ignorant of the specific setting and the form of non-stationarity, such as piece-wise constant drops or bounded total variation. This matters because earlier methods designed for restless rotting bandits performed poorly when applied to rested ones, restricting their use in settings like recommender systems where item value fades over repeated exposure or time. The approach centers on an adaptive window that focuses UCB estimates on recent observations to track the current decaying value of each arm.

Core claim

We introduce a novel algorithm, Rotting Adaptive Window UCB (RAW-UCB), that achieves near-optimal regret in both rotting rested and restless bandit, without any prior knowledge of the setting (rested or restless) and the type of non-stationarity (e.g., piece-wise constant, bounded variation). This is in striking contrast with previous negative results showing that no algorithm can achieve similar results as soon as rewards are allowed to increase.

What carries the argument

Rotting Adaptive Window UCB (RAW-UCB), an adaptive-window variant of UCB that recomputes indices using only recent rewards to track decaying arm values in either rested or restless rotting environments.

If this is right

  • A single method suffices for both rested and restless rotting bandits.
  • No prior knowledge of the setting or non-stationarity type is required.
  • Near-optimal regret holds for both piece-wise constant and bounded variation decay.
  • The result does not extend to cases where rewards are permitted to increase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rested versus restless distinction becomes secondary once monotonic decay is enforced.
  • Deployment in recommendation or tutoring systems becomes simpler because the source of decay need not be detected in advance.
  • The explicit contrast with increasing rewards points to monotonicity as the property enabling a universal algorithm.

Load-bearing premise

Rewards decrease monotonically over time with non-stationarity limited to piece-wise constant or bounded variation forms.

What would settle it

An instance of a rested rotting bandit with piece-wise constant decay where RAW-UCB's cumulative regret grows linearly instead of matching the known lower bound would disprove the near-optimality claim.

Figures

Figures reproduced from arXiv: 2604.21432 by Alessandro Lazaric, Julien Seznec, Michal Valko, Pierre M\'enard.

Figure 1
Figure 1. Figure 1: Left: rewards from the Yahoo! dataset for two days. Right: average regret over 500 runs. RAW-UCB vs FEWA. The two algorithms compute the same statistics and share most of their analysis. Yet, RAW-UCB consistently outperforms FEWA on the full (rested and restless) benchmark. The difference be￾tween the two is even more significant in the restless case. Moreover, RAW-UCB is also simpler to implement and fast… view at source ↗
Figure 2
Figure 2. Figure 2: Left: Regret at the end of the game for different values of L. Middle, Right: Regret across time for two values of L. Average over 2000 runs. We highlight the [10%, 90%] confidence region. performances of the two algorithms are much closer. We also notice that there is a larger variance in FEWA’s result compared to RAW-UCB. This is not surprising because we had to drastically reduce the confidence bounds t… view at source ↗
Figure 3
Figure 3. Figure 3: Left: reward functions on from the Yahoo! dataset Right: average regret of policies over 500 runs 0 25000 50000 75000 100000 125000 150000 175000 Round (t) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Arms Average Reward Day 2 - K=11 0 25000 50000 75000 100000 125000 150000 175000 Round (t) 0 2000 4000 6000 8000 10000 A v e r a g e r e g r e t R t EFF_RAW-UCB(α=1.4, m =1.1) EFF_RAW-UCB(α=1.4, m =2) EFF_RAW-UCB(… view at source ↗
read the original abstract

In many application domains (e.g., recommender systems, intelligent tutoring systems), the rewards associated to the actions tend to decrease over time. This decay is either caused by the actions executed in the past (e.g., a user may get bored when songs of the same genre are recommended over and over) or by an external factor (e.g., content becomes outdated). These two situations can be modeled as specific instances of the rested and restless bandit settings, where arms are rotting (i.e., their value decrease over time). These problems were thought to be significantly different, since Levine et al. (2017) showed that state-of-the-art algorithms for restless bandit perform poorly in the rested rotting setting. In this paper, we introduce a novel algorithm, Rotting Adaptive Window UCB (RAW-UCB), that achieves near-optimal regret in both rotting rested and restless bandit, without any prior knowledge of the setting (rested or restless) and the type of non-stationarity (e.g., piece-wise constant, bounded variation). This is in striking contrast with previous negative results showing that no algorithm can achieve similar results as soon as rewards are allowed to increase. We confirm our theoretical findings on a number of synthetic and dataset-based experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces the Rotting Adaptive Window UCB (RAW-UCB) algorithm, which uses an adaptive window that resets or shrinks based on observed reward drops to achieve a unified near-optimal regret bound of the form O(sqrt(K T log T) + V) for both rested and restless rotting bandits, without requiring knowledge of the setting (rested vs. restless) or the non-stationarity type (e.g., piecewise constant or bounded variation). The analysis applies the same window rule and UCB index in both cases, with experiments on synthetic and real datasets supporting the rates.

Significance. If the central construction and unified bound hold, the result is significant: it provides the first single algorithm that matches the best-known rates in each rotting setting separately while removing the need for prior knowledge or case-specific tuning, directly addressing the separation shown in prior work (Levine et al. 2017). The machine-checked-style analysis (same concentration and union-bound arguments across cases) and reproducible experimental corroboration strengthen the contribution for applications such as recommender systems.

major comments (2)
  1. [§4] §4 (Regret Analysis): the claim that the identical adaptive-window rule yields the O(sqrt(K T log T) + V) bound in both settings requires an explicit lemma showing that the restless-case time-based decay is absorbed into the same concentration inequality used for the rested case; without this, the 'no case-specific tuning' assertion rests on an unverified step.
  2. [Theorem 1] Theorem 1 (Unified Regret Bound): the variation term V is defined differently in the rested (only pulled arms) versus restless (all arms) cases; the proof must verify that the window adaptation rule does not inadvertently require knowledge of which definition applies, or the 'without any prior knowledge' guarantee is weakened.
minor comments (3)
  1. [Figure 2] Figure 2 (Regret plots): the y-axis scaling and legend placement make it difficult to compare the RAW-UCB curve against the rested-only and restless-only baselines at small T; add a zoomed inset or separate panels.
  2. [Notation] Notation section: the symbol V is introduced without an immediate equation reference; add 'where V = sum_t |r_t(a) - r_{t-1}(a)|' (or equivalent) at first use.
  3. [§5.2] §5.2 (Real-data experiments): the description of how rotting is induced from the dataset is brief; state the exact decay model (e.g., linear in number of plays or time) so readers can reproduce the setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive recommendation. The comments help clarify the presentation of the unified analysis. We address each major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (Regret Analysis): the claim that the identical adaptive-window rule yields the O(sqrt(K T log T) + V) bound in both settings requires an explicit lemma showing that the restless-case time-based decay is absorbed into the same concentration inequality used for the rested case; without this, the 'no case-specific tuning' assertion rests on an unverified step.

    Authors: We agree that an explicit lemma would improve clarity. The adaptive window resets or shrinks upon detecting a reward drop in the observed sequence for the pulled arm. This mechanism bounds the effective variation inside each window identically in both settings: in the restless case, external time-based decay manifests as an observed drop (when the arm is selected), triggering the same reset/shrink as in the rested case. The concentration and union-bound arguments therefore carry over directly. To make this transparent, we will insert a new lemma (Lemma 4.3) that isolates the absorption step for the restless decay using the identical window rule and concentration inequality. The revised manuscript will include this addition. revision: yes

  2. Referee: [Theorem 1] Theorem 1 (Unified Regret Bound): the variation term V is defined differently in the rested (only pulled arms) versus restless (all arms) cases; the proof must verify that the window adaptation rule does not inadvertently require knowledge of which definition applies, or the 'without any prior knowledge' guarantee is weakened.

    Authors: The window rule operates exclusively on the sequence of rewards observed for the arm chosen at each round; it never inspects or depends on changes to unpulled arms or on the global definition of V. Consequently, the algorithm itself requires no knowledge of whether V is measured over pulled arms only or over all arms. In the analysis we instantiate the bound with the appropriate V for each setting while keeping the algorithmic rule and its proof unchanged. We will add a short clarifying paragraph immediately after the statement of Theorem 1 that explicitly notes the observation-based nature of the rule and confirms that the same rule and proof steps apply regardless of the V definition. This is a clarification; the result and its proof are unaffected. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces RAW-UCB as a new algorithm with an adaptive window rule that is defined directly from observed reward drops, then derives a unified regret bound O(sqrt(K T log T) + V) by applying standard concentration inequalities separately to the rested and restless cases. The analysis uses the same index and window logic in both regimes without fitting parameters to the target regret or importing uniqueness results from prior self-citations; the bound follows from union bounds and variation terms that are external to the algorithm definition. No step reduces by construction to a fitted input or self-referential definition, and the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the modeling assumption that rewards are non-increasing (rotting) and that the non-stationarity belongs to standard classes such as piecewise constant or bounded variation; no free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption Rewards decrease over time in both rested and restless settings
    Stated in the abstract as the core modeling choice for rotting bandits
  • domain assumption Non-stationarity is of types such as piece-wise constant or bounded variation
    Mentioned as examples the algorithm handles without prior knowledge

pith-pipeline@v0.9.0 · 5525 in / 1332 out tokens · 41598 ms · 2026-05-09T20:39:02.025868+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 4 canonical work pages

  1. [1]

    and Bubeck, S

    Audibert, J.-Y. and Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits . In Proceedings of the Conference on Learning Theory (COLT), 2009 , pages 217--226

  2. [2]

    Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem . Machine Learning , 47(2-3):235--256

  3. [3]

    Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2003). The nonstochastic multiarmed bandit problem . SIAM Journal on Computing , 32(1):48--77

  4. [4]

    Auer, P., Gajane, P., and Ortner, R. (2019). Adaptively Tracking the Best Bandit Arm with an Unknown Number of Distribution Changes . In Proceedings of the Conference on Learning Theory (COLT) , pages 138--158

  5. [5]

    C., Charrier, G., Desprez, F., Jeannot, E., Jeanvoine, E., L \` e bre, A., Margery, D., Niclausse, N., Nussbaum, L., Richard, O., Perez, C., Quesnel, F., Rohr, C., and Sarzyniec, L

    Balouek, D., Amarie, A. C., Charrier, G., Desprez, F., Jeannot, E., Jeanvoine, E., L \` e bre, A., Margery, D., Niclausse, N., Nussbaum, L., Richard, O., Perez, C., Quesnel, F., Rohr, C., and Sarzyniec, L. (2013). Adding Virtualization Capabilities to the Grid'5000 Testbed . In Communications in Computer and Information Science , volume 367 CCIS, pages 3-...

  6. [6]

    Besbes, O., Gur, Y., and Zeevi, A. (2014). Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards . In Advances in Neural Information Processing Systems (NIPS) , pages 199--207

  7. [7]

    Besson, L. (2018). SMPyBandits: an Open-Source Research Framework for Single and Multi-Players Multi-Arms Bandits (MAB) Algorithms in Python . Online at: https://GitHub.com/SMPyBandits/SMPyBandits

  8. [8]

    arXiv preprint arXiv:1803.06971 , year=

    Besson, L. and Kaufmann, E. (2018). What Doubling Tricks Can and Can't Do for Multi-Armed Bandits . arXiv preprint arXiv:1803.06971

  9. [9]

    and Kaufmann, E

    Besson, L. and Kaufmann, E. (2019). The Generalized Likelihood Ratio Test meets klUCB: an Improved Algorithm for Piece-Wise Non-Stationary Bandits . arXiv preprint arXiv:1902.01575

  10. [10]

    and Gavald \` a , R

    Bifet, A. and Gavald \` a , R. (2007). Learning from time-changing data with adaptive windowing . In Proceedings of the 7th SIAM International Conference on Data Mining , pages 443--448

  11. [11]

    Cao, Y., Wen, Z., Kveton, B., and Xie, Y. (2019). Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit . In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) , pages 418--427

  12. [12]

    and Li, L

    Chapelle, O. and Li, L. (2011). An Empirical Evaluation of Thompson Sampling . In Advances in Neural Information Processing Systems (NIPS) , pages 2249--2257

  13. [13]

    Chen, Y., Lee, C.-W., Luo, H., and Wei, C.-Y. (2019). A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal and Parameter-free . In Proceedings of the Conference on Learning Theory (COLT) , pages 696--726

  14. [14]

    C., Simchi-Levi, D., and Zhu, R

    Cheung, W. C., Simchi-Levi, D., and Zhu, R. (2019). Learning to Optimize under Non-Stationarity . In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) , pages 1079--1087

  15. [15]

    Chow, Y. S. and Teicher, H. (1997). Probability theory : independence, interchangeability, martingales . Springer

  16. [16]

    Garivier, A., M \' e nard, P., and Stoltz, G. (2019). Explore first, exploit next: The true shape of regret in bandit problems . Mathematics of Operations Research , 44(2):377--399

  17. [17]

    and Moulines, E

    Garivier, A. and Moulines, E. (2011). On upper-confidence bound policies for switching bandit problems . In Proceedings of the 22nd International Conference on Algorithmic Learning Theory (ALT), 2011, Espoo, Finland. , volume 6925 LNAI, pages 174--188. Springer, Berlin, Heidelberg

  18. [18]

    Heidari, H., Kearns, M., and Roth, A. (2016). Tight Policy Regret Bounds for Improving and Decaying Bandits . Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) , pages 1562--1570

  19. [19]

    and Kleinberg, R

    Immorlica, N. and Kleinberg, R. (2018). Recharging bandits . In Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS , volume 2018-Octob, pages 309--319. IEEE Computer Society

  20. [20]

    and Qin, T

    Komiyama, J. and Qin, T. (2014). Time-Decaying Bandits for Non-stationary Systems . In Liu, T.-Y., Qi, Q., and Ye, Y., editors, Web and Internet Economics (WINE) , pages 460--466, Cham. Springer International Publishing

  21. [21]

    Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules . Advances in Applied Mathematics , 6(1):4--22

  22. [22]

    and Szepesv \' a ri, C

    Lattimore, T. and Szepesv \' a ri, C. (2020). Bandit Algorithms . Cambridge University Press UK

  23. [23]

    Levine, N., Crammer, K., and Mannor, S. (2017). Rotting Bandits . In Advances in Neural Information Processing Systems (NIPS) , pages 3074--3083

  24. [24]

    Liu, F., Lee, J., and Shroff, N. B. (2018). A change-detection based framework for piecewise-stationary multi-armed bandit problem. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , pages 3651--3658

  25. [25]

    Lou \" e dec, J., Rossi, L., Chevalier, M., Garivier, A., and Mothe, J. (2016). Algorithme de bandit et obsolescence : un mod \` e le pour la recommandation (regular paper) . In Conf \' e rence francophone sur l'Apprentissage Automatique, Marseille, 05/07/2016-07/07/2016 , page (en ligne), http://www.lif.univ-mrs.fr. Laboratoire d'Informatique Fondamental...

  26. [26]

    and Maillard, O.-A

    Mukherjee, S. and Maillard, O.-A. (2019). Distribution-dependent and Time-uniform Bounds for Piecewise i.i.d Bandits . arXiv preprint arXiv:1905.13159

  27. [27]

    and Grunewalder, S

    Pike-Burke, C. and Grunewalder, S. (2019). Recovering Bandits . In Advances in Neural Information Processing Systems (NeurIPS) , pages 14122--14131

  28. [28]

    Russac, Y., Vernade, C., and Capp \' e , O. (2019). Weighted Linear Bandits for Non-Stationary Environments . In Advances in Neural Information Processing Systems (NeurIPS) , pages 12040--12049

  29. [29]

    Seznec, J., Locatelli, A., Carpentier, A., Lazaric, A., and Valko, M. (2019). Rotting bandits are no harder than stochastic ones . In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) , pages 2564--2572

  30. [30]

    Thompson, W. R. (1933). On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples . Biometrika , 25(3/4):285--294

  31. [31]

    and Rudin, C

    Trac \` a , S. and Rudin, C. (2015). Regulating Greed Over Time . arXiv preprint arXiv:1505.05629

  32. [32]

    Warlop, R., Lazaric, A., and Mary, J. (2018). Fighting Boredom in Recommender Systems with Linear Reinforcement Learning . In Advances in Neural Information Processing Systems (NeurIPS) , pages 1757--1768

  33. [33]

    Wei, C.-Y., Hong, Y.-T., and Lu, C.-J. (2016). Tracking the Best Expert in Non-stationary Stochastic Environments . In Advances in Neural Information Processing Systems (NIPS) , pages 3972--3980

  34. [34]

    Whittle, P. (1980). Multi-Armed Bandits and the Gittins Index . Journal of the Royal Statistical Society. Series B (Methodological) , 42:143--149

  35. [35]

    Whittle, P. (1988). Restless bandits: activity allocation in a changing world . Journal of Applied Probability , 25(A):287--298

  36. [36]

    and Seldin, Y

    Zimmert, J. and Seldin, Y. (2019). An Optimal Algorithm for Stochastic and Adversarial Bandits . In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) , pages 467--475