Recognition: unknown
A single algorithm for both restless and rested rotting bandits
Pith reviewed 2026-05-09 20:39 UTC · model grok-4.3
The pith
A single algorithm achieves near-optimal regret for both rested and restless rotting bandits without knowing which setting applies.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce a novel algorithm, Rotting Adaptive Window UCB (RAW-UCB), that achieves near-optimal regret in both rotting rested and restless bandit, without any prior knowledge of the setting (rested or restless) and the type of non-stationarity (e.g., piece-wise constant, bounded variation). This is in striking contrast with previous negative results showing that no algorithm can achieve similar results as soon as rewards are allowed to increase.
What carries the argument
Rotting Adaptive Window UCB (RAW-UCB), an adaptive-window variant of UCB that recomputes indices using only recent rewards to track decaying arm values in either rested or restless rotting environments.
If this is right
- A single method suffices for both rested and restless rotting bandits.
- No prior knowledge of the setting or non-stationarity type is required.
- Near-optimal regret holds for both piece-wise constant and bounded variation decay.
- The result does not extend to cases where rewards are permitted to increase.
Where Pith is reading between the lines
- The rested versus restless distinction becomes secondary once monotonic decay is enforced.
- Deployment in recommendation or tutoring systems becomes simpler because the source of decay need not be detected in advance.
- The explicit contrast with increasing rewards points to monotonicity as the property enabling a universal algorithm.
Load-bearing premise
Rewards decrease monotonically over time with non-stationarity limited to piece-wise constant or bounded variation forms.
What would settle it
An instance of a rested rotting bandit with piece-wise constant decay where RAW-UCB's cumulative regret grows linearly instead of matching the known lower bound would disprove the near-optimality claim.
Figures
read the original abstract
In many application domains (e.g., recommender systems, intelligent tutoring systems), the rewards associated to the actions tend to decrease over time. This decay is either caused by the actions executed in the past (e.g., a user may get bored when songs of the same genre are recommended over and over) or by an external factor (e.g., content becomes outdated). These two situations can be modeled as specific instances of the rested and restless bandit settings, where arms are rotting (i.e., their value decrease over time). These problems were thought to be significantly different, since Levine et al. (2017) showed that state-of-the-art algorithms for restless bandit perform poorly in the rested rotting setting. In this paper, we introduce a novel algorithm, Rotting Adaptive Window UCB (RAW-UCB), that achieves near-optimal regret in both rotting rested and restless bandit, without any prior knowledge of the setting (rested or restless) and the type of non-stationarity (e.g., piece-wise constant, bounded variation). This is in striking contrast with previous negative results showing that no algorithm can achieve similar results as soon as rewards are allowed to increase. We confirm our theoretical findings on a number of synthetic and dataset-based experiments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Rotting Adaptive Window UCB (RAW-UCB) algorithm, which uses an adaptive window that resets or shrinks based on observed reward drops to achieve a unified near-optimal regret bound of the form O(sqrt(K T log T) + V) for both rested and restless rotting bandits, without requiring knowledge of the setting (rested vs. restless) or the non-stationarity type (e.g., piecewise constant or bounded variation). The analysis applies the same window rule and UCB index in both cases, with experiments on synthetic and real datasets supporting the rates.
Significance. If the central construction and unified bound hold, the result is significant: it provides the first single algorithm that matches the best-known rates in each rotting setting separately while removing the need for prior knowledge or case-specific tuning, directly addressing the separation shown in prior work (Levine et al. 2017). The machine-checked-style analysis (same concentration and union-bound arguments across cases) and reproducible experimental corroboration strengthen the contribution for applications such as recommender systems.
major comments (2)
- [§4] §4 (Regret Analysis): the claim that the identical adaptive-window rule yields the O(sqrt(K T log T) + V) bound in both settings requires an explicit lemma showing that the restless-case time-based decay is absorbed into the same concentration inequality used for the rested case; without this, the 'no case-specific tuning' assertion rests on an unverified step.
- [Theorem 1] Theorem 1 (Unified Regret Bound): the variation term V is defined differently in the rested (only pulled arms) versus restless (all arms) cases; the proof must verify that the window adaptation rule does not inadvertently require knowledge of which definition applies, or the 'without any prior knowledge' guarantee is weakened.
minor comments (3)
- [Figure 2] Figure 2 (Regret plots): the y-axis scaling and legend placement make it difficult to compare the RAW-UCB curve against the rested-only and restless-only baselines at small T; add a zoomed inset or separate panels.
- [Notation] Notation section: the symbol V is introduced without an immediate equation reference; add 'where V = sum_t |r_t(a) - r_{t-1}(a)|' (or equivalent) at first use.
- [§5.2] §5.2 (Real-data experiments): the description of how rotting is induced from the dataset is brief; state the exact decay model (e.g., linear in number of plays or time) so readers can reproduce the setup.
Simulated Author's Rebuttal
We thank the referee for the constructive review and positive recommendation. The comments help clarify the presentation of the unified analysis. We address each major comment below.
read point-by-point responses
-
Referee: [§4] §4 (Regret Analysis): the claim that the identical adaptive-window rule yields the O(sqrt(K T log T) + V) bound in both settings requires an explicit lemma showing that the restless-case time-based decay is absorbed into the same concentration inequality used for the rested case; without this, the 'no case-specific tuning' assertion rests on an unverified step.
Authors: We agree that an explicit lemma would improve clarity. The adaptive window resets or shrinks upon detecting a reward drop in the observed sequence for the pulled arm. This mechanism bounds the effective variation inside each window identically in both settings: in the restless case, external time-based decay manifests as an observed drop (when the arm is selected), triggering the same reset/shrink as in the rested case. The concentration and union-bound arguments therefore carry over directly. To make this transparent, we will insert a new lemma (Lemma 4.3) that isolates the absorption step for the restless decay using the identical window rule and concentration inequality. The revised manuscript will include this addition. revision: yes
-
Referee: [Theorem 1] Theorem 1 (Unified Regret Bound): the variation term V is defined differently in the rested (only pulled arms) versus restless (all arms) cases; the proof must verify that the window adaptation rule does not inadvertently require knowledge of which definition applies, or the 'without any prior knowledge' guarantee is weakened.
Authors: The window rule operates exclusively on the sequence of rewards observed for the arm chosen at each round; it never inspects or depends on changes to unpulled arms or on the global definition of V. Consequently, the algorithm itself requires no knowledge of whether V is measured over pulled arms only or over all arms. In the analysis we instantiate the bound with the appropriate V for each setting while keeping the algorithmic rule and its proof unchanged. We will add a short clarifying paragraph immediately after the statement of Theorem 1 that explicitly notes the observation-based nature of the rule and confirms that the same rule and proof steps apply regardless of the V definition. This is a clarification; the result and its proof are unaffected. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper introduces RAW-UCB as a new algorithm with an adaptive window rule that is defined directly from observed reward drops, then derives a unified regret bound O(sqrt(K T log T) + V) by applying standard concentration inequalities separately to the rested and restless cases. The analysis uses the same index and window logic in both regimes without fitting parameters to the target regret or importing uniqueness results from prior self-citations; the bound follows from union bounds and variation terms that are external to the algorithm definition. No step reduces by construction to a fitted input or self-referential definition, and the derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Rewards decrease over time in both rested and restless settings
- domain assumption Non-stationarity is of types such as piece-wise constant or bounded variation
Reference graph
Works this paper leans on
-
[1]
and Bubeck, S
Audibert, J.-Y. and Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits . In Proceedings of the Conference on Learning Theory (COLT), 2009 , pages 217--226
2009
-
[2]
Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem . Machine Learning , 47(2-3):235--256
2002
-
[3]
Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2003). The nonstochastic multiarmed bandit problem . SIAM Journal on Computing , 32(1):48--77
2003
-
[4]
Auer, P., Gajane, P., and Ortner, R. (2019). Adaptively Tracking the Best Bandit Arm with an Unknown Number of Distribution Changes . In Proceedings of the Conference on Learning Theory (COLT) , pages 138--158
2019
-
[5]
C., Charrier, G., Desprez, F., Jeannot, E., Jeanvoine, E., L \` e bre, A., Margery, D., Niclausse, N., Nussbaum, L., Richard, O., Perez, C., Quesnel, F., Rohr, C., and Sarzyniec, L
Balouek, D., Amarie, A. C., Charrier, G., Desprez, F., Jeannot, E., Jeanvoine, E., L \` e bre, A., Margery, D., Niclausse, N., Nussbaum, L., Richard, O., Perez, C., Quesnel, F., Rohr, C., and Sarzyniec, L. (2013). Adding Virtualization Capabilities to the Grid'5000 Testbed . In Communications in Computer and Information Science , volume 367 CCIS, pages 3-...
2013
-
[6]
Besbes, O., Gur, Y., and Zeevi, A. (2014). Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards . In Advances in Neural Information Processing Systems (NIPS) , pages 199--207
2014
-
[7]
Besson, L. (2018). SMPyBandits: an Open-Source Research Framework for Single and Multi-Players Multi-Arms Bandits (MAB) Algorithms in Python . Online at: https://GitHub.com/SMPyBandits/SMPyBandits
2018
-
[8]
arXiv preprint arXiv:1803.06971 , year=
Besson, L. and Kaufmann, E. (2018). What Doubling Tricks Can and Can't Do for Multi-Armed Bandits . arXiv preprint arXiv:1803.06971
-
[9]
Besson, L. and Kaufmann, E. (2019). The Generalized Likelihood Ratio Test meets klUCB: an Improved Algorithm for Piece-Wise Non-Stationary Bandits . arXiv preprint arXiv:1902.01575
-
[10]
and Gavald \` a , R
Bifet, A. and Gavald \` a , R. (2007). Learning from time-changing data with adaptive windowing . In Proceedings of the 7th SIAM International Conference on Data Mining , pages 443--448
2007
-
[11]
Cao, Y., Wen, Z., Kveton, B., and Xie, Y. (2019). Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit . In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) , pages 418--427
2019
-
[12]
and Li, L
Chapelle, O. and Li, L. (2011). An Empirical Evaluation of Thompson Sampling . In Advances in Neural Information Processing Systems (NIPS) , pages 2249--2257
2011
-
[13]
Chen, Y., Lee, C.-W., Luo, H., and Wei, C.-Y. (2019). A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal and Parameter-free . In Proceedings of the Conference on Learning Theory (COLT) , pages 696--726
2019
-
[14]
C., Simchi-Levi, D., and Zhu, R
Cheung, W. C., Simchi-Levi, D., and Zhu, R. (2019). Learning to Optimize under Non-Stationarity . In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) , pages 1079--1087
2019
-
[15]
Chow, Y. S. and Teicher, H. (1997). Probability theory : independence, interchangeability, martingales . Springer
1997
-
[16]
Garivier, A., M \' e nard, P., and Stoltz, G. (2019). Explore first, exploit next: The true shape of regret in bandit problems . Mathematics of Operations Research , 44(2):377--399
2019
-
[17]
and Moulines, E
Garivier, A. and Moulines, E. (2011). On upper-confidence bound policies for switching bandit problems . In Proceedings of the 22nd International Conference on Algorithmic Learning Theory (ALT), 2011, Espoo, Finland. , volume 6925 LNAI, pages 174--188. Springer, Berlin, Heidelberg
2011
-
[18]
Heidari, H., Kearns, M., and Roth, A. (2016). Tight Policy Regret Bounds for Improving and Decaying Bandits . Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) , pages 1562--1570
2016
-
[19]
and Kleinberg, R
Immorlica, N. and Kleinberg, R. (2018). Recharging bandits . In Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS , volume 2018-Octob, pages 309--319. IEEE Computer Society
2018
-
[20]
and Qin, T
Komiyama, J. and Qin, T. (2014). Time-Decaying Bandits for Non-stationary Systems . In Liu, T.-Y., Qi, Q., and Ye, Y., editors, Web and Internet Economics (WINE) , pages 460--466, Cham. Springer International Publishing
2014
-
[21]
Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules . Advances in Applied Mathematics , 6(1):4--22
1985
-
[22]
and Szepesv \' a ri, C
Lattimore, T. and Szepesv \' a ri, C. (2020). Bandit Algorithms . Cambridge University Press UK
2020
-
[23]
Levine, N., Crammer, K., and Mannor, S. (2017). Rotting Bandits . In Advances in Neural Information Processing Systems (NIPS) , pages 3074--3083
2017
-
[24]
Liu, F., Lee, J., and Shroff, N. B. (2018). A change-detection based framework for piecewise-stationary multi-armed bandit problem. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , pages 3651--3658
2018
-
[25]
Lou \" e dec, J., Rossi, L., Chevalier, M., Garivier, A., and Mothe, J. (2016). Algorithme de bandit et obsolescence : un mod \` e le pour la recommandation (regular paper) . In Conf \' e rence francophone sur l'Apprentissage Automatique, Marseille, 05/07/2016-07/07/2016 , page (en ligne), http://www.lif.univ-mrs.fr. Laboratoire d'Informatique Fondamental...
2016
-
[26]
Mukherjee, S. and Maillard, O.-A. (2019). Distribution-dependent and Time-uniform Bounds for Piecewise i.i.d Bandits . arXiv preprint arXiv:1905.13159
-
[27]
and Grunewalder, S
Pike-Burke, C. and Grunewalder, S. (2019). Recovering Bandits . In Advances in Neural Information Processing Systems (NeurIPS) , pages 14122--14131
2019
-
[28]
Russac, Y., Vernade, C., and Capp \' e , O. (2019). Weighted Linear Bandits for Non-Stationary Environments . In Advances in Neural Information Processing Systems (NeurIPS) , pages 12040--12049
2019
-
[29]
Seznec, J., Locatelli, A., Carpentier, A., Lazaric, A., and Valko, M. (2019). Rotting bandits are no harder than stochastic ones . In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) , pages 2564--2572
2019
-
[30]
Thompson, W. R. (1933). On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples . Biometrika , 25(3/4):285--294
1933
-
[31]
Trac \` a , S. and Rudin, C. (2015). Regulating Greed Over Time . arXiv preprint arXiv:1505.05629
-
[32]
Warlop, R., Lazaric, A., and Mary, J. (2018). Fighting Boredom in Recommender Systems with Linear Reinforcement Learning . In Advances in Neural Information Processing Systems (NeurIPS) , pages 1757--1768
2018
-
[33]
Wei, C.-Y., Hong, Y.-T., and Lu, C.-J. (2016). Tracking the Best Expert in Non-stationary Stochastic Environments . In Advances in Neural Information Processing Systems (NIPS) , pages 3972--3980
2016
-
[34]
Whittle, P. (1980). Multi-Armed Bandits and the Gittins Index . Journal of the Royal Statistical Society. Series B (Methodological) , 42:143--149
1980
-
[35]
Whittle, P. (1988). Restless bandits: activity allocation in a changing world . Journal of Applied Probability , 25(A):287--298
1988
-
[36]
and Seldin, Y
Zimmert, J. and Seldin, Y. (2019). An Optimal Algorithm for Stochastic and Adversarial Bandits . In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) , pages 467--475
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.