arxiv: 2604.21432 · v1 · submitted 2026-04-23 · 📊 stat.ML · cs.LG

Recognition: unknown

A single algorithm for both restless and rested rotting bandits

Julien Seznec , Pierre M\'enard , Alessandro Lazaric , Michal Valko

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:39 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords rotting banditsrested banditsrestless banditsnon-stationary banditsadaptive UCBregret boundsmonotonic decay

0 comments

The pith

A single algorithm achieves near-optimal regret for both rested and restless rotting bandits without knowing which setting applies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that one algorithm, RAW-UCB, can handle rotting bandits in both the rested case where pulling an arm causes its reward to decay and the restless case where decay occurs regardless of pulls. It delivers near-optimal regret bounds while remaining ignorant of the specific setting and the form of non-stationarity, such as piece-wise constant drops or bounded total variation. This matters because earlier methods designed for restless rotting bandits performed poorly when applied to rested ones, restricting their use in settings like recommender systems where item value fades over repeated exposure or time. The approach centers on an adaptive window that focuses UCB estimates on recent observations to track the current decaying value of each arm.

Core claim

We introduce a novel algorithm, Rotting Adaptive Window UCB (RAW-UCB), that achieves near-optimal regret in both rotting rested and restless bandit, without any prior knowledge of the setting (rested or restless) and the type of non-stationarity (e.g., piece-wise constant, bounded variation). This is in striking contrast with previous negative results showing that no algorithm can achieve similar results as soon as rewards are allowed to increase.

What carries the argument

Rotting Adaptive Window UCB (RAW-UCB), an adaptive-window variant of UCB that recomputes indices using only recent rewards to track decaying arm values in either rested or restless rotting environments.

If this is right

A single method suffices for both rested and restless rotting bandits.
No prior knowledge of the setting or non-stationarity type is required.
Near-optimal regret holds for both piece-wise constant and bounded variation decay.
The result does not extend to cases where rewards are permitted to increase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The rested versus restless distinction becomes secondary once monotonic decay is enforced.
Deployment in recommendation or tutoring systems becomes simpler because the source of decay need not be detected in advance.
The explicit contrast with increasing rewards points to monotonicity as the property enabling a universal algorithm.

Load-bearing premise

Rewards decrease monotonically over time with non-stationarity limited to piece-wise constant or bounded variation forms.

What would settle it

An instance of a rested rotting bandit with piece-wise constant decay where RAW-UCB's cumulative regret grows linearly instead of matching the known lower bound would disprove the near-optimality claim.

Figures

Figures reproduced from arXiv: 2604.21432 by Alessandro Lazaric, Julien Seznec, Michal Valko, Pierre M\'enard.

**Figure 1.** Figure 1: Left: rewards from the Yahoo! dataset for two days. Right: average regret over 500 runs. RAW-UCB vs FEWA. The two algorithms compute the same statistics and share most of their analysis. Yet, RAW-UCB consistently outperforms FEWA on the full (rested and restless) benchmark. The difference between the two is even more significant in the restless case. Moreover, RAW-UCB is also simpler to implement and fast… view at source ↗

**Figure 2.** Figure 2: Left: Regret at the end of the game for different values of L. Middle, Right: Regret across time for two values of L. Average over 2000 runs. We highlight the [10%, 90%] confidence region. performances of the two algorithms are much closer. We also notice that there is a larger variance in FEWA’s result compared to RAW-UCB. This is not surprising because we had to drastically reduce the confidence bounds t… view at source ↗

**Figure 3.** Figure 3: Left: reward functions on from the Yahoo! dataset Right: average regret of policies over 500 runs 0 25000 50000 75000 100000 125000 150000 175000 Round (t) 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Arms Average Reward Day 2 - K=11 0 25000 50000 75000 100000 125000 150000 175000 Round (t) 0 2000 4000 6000 8000 10000 A v e r a g e r e g r e t R t EFF_RAW-UCB(α=1.4, m =1.1) EFF_RAW-UCB(α=1.4, m =2) EFF_RAW-UCB(… view at source ↗

read the original abstract

In many application domains (e.g., recommender systems, intelligent tutoring systems), the rewards associated to the actions tend to decrease over time. This decay is either caused by the actions executed in the past (e.g., a user may get bored when songs of the same genre are recommended over and over) or by an external factor (e.g., content becomes outdated). These two situations can be modeled as specific instances of the rested and restless bandit settings, where arms are rotting (i.e., their value decrease over time). These problems were thought to be significantly different, since Levine et al. (2017) showed that state-of-the-art algorithms for restless bandit perform poorly in the rested rotting setting. In this paper, we introduce a novel algorithm, Rotting Adaptive Window UCB (RAW-UCB), that achieves near-optimal regret in both rotting rested and restless bandit, without any prior knowledge of the setting (rested or restless) and the type of non-stationarity (e.g., piece-wise constant, bounded variation). This is in striking contrast with previous negative results showing that no algorithm can achieve similar results as soon as rewards are allowed to increase. We confirm our theoretical findings on a number of synthetic and dataset-based experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAW-UCB gives one adaptive-window rule that covers both rested and restless rotting bandits with near-optimal regret and no prior knowledge of the setting.

read the letter

RAW-UCB is a single algorithm that achieves near-optimal regret for both rested and restless rotting bandits without any prior knowledge of the setting or the non-stationarity type. That's the core advance here. The paper shows how an adaptive window UCB can reset or shrink the window when rewards drop, and the same mechanism works whether the decay comes from pulling the arm or from time passing. The regret bound splits into cases but uses the same concentration arguments, leading to something like O(sqrt(K T log T) + V) where V measures total variation. This directly addresses the negative result from Levine et al. that no single approach worked before when rewards could increase, but here the rotting assumption lets it succeed. What stands out is that the algorithm needs no tuning for rested versus restless, and it handles piece-wise constant or bounded variation without knowing which. The experiments on synthetic rotting processes and real datasets match the predicted rates. On the soft side, the log T factor in the bound might not be the absolute tightest possible, but it is still near-optimal as stated and the analysis assumes only decreasing rewards, which is explicit. I did not see circularity or hidden parameters in the construction. This is useful for anyone working on bandits with decaying rewards, like in recommendation systems where content gets stale or users get bored. A reader interested in non-stationary bandits will get value from seeing how one window rule covers both models. It deserves a serious referee because the unification is clean and the evidence from theory and experiments lines up. I'd send it to review.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces the Rotting Adaptive Window UCB (RAW-UCB) algorithm, which uses an adaptive window that resets or shrinks based on observed reward drops to achieve a unified near-optimal regret bound of the form O(sqrt(K T log T) + V) for both rested and restless rotting bandits, without requiring knowledge of the setting (rested vs. restless) or the non-stationarity type (e.g., piecewise constant or bounded variation). The analysis applies the same window rule and UCB index in both cases, with experiments on synthetic and real datasets supporting the rates.

Significance. If the central construction and unified bound hold, the result is significant: it provides the first single algorithm that matches the best-known rates in each rotting setting separately while removing the need for prior knowledge or case-specific tuning, directly addressing the separation shown in prior work (Levine et al. 2017). The machine-checked-style analysis (same concentration and union-bound arguments across cases) and reproducible experimental corroboration strengthen the contribution for applications such as recommender systems.

major comments (2)

[§4] §4 (Regret Analysis): the claim that the identical adaptive-window rule yields the O(sqrt(K T log T) + V) bound in both settings requires an explicit lemma showing that the restless-case time-based decay is absorbed into the same concentration inequality used for the rested case; without this, the 'no case-specific tuning' assertion rests on an unverified step.
[Theorem 1] Theorem 1 (Unified Regret Bound): the variation term V is defined differently in the rested (only pulled arms) versus restless (all arms) cases; the proof must verify that the window adaptation rule does not inadvertently require knowledge of which definition applies, or the 'without any prior knowledge' guarantee is weakened.

minor comments (3)

[Figure 2] Figure 2 (Regret plots): the y-axis scaling and legend placement make it difficult to compare the RAW-UCB curve against the rested-only and restless-only baselines at small T; add a zoomed inset or separate panels.
[Notation] Notation section: the symbol V is introduced without an immediate equation reference; add 'where V = sum_t |r_t(a) - r_{t-1}(a)|' (or equivalent) at first use.
[§5.2] §5.2 (Real-data experiments): the description of how rotting is induced from the dataset is brief; state the exact decay model (e.g., linear in number of plays or time) so readers can reproduce the setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive recommendation. The comments help clarify the presentation of the unified analysis. We address each major comment below.

read point-by-point responses

Referee: [§4] §4 (Regret Analysis): the claim that the identical adaptive-window rule yields the O(sqrt(K T log T) + V) bound in both settings requires an explicit lemma showing that the restless-case time-based decay is absorbed into the same concentration inequality used for the rested case; without this, the 'no case-specific tuning' assertion rests on an unverified step.

Authors: We agree that an explicit lemma would improve clarity. The adaptive window resets or shrinks upon detecting a reward drop in the observed sequence for the pulled arm. This mechanism bounds the effective variation inside each window identically in both settings: in the restless case, external time-based decay manifests as an observed drop (when the arm is selected), triggering the same reset/shrink as in the rested case. The concentration and union-bound arguments therefore carry over directly. To make this transparent, we will insert a new lemma (Lemma 4.3) that isolates the absorption step for the restless decay using the identical window rule and concentration inequality. The revised manuscript will include this addition. revision: yes
Referee: [Theorem 1] Theorem 1 (Unified Regret Bound): the variation term V is defined differently in the rested (only pulled arms) versus restless (all arms) cases; the proof must verify that the window adaptation rule does not inadvertently require knowledge of which definition applies, or the 'without any prior knowledge' guarantee is weakened.

Authors: The window rule operates exclusively on the sequence of rewards observed for the arm chosen at each round; it never inspects or depends on changes to unpulled arms or on the global definition of V. Consequently, the algorithm itself requires no knowledge of whether V is measured over pulled arms only or over all arms. In the analysis we instantiate the bound with the appropriate V for each setting while keeping the algorithmic rule and its proof unchanged. We will add a short clarifying paragraph immediately after the statement of Theorem 1 that explicitly notes the observation-based nature of the rule and confirms that the same rule and proof steps apply regardless of the V definition. This is a clarification; the result and its proof are unaffected. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces RAW-UCB as a new algorithm with an adaptive window rule that is defined directly from observed reward drops, then derives a unified regret bound O(sqrt(K T log T) + V) by applying standard concentration inequalities separately to the rested and restless cases. The analysis uses the same index and window logic in both regimes without fitting parameters to the target regret or importing uniqueness results from prior self-citations; the bound follows from union bounds and variation terms that are external to the algorithm definition. No step reduces by construction to a fitted input or self-referential definition, and the derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the modeling assumption that rewards are non-increasing (rotting) and that the non-stationarity belongs to standard classes such as piecewise constant or bounded variation; no free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption Rewards decrease over time in both rested and restless settings
Stated in the abstract as the core modeling choice for rotting bandits
domain assumption Non-stationarity is of types such as piece-wise constant or bounded variation
Mentioned as examples the algorithm handles without prior knowledge

pith-pipeline@v0.9.0 · 5525 in / 1332 out tokens · 41598 ms · 2026-05-09T20:39:02.025868+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 4 canonical work pages

[1]

and Bubeck, S

Audibert, J.-Y. and Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits . In Proceedings of the Conference on Learning Theory (COLT), 2009 , pages 217--226

2009
[2]

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem . Machine Learning , 47(2-3):235--256

2002
[3]

Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2003). The nonstochastic multiarmed bandit problem . SIAM Journal on Computing , 32(1):48--77

2003
[4]

Auer, P., Gajane, P., and Ortner, R. (2019). Adaptively Tracking the Best Bandit Arm with an Unknown Number of Distribution Changes . In Proceedings of the Conference on Learning Theory (COLT) , pages 138--158

2019
[5]

C., Charrier, G., Desprez, F., Jeannot, E., Jeanvoine, E., L \` e bre, A., Margery, D., Niclausse, N., Nussbaum, L., Richard, O., Perez, C., Quesnel, F., Rohr, C., and Sarzyniec, L

Balouek, D., Amarie, A. C., Charrier, G., Desprez, F., Jeannot, E., Jeanvoine, E., L \` e bre, A., Margery, D., Niclausse, N., Nussbaum, L., Richard, O., Perez, C., Quesnel, F., Rohr, C., and Sarzyniec, L. (2013). Adding Virtualization Capabilities to the Grid'5000 Testbed . In Communications in Computer and Information Science , volume 367 CCIS, pages 3-...

2013
[6]

Besbes, O., Gur, Y., and Zeevi, A. (2014). Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards . In Advances in Neural Information Processing Systems (NIPS) , pages 199--207

2014
[7]

Besson, L. (2018). SMPyBandits: an Open-Source Research Framework for Single and Multi-Players Multi-Arms Bandits (MAB) Algorithms in Python . Online at: https://GitHub.com/SMPyBandits/SMPyBandits

2018
[8]

arXiv preprint arXiv:1803.06971 , year=

Besson, L. and Kaufmann, E. (2018). What Doubling Tricks Can and Can't Do for Multi-Armed Bandits . arXiv preprint arXiv:1803.06971

work page arXiv 2018
[9]

and Kaufmann, E

Besson, L. and Kaufmann, E. (2019). The Generalized Likelihood Ratio Test meets klUCB: an Improved Algorithm for Piece-Wise Non-Stationary Bandits . arXiv preprint arXiv:1902.01575

work page arXiv 2019
[10]

and Gavald \` a , R

Bifet, A. and Gavald \` a , R. (2007). Learning from time-changing data with adaptive windowing . In Proceedings of the 7th SIAM International Conference on Data Mining , pages 443--448

2007
[11]

Cao, Y., Wen, Z., Kveton, B., and Xie, Y. (2019). Nearly Optimal Adaptive Procedure with Change Detection for Piecewise-Stationary Bandit . In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) , pages 418--427

2019
[12]

and Li, L

Chapelle, O. and Li, L. (2011). An Empirical Evaluation of Thompson Sampling . In Advances in Neural Information Processing Systems (NIPS) , pages 2249--2257

2011
[13]

Chen, Y., Lee, C.-W., Luo, H., and Wei, C.-Y. (2019). A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal and Parameter-free . In Proceedings of the Conference on Learning Theory (COLT) , pages 696--726

2019
[14]

C., Simchi-Levi, D., and Zhu, R

Cheung, W. C., Simchi-Levi, D., and Zhu, R. (2019). Learning to Optimize under Non-Stationarity . In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) , pages 1079--1087

2019
[15]

Chow, Y. S. and Teicher, H. (1997). Probability theory : independence, interchangeability, martingales . Springer

1997
[16]

Garivier, A., M \' e nard, P., and Stoltz, G. (2019). Explore first, exploit next: The true shape of regret in bandit problems . Mathematics of Operations Research , 44(2):377--399

2019
[17]

and Moulines, E

Garivier, A. and Moulines, E. (2011). On upper-confidence bound policies for switching bandit problems . In Proceedings of the 22nd International Conference on Algorithmic Learning Theory (ALT), 2011, Espoo, Finland. , volume 6925 LNAI, pages 174--188. Springer, Berlin, Heidelberg

2011
[18]

Heidari, H., Kearns, M., and Roth, A. (2016). Tight Policy Regret Bounds for Improving and Decaying Bandits . Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) , pages 1562--1570

2016
[19]

and Kleinberg, R

Immorlica, N. and Kleinberg, R. (2018). Recharging bandits . In Proceedings - Annual IEEE Symposium on Foundations of Computer Science, FOCS , volume 2018-Octob, pages 309--319. IEEE Computer Society

2018
[20]

and Qin, T

Komiyama, J. and Qin, T. (2014). Time-Decaying Bandits for Non-stationary Systems . In Liu, T.-Y., Qi, Q., and Ye, Y., editors, Web and Internet Economics (WINE) , pages 460--466, Cham. Springer International Publishing

2014
[21]

Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules . Advances in Applied Mathematics , 6(1):4--22

1985
[22]

and Szepesv \' a ri, C

Lattimore, T. and Szepesv \' a ri, C. (2020). Bandit Algorithms . Cambridge University Press UK

2020
[23]

Levine, N., Crammer, K., and Mannor, S. (2017). Rotting Bandits . In Advances in Neural Information Processing Systems (NIPS) , pages 3074--3083

2017
[24]

Liu, F., Lee, J., and Shroff, N. B. (2018). A change-detection based framework for piecewise-stationary multi-armed bandit problem. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , pages 3651--3658

2018
[25]

Lou \" e dec, J., Rossi, L., Chevalier, M., Garivier, A., and Mothe, J. (2016). Algorithme de bandit et obsolescence : un mod \` e le pour la recommandation (regular paper) . In Conf \' e rence francophone sur l'Apprentissage Automatique, Marseille, 05/07/2016-07/07/2016 , page (en ligne), http://www.lif.univ-mrs.fr. Laboratoire d'Informatique Fondamental...

2016
[26]

and Maillard, O.-A

Mukherjee, S. and Maillard, O.-A. (2019). Distribution-dependent and Time-uniform Bounds for Piecewise i.i.d Bandits . arXiv preprint arXiv:1905.13159

work page arXiv 2019
[27]

and Grunewalder, S

Pike-Burke, C. and Grunewalder, S. (2019). Recovering Bandits . In Advances in Neural Information Processing Systems (NeurIPS) , pages 14122--14131

2019
[28]

Russac, Y., Vernade, C., and Capp \' e , O. (2019). Weighted Linear Bandits for Non-Stationary Environments . In Advances in Neural Information Processing Systems (NeurIPS) , pages 12040--12049

2019
[29]

Seznec, J., Locatelli, A., Carpentier, A., Lazaric, A., and Valko, M. (2019). Rotting bandits are no harder than stochastic ones . In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) , pages 2564--2572

2019
[30]

Thompson, W. R. (1933). On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples . Biometrika , 25(3/4):285--294

1933
[31]

and Rudin, C

Trac \` a , S. and Rudin, C. (2015). Regulating Greed Over Time . arXiv preprint arXiv:1505.05629

work page arXiv 2015
[32]

Warlop, R., Lazaric, A., and Mary, J. (2018). Fighting Boredom in Recommender Systems with Linear Reinforcement Learning . In Advances in Neural Information Processing Systems (NeurIPS) , pages 1757--1768

2018
[33]

Wei, C.-Y., Hong, Y.-T., and Lu, C.-J. (2016). Tracking the Best Expert in Non-stationary Stochastic Environments . In Advances in Neural Information Processing Systems (NIPS) , pages 3972--3980

2016
[34]

Whittle, P. (1980). Multi-Armed Bandits and the Gittins Index . Journal of the Royal Statistical Society. Series B (Methodological) , 42:143--149

1980
[35]

Whittle, P. (1988). Restless bandits: activity allocation in a changing world . Journal of Applied Probability , 25(A):287--298

1988
[36]

and Seldin, Y

Zimmert, J. and Seldin, Y. (2019). An Optimal Algorithm for Stochastic and Adversarial Bandits . In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) , pages 467--475

2019