Robust Restless Multi-Armed Bandit for Data Center Flexibility Services Through Virtual Machine Scheduling
Pith reviewed 2026-05-20 07:15 UTC · model grok-4.3
The pith
A mixed-strategy restless bandit lets grid operators request load reductions from data centers without learning their internal job schedules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By treating each data center as a restless arm in a Markov decision process whose transitions are estimated from open VM datasets, the authors derive Whittle-index policies via Thompson sampling and then embed a global upper-confidence-bound term with trust indices; the resulting mixed-strategy algorithm produces robust load-reduction decisions that remain effective when state spaces enlarge and when contextual information is noisy, outperforming both the pure Thompson-Whittle method and the EXP4 framework.
What carries the argument
The mixed-strategy Whittle-index policy that augments Thompson sampling with a global upper-confidence bound encoded by trust indices to accelerate and stabilize learning in large state spaces.
If this is right
- Grid operators obtain real-time load-reduction decisions without needing access to proprietary rescheduling procedures inside each data center.
- The policy remains effective when the number of states per data-center arm increases.
- Performance advantage over pure Thompson-Whittle grows when contextual signals about job patterns become noisier.
- The approach yields better regret than the EXP4 contextual-bandit baseline on the same VM traces.
Where Pith is reading between the lines
- The same restless-arm modeling could be applied to other privacy-sensitive flexible loads such as electric-vehicle charging fleets.
- Replacing the trust-index component with an adaptive mechanism for non-stationary workloads might shorten learning time when data-center utilization patterns shift rapidly.
- Because the code is open-sourced, independent groups can test the method on additional public VM traces or on synthetic traces that violate the current Markov assumption.
Load-bearing premise
Job arrivals and rescheduling at each data center can be captured accurately as a restless arm whose transition function is learnable from open virtual-machine datasets.
What would settle it
Re-running the mixed-strategy algorithm on traces whose state space is doubled or whose context noise is deliberately increased and checking whether it still outperforms the pure Thompson-Whittle baseline would directly test the robustness claim.
Figures
read the original abstract
Energy demands from data centers have surged and stressed the grid in recent years. Electric grids require balancing supply and demand every second, motivating demand response (reduction) from large loads, including data centers. This can be achieved by rescheduling jobs on physical machines. Its real-time implementation is uncertain due to fluctuating resource utilization, and rescheduling incurs quality-of-service (QoS) losses that providers are unwilling to disclose. We propose a restless multi-arm bandit (RMAB) framework in which the grid operator requests load reductions without access to detailed job-rescheduling procedures. Using the open-source virtual machine (VM) datasets, we model job arrivals and rescheduling at each data center as a restless arm in a Markov decision process (MDP), and derive Whittle-index-based policies based on the learned transition function via Thompson sampling. To overcome the weakness of an increasingly long learning process due to an enlarged state space, we used a mixed strategy that included a global upper confidence bound (UCB) encoded with trust indices to enhance robustness and accelerate learning. Results show that the proposed mixed-strategy algorithm remains robust across varying state-space sizes and consistently outperforms the pure Thompson-Whittle (TW) algorithm, especially when contextual information is noisy. It also demonstrates superior performance compared to the state-of-the-art EXP4 framework. We provided an open-sourced code for reproducibility.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a restless multi-armed bandit (RMAB) framework to enable data-center demand response for grid balancing via VM job rescheduling. Each data center is modeled as a restless arm whose job-arrival and rescheduling dynamics form an MDP; transition functions are learned from open-source VM traces via Thompson sampling, Whittle indices are computed for the resulting policies, and a mixed strategy augments the pure Thompson-Whittle (TW) policy with a global UCB term encoded by trust indices. The authors claim that the mixed policy remains robust across state-space sizes, outperforms pure TW especially under noisy context, and also beats the EXP4 baseline, with open-source code provided for reproducibility.
Significance. If the MDP modeling step accurately reflects real utilization and QoS dynamics and the reported gains are reproducible, the work could supply a practical, privacy-preserving mechanism for grid operators to elicit load reductions from data centers. The explicit release of code is a clear strength that supports verification and extension.
major comments (2)
- [Abstract / Modeling section] Abstract and Modeling section: The central claim that Whittle-index policies derived from Thompson-sampled transitions produce reliable real-time decisions rests on the assumption that the chosen open-source VM datasets yield an MDP whose transition function accurately captures fluctuating utilization and QoS losses. The manuscript supplies no description of the state variables, action space, or estimation procedure (empirical frequencies, parametric fit, or direct sampling on raw traces). Without these details it is impossible to verify that the reported outperformance versus pure TW and EXP4 is attributable to the mixed strategy rather than to an artifact of the particular dataset encoding.
- [Results section] Results section: The abstract asserts that the mixed-strategy algorithm 'remains robust across varying state-space sizes' and 'consistently outperforms' TW and EXP4 'especially when contextual information is noisy,' yet no experimental details—dataset statistics, number of independent runs, error bars, exact state-space cardinalities tested, or noise levels—are supplied. This absence prevents assessment of whether the robustness and superiority claims are load-bearing or merely suggestive.
minor comments (2)
- [Algorithm description] The notation for the trust-index scaling factor inside the global UCB term is introduced without an explicit equation or pseudocode block; adding one would improve clarity of the mixed strategy.
- [Abstract] A brief statement of the precise open-source repository or DOI for the released code would strengthen the reproducibility claim already noted in the abstract.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We have prepared point-by-point responses below and will revise the manuscript accordingly to improve clarity on the modeling and experimental details.
read point-by-point responses
-
Referee: [Abstract / Modeling section] Abstract and Modeling section: The central claim that Whittle-index policies derived from Thompson-sampled transitions produce reliable real-time decisions rests on the assumption that the chosen open-source VM datasets yield an MDP whose transition function accurately captures fluctuating utilization and QoS losses. The manuscript supplies no description of the state variables, action space, or estimation procedure (empirical frequencies, parametric fit, or direct sampling on raw traces). Without these details it is impossible to verify that the reported outperformance versus pure TW and EXP4 is attributable to the mixed strategy rather than to an artifact of the particular dataset encoding.
Authors: We agree that the original manuscript did not include sufficient explicit description of the MDP components. In the revised version we will expand the Modeling section to specify the state variables (utilization level, queue length, and QoS penalty state), the action space (binary rescheduling decision per VM with associated QoS cost), and the transition estimation procedure (Thompson sampling over empirical frequencies derived directly from the open-source VM traces). These additions will make clear that the transition model is learned from real trace statistics rather than an arbitrary encoding, thereby supporting that the mixed-strategy gains are not dataset artifacts. revision: yes
-
Referee: [Results section] Results section: The abstract asserts that the mixed-strategy algorithm 'remains robust across varying state-space sizes' and 'consistently outperforms' TW and EXP4 'especially when contextual information is noisy,' yet no experimental details—dataset statistics, number of independent runs, error bars, exact state-space cardinalities tested, or noise levels—are supplied. This absence prevents assessment of whether the robustness and superiority claims are load-bearing or merely suggestive.
Authors: We acknowledge the need for fuller experimental reporting. The revised manuscript will add a dedicated experimental-setup subsection that reports: (i) summary statistics of the VM traces (number of jobs, arrival rates, utilization distributions), (ii) the exact state-space cardinalities tested (e.g., 10, 20, 50 states per arm), (iii) number of independent runs (20) with error bars showing standard deviation, and (iv) the precise noise levels injected into contextual features (Gaussian noise with variances 0.1 and 0.3). These details will allow readers to evaluate the robustness claims directly. revision: yes
Circularity Check
No circularity: derivation uses external datasets and standard RMAB methods as independent inputs
full rationale
The paper models job arrivals and rescheduling from open-source VM datasets as restless arms in an MDP, learns transition functions via Thompson sampling, and applies Whittle-index policies plus a mixed global-UCB/trust-index strategy. Performance claims are empirical comparisons against pure TW and EXP4 baselines on those external datasets. No quoted step reduces a reported prediction or result to a fitted parameter or self-citation by construction; the chain remains self-contained against external benchmarks with no load-bearing self-referential definitions or renamings.
Axiom & Free-Parameter Ledger
free parameters (1)
- Trust-index scaling factor in global UCB
axioms (1)
- domain assumption Job arrivals and rescheduling dynamics at each data center can be represented as a restless Markov arm whose state evolves independently of the scheduling decision except when the arm is pulled.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we model job arrivals and rescheduling at each data center as a restless arm in a Markov decision process (MDP), and derive Whittle-index-based policies based on the learned transition function via Thompson sampling
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
mixed strategy that included a global upper confidence bound (UCB) encoded with trust indices
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
AI data centres as grid-interactive assets,
P. Colangelo, A. K. Coskun, J. Megrue, C. Roberts, S. Sengupta, V . Sivaram, E. Tiao, A. Vijaykar, C. Williams, D. C. Wilson, B. Records, Z. MacFarland, D. Dreiling, N. Morey, A. Ratnayake, and B. Vairamohan, “AI data centres as grid-interactive assets,”Nature Energy, vol. 11, no. 2, pp. 254–261, Dec. 2025. [Online]. Available: https://www.nature.com/arti...
work page 2025
-
[2]
Deadline scheduling as restless bandits,
Z. Yu, Y . Xu, and L. Tong, “Deadline scheduling as restless bandits,” in 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton). Monticello, IL, USA: IEEE, Sep. 2016, pp. 733–
work page 2016
-
[3]
Available: http://ieeexplore.ieee.org/document/7852304/
[Online]. Available: http://ieeexplore.ieee.org/document/7852304/
-
[4]
Restless bandits: activity allocation in a changing world,
P. Whittle, “Restless bandits: activity allocation in a changing world,” Journal of Applied Probability, vol. 25, no. A, pp. 287–298, 1988. [Online]. Available: https://doi.org/10.2307/3214163
-
[5]
Index Policies for Demand Response,
J. A. Taylor and J. L. Mathieu, “Index Policies for Demand Response,” IEEE Transactions on Power Systems, vol. 29, no. 3, pp. 1287–1295, May
-
[6]
Available: http://ieeexplore.ieee.org/document/6674094/
[Online]. Available: http://ieeexplore.ieee.org/document/6674094/
-
[7]
Contextual Restless Multi-Armed Bandits with Application to Demand Response Decision-Making,
X. Chen and I.-H. Hou, “Contextual Restless Multi-Armed Bandits with Application to Demand Response Decision-Making,” in2024 IEEE 63rd Conference on Decision and Control (CDC). Milan, Italy: IEEE, Dec. 2024, pp. 2652–2657. [Online]. Available: https: //ieeexplore.ieee.org/document/10886713/
-
[8]
E. Cortez, A. Bonde, A. Muzio, M. Russinovich, M. Fontoura, and R. Bianchini, “Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms,” inProceedings of the 26th Symposium on Operating Systems Principles. Shanghai China: ACM, Oct. 2017, pp. 153–167. [Online]. Available: https://dl.acm.org/doi/10....
-
[9]
Contextual Bandit Algorithms with Supervised Learning Guarantees,
A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R. E. Schapire, “Contextual Bandit Algorithms with Supervised Learning Guarantees,” vol. PMLR 15:19-26, 2011. [Online]. Available: https: //proceedings.mlr.press/v15/beygelzimer11a.html
work page 2011
-
[10]
NVIDIA, “NVIDIA A100 datasheet.” [Online]. Avail- able: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/ a100/pdf/nvidia-a100-datasheet.pdf V. APPENDIX A. Power consumption of VM workloads The power consumption of each VM job per core hour is modeled as a piecewise linear function of CPU/GPU uti- lization, which consist of a static componen...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.