Robust Restless Multi-Armed Bandit for Data Center Flexibility Services Through Virtual Machine Scheduling

Thomas Magnanti; Yifu Ding; Zixi Chen

arxiv: 2605.19116 · v1 · pith:UFGLHDC7new · submitted 2026-05-18 · 💻 cs.CE

Robust Restless Multi-Armed Bandit for Data Center Flexibility Services Through Virtual Machine Scheduling

Yifu Ding , Zixi Chen , Thomas Magnanti This is my paper

Pith reviewed 2026-05-20 07:15 UTC · model grok-4.3

classification 💻 cs.CE

keywords restless multi-armed banditdemand responsedata centervirtual machine schedulingWhittle indexThompson samplinggrid flexibility

0 comments

The pith

A mixed-strategy restless bandit lets grid operators request load reductions from data centers without learning their internal job schedules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to model each data center as a restless arm whose job arrivals and rescheduling transitions are learned from public virtual-machine traces. A Whittle-index policy is then derived via Thompson sampling, but the learning step is augmented with a global upper-confidence-bound term that uses trust indices to keep the policy stable when the state space grows or when context signals are noisy. If the claim holds, operators could run real-time demand-response programs at scale while data centers keep their quality-of-service loss functions private. The reported experiments indicate that the mixed strategy stays effective across different state-space sizes and beats both the pure Thompson-Whittle baseline and the EXP4 contextual bandit.

Core claim

By treating each data center as a restless arm in a Markov decision process whose transitions are estimated from open VM datasets, the authors derive Whittle-index policies via Thompson sampling and then embed a global upper-confidence-bound term with trust indices; the resulting mixed-strategy algorithm produces robust load-reduction decisions that remain effective when state spaces enlarge and when contextual information is noisy, outperforming both the pure Thompson-Whittle method and the EXP4 framework.

What carries the argument

The mixed-strategy Whittle-index policy that augments Thompson sampling with a global upper-confidence bound encoded by trust indices to accelerate and stabilize learning in large state spaces.

If this is right

Grid operators obtain real-time load-reduction decisions without needing access to proprietary rescheduling procedures inside each data center.
The policy remains effective when the number of states per data-center arm increases.
Performance advantage over pure Thompson-Whittle grows when contextual signals about job patterns become noisier.
The approach yields better regret than the EXP4 contextual-bandit baseline on the same VM traces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same restless-arm modeling could be applied to other privacy-sensitive flexible loads such as electric-vehicle charging fleets.
Replacing the trust-index component with an adaptive mechanism for non-stationary workloads might shorten learning time when data-center utilization patterns shift rapidly.
Because the code is open-sourced, independent groups can test the method on additional public VM traces or on synthetic traces that violate the current Markov assumption.

Load-bearing premise

Job arrivals and rescheduling at each data center can be captured accurately as a restless arm whose transition function is learnable from open virtual-machine datasets.

What would settle it

Re-running the mixed-strategy algorithm on traces whose state space is doubled or whose context noise is deliberately increased and checking whether it still outperforms the pure Thompson-Whittle baseline would directly test the robustness claim.

Figures

Figures reproduced from arXiv: 2605.19116 by Thomas Magnanti, Yifu Ding, Zixi Chen.

**Figure 3.** Figure 3: Running average reward over 1000 rounds for EXP4, State [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity analysis results: (a) the average reward and (b) the average [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Energy demands from data centers have surged and stressed the grid in recent years. Electric grids require balancing supply and demand every second, motivating demand response (reduction) from large loads, including data centers. This can be achieved by rescheduling jobs on physical machines. Its real-time implementation is uncertain due to fluctuating resource utilization, and rescheduling incurs quality-of-service (QoS) losses that providers are unwilling to disclose. We propose a restless multi-arm bandit (RMAB) framework in which the grid operator requests load reductions without access to detailed job-rescheduling procedures. Using the open-source virtual machine (VM) datasets, we model job arrivals and rescheduling at each data center as a restless arm in a Markov decision process (MDP), and derive Whittle-index-based policies based on the learned transition function via Thompson sampling. To overcome the weakness of an increasingly long learning process due to an enlarged state space, we used a mixed strategy that included a global upper confidence bound (UCB) encoded with trust indices to enhance robustness and accelerate learning. Results show that the proposed mixed-strategy algorithm remains robust across varying state-space sizes and consistently outperforms the pure Thompson-Whittle (TW) algorithm, especially when contextual information is noisy. It also demonstrates superior performance compared to the state-of-the-art EXP4 framework. We provided an open-sourced code for reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This applies standard restless bandits plus a mixed UCB-trust tweak to VM scheduling for grid demand response, but the robustness claims rest on an unexamined modeling step.

read the letter

The main takeaway is that the authors take the established Whittle-index approach for restless bandits and add a mixed strategy that folds in a global upper-confidence bound encoded with trust indices. They use this to schedule virtual-machine rescheduling so data centers can provide load reductions to the grid without the operator seeing internal job details. The open code and the focus on noisy context and growing state spaces are the concrete additions here.

Referee Report

2 major / 2 minor

Summary. The paper proposes a restless multi-armed bandit (RMAB) framework to enable data-center demand response for grid balancing via VM job rescheduling. Each data center is modeled as a restless arm whose job-arrival and rescheduling dynamics form an MDP; transition functions are learned from open-source VM traces via Thompson sampling, Whittle indices are computed for the resulting policies, and a mixed strategy augments the pure Thompson-Whittle (TW) policy with a global UCB term encoded by trust indices. The authors claim that the mixed policy remains robust across state-space sizes, outperforms pure TW especially under noisy context, and also beats the EXP4 baseline, with open-source code provided for reproducibility.

Significance. If the MDP modeling step accurately reflects real utilization and QoS dynamics and the reported gains are reproducible, the work could supply a practical, privacy-preserving mechanism for grid operators to elicit load reductions from data centers. The explicit release of code is a clear strength that supports verification and extension.

major comments (2)

[Abstract / Modeling section] Abstract and Modeling section: The central claim that Whittle-index policies derived from Thompson-sampled transitions produce reliable real-time decisions rests on the assumption that the chosen open-source VM datasets yield an MDP whose transition function accurately captures fluctuating utilization and QoS losses. The manuscript supplies no description of the state variables, action space, or estimation procedure (empirical frequencies, parametric fit, or direct sampling on raw traces). Without these details it is impossible to verify that the reported outperformance versus pure TW and EXP4 is attributable to the mixed strategy rather than to an artifact of the particular dataset encoding.
[Results section] Results section: The abstract asserts that the mixed-strategy algorithm 'remains robust across varying state-space sizes' and 'consistently outperforms' TW and EXP4 'especially when contextual information is noisy,' yet no experimental details—dataset statistics, number of independent runs, error bars, exact state-space cardinalities tested, or noise levels—are supplied. This absence prevents assessment of whether the robustness and superiority claims are load-bearing or merely suggestive.

minor comments (2)

[Algorithm description] The notation for the trust-index scaling factor inside the global UCB term is introduced without an explicit equation or pseudocode block; adding one would improve clarity of the mixed strategy.
[Abstract] A brief statement of the precise open-source repository or DOI for the released code would strengthen the reproducibility claim already noted in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We have prepared point-by-point responses below and will revise the manuscript accordingly to improve clarity on the modeling and experimental details.

read point-by-point responses

Referee: [Abstract / Modeling section] Abstract and Modeling section: The central claim that Whittle-index policies derived from Thompson-sampled transitions produce reliable real-time decisions rests on the assumption that the chosen open-source VM datasets yield an MDP whose transition function accurately captures fluctuating utilization and QoS losses. The manuscript supplies no description of the state variables, action space, or estimation procedure (empirical frequencies, parametric fit, or direct sampling on raw traces). Without these details it is impossible to verify that the reported outperformance versus pure TW and EXP4 is attributable to the mixed strategy rather than to an artifact of the particular dataset encoding.

Authors: We agree that the original manuscript did not include sufficient explicit description of the MDP components. In the revised version we will expand the Modeling section to specify the state variables (utilization level, queue length, and QoS penalty state), the action space (binary rescheduling decision per VM with associated QoS cost), and the transition estimation procedure (Thompson sampling over empirical frequencies derived directly from the open-source VM traces). These additions will make clear that the transition model is learned from real trace statistics rather than an arbitrary encoding, thereby supporting that the mixed-strategy gains are not dataset artifacts. revision: yes
Referee: [Results section] Results section: The abstract asserts that the mixed-strategy algorithm 'remains robust across varying state-space sizes' and 'consistently outperforms' TW and EXP4 'especially when contextual information is noisy,' yet no experimental details—dataset statistics, number of independent runs, error bars, exact state-space cardinalities tested, or noise levels—are supplied. This absence prevents assessment of whether the robustness and superiority claims are load-bearing or merely suggestive.

Authors: We acknowledge the need for fuller experimental reporting. The revised manuscript will add a dedicated experimental-setup subsection that reports: (i) summary statistics of the VM traces (number of jobs, arrival rates, utilization distributions), (ii) the exact state-space cardinalities tested (e.g., 10, 20, 50 states per arm), (iii) number of independent runs (20) with error bars showing standard deviation, and (iv) the precise noise levels injected into contextual features (Gaussian noise with variances 0.1 and 0.3). These details will allow readers to evaluate the robustness claims directly. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses external datasets and standard RMAB methods as independent inputs

full rationale

The paper models job arrivals and rescheduling from open-source VM datasets as restless arms in an MDP, learns transition functions via Thompson sampling, and applies Whittle-index policies plus a mixed global-UCB/trust-index strategy. Performance claims are empirical comparisons against pure TW and EXP4 baselines on those external datasets. No quoted step reduces a reported prediction or result to a fitted parameter or self-citation by construction; the chain remains self-contained against external benchmarks with no load-bearing self-referential definitions or renamings.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard MDP modeling assumptions and learned transitions from public datasets; no major free parameters or invented entities are introduced beyond the algorithmic mixing weights.

free parameters (1)

Trust-index scaling factor in global UCB
Weight that blends the UCB term with Thompson sampling; chosen to accelerate learning and improve robustness.

axioms (1)

domain assumption Job arrivals and rescheduling dynamics at each data center can be represented as a restless Markov arm whose state evolves independently of the scheduling decision except when the arm is pulled.
Invoked when framing the problem as an RMAB and deriving Whittle indices.

pith-pipeline@v0.9.0 · 5779 in / 1406 out tokens · 65472 ms · 2026-05-20T07:15:52.269891+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we model job arrivals and rescheduling at each data center as a restless arm in a Markov decision process (MDP), and derive Whittle-index-based policies based on the learned transition function via Thompson sampling
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

mixed strategy that included a global upper confidence bound (UCB) encoded with trust indices

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

[1]

AI data centres as grid-interactive assets,

P. Colangelo, A. K. Coskun, J. Megrue, C. Roberts, S. Sengupta, V . Sivaram, E. Tiao, A. Vijaykar, C. Williams, D. C. Wilson, B. Records, Z. MacFarland, D. Dreiling, N. Morey, A. Ratnayake, and B. Vairamohan, “AI data centres as grid-interactive assets,”Nature Energy, vol. 11, no. 2, pp. 254–261, Dec. 2025. [Online]. Available: https://www.nature.com/arti...

work page 2025
[2]

Deadline scheduling as restless bandits,

Z. Yu, Y . Xu, and L. Tong, “Deadline scheduling as restless bandits,” in 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton). Monticello, IL, USA: IEEE, Sep. 2016, pp. 733–

work page 2016
[3]

Available: http://ieeexplore.ieee.org/document/7852304/

[Online]. Available: http://ieeexplore.ieee.org/document/7852304/

work page arXiv
[4]

Restless bandits: activity allocation in a changing world,

P. Whittle, “Restless bandits: activity allocation in a changing world,” Journal of Applied Probability, vol. 25, no. A, pp. 287–298, 1988. [Online]. Available: https://doi.org/10.2307/3214163

work page doi:10.2307/3214163 1988
[5]

Index Policies for Demand Response,

J. A. Taylor and J. L. Mathieu, “Index Policies for Demand Response,” IEEE Transactions on Power Systems, vol. 29, no. 3, pp. 1287–1295, May

work page
[6]

Available: http://ieeexplore.ieee.org/document/6674094/

[Online]. Available: http://ieeexplore.ieee.org/document/6674094/

work page arXiv
[7]

Contextual Restless Multi-Armed Bandits with Application to Demand Response Decision-Making,

X. Chen and I.-H. Hou, “Contextual Restless Multi-Armed Bandits with Application to Demand Response Decision-Making,” in2024 IEEE 63rd Conference on Decision and Control (CDC). Milan, Italy: IEEE, Dec. 2024, pp. 2652–2657. [Online]. Available: https: //ieeexplore.ieee.org/document/10886713/

work page arXiv 2024
[8]

Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms,

E. Cortez, A. Bonde, A. Muzio, M. Russinovich, M. Fontoura, and R. Bianchini, “Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms,” inProceedings of the 26th Symposium on Operating Systems Principles. Shanghai China: ACM, Oct. 2017, pp. 153–167. [Online]. Available: https://dl.acm.org/doi/10....

work page doi:10.1145/3132747.3132772 2017
[9]

Contextual Bandit Algorithms with Supervised Learning Guarantees,

A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R. E. Schapire, “Contextual Bandit Algorithms with Supervised Learning Guarantees,” vol. PMLR 15:19-26, 2011. [Online]. Available: https: //proceedings.mlr.press/v15/beygelzimer11a.html

work page 2011
[10]

NVIDIA A100 datasheet

NVIDIA, “NVIDIA A100 datasheet.” [Online]. Avail- able: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/ a100/pdf/nvidia-a100-datasheet.pdf V. APPENDIX A. Power consumption of VM workloads The power consumption of each VM job per core hour is modeled as a piecewise linear function of CPU/GPU uti- lization, which consist of a static componen...

work page

[1] [1]

AI data centres as grid-interactive assets,

P. Colangelo, A. K. Coskun, J. Megrue, C. Roberts, S. Sengupta, V . Sivaram, E. Tiao, A. Vijaykar, C. Williams, D. C. Wilson, B. Records, Z. MacFarland, D. Dreiling, N. Morey, A. Ratnayake, and B. Vairamohan, “AI data centres as grid-interactive assets,”Nature Energy, vol. 11, no. 2, pp. 254–261, Dec. 2025. [Online]. Available: https://www.nature.com/arti...

work page 2025

[2] [2]

Deadline scheduling as restless bandits,

Z. Yu, Y . Xu, and L. Tong, “Deadline scheduling as restless bandits,” in 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton). Monticello, IL, USA: IEEE, Sep. 2016, pp. 733–

work page 2016

[3] [3]

Available: http://ieeexplore.ieee.org/document/7852304/

[Online]. Available: http://ieeexplore.ieee.org/document/7852304/

work page arXiv

[4] [4]

Restless bandits: activity allocation in a changing world,

P. Whittle, “Restless bandits: activity allocation in a changing world,” Journal of Applied Probability, vol. 25, no. A, pp. 287–298, 1988. [Online]. Available: https://doi.org/10.2307/3214163

work page doi:10.2307/3214163 1988

[5] [5]

Index Policies for Demand Response,

J. A. Taylor and J. L. Mathieu, “Index Policies for Demand Response,” IEEE Transactions on Power Systems, vol. 29, no. 3, pp. 1287–1295, May

work page

[6] [6]

Available: http://ieeexplore.ieee.org/document/6674094/

[Online]. Available: http://ieeexplore.ieee.org/document/6674094/

work page arXiv

[7] [7]

Contextual Restless Multi-Armed Bandits with Application to Demand Response Decision-Making,

X. Chen and I.-H. Hou, “Contextual Restless Multi-Armed Bandits with Application to Demand Response Decision-Making,” in2024 IEEE 63rd Conference on Decision and Control (CDC). Milan, Italy: IEEE, Dec. 2024, pp. 2652–2657. [Online]. Available: https: //ieeexplore.ieee.org/document/10886713/

work page arXiv 2024

[8] [8]

Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms,

E. Cortez, A. Bonde, A. Muzio, M. Russinovich, M. Fontoura, and R. Bianchini, “Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms,” inProceedings of the 26th Symposium on Operating Systems Principles. Shanghai China: ACM, Oct. 2017, pp. 153–167. [Online]. Available: https://dl.acm.org/doi/10....

work page doi:10.1145/3132747.3132772 2017

[9] [9]

Contextual Bandit Algorithms with Supervised Learning Guarantees,

A. Beygelzimer, J. Langford, L. Li, L. Reyzin, and R. E. Schapire, “Contextual Bandit Algorithms with Supervised Learning Guarantees,” vol. PMLR 15:19-26, 2011. [Online]. Available: https: //proceedings.mlr.press/v15/beygelzimer11a.html

work page 2011

[10] [10]

NVIDIA A100 datasheet

NVIDIA, “NVIDIA A100 datasheet.” [Online]. Avail- able: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/ a100/pdf/nvidia-a100-datasheet.pdf V. APPENDIX A. Power consumption of VM workloads The power consumption of each VM job per core hour is modeled as a piecewise linear function of CPU/GPU uti- lization, which consist of a static componen...

work page