When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control
Pith reviewed 2026-06-29 19:06 UTC · model grok-4.3
The pith
A calibrated rule-based autoscaler beats six deep RL algorithms on cost for every tested workload in adaptive resource control.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A properly calibrated rule-based autoscaler achieves the lowest cost on all six workloads, though it trails the best RL agents on bursty and flash traffic; discrete-action algorithms outperform continuous-action ones by one to two orders of magnitude in constraint violations due to action-space mismatch; and no single algorithm dominates across workloads, with rankings shifting by up to four positions. The bottleneck is therefore baseline calibration, reward engineering, and realistic evaluation protocols rather than algorithm selection.
What carries the argument
RLScale-Bench evaluation protocol that matches architectures, training budgets, and reward functions while comparing DRL agents directly to a calibrated rule-based baseline under cost and service-level constraints in Kubernetes HPA.
If this is right
- Baseline calibration and reward engineering matter more for performance than selecting among the six tested DRL algorithms.
- Discrete action spaces should be preferred for resource control tasks to keep constraint violations low.
- Workload-specific ranking shifts mean that algorithm choice must be revalidated when traffic patterns change.
- Distribution-shift generalization tests become essential because performance gaps appear only under certain workloads.
Where Pith is reading between the lines
- Teams facing similar control problems may obtain better results by first investing effort in tuning rule-based systems before training RL agents.
- The same calibration emphasis could apply to other continuous-control domains where simple policies are easy to adjust.
- Extending the benchmark to production traces with longer horizons would test whether the cost advantage persists outside the six synthetic patterns.
Load-bearing premise
The rule-based baseline is calibrated without any knowledge or data that the RL agents are denied, and the six workload patterns plus the Kubernetes setup represent real production environments.
What would settle it
A new workload pattern where at least one of the six DRL algorithms records both lower total cost and fewer constraint violations than the calibrated rule-based controller when both are run under identical Kubernetes HPA conditions and training budgets.
Figures
read the original abstract
A properly calibrated rule-based autoscaler can beat every one of six mainstream deep reinforcement learning (DRL) algorithms on cost across every workload we test - so when, if ever, does DRL actually help? We study this in RLScale-Bench, a reproducible benchmark and evaluation protocol for DRL on adaptive resource control, where an agent allocates compute to a dynamic workload under cost and service-level constraints. We evaluate PPO, DQN, A2C, SAC, TD3, and DDPG under matched architectures, training budgets, and reward functions against a calibrated rule-based baseline across six workload patterns and five seeds (240 runs), instantiate the benchmark on Kubernetes Horizontal Pod Autoscaling, and probe distribution-shift generalization. Three findings challenge common assumptions: (i) the calibrated controller achieves the lowest cost on all six workloads, though it trails the best RL agents on bursty and flash traffic; (ii) discrete-action algorithms outperform continuous-action ones by one to two orders of magnitude in constraint violations due to action-space mismatch; and (iii) no single algorithm dominates across workloads, with rankings shifting by up to four positions. The bottleneck in RL-based resource control is not algorithm selection but baseline calibration, reward engineering, and realistic evaluation protocols.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RLScale-Bench, a reproducible benchmark for evaluating deep RL on adaptive resource control under cost and SLA constraints. It reports results from 240 runs (six DRL algorithms—PPO, DQN, A2C, SAC, TD3, DDPG—plus a rule-based baseline, six workloads, five seeds) with matched architectures and reward functions, instantiated on Kubernetes HPA. The central claims are that a properly calibrated rule-based autoscaler achieves the lowest cost on every workload, that discrete-action algorithms produce far fewer constraint violations than continuous-action ones, and that algorithm rankings vary substantially across workloads, implying that baseline calibration rather than algorithm choice is the key bottleneck.
Significance. If the calibration fairness and workload representativeness hold, the work supplies concrete evidence that strong rule-based controllers can outperform mainstream DRL on cost in this domain, together with a public benchmark and evaluation protocol that future studies can adopt. The matched training budgets, multiple seeds, and distribution-shift probes are concrete strengths that increase the reliability of the comparative claims.
major comments (2)
- [Methods (baseline calibration)] The description of the rule-based baseline calibration procedure (Methods section) supplies no information on whether thresholds or parameters were derived from the identical workload traces later used for RL evaluation, from separate held-out data, or from expert knowledge of the six patterns. Because the headline result—that the baseline beats all six DRL algorithms on cost—rests on the claim of fair calibration, this omission is load-bearing.
- [Results (cost table)] Table reporting per-workload costs (Results) states that the baseline is lowest on all six workloads, yet no statistical test or confidence interval across the five seeds is provided to support that the observed differences are reliable rather than within-run variance.
minor comments (2)
- [Abstract] The abstract states '240 runs' but the arithmetic (6 algorithms imes 6 workloads imes 5 seeds) yields 180; clarify whether the baseline contributes additional runs or whether the count includes something else.
- [Experimental setup] Workload patterns are labeled 'bursty and flash traffic' without quantitative definitions or pointers to the exact trace files; adding these would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their insightful comments on methodological transparency and statistical support. We address each major comment below and will update the manuscript accordingly.
read point-by-point responses
-
Referee: [Methods (baseline calibration)] The description of the rule-based baseline calibration procedure (Methods section) supplies no information on whether thresholds or parameters were derived from the identical workload traces later used for RL evaluation, from separate held-out data, or from expert knowledge of the six patterns. Because the headline result—that the baseline beats all six DRL algorithms on cost—rests on the claim of fair calibration, this omission is load-bearing.
Authors: We agree with the referee that the calibration procedure should be explicitly documented. We will revise the Methods section to provide a complete description of how the rule-based baseline parameters were determined, including the data sources used for calibration. revision: yes
-
Referee: [Results (cost table)] Table reporting per-workload costs (Results) states that the baseline is lowest on all six workloads, yet no statistical test or confidence interval across the five seeds is provided to support that the observed differences are reliable rather than within-run variance.
Authors: We concur that providing measures of variability and statistical significance would strengthen the results. In the revised version, we will augment the cost table with 95% confidence intervals across the five seeds and include the results of paired statistical tests (e.g., t-tests) comparing the baseline to each DRL algorithm per workload. revision: yes
Circularity Check
No circularity: empirical benchmark comparison stands on independent calibration and experimental runs
full rationale
The paper reports results from 240 experimental runs comparing six DRL algorithms against a rule-based baseline on six workloads. The central claim (baseline achieves lowest cost) is an empirical outcome from those runs rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked to force the result; the calibration step is described as external to the RL loops, and the evaluation protocol is presented as reproducible and matched across methods.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The six workload patterns and the Kubernetes HPA environment are representative of production adaptive resource control tasks.
- domain assumption The rule-based baseline calibration is performed without access to the same training or evaluation data used by the RL agents.
Reference graph
Works this paper leans on
-
[1]
Brockman, G., Cheung, V ., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI Gym. arXiv preprint arXiv:1606.01540,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S
doi: 10.1016/j.engappai.2021.104288. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor- critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational Con- ference on Machine Learning, pp. 1861–1870. PMLR,
-
[3]
Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control
arXiv:1708.04133. 7 When Does Deep RL Beat Calibrated Baselines? KEDA Authors. KEDA: Kubernetes-based event driven autoscaling. https://keda.sh/,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Kubernetes Authors
Accessed: 2026-03-15. Kubernetes Authors. Kubernetes horizontal pod autoscaler. https://kubernetes. io/docs/tasks/run-application/ horizontal-pod-autoscale/,
2026
-
[5]
Continuous control with deep reinforcement learning
Accessed: 2026-03-15. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y ., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971,
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[6]
Tor Lattimore, Csaba Szepesvari, and Gellert Weisz
doi: 10.1038/nature14236. Mnih, V ., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn- chronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–
-
[7]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. In arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
A sampling-neighborhood- regularized latent factorization of tensor for dynamic qos estimation,
doi: 10.1109/TNSM. 2021.3052837. Wang, Z., Zhu, S., Li, J., Jiang, W., Ramakrishnan, K. K., Zheng, Y ., Yan, M., Zhang, X., and Liu, A. X. Deep- Scaling: Microservices autoscaling for stable CPU uti- lization in large scale cloud systems. InACM Sympo- sium on Cloud Computing (SoCC), pp. 16–30,
-
[9]
Zhang, G., Guo, W., Tan, Z., Guan, Q., and Jiang, H
doi: 10.1145/3542929.3563469. Zhang, G., Guo, W., Tan, Z., Guan, Q., and Jiang, H. KIS-S: A GPU-aware Kubernetes inference simulator with RL- based auto-scaling. InIEEE International Performance, Computing, and Communications Conference (IPCCC), 2025a. doi: 10.1109/IPCCC62082.2025.11304654. Zhang, G., Guo, W., Tan, Z., and Jiang, H. AMP4EC: Adaptive model...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.