When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control

Chuanyi Sun; Guilin Zhang; John Fossaceca; Kai Zhao; Shahryar Sarkani

arxiv: 2605.26418 · v2 · pith:6THIRS4Qnew · submitted 2026-05-26 · 💻 cs.LG · cs.AI· cs.DC

When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control

Guilin Zhang , Chuanyi Sun , Kai Zhao , Shahryar Sarkani , John Fossaceca This is my paper

Pith reviewed 2026-06-29 19:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC

keywords deep reinforcement learningautoscalingresource allocationbenchmarkrule-based controlKubernetesadaptive systemscost optimization

0 comments

The pith

A calibrated rule-based autoscaler beats six deep RL algorithms on cost for every tested workload in adaptive resource control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks when deep reinforcement learning actually improves adaptive resource allocation over simpler methods. It sets up RLScale-Bench to run six DRL algorithms against a tuned rule-based controller on Kubernetes Horizontal Pod Autoscaling, using six workload patterns and five random seeds. Results show the rule-based approach records the lowest cost in all cases while discrete-action RL methods violate constraints far less often than continuous ones. Rankings among algorithms shift with workload type, so no single DRL choice wins across the board. Readers care because the work points to calibration and evaluation protocol as the real limits rather than which RL variant is picked.

Core claim

A properly calibrated rule-based autoscaler achieves the lowest cost on all six workloads, though it trails the best RL agents on bursty and flash traffic; discrete-action algorithms outperform continuous-action ones by one to two orders of magnitude in constraint violations due to action-space mismatch; and no single algorithm dominates across workloads, with rankings shifting by up to four positions. The bottleneck is therefore baseline calibration, reward engineering, and realistic evaluation protocols rather than algorithm selection.

What carries the argument

RLScale-Bench evaluation protocol that matches architectures, training budgets, and reward functions while comparing DRL agents directly to a calibrated rule-based baseline under cost and service-level constraints in Kubernetes HPA.

If this is right

Baseline calibration and reward engineering matter more for performance than selecting among the six tested DRL algorithms.
Discrete action spaces should be preferred for resource control tasks to keep constraint violations low.
Workload-specific ranking shifts mean that algorithm choice must be revalidated when traffic patterns change.
Distribution-shift generalization tests become essential because performance gaps appear only under certain workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams facing similar control problems may obtain better results by first investing effort in tuning rule-based systems before training RL agents.
The same calibration emphasis could apply to other continuous-control domains where simple policies are easy to adjust.
Extending the benchmark to production traces with longer horizons would test whether the cost advantage persists outside the six synthetic patterns.

Load-bearing premise

The rule-based baseline is calibrated without any knowledge or data that the RL agents are denied, and the six workload patterns plus the Kubernetes setup represent real production environments.

What would settle it

A new workload pattern where at least one of the six DRL algorithms records both lower total cost and fewer constraint violations than the calibrated rule-based controller when both are run under identical Kubernetes HPA conditions and training budgets.

Figures

Figures reproduced from arXiv: 2605.26418 by Chuanyi Sun, Guilin Zhang, John Fossaceca, Kai Zhao, Shahryar Sarkani.

**Figure 1.** Figure 1: RLSCALE-BENCH pipeline. The six stages realize our contributions: matched RL agents (C1), calibrated HPA baseline (C2), 5-seed training and 240-run evaluation (C3), deployment on five shifted workloads (C4), and the three counter-intuitive findings (C5). deployments. We address these gaps with RLSCALE-BENCH, a benchmark that follows the reproducible-evaluation principles advocated by Agarwal et al. (2021)… view at source ↗

**Figure 2.** Figure 2: Discrete-action algorithms (PPO, DQN, A2C) achieve dramatically lower SLO violations than continuous-action algorithms (SAC, TD3, DDPG). The continuous family’s median SLO count is >100× higher. on cost-dominated ones, together forming the cost–SLO frontier analyzed in Appendix D. Notably, DDPG achieves a perfect 1.00 (worst) on every non-constant workload due to its degenerate single-replica policy, conf… view at source ↗

**Figure 3.** Figure 3: Composite performance heatmap (0 = best, 1 = worst). HPA and SAC form the Pareto frontier among viable algorithms. DDPG’s degenerate policy scores 1.00 on all dynamic workloads. ranking (rank 2–3); DQN is consistently the weakest viable algorithm (rank 4–5); SAC shows the widest rank variance, excelling on variable workloads (rank 1) but performing poorly on ramp traffic (rank 4). This instability carries … view at source ↗

**Figure 4.** Figure 4: Algorithm ranking (by SLO violations) shifts across workloads. No algorithm maintains rank 1 on all patterns. Only viable algorithms shown (TD3, DDPG excluded). training distribution systematically overestimate deployed performance. 4.5. Distribution-Shift Generalization All agents were trained on the variable workload—a random-walk trace designed to expose the policy to diverse conditions—and deployed on … view at source ↗

**Figure 6.** Figure 6: shows the cost-SLO scatter for all algorithms. Each point represents one (algorithm, workload) pair. The Pareto front connects methods that are not dominated on both cost and SLO simultaneously. 0.000 0.002 0.004 0.006 0.008 0.010 0.012 Total Cost (USD) 0 10 0 10 1 10 2 10 3 SLO Violations (symlog) DDPG HPAPPO SAC (a) All Algorithms 0.000 0.001 0.002 0.003 0.004 0.005 Total Cost (USD) 0 20 40 60 SLO Violat… view at source ↗

**Figure 5.** Figure 5: shows the complete cost and SLO comparison for viable algorithms (HPA, PPO, DQN, A2C, SAC) across all six workloads with 95% confidence intervals. Constant Periodic Variable Bursty Ramp Flash 0.000 0.001 0.002 0.003 0.004 0.005 Total Cost (USD) (a) Total Cost (USD) Constant Periodic Variable Bursty Ramp Flash 0 20 40 60 SLO Violations (b) SLO Violations HPA PPO DQN A2C SAC [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

read the original abstract

A properly calibrated rule-based autoscaler can beat every one of six mainstream deep reinforcement learning (DRL) algorithms on cost across every workload we test - so when, if ever, does DRL actually help? We study this in RLScale-Bench, a reproducible benchmark and evaluation protocol for DRL on adaptive resource control, where an agent allocates compute to a dynamic workload under cost and service-level constraints. We evaluate PPO, DQN, A2C, SAC, TD3, and DDPG under matched architectures, training budgets, and reward functions against a calibrated rule-based baseline across six workload patterns and five seeds (240 runs), instantiate the benchmark on Kubernetes Horizontal Pod Autoscaling, and probe distribution-shift generalization. Three findings challenge common assumptions: (i) the calibrated controller achieves the lowest cost on all six workloads, though it trails the best RL agents on bursty and flash traffic; (ii) discrete-action algorithms outperform continuous-action ones by one to two orders of magnitude in constraint violations due to action-space mismatch; and (iii) no single algorithm dominates across workloads, with rankings shifting by up to four positions. The bottleneck in RL-based resource control is not algorithm selection but baseline calibration, reward engineering, and realistic evaluation protocols.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds a calibrated rule-based baseline beating six DRL algorithms on cost across workloads in a new benchmark, with the main open issue being whether that calibration was done independently.

read the letter

The main thing to know is that this paper claims a properly tuned rule-based autoscaler beats PPO, DQN, A2C, SAC, TD3, and DDPG on cost in every one of their six workloads, while also showing that discrete-action RL methods avoid constraint violations much better than continuous ones.

They introduce RLScale-Bench with matched architectures, training budgets, reward functions, five seeds, and 240 total runs on Kubernetes HPA. The results include workload-specific ranking shifts and some distribution-shift tests. This setup is new and directly addresses the common assumption that DRL will outperform once you have decent baselines. The action-space mismatch finding is concrete and not something I recall from the prior work they cite.

The paper does a solid job on reproducibility and controlled comparison. Running multiple algorithms under the same conditions and reporting the baseline wins on cost gives a clear counter-example to the default narrative.

The soft spot is the baseline calibration. The abstract calls it properly calibrated and independent, but supplies no description of how the thresholds were chosen or whether that process used the same workload traces later used for RL evaluation. If the calibration effectively baked in workload-specific knowledge unavailable to the agents, the cost comparison does not hold. The representativeness of the six patterns is a secondary concern that follows from this.

This is for people working on RL for resource management or autoscaling. A reader who wants a reproducible benchmark and evidence on when DRL helps versus when a tuned controller suffices will find it useful. It deserves a serious referee to check the calibration procedure and full statistical reporting.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RLScale-Bench, a reproducible benchmark for evaluating deep RL on adaptive resource control under cost and SLA constraints. It reports results from 240 runs (six DRL algorithms—PPO, DQN, A2C, SAC, TD3, DDPG—plus a rule-based baseline, six workloads, five seeds) with matched architectures and reward functions, instantiated on Kubernetes HPA. The central claims are that a properly calibrated rule-based autoscaler achieves the lowest cost on every workload, that discrete-action algorithms produce far fewer constraint violations than continuous-action ones, and that algorithm rankings vary substantially across workloads, implying that baseline calibration rather than algorithm choice is the key bottleneck.

Significance. If the calibration fairness and workload representativeness hold, the work supplies concrete evidence that strong rule-based controllers can outperform mainstream DRL on cost in this domain, together with a public benchmark and evaluation protocol that future studies can adopt. The matched training budgets, multiple seeds, and distribution-shift probes are concrete strengths that increase the reliability of the comparative claims.

major comments (2)

[Methods (baseline calibration)] The description of the rule-based baseline calibration procedure (Methods section) supplies no information on whether thresholds or parameters were derived from the identical workload traces later used for RL evaluation, from separate held-out data, or from expert knowledge of the six patterns. Because the headline result—that the baseline beats all six DRL algorithms on cost—rests on the claim of fair calibration, this omission is load-bearing.
[Results (cost table)] Table reporting per-workload costs (Results) states that the baseline is lowest on all six workloads, yet no statistical test or confidence interval across the five seeds is provided to support that the observed differences are reliable rather than within-run variance.

minor comments (2)

[Abstract] The abstract states '240 runs' but the arithmetic (6 algorithms imes 6 workloads imes 5 seeds) yields 180; clarify whether the baseline contributes additional runs or whether the count includes something else.
[Experimental setup] Workload patterns are labeled 'bursty and flash traffic' without quantitative definitions or pointers to the exact trace files; adding these would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on methodological transparency and statistical support. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses

Referee: [Methods (baseline calibration)] The description of the rule-based baseline calibration procedure (Methods section) supplies no information on whether thresholds or parameters were derived from the identical workload traces later used for RL evaluation, from separate held-out data, or from expert knowledge of the six patterns. Because the headline result—that the baseline beats all six DRL algorithms on cost—rests on the claim of fair calibration, this omission is load-bearing.

Authors: We agree with the referee that the calibration procedure should be explicitly documented. We will revise the Methods section to provide a complete description of how the rule-based baseline parameters were determined, including the data sources used for calibration. revision: yes
Referee: [Results (cost table)] Table reporting per-workload costs (Results) states that the baseline is lowest on all six workloads, yet no statistical test or confidence interval across the five seeds is provided to support that the observed differences are reliable rather than within-run variance.

Authors: We concur that providing measures of variability and statistical significance would strengthen the results. In the revised version, we will augment the cost table with 95% confidence intervals across the five seeds and include the results of paired statistical tests (e.g., t-tests) comparing the baseline to each DRL algorithm per workload. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark comparison stands on independent calibration and experimental runs

full rationale

The paper reports results from 240 experimental runs comparing six DRL algorithms against a rule-based baseline on six workloads. The central claim (baseline achieves lowest cost) is an empirical outcome from those runs rather than any derivation, equation, or fitted parameter that reduces to its own inputs by construction. No self-citations, ansatzes, or uniqueness theorems are invoked to force the result; the calibration step is described as external to the RL loops, and the evaluation protocol is presented as reproducible and matched across methods.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the representativeness of the six workload patterns and the fairness of the baseline calibration procedure; no free parameters are explicitly fitted in the abstract, and no new entities are postulated.

axioms (2)

domain assumption The six workload patterns and the Kubernetes HPA environment are representative of production adaptive resource control tasks.
The abstract invokes these workloads as the test distribution without providing external validation data.
domain assumption The rule-based baseline calibration is performed without access to the same training or evaluation data used by the RL agents.
The claim that the baseline beats all RL methods rests on this separation of information.

pith-pipeline@v0.9.1-grok · 5770 in / 1406 out tokens · 31912 ms · 2026-06-29T19:06:14.608848+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 8 canonical work pages · 4 internal anchors

[1]

OpenAI Gym

Brockman, G., Cheung, V ., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI Gym. arXiv preprint arXiv:1606.01540,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S

doi: 10.1016/j.engappai.2021.104288. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor- critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational Con- ference on Machine Learning, pp. 1861–1870. PMLR,

work page doi:10.1016/j.engappai.2021.104288 2021
[3]

Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control

arXiv:1708.04133. 7 When Does Deep RL Beat Calibrated Baselines? KEDA Authors. KEDA: Kubernetes-based event driven autoscaling. https://keda.sh/,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Kubernetes Authors

Accessed: 2026-03-15. Kubernetes Authors. Kubernetes horizontal pod autoscaler. https://kubernetes. io/docs/tasks/run-application/ horizontal-pod-autoscale/,

2026
[5]

Continuous control with deep reinforcement learning

Accessed: 2026-03-15. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y ., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Tor Lattimore, Csaba Szepesvari, and Gellert Weisz

doi: 10.1038/nature14236. Mnih, V ., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn- chronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–

work page doi:10.1038/nature14236 1928
[7]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. In arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

A sampling-neighborhood- regularized latent factorization of tensor for dynamic qos estimation,

doi: 10.1109/TNSM. 2021.3052837. Wang, Z., Zhu, S., Li, J., Jiang, W., Ramakrishnan, K. K., Zheng, Y ., Yan, M., Zhang, X., and Liu, A. X. Deep- Scaling: Microservices autoscaling for stable CPU uti- lization in large scale cloud systems. InACM Sympo- sium on Cloud Computing (SoCC), pp. 16–30,

work page doi:10.1109/tnsm 2021
[9]

Zhang, G., Guo, W., Tan, Z., Guan, Q., and Jiang, H

doi: 10.1145/3542929.3563469. Zhang, G., Guo, W., Tan, Z., Guan, Q., and Jiang, H. KIS-S: A GPU-aware Kubernetes inference simulator with RL- based auto-scaling. InIEEE International Performance, Computing, and Communications Conference (IPCCC), 2025a. doi: 10.1109/IPCCC62082.2025.11304654. Zhang, G., Guo, W., Tan, Z., and Jiang, H. AMP4EC: Adaptive model...

work page doi:10.1145/3542929.3563469 2025

[1] [1]

OpenAI Gym

Brockman, G., Cheung, V ., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI Gym. arXiv preprint arXiv:1606.01540,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S

doi: 10.1016/j.engappai.2021.104288. Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Soft actor- critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational Con- ference on Machine Learning, pp. 1861–1870. PMLR,

work page doi:10.1016/j.engappai.2021.104288 2021

[3] [3]

Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control

arXiv:1708.04133. 7 When Does Deep RL Beat Calibrated Baselines? KEDA Authors. KEDA: Kubernetes-based event driven autoscaling. https://keda.sh/,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Kubernetes Authors

Accessed: 2026-03-15. Kubernetes Authors. Kubernetes horizontal pod autoscaler. https://kubernetes. io/docs/tasks/run-application/ horizontal-pod-autoscale/,

2026

[5] [5]

Continuous control with deep reinforcement learning

Accessed: 2026-03-15. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y ., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning.arXiv preprint arXiv:1509.02971,

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

Tor Lattimore, Csaba Szepesvari, and Gellert Weisz

doi: 10.1038/nature14236. Mnih, V ., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn- chronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–

work page doi:10.1038/nature14236 1928

[7] [7]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. In arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

A sampling-neighborhood- regularized latent factorization of tensor for dynamic qos estimation,

doi: 10.1109/TNSM. 2021.3052837. Wang, Z., Zhu, S., Li, J., Jiang, W., Ramakrishnan, K. K., Zheng, Y ., Yan, M., Zhang, X., and Liu, A. X. Deep- Scaling: Microservices autoscaling for stable CPU uti- lization in large scale cloud systems. InACM Sympo- sium on Cloud Computing (SoCC), pp. 16–30,

work page doi:10.1109/tnsm 2021

[9] [9]

Zhang, G., Guo, W., Tan, Z., Guan, Q., and Jiang, H

doi: 10.1145/3542929.3563469. Zhang, G., Guo, W., Tan, Z., Guan, Q., and Jiang, H. KIS-S: A GPU-aware Kubernetes inference simulator with RL- based auto-scaling. InIEEE International Performance, Computing, and Communications Conference (IPCCC), 2025a. doi: 10.1109/IPCCC62082.2025.11304654. Zhang, G., Guo, W., Tan, Z., and Jiang, H. AMP4EC: Adaptive model...

work page doi:10.1145/3542929.3563469 2025