pith. sign in

arxiv: 2605.21557 · v1 · pith:CU2IPJNYnew · submitted 2026-05-20 · 📊 stat.ML · cs.AI· cs.LG

Scalable On-Policy Reinforcement Learning via Adaptive Batch Scaling

Pith reviewed 2026-05-22 00:16 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG
keywords reinforcement learningbatch size scalingadaptive trainingpolicy non-stationaritybehavioral divergenceon-policy methodsAtari benchmark
0
0 comments X

The pith

Adaptive batch scaling lets reinforcement learning use larger batches and networks by tying size to measured policy stability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that non-stationarity in RL is not fixed but fades as training progresses, with early rapid policy changes requiring small batches and later near-stationary phases tolerating large ones. It introduces Behavioral Divergence to track action shifts between updates and uses this to shrink or grow the effective batch size dynamically. When paired with bigger networks inside the Parallelised Q-Network algorithm, the method produces stronger results on the ALE benchmark than fixed-batch approaches.

Core claim

Non-stationarity evolves during training; early volatility demands small batches for plasticity while late-stage quasi-stationarity permits large batches for precise updates. Behavioral Divergence quantifies this by measuring action-level differences between consecutive policy updates and serves as the signal to scale batch size inversely with volatility. The resulting Adaptive Batch Scaling procedure, integrated with Parallelised Q-Network, demonstrates that the combination of larger networks and larger batches yields the highest performance on the ALE benchmark.

What carries the argument

Adaptive Batch Scaling (ABS), which computes Behavioral Divergence from action shifts between updates and scales batch size inversely to that volatility measure.

If this is right

  • Late-stage training can safely use batch sizes that would have destabilized early training.
  • Larger networks become advantageous once batch size is allowed to grow with stability.
  • The same adaptive rule can be applied on top of existing on-policy algorithms without changing their core update mechanics.
  • Performance gains appear on the full ALE suite rather than isolated tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar divergence-based triggers could adapt other hyperparameters such as learning rate or network depth mid-training.
  • The approach may extend to continuous-control domains where policy shift can be measured by action or state distribution distance.
  • If the divergence signal generalizes, fixed large-batch schedules in RL could be replaced by simple online monitors without extra hyperparameter search.

Load-bearing premise

Behavioral Divergence reliably marks the transition to a quasi-stationary regime in which larger batches improve rather than degrade learning.

What would settle it

An experiment in which batch size is increased precisely when Behavioral Divergence is low yet final performance still drops relative to a fixed small-batch baseline.

Figures

Figures reproduced from arXiv: 2605.21557 by Jongchan Park.

Figure 1
Figure 1. Figure 1: Evolution of Behavioral Divergence, Gradient Noise Scale (GNS) and Batch Size Scaling. (Left) Behavioral diver￾gence and GNS aggregated across 10 Atari environments (3 seeds each). As the policy approaches a near-stationary stage, the diver￾gence diminishes while the GNS increases. (Right) Correspond￾ingly, ABS dynamically scales up the batch size, responding to the increased stability of the learning proc… view at source ↗
Figure 2
Figure 2. Figure 2: Log Scaled Performance Improvement of PQN with ABS relative to vanilla PQN on the Full ALE benchmark. Notably, ABS improves PQN performance in most environments. • Q3 (Sensitivity). How sensitive is ABS to its hyperpa￾rameters, and which 54components are most critical to its performance? • Q4 (Generalization). Is ABS applicable to continuous control domains (e.g., PPO) and even to a recent off￾policy algor… view at source ↗
Figure 3
Figure 3. Figure 3: Learning Curves of Vanilla PQN, with ABS(Ours) and GNS on Atari-10. ABS (ours) boosts the overall performance of PQN across most tasks, surpassing both vanilla PQN and GNS in efficiency [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mean Improvement Rate of ABS over each Baseline (Left: PQN, Center: PQN-L and Right: PQN-XL) on Atari-10, as a function of Rollout Range. The improvement rate consistently grows with model capacity and with larger Lmax (maximum rollout length; batch size), demonstrating a positive correlation between model scale and the efficacy of adaptive batch [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Log Scaled Performance Improvement of PQN with ABS over various Hyperparameter settings (Top: Rollout Range, Center: Policy Change Thresholds and Bottom: Adapt Frequency) on Atari-10. over each baseline across the Atari-10 environments, or￾ganized into three bar plots corresponding to three model scales: PQN, PQN-L, and PQN-XL. Within each plot, the x￾axis denotes the rollout range [Lmin → Lmax] (16→32; 2,… view at source ↗
Figure 6
Figure 6. Figure 6: Learning Curves of Vanilla PPO, with ABS on four MuJoCo locomotion tasks. ABS (Ours) consistently enhances both the final performance and sample efficiency of PPO across all evaluated tasks. These results demonstrate that ABS can be generalized to continuous action spaces, beyond the discrete PQN setting. 5.6. A4: Generalizable to Continuous Control (PPO) To demonstrate the generality of our approach beyon… view at source ↗
Figure 7
Figure 7. Figure 7: Log Scaled Performance Improvement of PQN with ABS over various Lmax relative to vanilla PQN on the Atari-10 benchmark [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Log Scaled Performance Improvement of PQN-L with large fixed batches and ABS over various Lmax relative to vanilla PQN-L on the Atari-10 benchmark [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Log Scaled Performance Improvement of PQN-XL with large fixed batches and ABS over various Lmax relative to vanilla PQN-XL on the Atari-10 benchmark. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

Conventional wisdom holds that large-batch training is fundamentally incompatible with Reinforcement Learning (RL) - beyond a modest threshold, increasing batch sizes typically yields diminishing returns or performance degradation due to the inherent non-stationarity of the data distribution. We challenge this view by observing that non-stationarity is not a fixed property of RL, but evolves throughout training: early stages exhibit rapid behavioral shifts that demand small batches for plasticity, whereas late stages approach a quasi-stationary regime where large batches enable precise convergence. Motivated by this observation, we propose Adaptive Batch Scaling (ABS), that dynamically adjusts the effective batch size according to the stability of the learning policy. Central to ABS is Behavioral Divergence, a novel metric that quantifies policy non-stationarity by measuring action-level shifts between consecutive updates, which we use to scale batch size inversely to policy volatility. Integrated with the Parallelised Q-Network (PQN) algorithm and evaluated on the ALE benchmark, ABS seamlessly reconciles early-stage plasticity with late-stage stable convergence. Strikingly, contrary to conventional wisdom, our results reveal that the combination of larger networks and larger batch sizes achieves the best performance - a scaling behavior previously thought to be unattainable in RL, now unlocked through adaptive batch control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Adaptive Batch Scaling (ABS) for on-policy RL. It observes that policy non-stationarity evolves over training—rapid early behavioral shifts require small batches for plasticity, while later quasi-stationary regimes benefit from large batches. ABS uses a Behavioral Divergence metric (action-level shifts between consecutive policy updates) to dynamically scale batch size inversely with volatility. Integrated with Parallelised Q-Network (PQN) and evaluated on the ALE benchmark, the paper claims this approach enables the combination of larger networks and larger batch sizes to achieve the best performance, contradicting conventional wisdom that large batches degrade RL due to non-stationarity.

Significance. If the central empirical claims are substantiated with quantitative results, this would represent a meaningful advance in RL scaling practices. It offers a practical mechanism to reconcile plasticity and stable convergence, potentially allowing larger models and batches in on-policy settings where they were previously avoided. The work directly engages with batch-size limitations in RL and provides a falsifiable adaptive rule tied to a measurable stability metric.

major comments (3)
  1. [Abstract] Abstract: The abstract asserts results on the ALE benchmark with PQN showing superiority of large-network + large-batch combinations, yet supplies no quantitative numbers, baselines, error bars, or ablation details. Without these, it is impossible to assess whether the reported scaling behavior is supported or whether it exceeds standard PQN performance.
  2. [Motivation paragraph / Behavioral Divergence section] Motivation and Behavioral Divergence definition: Behavioral Divergence is computed directly from action-level shifts between consecutive updates—the same policy changes whose stability the batch-size rule is intended to control. The manuscript does not provide an external validation (e.g., correlation with independent non-stationarity measures or controlled experiments showing that divergence thresholds predict reduced sensitivity to batch size) that would establish the metric as an independent indicator of the quasi-stationary regime.
  3. [Experimental section / Results on ALE] Experimental validation of the adaptive rule: The central claim requires that divergence-based batch scaling causally enables the large-batch regime without degrading early-stage learning. No quantitative link is shown between chosen divergence thresholds and either (a) measured reduction in batch-size sensitivity or (b) the reported performance gains of the large-network/large-batch combination; this link is load-bearing for the adaptive-control argument.
minor comments (2)
  1. [Method] The formal definition of Behavioral Divergence would benefit from an explicit equation (e.g., in terms of action probabilities or KL divergence) rather than a purely verbal description.
  2. [Figures] Figure captions and axis labels for any learning curves or batch-size trajectories should explicitly state the number of seeds and whether shaded regions represent standard error or deviation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating where we will revise the manuscript to incorporate the feedback and strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts results on the ALE benchmark with PQN showing superiority of large-network + large-batch combinations, yet supplies no quantitative numbers, baselines, error bars, or ablation details. Without these, it is impossible to assess whether the reported scaling behavior is supported or whether it exceeds standard PQN performance.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will update the abstract to report specific mean scores and standard deviations on the ALE benchmark for the large-network + large-batch configuration under ABS, together with direct comparisons to standard PQN and other baselines. This will make the claimed scaling behavior immediately verifiable from the abstract. revision: yes

  2. Referee: [Motivation paragraph / Behavioral Divergence section] Motivation and Behavioral Divergence definition: Behavioral Divergence is computed directly from action-level shifts between consecutive updates—the same policy changes whose stability the batch-size rule is intended to control. The manuscript does not provide an external validation (e.g., correlation with independent non-stationarity measures or controlled experiments showing that divergence thresholds predict reduced sensitivity to batch size) that would establish the metric as an independent indicator of the quasi-stationary regime.

    Authors: Behavioral Divergence is deliberately defined on action-level policy shifts because these are the precise changes that alter the on-policy data distribution and therefore determine appropriate batch size. While the manuscript demonstrates its practical utility through end-to-end performance, we acknowledge the value of additional validation. In the revision we will add a supplementary analysis correlating Behavioral Divergence with independent signals such as policy entropy and value-function variance across training runs, thereby providing external support for its use as a stability indicator. revision: partial

  3. Referee: [Experimental section / Results on ALE] Experimental validation of the adaptive rule: The central claim requires that divergence-based batch scaling causally enables the large-batch regime without degrading early-stage learning. No quantitative link is shown between chosen divergence thresholds and either (a) measured reduction in batch-size sensitivity or (b) the reported performance gains of the large-network/large-batch combination; this link is load-bearing for the adaptive-control argument.

    Authors: The current results show that ABS yields the best performance precisely when large batches are used in later stages, but we accept that a more explicit causal link between the chosen divergence thresholds and reduced batch-size sensitivity would strengthen the argument. In the revised version we will add targeted ablations that fix batch size at different divergence levels and quantify the resulting performance degradation (or lack thereof), directly illustrating how the adaptive thresholds enable stable large-batch training without harming early plasticity. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines Behavioral Divergence explicitly as a measurement of action-level shifts between consecutive policy updates and uses it as a control signal to adjust batch size. This is a direct computation from observed policy behavior rather than a fitted parameter, self-referential definition, or ansatz that forces the reported performance outcome by construction. The central claim—that adaptive batch scaling unlocks superior large-network/large-batch scaling on the ALE benchmark—is presented as an empirical result from experiments, not a mathematical reduction to the input assumptions. No self-citations, uniqueness theorems, or renamings of known results appear as load-bearing steps in the abstract or described method. The derivation chain remains self-contained against the external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the untested premise that policy non-stationarity evolves monotonically from high to low volatility and that a simple action-shift metric is sufficient to detect the transition point for batch scaling. No free parameters are named in the abstract, but the inverse scaling rule itself functions as an implicit design choice. No new physical entities are introduced.

axioms (1)
  • domain assumption Non-stationarity in on-policy RL decreases over the course of training and eventually reaches a quasi-stationary regime.
    Stated in the opening motivation of the abstract as the basis for adaptive batch sizing.
invented entities (1)
  • Behavioral Divergence metric no independent evidence
    purpose: Quantifies policy non-stationarity to decide batch size
    Newly defined quantity used to drive the adaptive rule; no independent evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5742 in / 1351 out tokens · 41468 ms · 2026-05-22T00:16:51.190593+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 10 internal anchors

  1. [1]

    S., and Zilberstein, S

    Bhatia, A., Thomas, P. S., and Zilberstein, S. Adaptive rollout length for model-based rl using model-free deep rl.arXiv preprint arXiv:2206.02380,

  2. [2]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,

  3. [3]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

  4. [4]

    C., Obando-Ceron, J., Li, L., Bacon, P.-L., Berseth, G., Courville, A., and Castro, P

    Castanyer, R. C., Obando-Ceron, J., Li, L., Bacon, P.-L., Berseth, G., Courville, A., and Castro, P. S. Stable gra- dients for stable learning at scale in deep reinforcement learning.arXiv preprint arXiv:2506.15544,

  5. [5]

    Beyond the rainbow: High performance deep reinforcement learning on a desktop pc.arXiv preprint arXiv:2411.03820,

    Clark, T., Towers, M., Evers, C., and Hare, J. Beyond the rainbow: High performance deep reinforcement learning on a desktop pc.arXiv preprint arXiv:2411.03820,

  6. [6]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929,

  7. [7]

    N., and Martin, M

    Gallici, M., Fellows, M., Ellis, B., Pou, B., Masmitja, I., Foerster, J. N., and Martin, M. Simplifying deep temporal difference learning.arXiv preprint arXiv:2407.04811,

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  9. [9]

    and Brown, T

    Hernandez, D. and Brown, T. B. Measuring the algo- rithmic efficiency of neural networks.arXiv preprint arXiv:2005.04305,

  10. [10]

    The impact of non-stationarity on gen- eralisation in deep reinforcement learning.arXiv preprint arXiv:2006.05826, 8,

    Igl, M., Farquhar, G., Luketina, J., Boehmer, W., and Whiteson, S. The impact of non-stationarity on gen- eralisation in deep reinforcement learning.arXiv preprint arXiv:2006.05826, 8,

  11. [11]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

  12. [12]

    On the variance of the adaptive learning rate and beyond.arXiv preprint arXiv:1908.03265,

    Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. On the variance of the adaptive learning rate and beyond.arXiv preprint arXiv:1908.03265,

  13. [13]

    McCandlish, S., Kaplan, J., Amodei, D., and Team, O. D. An empirical model of large-batch training.arXiv preprint arXiv:1812.06162,

  14. [14]

    K., Precup, D., and Castro, P

    Obando-Ceron, J., Sokar, G., Willi, T., Lyle, C., Farebrother, J., Foerster, J., Dziugaite, G. K., Precup, D., and Castro, P. S. Mixtures of experts unlock parameter scaling for deep rl.arXiv preprint arXiv:2402.08609,

  15. [15]

    Proximal Policy Optimization Algorithms

    Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

  16. [16]

    Don't Decay the Learning Rate, Increase the Batch Size

    Smith, S. L., Kindermans, P.-J., Ying, C., and Le, Q. V . Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489,

  17. [17]

    1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities, February 2026

    10 Scalable On-Policy Reinforcement Learning via Adaptive Batch Scaling Wang, K., Javali, I., Bortkiewicz, M., Eysenbach, B., et al. 1000 layer networks for self-supervised rl: Scaling depth can enable new goal-reaching capabilities.arXiv preprint arXiv:2503.14858,

  18. [18]

    Large Batch Training of Convolutional Networks

    You, Y ., Gitman, I., and Ginsburg, B. Large batch training of convolutional networks.arXiv preprint arXiv:1708.03888,

  19. [19]

    Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

    You, Y ., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-J. Large batch optimization for deep learning: Training bert in 76 minutes.arXiv preprint arXiv:1904.00962,

  20. [20]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Fan, T., Liu, G., Liu, L., Liu, X., et al. Dapo: An open-source llm reinforcement learning system at scale, 2025.URL https://arxiv. org/abs/2503.14476,

  21. [21]

    m Compute mean gradient¯g← 1 m P gi Estimate Noise:N←b· 1 m−1 P ∥gi −¯g∥2 Estimate Signal:S← ∥¯g∥ 2 − 1 B N Calculate GNS: ˆBsimple ←N/S Update Batch Size:B new ←clip(⌊ ˆBsimple⌋, Bmin, Bmax) We employ a dynamic batch size adjustment strategy based on theGradient Noise Scale(GNS), denoted as Bsimple, following the empirical model proposed by McCandlish et...

  22. [22]

    Step 1: Data Collection Collect trajectoriesDforL adapt steps usingEenvironments and policyπ θ

    whilet < T total do Ift(modK) == 0, setθ old ←θ. Step 1: Data Collection Collect trajectoriesDforL adapt steps usingEenvironments and policyπ θ. t←t+ (E×L adapt) Step 2: Policy Update (PPO) forepoch= 1toN epochs do Sample mini-batches fromD. Updateθby maximizing PPO objectiveL CLIP. end for Step 3: Adaptive Scaling ift(modK) == 0then Calculate KL Divergen...