Scalable On-Policy Reinforcement Learning via Adaptive Batch Scaling

Jongchan Park

arxiv: 2605.21557 · v1 · pith:CU2IPJNYnew · submitted 2026-05-20 · 📊 stat.ML · cs.AI· cs.LG

Scalable On-Policy Reinforcement Learning via Adaptive Batch Scaling

Jongchan Park This is my paper

Pith reviewed 2026-05-22 00:16 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG

keywords reinforcement learningbatch size scalingadaptive trainingpolicy non-stationaritybehavioral divergenceon-policy methodsAtari benchmark

0 comments

The pith

Adaptive batch scaling lets reinforcement learning use larger batches and networks by tying size to measured policy stability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that non-stationarity in RL is not fixed but fades as training progresses, with early rapid policy changes requiring small batches and later near-stationary phases tolerating large ones. It introduces Behavioral Divergence to track action shifts between updates and uses this to shrink or grow the effective batch size dynamically. When paired with bigger networks inside the Parallelised Q-Network algorithm, the method produces stronger results on the ALE benchmark than fixed-batch approaches.

Core claim

Non-stationarity evolves during training; early volatility demands small batches for plasticity while late-stage quasi-stationarity permits large batches for precise updates. Behavioral Divergence quantifies this by measuring action-level differences between consecutive policy updates and serves as the signal to scale batch size inversely with volatility. The resulting Adaptive Batch Scaling procedure, integrated with Parallelised Q-Network, demonstrates that the combination of larger networks and larger batches yields the highest performance on the ALE benchmark.

What carries the argument

Adaptive Batch Scaling (ABS), which computes Behavioral Divergence from action shifts between updates and scales batch size inversely to that volatility measure.

If this is right

Late-stage training can safely use batch sizes that would have destabilized early training.
Larger networks become advantageous once batch size is allowed to grow with stability.
The same adaptive rule can be applied on top of existing on-policy algorithms without changing their core update mechanics.
Performance gains appear on the full ALE suite rather than isolated tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar divergence-based triggers could adapt other hyperparameters such as learning rate or network depth mid-training.
The approach may extend to continuous-control domains where policy shift can be measured by action or state distribution distance.
If the divergence signal generalizes, fixed large-batch schedules in RL could be replaced by simple online monitors without extra hyperparameter search.

Load-bearing premise

Behavioral Divergence reliably marks the transition to a quasi-stationary regime in which larger batches improve rather than degrade learning.

What would settle it

An experiment in which batch size is increased precisely when Behavioral Divergence is low yet final performance still drops relative to a fixed small-batch baseline.

Figures

Figures reproduced from arXiv: 2605.21557 by Jongchan Park.

**Figure 1.** Figure 1: Evolution of Behavioral Divergence, Gradient Noise Scale (GNS) and Batch Size Scaling. (Left) Behavioral divergence and GNS aggregated across 10 Atari environments (3 seeds each). As the policy approaches a near-stationary stage, the divergence diminishes while the GNS increases. (Right) Correspondingly, ABS dynamically scales up the batch size, responding to the increased stability of the learning proc… view at source ↗

**Figure 2.** Figure 2: Log Scaled Performance Improvement of PQN with ABS relative to vanilla PQN on the Full ALE benchmark. Notably, ABS improves PQN performance in most environments. • Q3 (Sensitivity). How sensitive is ABS to its hyperparameters, and which 54components are most critical to its performance? • Q4 (Generalization). Is ABS applicable to continuous control domains (e.g., PPO) and even to a recent offpolicy algor… view at source ↗

**Figure 3.** Figure 3: Learning Curves of Vanilla PQN, with ABS(Ours) and GNS on Atari-10. ABS (ours) boosts the overall performance of PQN across most tasks, surpassing both vanilla PQN and GNS in efficiency [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Mean Improvement Rate of ABS over each Baseline (Left: PQN, Center: PQN-L and Right: PQN-XL) on Atari-10, as a function of Rollout Range. The improvement rate consistently grows with model capacity and with larger Lmax (maximum rollout length; batch size), demonstrating a positive correlation between model scale and the efficacy of adaptive batch [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Log Scaled Performance Improvement of PQN with ABS over various Hyperparameter settings (Top: Rollout Range, Center: Policy Change Thresholds and Bottom: Adapt Frequency) on Atari-10. over each baseline across the Atari-10 environments, organized into three bar plots corresponding to three model scales: PQN, PQN-L, and PQN-XL. Within each plot, the xaxis denotes the rollout range [Lmin → Lmax] (16→32; 2,… view at source ↗

**Figure 6.** Figure 6: Learning Curves of Vanilla PPO, with ABS on four MuJoCo locomotion tasks. ABS (Ours) consistently enhances both the final performance and sample efficiency of PPO across all evaluated tasks. These results demonstrate that ABS can be generalized to continuous action spaces, beyond the discrete PQN setting. 5.6. A4: Generalizable to Continuous Control (PPO) To demonstrate the generality of our approach beyon… view at source ↗

**Figure 7.** Figure 7: Log Scaled Performance Improvement of PQN with ABS over various Lmax relative to vanilla PQN on the Atari-10 benchmark [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Log Scaled Performance Improvement of PQN-L with large fixed batches and ABS over various Lmax relative to vanilla PQN-L on the Atari-10 benchmark [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Log Scaled Performance Improvement of PQN-XL with large fixed batches and ABS over various Lmax relative to vanilla PQN-XL on the Atari-10 benchmark. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Conventional wisdom holds that large-batch training is fundamentally incompatible with Reinforcement Learning (RL) - beyond a modest threshold, increasing batch sizes typically yields diminishing returns or performance degradation due to the inherent non-stationarity of the data distribution. We challenge this view by observing that non-stationarity is not a fixed property of RL, but evolves throughout training: early stages exhibit rapid behavioral shifts that demand small batches for plasticity, whereas late stages approach a quasi-stationary regime where large batches enable precise convergence. Motivated by this observation, we propose Adaptive Batch Scaling (ABS), that dynamically adjusts the effective batch size according to the stability of the learning policy. Central to ABS is Behavioral Divergence, a novel metric that quantifies policy non-stationarity by measuring action-level shifts between consecutive updates, which we use to scale batch size inversely to policy volatility. Integrated with the Parallelised Q-Network (PQN) algorithm and evaluated on the ALE benchmark, ABS seamlessly reconciles early-stage plasticity with late-stage stable convergence. Strikingly, contrary to conventional wisdom, our results reveal that the combination of larger networks and larger batch sizes achieves the best performance - a scaling behavior previously thought to be unattainable in RL, now unlocked through adaptive batch control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a simple adaptive rule to grow batch sizes in on-policy RL once policy updates stabilize, but the abstract supplies no numbers or controls to back the large-network-plus-large-batch claim.

read the letter

The central idea is to track policy stability with Behavioral Divergence, which counts action-level differences between consecutive updates, and to increase batch size as that metric drops. Early high volatility gets small batches to preserve learning speed; later low volatility gets large batches for better gradient estimates. This is paired with PQN on the ALE suite and framed as unlocking scaling that conventional RL wisdom says is off-limits. The observation that non-stationarity is not constant across training is the clearest contribution and feels like a useful reframing for people who have hit batch-size walls when trying bigger models. The rule itself is lightweight and does not require extra networks or heavy computation, which is a practical plus if it works. The experiments are described only at the level of the abstract, so there are no reported scores, baselines, ablations, or variance numbers to evaluate. That makes it impossible to tell whether the reported superiority of large networks with large batches is driven by the adaptive schedule or would appear under any schedule that simply enlarges batches late in training. The circularity concern also lands: because the divergence metric is built directly from the same policy updates whose stability it is supposed to signal, we need evidence that the threshold actually predicts reduced sensitivity to batch size rather than just tracking training progress. A reader working on scaling on-policy methods or on large RL agents would find the adaptive heuristic worth trying, but only after seeing the full experimental section with proper controls. The work shows clear thinking about the time-varying nature of the problem and honest engagement with the usual objections to large batches. It is worth sending to peer review so the experiments can be checked in detail; the idea is concrete enough that referees could give useful feedback on validation of the metric and on whether the scaling result survives tighter comparisons.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes Adaptive Batch Scaling (ABS) for on-policy RL. It observes that policy non-stationarity evolves over training—rapid early behavioral shifts require small batches for plasticity, while later quasi-stationary regimes benefit from large batches. ABS uses a Behavioral Divergence metric (action-level shifts between consecutive policy updates) to dynamically scale batch size inversely with volatility. Integrated with Parallelised Q-Network (PQN) and evaluated on the ALE benchmark, the paper claims this approach enables the combination of larger networks and larger batch sizes to achieve the best performance, contradicting conventional wisdom that large batches degrade RL due to non-stationarity.

Significance. If the central empirical claims are substantiated with quantitative results, this would represent a meaningful advance in RL scaling practices. It offers a practical mechanism to reconcile plasticity and stable convergence, potentially allowing larger models and batches in on-policy settings where they were previously avoided. The work directly engages with batch-size limitations in RL and provides a falsifiable adaptive rule tied to a measurable stability metric.

major comments (3)

[Abstract] Abstract: The abstract asserts results on the ALE benchmark with PQN showing superiority of large-network + large-batch combinations, yet supplies no quantitative numbers, baselines, error bars, or ablation details. Without these, it is impossible to assess whether the reported scaling behavior is supported or whether it exceeds standard PQN performance.
[Motivation paragraph / Behavioral Divergence section] Motivation and Behavioral Divergence definition: Behavioral Divergence is computed directly from action-level shifts between consecutive updates—the same policy changes whose stability the batch-size rule is intended to control. The manuscript does not provide an external validation (e.g., correlation with independent non-stationarity measures or controlled experiments showing that divergence thresholds predict reduced sensitivity to batch size) that would establish the metric as an independent indicator of the quasi-stationary regime.
[Experimental section / Results on ALE] Experimental validation of the adaptive rule: The central claim requires that divergence-based batch scaling causally enables the large-batch regime without degrading early-stage learning. No quantitative link is shown between chosen divergence thresholds and either (a) measured reduction in batch-size sensitivity or (b) the reported performance gains of the large-network/large-batch combination; this link is load-bearing for the adaptive-control argument.

minor comments (2)

[Method] The formal definition of Behavioral Divergence would benefit from an explicit equation (e.g., in terms of action probabilities or KL divergence) rather than a purely verbal description.
[Figures] Figure captions and axis labels for any learning curves or batch-size trajectories should explicitly state the number of seeds and whether shaded regions represent standard error or deviation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, indicating where we will revise the manuscript to incorporate the feedback and strengthen the presentation of our results.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract asserts results on the ALE benchmark with PQN showing superiority of large-network + large-batch combinations, yet supplies no quantitative numbers, baselines, error bars, or ablation details. Without these, it is impossible to assess whether the reported scaling behavior is supported or whether it exceeds standard PQN performance.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will update the abstract to report specific mean scores and standard deviations on the ALE benchmark for the large-network + large-batch configuration under ABS, together with direct comparisons to standard PQN and other baselines. This will make the claimed scaling behavior immediately verifiable from the abstract. revision: yes
Referee: [Motivation paragraph / Behavioral Divergence section] Motivation and Behavioral Divergence definition: Behavioral Divergence is computed directly from action-level shifts between consecutive updates—the same policy changes whose stability the batch-size rule is intended to control. The manuscript does not provide an external validation (e.g., correlation with independent non-stationarity measures or controlled experiments showing that divergence thresholds predict reduced sensitivity to batch size) that would establish the metric as an independent indicator of the quasi-stationary regime.

Authors: Behavioral Divergence is deliberately defined on action-level policy shifts because these are the precise changes that alter the on-policy data distribution and therefore determine appropriate batch size. While the manuscript demonstrates its practical utility through end-to-end performance, we acknowledge the value of additional validation. In the revision we will add a supplementary analysis correlating Behavioral Divergence with independent signals such as policy entropy and value-function variance across training runs, thereby providing external support for its use as a stability indicator. revision: partial
Referee: [Experimental section / Results on ALE] Experimental validation of the adaptive rule: The central claim requires that divergence-based batch scaling causally enables the large-batch regime without degrading early-stage learning. No quantitative link is shown between chosen divergence thresholds and either (a) measured reduction in batch-size sensitivity or (b) the reported performance gains of the large-network/large-batch combination; this link is load-bearing for the adaptive-control argument.

Authors: The current results show that ABS yields the best performance precisely when large batches are used in later stages, but we accept that a more explicit causal link between the chosen divergence thresholds and reduced batch-size sensitivity would strengthen the argument. In the revised version we will add targeted ablations that fix batch size at different divergence levels and quantify the resulting performance degradation (or lack thereof), directly illustrating how the adaptive thresholds enable stable large-batch training without harming early plasticity. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines Behavioral Divergence explicitly as a measurement of action-level shifts between consecutive policy updates and uses it as a control signal to adjust batch size. This is a direct computation from observed policy behavior rather than a fitted parameter, self-referential definition, or ansatz that forces the reported performance outcome by construction. The central claim—that adaptive batch scaling unlocks superior large-network/large-batch scaling on the ALE benchmark—is presented as an empirical result from experiments, not a mathematical reduction to the input assumptions. No self-citations, uniqueness theorems, or renamings of known results appear as load-bearing steps in the abstract or described method. The derivation chain remains self-contained against the external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the untested premise that policy non-stationarity evolves monotonically from high to low volatility and that a simple action-shift metric is sufficient to detect the transition point for batch scaling. No free parameters are named in the abstract, but the inverse scaling rule itself functions as an implicit design choice. No new physical entities are introduced.

axioms (1)

domain assumption Non-stationarity in on-policy RL decreases over the course of training and eventually reaches a quasi-stationary regime.
Stated in the opening motivation of the abstract as the basis for adaptive batch sizing.

invented entities (1)

Behavioral Divergence metric no independent evidence
purpose: Quantifies policy non-stationarity to decide batch size
Newly defined quantity used to drive the adaptive rule; no independent evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5742 in / 1351 out tokens · 41468 ms · 2026-05-22T00:16:51.190593+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Central to ABS is Behavioral Divergence, a novel metric that quantifies policy non-stationarity by measuring action-level shifts between consecutive updates, which we use to scale batch size inversely to policy volatility.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

early stages exhibit rapid behavioral shifts that demand small batches for plasticity, whereas late stages approach a quasi-stationary regime where large batches enable precise convergence

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 10 internal anchors

[1]

S., and Zilberstein, S

Bhatia, A., Thomas, P. S., and Zilberstein, S. Adaptive rollout length for model-based rl using model-free deep rl.arXiv preprint arXiv:2206.02380,

work page arXiv
[2]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901
[4]

C., Obando-Ceron, J., Li, L., Bacon, P.-L., Berseth, G., Courville, A., and Castro, P

Castanyer, R. C., Obando-Ceron, J., Li, L., Bacon, P.-L., Berseth, G., Courville, A., and Castro, P. S. Stable gra- dients for stable learning at scale in deep reinforcement learning.arXiv preprint arXiv:2506.15544,

work page arXiv
[5]

Beyond the rainbow: High performance deep reinforcement learning on a desktop pc.arXiv preprint arXiv:2411.03820,

Clark, T., Towers, M., Evers, C., and Hare, J. Beyond the rainbow: High performance deep reinforcement learning on a desktop pc.arXiv preprint arXiv:2411.03820,

work page arXiv
[6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[7]

N., and Martin, M

Gallici, M., Fellows, M., Ellis, B., Pou, B., Masmitja, I., Foerster, J. N., and Martin, M. Simplifying deep temporal difference learning.arXiv preprint arXiv:2407.04811,

work page arXiv
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

and Brown, T

Hernandez, D. and Brown, T. B. Measuring the algo- rithmic efficiency of neural networks.arXiv preprint arXiv:2005.04305,

work page arXiv 2005
[10]

The impact of non-stationarity on gen- eralisation in deep reinforcement learning.arXiv preprint arXiv:2006.05826, 8,

Igl, M., Farquhar, G., Luketina, J., Boehmer, W., and Whiteson, S. The impact of non-stationarity on gen- eralisation in deep reinforcement learning.arXiv preprint arXiv:2006.05826, 8,

work page arXiv 2006
[11]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[12]

On the variance of the adaptive learning rate and beyond.arXiv preprint arXiv:1908.03265,

Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. On the variance of the adaptive learning rate and beyond.arXiv preprint arXiv:1908.03265,

work page arXiv 1908
[13]

McCandlish, S., Kaplan, J., Amodei, D., and Team, O. D. An empirical model of large-batch training.arXiv preprint arXiv:1812.06162,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

K., Precup, D., and Castro, P

Obando-Ceron, J., Sokar, G., Willi, T., Lyle, C., Farebrother, J., Foerster, J., Dziugaite, G. K., Precup, D., and Castro, P. S. Mixtures of experts unlock parameter scaling for deep rl.arXiv preprint arXiv:2402.08609,

work page arXiv
[15]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Don't Decay the Learning Rate, Increase the Batch Size

Smith, S. L., Kindermans, P.-J., Ying, C., and Le, Q. V . Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities, February 2026

10 Scalable On-Policy Reinforcement Learning via Adaptive Batch Scaling Wang, K., Javali, I., Bortkiewicz, M., Eysenbach, B., et al. 1000 layer networks for self-supervised rl: Scaling depth can enable new goal-reaching capabilities.arXiv preprint arXiv:2503.14858,

work page arXiv
[18]

Large Batch Training of Convolutional Networks

You, Y ., Gitman, I., and Ginsburg, B. Large batch training of convolutional networks.arXiv preprint arXiv:1708.03888,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

You, Y ., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-J. Large batch optimization for deep learning: Training bert in 76 minutes.arXiv preprint arXiv:1904.00962,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[20]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Fan, T., Liu, G., Liu, L., Liu, X., et al. Dapo: An open-source llm reinforcement learning system at scale, 2025.URL https://arxiv. org/abs/2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

m Compute mean gradient¯g← 1 m P gi Estimate Noise:N←b· 1 m−1 P ∥gi −¯g∥2 Estimate Signal:S← ∥¯g∥ 2 − 1 B N Calculate GNS: ˆBsimple ←N/S Update Batch Size:B new ←clip(⌊ ˆBsimple⌋, Bmin, Bmax) We employ a dynamic batch size adjustment strategy based on theGradient Noise Scale(GNS), denoted as Bsimple, following the empirical model proposed by McCandlish et...

work page 2018
[22]

Step 1: Data Collection Collect trajectoriesDforL adapt steps usingEenvironments and policyπ θ

whilet < T total do Ift(modK) == 0, setθ old ←θ. Step 1: Data Collection Collect trajectoriesDforL adapt steps usingEenvironments and policyπ θ. t←t+ (E×L adapt) Step 2: Policy Update (PPO) forepoch= 1toN epochs do Sample mini-batches fromD. Updateθby maximizing PPO objectiveL CLIP. end for Step 3: Adaptive Scaling ift(modK) == 0then Calculate KL Divergen...

work page arXiv

[1] [1]

S., and Zilberstein, S

Bhatia, A., Thomas, P. S., and Zilberstein, S. Adaptive rollout length for model-based rl using model-free deep rl.arXiv preprint arXiv:2206.02380,

work page arXiv

[2] [2]

RT-1: Robotics Transformer for Real-World Control at Scale

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y ., Dabis, J., Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Hsu, J., et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

work page 1901

[4] [4]

C., Obando-Ceron, J., Li, L., Bacon, P.-L., Berseth, G., Courville, A., and Castro, P

Castanyer, R. C., Obando-Ceron, J., Li, L., Bacon, P.-L., Berseth, G., Courville, A., and Castro, P. S. Stable gra- dients for stable learning at scale in deep reinforcement learning.arXiv preprint arXiv:2506.15544,

work page arXiv

[5] [5]

Beyond the rainbow: High performance deep reinforcement learning on a desktop pc.arXiv preprint arXiv:2411.03820,

Clark, T., Towers, M., Evers, C., and Hare, J. Beyond the rainbow: High performance deep reinforcement learning on a desktop pc.arXiv preprint arXiv:2411.03820,

work page arXiv

[6] [6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[7] [7]

N., and Martin, M

Gallici, M., Fellows, M., Ellis, B., Pou, B., Masmitja, I., Foerster, J. N., and Martin, M. Simplifying deep temporal difference learning.arXiv preprint arXiv:2407.04811,

work page arXiv

[8] [8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

and Brown, T

Hernandez, D. and Brown, T. B. Measuring the algo- rithmic efficiency of neural networks.arXiv preprint arXiv:2005.04305,

work page arXiv 2005

[10] [10]

The impact of non-stationarity on gen- eralisation in deep reinforcement learning.arXiv preprint arXiv:2006.05826, 8,

Igl, M., Farquhar, G., Luketina, J., Boehmer, W., and Whiteson, S. The impact of non-stationarity on gen- eralisation in deep reinforcement learning.arXiv preprint arXiv:2006.05826, 8,

work page arXiv 2006

[11] [11]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[12] [12]

On the variance of the adaptive learning rate and beyond.arXiv preprint arXiv:1908.03265,

Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. On the variance of the adaptive learning rate and beyond.arXiv preprint arXiv:1908.03265,

work page arXiv 1908

[13] [13]

McCandlish, S., Kaplan, J., Amodei, D., and Team, O. D. An empirical model of large-batch training.arXiv preprint arXiv:1812.06162,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

K., Precup, D., and Castro, P

Obando-Ceron, J., Sokar, G., Willi, T., Lyle, C., Farebrother, J., Foerster, J., Dziugaite, G. K., Precup, D., and Castro, P. S. Mixtures of experts unlock parameter scaling for deep rl.arXiv preprint arXiv:2402.08609,

work page arXiv

[15] [15]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Don't Decay the Learning Rate, Increase the Batch Size

Smith, S. L., Kindermans, P.-J., Ying, C., and Le, Q. V . Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities, February 2026

10 Scalable On-Policy Reinforcement Learning via Adaptive Batch Scaling Wang, K., Javali, I., Bortkiewicz, M., Eysenbach, B., et al. 1000 layer networks for self-supervised rl: Scaling depth can enable new goal-reaching capabilities.arXiv preprint arXiv:2503.14858,

work page arXiv

[18] [18]

Large Batch Training of Convolutional Networks

You, Y ., Gitman, I., and Ginsburg, B. Large batch training of convolutional networks.arXiv preprint arXiv:1708.03888,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

You, Y ., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-J. Large batch optimization for deep learning: Training bert in 76 minutes.arXiv preprint arXiv:1904.00962,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[20] [20]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Fan, T., Liu, G., Liu, L., Liu, X., et al. Dapo: An open-source llm reinforcement learning system at scale, 2025.URL https://arxiv. org/abs/2503.14476,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

m Compute mean gradient¯g← 1 m P gi Estimate Noise:N←b· 1 m−1 P ∥gi −¯g∥2 Estimate Signal:S← ∥¯g∥ 2 − 1 B N Calculate GNS: ˆBsimple ←N/S Update Batch Size:B new ←clip(⌊ ˆBsimple⌋, Bmin, Bmax) We employ a dynamic batch size adjustment strategy based on theGradient Noise Scale(GNS), denoted as Bsimple, following the empirical model proposed by McCandlish et...

work page 2018

[22] [22]

Step 1: Data Collection Collect trajectoriesDforL adapt steps usingEenvironments and policyπ θ

whilet < T total do Ift(modK) == 0, setθ old ←θ. Step 1: Data Collection Collect trajectoriesDforL adapt steps usingEenvironments and policyπ θ. t←t+ (E×L adapt) Step 2: Policy Update (PPO) forepoch= 1toN epochs do Sample mini-batches fromD. Updateθby maximizing PPO objectiveL CLIP. end for Step 3: Adaptive Scaling ift(modK) == 0then Calculate KL Divergen...

work page arXiv