AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

Aviral Kumar; Chenlu Ye; Hao Bai; Rui Yang; Spencer Whitehead; Tong Zhang

arxiv: 2606.05597 · v2 · pith:EPZPTRJDnew · submitted 2026-06-04 · 💻 cs.LG

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

Hao Bai , Rui Yang , Chenlu Ye , Spencer Whitehead , Aviral Kumar , Tong Zhang This is my paper

Pith reviewed 2026-06-28 02:41 UTC · model grok-4.3

classification 💻 cs.LG

keywords async RLweb agentsmulti-step RLGRPOvision-language agentstrajectory normalizationtraining efficiencyWebGym

0 comments

The pith

Async design overlapping rollouts with updates plus constant normalizer in GRPO speeds web RL training 2.9x and contracts verbose trajectories while preserving success.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that synchronous multi-step RL for vision-language web agents wastes GPU time on idle periods and produces overly long trajectories because the per-trajectory normalizer in GRPO down-weights the negative gradient on failed tokens. An asynchronous pipeline overlaps rollout, gradient computation, and policy refresh, using an everlasting rollout pool and lightweight screenshot handling to raise throughput. Replacing the length-based normalizer 1/|τ_i| with the constant 1/k removes the coupling between failure length and gradient magnitude, shortening paths without lowering aggregate success. These fixes matter because web tasks require repeated visual observations and actions, so both compute and token costs grow quickly with task hardness.

Core claim

AsyncWebRL overlaps rollout, gradient update, and policy refresh across iterations using an asynchronous design with an everlasting rollout pool and lightweight screenshot handling, delivering up to 2.9 times higher end-to-end training throughput than the prior fastest open synchronous pipeline. It also identifies the per-trajectory normalizer 1/|τ_i| in multi-step GRPO as the source of token-level inefficiency, because failures are longer than successes and therefore receive weaker negative gradients; replacing it with a constant 1/k contracts trajectories while keeping aggregate success intact. The combined changes produce a new open-source state of the art on the WebGym out-of-distributio

What carries the argument

The constant normalizer 1/k in multi-step GRPO that decouples trajectory length from per-token gradient weight, together with asynchronous overlap of rollout and update steps.

If this is right

End-to-end training throughput rises by up to 2.9 times over prior synchronous web RL pipelines.
Agent trajectories shorten in length while aggregate task success is preserved.
Relative performance gains reach +42 percent on medium tasks and +48 percent on hard tasks.
New state-of-the-art open-source results on the WebGym out-of-distribution test split.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same length-bias problem in GRPO normalizers is likely to appear in other multi-step RL domains where failure trajectories differ in length from successes.
Asynchronous overlapping of rollout and update could transfer to other vision-language RL settings that currently use synchronous pipelines.
Lower per-trajectory token counts may reduce inference-time compute when the trained agents are deployed on real web tasks.
The approach could be combined with experience replay or other memory mechanisms to further stabilize training on long-horizon web navigation.

Load-bearing premise

Failures are systematically longer than successes, so that swapping the per-trajectory normalizer for a constant will shorten trajectories without lowering success rates or harming training stability.

What would settle it

Train the same web-agent policy with the constant 1/k normalizer and measure whether average trajectory length falls while success rate on the WebGym test split stays the same or rises.

Figures

Figures reproduced from arXiv: 2606.05597 by Aviral Kumar, Chenlu Ye, Hao Bai, Rui Yang, Spencer Whitehead, Tong Zhang.

**Figure 1.** Figure 1: Multi-step Asynchronous Management. Compared to WebGym, AsyncWebRL eliminates the inter-iteration bubble time caused by reconstructing the rollout pool at every iteration boundary and waiting for the policy refresh of πt to complete. Colored blocks denote concurrent rollout workers producing trajectories, gradient updates on πt, and policy refreshes that broadcast new weights to the rollout workers. Whi… view at source ↗

**Figure 2.** Figure 2: Test success rate vs. training trajectories collected on the WebGym OOD test split. Solid colored curves are our runs under AsyncWebRL: AsyncWebRL (full) and AsyncWebRL-RAFT++. The gray dashed curve is the prior WebGym sync REINFORCE baseline (values from Bai et al. (2026)). Top: Instruct. Bottom: Thinking. No WebGym curve is shown on Thinking because its baseline was not trained under our (10, 20, 30) per… view at source ↗

**Figure 3.** Figure 3: Left: End to end training trajectory throughput. Right: Off-policyness during GRPO training for the Instruct (blue) and Thinking (red) Qwen3-VL-8B runs: mean and max of the per-token off-policy gap g across a training batch. family, async framework substituted for sync). AsyncWebRL-RAFT++ on Instruct reaches 39.3%, against 42.9% for the prior WebGym pipeline (Table 1). The 3.6% gap is consistent with the … view at source ↗

**Figure 4.** Figure 4: Effect of the 1/|τi| normalizer on GRPO training dynamics. Rows are the two Qwen3-VL-8B variants (top: Instruct, bottom: Thinking), columns are, from left to right, test reward, #steps per trajectory, per-token entropy, and #tokens per step. Test reward is essentially tied between the two losses, but the 1/|τi| run produces longer trajectories, longer per-step responses, and lower per-token entropy. 0 100 … view at source ↗

**Figure 5.** Figure 5: Coupled vs. decoupled importance sampling under the async RL GRPO loss. From left to right: per update mean of training reward, fraction of tokens hit by the ϵ-clip, and fraction of tokens hit by the dual clip. GRPO No LN keys0 =keyslast 7% 65% No-edit step pairs 36% 76% Generic-slot keys 34% 11% [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Memory at step 4 of one representative 10/20/30-horizon rollout per checkpoint. 1/|τi| accumulates verbose generic keys; the constant 1/k fix re-keys Memory to task sub-questions. 0 1 2 3 4 5 6 7 8 9 Step 0 1 2 3 4 5 6 7 8 9 # Memory keys Instruct Success / Easy 0 2 4 6 8 10 12 14 16 18 Step 0 2 4 6 8 10 12 14 16 18 Success / Medium 0 1 2 3 4 5 6 7 8 9 Step 0 1 2 3 4 5 6 7 8 9 Failure / Easy 0 2 4 6 8 10 1… view at source ↗

**Figure 7.** Figure 7: Trajectory-mean number of Memory JSON keys per agent step, split by outcome (Success, Failure) and difficulty (Easy, Medium). 1/|τi| tracks the one-new-key-per-step diagonal; the constant 1/k fix stays close to Base. 20/40/60 while keeping 1/|τi | does exactly that, and we see the predicted amplification in Figure 7: the GRPO (length norm, 20/40/60) curve tracks the one-new-key-per-step diagonal across bo… view at source ↗

**Figure 8.** Figure 8: Learning-rate ablation on RAFT++ (Qwen3-VL-8B-Instruct, b=120, off-policy = 2, no KL). From left to right: held-out test success rate (the larger-LR run wins by a wide margin at the peak), training reward, where the trend reverses, pre-clipping gradient L2 norm (consistently bounded for the larger LR), and per-token policy entropy, which decays faster under the larger LR. The train/test flip suggests low-L… view at source ↗

**Figure 9.** Figure 9: GRPO vs. RAFT++ on Qwen3-VL-8B-Instruct. Left: training reward per optimizer step. Middle: per-token policy entropy averaged over the loss mask, the step-zero offset is induced by the conditional vs. unconditional denominator (see text). Right: fraction of tokens hitting the ϵ-clip, both runs sit near 0.3%, ruling out clip mechanics as the driver of the entropy gap. ory key-count curves are plotted in [PI… view at source ↗

**Figure 10.** Figure 10: Effect of consumer batch size on GRPO Instruct training, plotted against wall-clock hours: batch=128 (the canonical setting used throughout the paper) versus batch=32, all other hyperparameters held fixed. From left: test reward, training reward, policy entropy, average steps per trajectory, and average response tokens per step. time only (forward, backward, optimizer), which the trainer records directly … view at source ↗

**Figure 11.** Figure 11: Lightweight Screenshot Handling. Compared to WebGym, which serializes every screenshot through the shared RPC object store and spills to disk under concurrent rollouts, AsyncWebRL keeps all image tensors in a dedicated in-memory actor and routes only lightweight references through RPC, eliminating the per-step object-store bottleneck. only the slices it needs at gradient-update time and immediately releas… view at source ↗

read the original abstract

Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL, which addresses both. On the system side, an asynchronous design overlaps rollout, gradient update, and policy refresh across iterations, paired with two web-agent-specific adaptations, namely an everlasting rollout pool and lightweight screenshot handling, that together deliver up to a $2.9\times$ end-to-end training-throughput speedup over the previously fastest open synchronous pipeline (WebGym). On the algorithmic side, we identify the per-trajectory normalizer $1/|\tau_i|$ in multi-step GRPO as the root cause of trajectory-level and token-level inefficiency: because failures are systematically longer than successes, it down-weights the negative gradient on failed tokens, so the policy keeps producing verbose memory schemas. Replacing $1/|\tau_i|$ with a constant $1/k$ breaks this coupling, contracting trajectories while preserving aggregate success. Together, these contributions set a new open-source state of the art on the WebGym out-of-distribution test split (+5.8% relative over the 42.9% prior best), with the largest gains on the harder slices (+42% relative on Medium, +48% relative on Hard).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AsyncWebRL gets real throughput from the async pipeline and a simple GRPO normalizer swap that shortens trajectories, but the claim that the swap preserves success without side effects still needs direct evidence.

read the letter

AsyncWebRL's async pipeline and the GRPO normalizer swap are the two concrete advances here. The async setup with everlasting pool and light screenshots cuts training time by up to 2.9x over WebGym. The normalizer change from 1/|τ_i| to 1/k is meant to stop the policy from over-producing long failed trajectories.

The system contributions are straightforward and address a real bottleneck in synchronous multi-step RL. Overlapping the phases makes sense for web agents where rollouts can vary in length.

On the algorithm, the observation that failures are longer and thus get down-weighted under the old normalizer is a fair diagnosis of why policies stay verbose. Switching to constant 1/k is presented as breaking that link without losing success rate. The +5.8% overall and larger gains on harder slices are the headline numbers.

The weak point is the lack of direct evidence shown for the "no side effects" part of the normalizer claim. If the paper has length stats or ablations confirming success holds steady, that would strengthen it; otherwise the SOTA result could partly reflect other factors.

This work is for labs running RL on visual web tasks. It is worth a referee's time because the throughput improvement is measurable and the normalizer idea is testable on the same benchmark.

I recommend sending it out for review.

Referee Report

3 major / 2 minor

Summary. The paper introduces AsyncWebRL, an asynchronous multi-step RL framework for vision-language web agents. It claims two main contributions: (1) a system design with asynchronous rollouts, an everlasting rollout pool, and lightweight screenshot handling that yields up to 2.9× end-to-end training throughput over prior synchronous pipelines; (2) replacement of the per-trajectory normalizer 1/|τ_i| in multi-step GRPO with a constant 1/k, motivated by the observation that failures are longer than successes, which is asserted to contract trajectory lengths while preserving aggregate success rate. These changes are reported to establish a new open-source SOTA on the WebGym out-of-distribution test split (+5.8% relative over 42.9%), with larger relative gains on Medium (+42%) and Hard (+48%) slices.

Significance. If the central claims are substantiated with full experimental details, the work would offer a practical advance in scaling RL for web agents by jointly tackling compute inefficiency and trajectory verbosity. The combination of async system optimizations and the targeted normalizer change could influence training pipelines for long-horizon agents; the reported gains on harder task slices are particularly noteworthy if reproducible.

major comments (3)

[Abstract and §3] Abstract and §3 (algorithmic contribution): the assertion that replacing 1/|τ_i| with constant 1/k 'contracts trajectories while preserving aggregate success' without side effects on other metrics or training stability is load-bearing for the SOTA attribution, yet the provided text supplies no length histograms, per-slice success deltas, stability curves, or ablations confirming that success is unchanged rather than traded for length on some tasks.
[§4] §4 (experiments): the headline +5.8% overall and larger gains on Medium/Hard slices are presented without error bars, ablation tables isolating the normalizer change from the async system, or verification that the constant 1/k does not introduce instability or degrade other metrics; this absence prevents assessment of whether the reported improvements are robust.
[§2.2] §2.2 (GRPO formulation): the claim that 1/|τ_i| down-weights negative gradients on failed tokens because failures are systematically longer requires explicit quantitative support (e.g., mean length statistics for success vs. failure trajectories) to justify the normalizer swap as the root cause rather than a correlated symptom.

minor comments (2)

The definition and value of the constant k (free parameter) should be stated explicitly with sensitivity analysis, as it appears in the free_parameters list but is not detailed in the abstract.
Notation for trajectory length |τ_i| and the new normalizer should be introduced with a clear equation reference in the main text for readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below with clarifications and commitments to revisions that strengthen the presentation of our claims without altering the core contributions.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (algorithmic contribution): the assertion that replacing 1/|τ_i| with constant 1/k 'contracts trajectories while preserving aggregate success' without side effects on other metrics or training stability is load-bearing for the SOTA attribution, yet the provided text supplies no length histograms, per-slice success deltas, stability curves, or ablations confirming that success is unchanged rather than traded for length on some tasks.

Authors: We agree that additional supporting analyses would strengthen the manuscript. In the revised version we will include length histograms for success versus failure trajectories, per-slice success rate deltas, training stability curves, and an ablation isolating the normalizer change. Internal results confirm that the constant 1/k contracts average trajectory length while preserving aggregate success rate with no degradation in stability or other metrics; the per-trajectory normalizer was observed to systematically reduce gradient magnitude on longer failure trajectories. revision: yes
Referee: [§4] §4 (experiments): the headline +5.8% overall and larger gains on Medium/Hard slices are presented without error bars, ablation tables isolating the normalizer change from the async system, or verification that the constant 1/k does not introduce instability or degrade other metrics; this absence prevents assessment of whether the reported improvements are robust.

Authors: We acknowledge that error bars and isolated ablations improve assessment of robustness. The revised §4 will report error bars from multiple random seeds, add an ablation table separating the async system from the normalizer change, and include additional stability metrics such as loss curves. The reported gains were obtained under fixed experimental conditions; we will document the full setup to support reproducibility. revision: yes
Referee: [§2.2] §2.2 (GRPO formulation): the claim that 1/|τ_i| down-weights negative gradients on failed tokens because failures are systematically longer requires explicit quantitative support (e.g., mean length statistics for success vs. failure trajectories) to justify the normalizer swap as the root cause rather than a correlated symptom.

Authors: We will add explicit quantitative support in the revised §2.2, including mean and median trajectory lengths for success and failure trajectories. These statistics show failures are systematically longer, directly motivating the normalizer change to equalize gradient weighting; this evidence distinguishes the root cause from a mere correlation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained empirical design choice

full rationale

The paper identifies an observed empirical pattern (failures longer than successes) and proposes replacing the per-trajectory normalizer 1/|τ_i| with constant 1/k as a direct algorithmic change. This is presented as a motivated design decision rather than a derived prediction or result that reduces to its own inputs by construction. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems are invoked in a load-bearing way. The reported gains are empirical outcomes of the system and algorithmic changes, with no reduction of claims to self-referential fitting or renaming.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; full paper would be needed to list all free parameters, axioms, and invented entities. From abstract, the constant k appears as a design choice and the length-difference observation is treated as a domain fact.

free parameters (1)

k
Constant used in place of 1/|τ_i|; value not stated in abstract.

axioms (1)

domain assumption Failures are systematically longer than successes in the web-agent trajectories.
Invoked as the root cause of the normalizer problem.

pith-pipeline@v0.9.1-grok · 5776 in / 1315 out tokens · 54370 ms · 2026-06-28T02:41:43.922686+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 22 canonical work pages · 10 internal anchors

[1]

doing: Agents that reason by scaling test-time interaction , author=

Thinking vs. doing: Agents that reason by scaling test-time interaction , author=. arXiv preprint arXiv:2506.07976 , year=

work page arXiv
[2]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

2024
[3]

Proceedings of the AAAI conference on artificial intelligence , volume=

Mastering complex control in moba games with deep reinforcement learning , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[4]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Minimax-m1: Scaling test-time compute efficiently with lightning attention , author=. arXiv preprint arXiv:2506.13585 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

2025 , eprint =

Understanding R1-Zero-Like Training: A Critical Perspective , author =. 2025 , eprint =

2025
[6]

Llamarl: A distributed asynchronous reinforcement learning framework for efficient large-scale llm training, 2025a.https://arxiv.org/abs/2505.24034

LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training , author=. arXiv preprint arXiv:2505.24034 , year=

work page arXiv
[7]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

ICML 2025 Workshop on Computer Use Agents , year=

Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment , author=. ICML 2025 Workshop on Computer Use Agents , year=

2025
[9]

arXiv preprint arXiv:2504.11343 , year=

A minimalist approach to llm reasoning: from rejection sampling to reinforce , author=. arXiv preprint arXiv:2504.11343 , year=

work page arXiv
[10]

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl , author=. arXiv preprint arXiv:2602.22190 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

arXiv preprint arXiv:2601.16443 , year=

Endless Terminals: Scaling RL Environments for Terminal Agents , author=. arXiv preprint arXiv:2601.16443 , year=

work page arXiv
[12]

Training Software Engineering Agents and Verifiers with SWE-Gym

Training software engineering agents and verifiers with swe-gym , author=. arXiv preprint arXiv:2412.21139 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

European conference on computer vision , pages=

Modeling context in referring expressions , author=. European conference on computer vision , pages=. 2016 , organization=

2016
[14]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Vlm-r1: A stable and generalizable r1-style large vision-language model , author=. arXiv preprint arXiv:2504.07615 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[15]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning , author=. arXiv preprint arXiv:2509.02544 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[17]

arXiv preprint arXiv:2502.15760 , year=

Digi-q: Learning q-value functions for training device-control agents , author=. arXiv preprint arXiv:2502.15760 , year=

work page arXiv
[18]

Advances in Neural Information Processing Systems , volume=

Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
[19]

arXiv preprint arXiv:2505.19789 , year=

What can rl bring to vla generalization? an empirical study , author=. arXiv preprint arXiv:2505.19789 , year=

work page arXiv
[20]

arXiv preprint arXiv:2506.00070 , year=

Robot-r1: Reinforcement learning for enhanced embodied reasoning in robotics , author=. arXiv preprint arXiv:2506.00070 , year=

work page arXiv
[21]

Advances in neural information processing systems , volume=

Fine-tuning large vision-language models as decision-making agents via reinforcement learning , author=. Advances in neural information processing systems , volume=
[22]

International conference on machine learning , pages=

Asynchronous methods for deep reinforcement learning , author=. International conference on machine learning , pages=. 2016 , organization=

2016
[23]

International conference on machine learning , pages=

Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures , author=. International conference on machine learning , pages=. 2018 , organization=

2018
[24]

arXiv preprint arXiv:2410.18252 , year=

Asynchronous rlhf: Faster and more efficient off-policy rl for language models , author=. arXiv preprint arXiv:2410.18252 , year=

work page arXiv
[25]

2025 , eprint =

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning , author =. 2025 , eprint =

2025
[26]

arXiv preprint arXiv:2504.15930 , year=

Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation , author=. arXiv preprint arXiv:2504.15930 , year=

work page arXiv
[27]

Asyncflow: An asynchronous streaming rl framework for eﬀicient llm post-training, 2025

AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training , author=. arXiv preprint arXiv:2507.01663 , year=

work page arXiv
[28]

arXiv preprint arXiv:2510.11345 , year=

Part II: ROLL Flash--Accelerating RLVR and Agentic Training with Asynchrony , author=. arXiv preprint arXiv:2510.11345 , year=

work page arXiv
[29]

arXiv preprint arXiv:2510.12633 , year=

Laminar: A scalable asynchronous rl post-training framework , author=. arXiv preprint arXiv:2510.12633 , year=

work page arXiv
[30]

Zhou, Yuzhen and Li, Jiajun and Su, Yusheng and Ramesh, Gowtham and Zhu, Zilin and Long, Xiang and Zhao, Chenyang and Pan, Jin and Yu, Xiaodong and Wang, Ze and Du, Kangrui and Wu, Jialian and Sun, Ximeng and Liu, Jiang and Yu, Qiaolin and Chen, Hao and Liu, Zicheng and Barsoum, Emad , journal =
[31]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks , author=. arXiv preprint arXiv:2601.02439 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[35]

2021 , eprint =

Batch size-invariance for policy optimization , author =. 2021 , eprint =

2021

[1] [1]

doing: Agents that reason by scaling test-time interaction , author=

Thinking vs. doing: Agents that reason by scaling test-time interaction , author=. arXiv preprint arXiv:2506.07976 , year=

work page arXiv

[2] [2]

2024 , journal =

HybridFlow: A Flexible and Efficient RLHF Framework , author =. 2024 , journal =

2024

[3] [3]

Proceedings of the AAAI conference on artificial intelligence , volume=

Mastering complex control in moba games with deep reinforcement learning , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

[4] [4]

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Minimax-m1: Scaling test-time compute efficiently with lightning attention , author=. arXiv preprint arXiv:2506.13585 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

2025 , eprint =

Understanding R1-Zero-Like Training: A Critical Perspective , author =. 2025 , eprint =

2025

[6] [6]

Llamarl: A distributed asynchronous reinforcement learning framework for efficient large-scale llm training, 2025a.https://arxiv.org/abs/2505.24034

LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training , author=. arXiv preprint arXiv:2505.24034 , year=

work page arXiv

[7] [7]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Dapo: An open-source llm reinforcement learning system at scale , author=. arXiv preprint arXiv:2503.14476 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

ICML 2025 Workshop on Computer Use Agents , year=

Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment , author=. ICML 2025 Workshop on Computer Use Agents , year=

2025

[9] [9]

arXiv preprint arXiv:2504.11343 , year=

A minimalist approach to llm reasoning: from rejection sampling to reinforce , author=. arXiv preprint arXiv:2504.11343 , year=

work page arXiv

[10] [10]

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

Gui-libra: Training native gui agents to reason and act with action-aware supervision and partially verifiable rl , author=. arXiv preprint arXiv:2602.22190 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

arXiv preprint arXiv:2601.16443 , year=

Endless Terminals: Scaling RL Environments for Terminal Agents , author=. arXiv preprint arXiv:2601.16443 , year=

work page arXiv

[12] [12]

Training Software Engineering Agents and Verifiers with SWE-Gym

Training software engineering agents and verifiers with swe-gym , author=. arXiv preprint arXiv:2412.21139 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

European conference on computer vision , pages=

Modeling context in referring expressions , author=. European conference on computer vision , pages=. 2016 , organization=

2016

[14] [14]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Vlm-r1: A stable and generalizable r1-style large vision-language model , author=. arXiv preprint arXiv:2504.07615 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning , author=. arXiv preprint arXiv:2509.02544 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[17] [17]

arXiv preprint arXiv:2502.15760 , year=

Digi-q: Learning q-value functions for training device-control agents , author=. arXiv preprint arXiv:2502.15760 , year=

work page arXiv

[18] [18]

Advances in Neural Information Processing Systems , volume=

Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

[19] [19]

arXiv preprint arXiv:2505.19789 , year=

What can rl bring to vla generalization? an empirical study , author=. arXiv preprint arXiv:2505.19789 , year=

work page arXiv

[20] [20]

arXiv preprint arXiv:2506.00070 , year=

Robot-r1: Reinforcement learning for enhanced embodied reasoning in robotics , author=. arXiv preprint arXiv:2506.00070 , year=

work page arXiv

[21] [21]

Advances in neural information processing systems , volume=

Fine-tuning large vision-language models as decision-making agents via reinforcement learning , author=. Advances in neural information processing systems , volume=

[22] [22]

International conference on machine learning , pages=

Asynchronous methods for deep reinforcement learning , author=. International conference on machine learning , pages=. 2016 , organization=

2016

[23] [23]

International conference on machine learning , pages=

Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures , author=. International conference on machine learning , pages=. 2018 , organization=

2018

[24] [24]

arXiv preprint arXiv:2410.18252 , year=

Asynchronous rlhf: Faster and more efficient off-policy rl for language models , author=. arXiv preprint arXiv:2410.18252 , year=

work page arXiv

[25] [25]

2025 , eprint =

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning , author =. 2025 , eprint =

2025

[26] [26]

arXiv preprint arXiv:2504.15930 , year=

Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation , author=. arXiv preprint arXiv:2504.15930 , year=

work page arXiv

[27] [27]

Asyncflow: An asynchronous streaming rl framework for eﬀicient llm post-training, 2025

AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training , author=. arXiv preprint arXiv:2507.01663 , year=

work page arXiv

[28] [28]

arXiv preprint arXiv:2510.11345 , year=

Part II: ROLL Flash--Accelerating RLVR and Agentic Training with Asynchrony , author=. arXiv preprint arXiv:2510.11345 , year=

work page arXiv

[29] [29]

arXiv preprint arXiv:2510.12633 , year=

Laminar: A scalable asynchronous rl post-training framework , author=. arXiv preprint arXiv:2510.12633 , year=

work page arXiv

[30] [30]

Zhou, Yuzhen and Li, Jiajun and Su, Yusheng and Ramesh, Gowtham and Zhu, Zilin and Long, Xiang and Zhao, Chenyang and Pan, Jin and Yu, Xiaodong and Wang, Ze and Du, Kangrui and Wu, Jialian and Sun, Ximeng and Liu, Jiang and Yu, Qiaolin and Chen, Hao and Liu, Zicheng and Barsoum, Emad , journal =

[31] [31]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Group Sequence Policy Optimization

Group sequence policy optimization , author=. arXiv preprint arXiv:2507.18071 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks

WebGym: Scaling Training Environments for Visual Web Agents with Realistic Tasks , author=. arXiv preprint arXiv:2601.02439 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

2021 , eprint =

Batch size-invariance for policy optimization , author =. 2021 , eprint =

2021