Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

Caiming Xiong; Ding Zhao; Haolin Chen; Huan Wang; Jielin Qiu; Shiyu Wang; Silvio Savarese; Weiran Yao; Zhepeng Cen; Zhiwei Liu

arxiv: 2510.06499 · v2 · submitted 2025-10-07 · 💻 cs.CL · cs.AI

Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

Zhepeng Cen , Haolin Chen , Shiyu Wang , Zuxin Liu , Zhiwei Liu , Jielin Qiu , Ding Zhao , Silvio Savarese

show 3 more authors

Caiming Xiong Huan Wang Weiran Yao

This is my paper

Pith reviewed 2026-05-18 08:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords reinforcement learningdata pipelinelarge language modelsquestion answer pairspretrainingefficient RLscalable data

0 comments

The pith

Reinforcement learning with a new 1.2 million example dataset matches continual pretraining performance using up to 100 times fewer tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models learn mainly by imitating patterns in massive text collections, but this leaves a gap that weakens their reasoning abilities. Reinforcement learning promises a more efficient way to close that gap, yet it has been limited by tiny RL datasets compared to pretraining scales. The Webscale-RL pipeline solves this by automatically turning web-scale pretraining documents into millions of diverse and verifiable question-answer pairs. A dataset of 1.2 million such pairs across nine domains was built this way. When models train with RL on these pairs, they outperform continual pretraining and other baselines on reasoning benchmarks while needing far less data.

Core claim

The paper claims that the Webscale-RL pipeline can convert large-scale pre-training documents into a dataset of 1.2 million diverse, verifiable question-answer pairs. Reinforcement learning trained on this dataset significantly outperforms continual pretraining and data refinement methods. It reaches equivalent performance to continual pretraining but with up to 100 times fewer tokens.

What carries the argument

The Webscale-RL pipeline, which systematically converts pre-training documents into millions of diverse, verifiable question-answer pairs for use in reinforcement learning.

If this is right

RL training becomes viable at pretraining scales of data volume
Language models gain better reasoning with less total training compute
The training-generation gap in LLMs can be bridged more efficiently
Performance improves across multiple domains and benchmarks compared to baselines

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could be extended to generate RL data for specific tasks like mathematics or coding
Lower token usage might make frequent model retraining more practical
The pipeline might help reduce biases if the source documents are carefully selected
Similar methods could apply to non-text modalities in the future

Load-bearing premise

The automated pipeline generates high-quality, diverse, and truly verifiable question-answer pairs from pretraining documents without adding noise or unverifiable content that would hurt RL results.

What would settle it

If human reviewers find that a large portion of the generated question-answer pairs cannot be verified from the source documents, or if RL performance fails to improve when using only verified pairs versus the full set.

Figures

Figures reproduced from arXiv: 2510.06499 by Caiming Xiong, Ding Zhao, Haolin Chen, Huan Wang, Jielin Qiu, Shiyu Wang, Silvio Savarese, Weiran Yao, Zhepeng Cen, Zhiwei Liu, Zuxin Liu.

**Figure 2.** Figure 2: Overview of the Webscale-RL data pipeline that systematically converts large-scale pretraining data into RL data while preserving the scale and diversity of web data. The pipeline maintains a domain-specific demonstration library for few-shot examples for high quality generation and assigns multiple personas to each document to encourage reflecting different viewpoints. The generated QA pairs are verified … view at source ↗

**Figure 3.** Figure 3: Left: The domain distribution of Webscale-RL dataset. Right: The comparison on question embedding of Webscale-RL and Nemotron data. We randomly sample 5K questions from each dataset and visualize the embedding (by Qwen3-Embedding) reduced to 2D using UMAP. We list the domain distribution of our dataset in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Scaling comparison between Webscale-RL training and continual pretraining with [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100$\times$ fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete automated pipeline to turn pretraining documents into 1.2M RL QA pairs and claims this yields 100x token efficiency over continual pretraining, but the result stands or falls on whether the generated pairs are actually clean and verifiable.

read the letter

The main takeaway is that the authors built Webscale-RL, a pipeline that systematically converts large pretraining corpora into 1.2 million question-answer pairs across more than nine domains, then show RL training on this data reaches the performance of continual pretraining with up to 100 times fewer tokens while beating some refinement baselines on benchmarks. That directly targets the data bottleneck that has kept RL from scaling like pretraining does for LLMs. The scale and the automation are the parts that feel new; prior synthetic data work has been smaller or less tied to raw web documents. The experiments provide a practical demonstration that larger, more diverse RL data can close the training-generation gap more efficiently than just adding more pretraining tokens. Credit for shipping a sizable new dataset and running the head-to-head comparisons. The soft spot is data quality. The efficiency claim requires that the automated conversion produces questions whose answers are uniquely determined by the source text and free of noise or external-knowledge leakage. If the pipeline sometimes outputs ambiguous or multi-answer cases, the RL reward signal gets diluted and the 100x number becomes harder to interpret. The abstract asserts verifiable pairs but the real test is whether the full paper reports verification rates, inter-annotator checks, or failure-mode analysis; without those numbers the central result rests on an unproven premise. This is aimed at researchers scaling RL for reasoning or building synthetic data engines. People working on data pipelines or efficiency comparisons will find the dataset and the token-efficiency angle useful even if they end up re-running the verification themselves. It deserves a serious referee because it supplies a new large-scale resource and empirical results on a timely problem; the claims are testable once the methods and controls are examined in detail. I would send it out for peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Webscale-RL automated data pipeline that converts large-scale pre-training documents into 1.2 million diverse, verifiable question-answer pairs spanning more than 9 domains. Through experiments, it claims that RL training on this dataset significantly outperforms continual pretraining and strong baselines on benchmarks, achieving equivalent performance with up to 100× fewer tokens.

Significance. If the central efficiency result holds, this would represent a substantial advance in scaling reinforcement learning for language models to match the data scale of pretraining. The work provides a concrete path to address the data bottleneck in RL for LLMs, potentially leading to more robust reasoning capabilities. The scale of the constructed dataset (1.2M examples) is a notable practical contribution.

major comments (2)

[§3 (Pipeline)] §3 (Pipeline): The description asserts that the automated conversion produces 'verifiable question-answer pairs' from arbitrary pretraining documents, yet no quantitative verification success rate, error analysis, or inter-annotator agreement is reported. This directly undermines the 100× token-efficiency claim, as unverifiable or noisy QAs would inject label noise into the RL reward signal and inflate apparent gains relative to the continual pretraining baseline.
[§5 (Experiments)] §5 (Experiments): The headline result that RL matches continual pretraining performance at up to 100× fewer tokens lacks sufficient controls on the baseline implementation, including exact token budgets, whether the pretraining used the identical source documents, and details on how the 1.2M QA pairs were formatted for RL (e.g., reward model or verifier). Without these, the efficiency comparison cannot be isolated from data-quality artifacts.

minor comments (2)

[Abstract] Abstract: The phrase 'a suite of benchmarks' should list the specific evaluation tasks and metrics to allow immediate assessment of the claimed outperformance.
[Throughout] Throughout: Ensure the term 'verifiable' is defined operationally (e.g., via an explicit verification procedure or success threshold) on first use rather than left as a qualitative descriptor.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of the pipeline and experiments. We address each major point below and have revised the manuscript to incorporate additional details and analyses where needed.

read point-by-point responses

Referee: [§3 (Pipeline)] §3 (Pipeline): The description asserts that the automated conversion produces 'verifiable question-answer pairs' from arbitrary pretraining documents, yet no quantitative verification success rate, error analysis, or inter-annotator agreement is reported. This directly undermines the 100× token-efficiency claim, as unverifiable or noisy QAs would inject label noise into the RL reward signal and inflate apparent gains relative to the continual pretraining baseline.

Authors: We agree that the original manuscript would benefit from explicit quantitative metrics on verification quality. The pipeline generates questions from source documents and verifies answers directly against those documents using an automated process, which ensures verifiability by construction rather than post-hoc human labeling. However, we acknowledge the absence of a reported success rate or error breakdown. In the revised manuscript, we have added a new subsection to §3 that includes a verification success rate computed over a held-out sample of documents, an error analysis of failure cases (e.g., ambiguous questions or partial answers), and a comparison against a small human-verified subset. These additions directly address concerns about label noise and better support the reported efficiency gains. revision: yes
Referee: [§5 (Experiments)] §5 (Experiments): The headline result that RL matches continual pretraining performance at up to 100× fewer tokens lacks sufficient controls on the baseline implementation, including exact token budgets, whether the pretraining used the identical source documents, and details on how the 1.2M QA pairs were formatted for RL (e.g., reward model or verifier). Without these, the efficiency comparison cannot be isolated from data-quality artifacts.

Authors: This is a fair critique of the experimental reporting. The continual pretraining baseline was run on the same underlying pretraining corpus from which the 1.2M QA pairs were derived. To make this explicit, the revised §5 now includes a table with precise token budgets for both the RL runs and the continual pretraining runs, confirmation that identical source documents were used, and a detailed description of the RL formatting: each QA pair is presented as a prompt-completion pair with a binary reward signal produced by a verifier model that checks answer correctness against the original document. These clarifications allow the efficiency comparison to be more cleanly isolated from data artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical efficiency claims rest on benchmark comparisons, not derivations or self-referential fits

full rationale

The paper introduces an automated pipeline for generating QA pairs from pretraining documents and reports experimental results showing RL training achieves comparable performance to continual pretraining with up to 100x fewer tokens. No equations, first-principles derivations, or parameter-fitting steps are described that would reduce predictions to inputs by construction. The central claim is supported by direct empirical comparisons across benchmarks rather than any self-citation chain or definitional equivalence. This is a standard empirical systems paper with external falsifiability via reported training runs and baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unverified assumption that the automated pipeline can reliably extract verifiable QA pairs at web scale.

axioms (1)

domain assumption Pre-training documents can be systematically converted into diverse, verifiable question-answer pairs suitable for RL.
This is the foundational premise of the Webscale-RL pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5762 in / 1205 out tokens · 29865 ms · 2026-05-18T08:36:11.654697+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100× fewer tokens.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 6 internal anchors

[1]

Mmlu-pro: A more robust and challenging multi-task language under- standing benchmark

Yubo Wang et al. “Mmlu-pro: A more robust and challenging multi-task language under- standing benchmark”. In:Advances in Neural Information Processing Systems37 (2024), pp. 95266–95290

work page 2024
[2]

Redpajama: an open dataset for training large language models

Maurice Weber et al. “Redpajama: an open dataset for training large language models”. In:Advances in neural information processing systems37 (2024), pp. 116462–116492

work page 2024
[3]

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Yuxiang Wei et al. “Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution”. In:arXiv preprint arXiv:2502.18449(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Qurating: Selecting high-quality data for training language models

Alexander Wettig et al. “Qurating: Selecting high-quality data for training language models”. In:arXiv preprint arXiv:2402.09739(2024)

work page arXiv 2024
[5]

[Online]

xAI.Grok 4.https://x.ai/news/grok-4. [Online]. 2025

work page 2025
[6]

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

Tian Xie et al. “Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning”. In:arXiv preprint arXiv:2502.14768(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Qwen2.5 Technical Report

An Yang et al. “Qwen2. 5 technical report”. In:arXiv preprint arXiv:2412.15115(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang et al. “Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement”. In:arXiv preprint arXiv:2409.12122(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions,

Weizhe Yuan et al. “Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions”. In:arXiv preprint arXiv:2502.13124(2025)

work page arXiv 2025
[10]

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

Eric Zelikman et al. “Quiet-star: Language models can teach themselves to think before speaking”. In:arXiv preprint arXiv:2403.09629(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Nemotron-Research-Tool-N1: Exploring tool-using language models with reinforced reasoning

Shaokun Zhang et al. “Nemotron-Research-Tool-N1: Tool-Using Language Models with Reinforced Reasoning”. In:arXiv preprint arXiv:2505.00024(2025)

work page arXiv 2025
[12]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang et al. “Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models”. In:arXiv preprint arXiv:2506.05176(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

arXiv preprint arXiv:2409.17115 , year=

Fan Zhou et al. “Programming every example: Lifting pre-training data quality like experts at scale”. In:arXiv preprint arXiv:2409.17115(2024)

work page arXiv 2024
[14]

Megamath: Pushing the limits of open math corpora

Fan Zhou et al. “Megamath: Pushing the limits of open math corpora”. In:arXiv preprint arXiv:2504.02807(2025). 16 B Details of Dataset Construction and Training A Usage of LLMs In paper writing, the LLMs are mainly used for proofreading and polishing the language, including grammar, spelling, and clarity. The main content, ideas, experiments and following...

work page arXiv 2025

[1] [1]

Mmlu-pro: A more robust and challenging multi-task language under- standing benchmark

Yubo Wang et al. “Mmlu-pro: A more robust and challenging multi-task language under- standing benchmark”. In:Advances in Neural Information Processing Systems37 (2024), pp. 95266–95290

work page 2024

[2] [2]

Redpajama: an open dataset for training large language models

Maurice Weber et al. “Redpajama: an open dataset for training large language models”. In:Advances in neural information processing systems37 (2024), pp. 116462–116492

work page 2024

[3] [3]

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Yuxiang Wei et al. “Swe-rl: Advancing llm reasoning via reinforcement learning on open software evolution”. In:arXiv preprint arXiv:2502.18449(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Qurating: Selecting high-quality data for training language models

Alexander Wettig et al. “Qurating: Selecting high-quality data for training language models”. In:arXiv preprint arXiv:2402.09739(2024)

work page arXiv 2024

[5] [5]

[Online]

xAI.Grok 4.https://x.ai/news/grok-4. [Online]. 2025

work page 2025

[6] [6]

Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

Tian Xie et al. “Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning”. In:arXiv preprint arXiv:2502.14768(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Qwen2.5 Technical Report

An Yang et al. “Qwen2. 5 technical report”. In:arXiv preprint arXiv:2412.15115(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement

An Yang et al. “Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement”. In:arXiv preprint arXiv:2409.12122(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions,

Weizhe Yuan et al. “Naturalreasoning: Reasoning in the wild with 2.8 m challenging questions”. In:arXiv preprint arXiv:2502.13124(2025)

work page arXiv 2025

[10] [10]

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

Eric Zelikman et al. “Quiet-star: Language models can teach themselves to think before speaking”. In:arXiv preprint arXiv:2403.09629(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Nemotron-Research-Tool-N1: Exploring tool-using language models with reinforced reasoning

Shaokun Zhang et al. “Nemotron-Research-Tool-N1: Tool-Using Language Models with Reinforced Reasoning”. In:arXiv preprint arXiv:2505.00024(2025)

work page arXiv 2025

[12] [12]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang et al. “Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models”. In:arXiv preprint arXiv:2506.05176(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

arXiv preprint arXiv:2409.17115 , year=

Fan Zhou et al. “Programming every example: Lifting pre-training data quality like experts at scale”. In:arXiv preprint arXiv:2409.17115(2024)

work page arXiv 2024

[14] [14]

Megamath: Pushing the limits of open math corpora

Fan Zhou et al. “Megamath: Pushing the limits of open math corpora”. In:arXiv preprint arXiv:2504.02807(2025). 16 B Details of Dataset Construction and Training A Usage of LLMs In paper writing, the LLMs are mainly used for proofreading and polishing the language, including grammar, spelling, and clarity. The main content, ideas, experiments and following...

work page arXiv 2025