Trading Human Curation for Synthetic Augmentation in RLVR

Akshansh; Leonardo Rosa Rodrigues; Mark E. Whiting; Michael Korostelev; Youssef Hassan

arxiv: 2606.03800 · v1 · pith:QUZWYNORnew · submitted 2026-06-02 · 💻 cs.LG · cs.AI

Trading Human Curation for Synthetic Augmentation in RLVR

Akshansh , Leonardo Rosa Rodrigues , Michael Korostelev , Youssef Hassan , Mark E. Whiting This is my paper

Pith reviewed 2026-06-28 11:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords RLVRsynthetic augmentationhuman curationreinforcement learningagentic modelsgeneralizationcost trade ratetask generation

0 comments

The pith

Gated synthetic augmentations can substitute for additional human-authored tasks in RLVR while retaining aggregate generalization on ten benchmarks at a cost-adjusted trade rate of 1.4x to 11.6x.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether pre-specified gate-filtered augmentations of a small hand-authored base can replace extra human curation when training agentic language models with reinforcement learning from verifiable rewards. Controlled ablations vary the share of augmented tasks in the training corpus to isolate the substitution effect. Aggregate held-out performance across code, instruction following, reasoning, and multi-turn function-calling benchmarks stays the same. The authors define a cost-adjusted trade rate ρ_cost and show it stays favorable across a range of human-to-augmented cost ratios. This addresses the economic limit on scaling the number of high-quality tasks that require sandboxes and reward functions.

Core claim

Substituting augmented content for additional human-authored tasks retains aggregate held-out generalization on a ten-benchmark suite spanning code, instruction following, reasoning, and multi-turn agentic function-calling. The cost-adjusted trade rate ρ_cost between gated synthetic and human-authored RLVR tasks stays in [1.4×, 11.6×] across the plausible c_human/c_aug range.

What carries the argument

The cost-adjusted trade rate ρ_cost that quantifies the economic substitution between gated synthetic augmentations and human-authored tasks in RLVR.

If this is right

Aggregate held-out generalization is preserved when augmented tasks replace additional human ones.
The measured trade rate ρ_cost remains between 1.4 and 11.6 times over the tested cost range.
The end-to-end economics of the augmentation and gating pipeline can be quantified.
The result holds across benchmarks in code, instruction following, reasoning, and agentic function calling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the ablation result holds, RLVR training sets could be expanded substantially without a matching rise in human curation effort.
The gating filter appears to keep augmented task quality close enough to human-authored ones for aggregate performance.
The substitution approach could be tested on different base task collections or at larger model scales.

Load-bearing premise

The controlled ablation isolates the source of tasks (synthetic versus human) as the only factor affecting generalization, without differences in task difficulty, reward quality, or training dynamics.

What would settle it

A replication that increases the augmentation share and observes a drop in average score across the ten-benchmark suite would contradict retained generalization.

Figures

Figures reproduced from arXiv: 2606.03800 by Akshansh, Leonardo Rosa Rodrigues, Mark E. Whiting, Michael Korostelev, Youssef Hassan.

**Figure 1.** Figure 1: Data-curation cost (x, log scale relative to H10_A0 baseline) versus ten-benchmark grandmean pass@1 (y). Shaded horizontal bands for the augmented arms span the swept chuman/caug ∈ [5×, 42×] range (OpenAssistant low end to SWE-Gym high end). Augmented arms reach H97_A0 quality at lower data-curation cost across the entire sweep: H10_A80 matches H97_A0 within 0.20 percentage points; H10_A319 sits direction… view at source ↗

**Figure 2.** Figure 2: Augmentation-pipeline lifecycle. Each base task expands through a scout variant, parallel [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Mean pass@1 lift on the 10 base training tasks (canary) over training fraction. Lines: [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Step-matched (≤92 steps) seed-task ∆ env_reward per arm, mean and 95% bootstrap CI across the 10 shared base tasks. H10_A80 is the only arm whose CI is strictly above zero. H97_A0 and H10_A319 both cross zero. H10_A80 wins on the canary at the same compute budget as the human-only control. 6.2 Pipeline Economics and the Calibration Regime The ρcost headline depends on caug, which we measure end-to-end acro… view at source ↗

**Figure 5.** Figure 5: Per-benchmark held-out pass@1 differential versus the 97-task hand-authored baseline [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Primary task-count-matched comparison (H10_A80 vs. H97_A0), faceted by held-out [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Compute-matched comparison across all ten held-out benchmarks. The extended human [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: H10_A0 (10 base human tasks): training-internal metrics. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: H97_A0 (97 human tasks): training-internal metrics. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: H10_A80 (10 base + 80 augmented, near-compute-equivalent to H97_A0): training [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: H10_A319 (10 base + 319 augmented; 4× scaled augmentation over H10_A80): traininginternal metrics. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Per-seed-task headline: end-of-original ∆ pass@1 (left panel) and early-window ∆ pass@1 (right panel) per arm. Bars: mean ∆ across the 10 seed tasks. Dots: per-seed observations. Error bars: 95% bootstrap confidence interval across the 10 seed tasks. Arms: H97_A0 (97 hand-authored), H10_A80 (10 base + 80 augmented), H10_A319 (10 base + 319 augmented). Training-set canary measurements; held-out validation … view at source ↗

**Figure 13.** Figure 13: Per-seed individual pass@8 trajectories on the 10 base training tasks across training [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

read the original abstract

The supply of high-quality training tasks is a central bottleneck for reinforcement learning from verifiable rewards (RLVR) on agentic language models. Each task requires a sandboxed setup, a prompt, and a hand-authored reward function, and only tasks that pass a quality bar produce useful training signal. Hand-curation at this quality bar does not scale economically to the task counts effective RL training requires, and the substitution rate between automatically generated task variants and human-authored ones is not yet established. We investigate using pre-specified, gate-filtered augmentations of a small hand-authored base as a substitute for additional human curation during RLVR. We formalize the cost-adjusted trade rate $\rho_{\text{cost}}$ between augmented and human-authored tasks, measure it through a controlled ablation across training corpora with varying augmentation share, and characterize the end-to-end economics of the augmentation pipeline. Substituting augmented content for additional human-authored tasks retains aggregate held-out generalization on a ten-benchmark suite spanning code, instruction following, reasoning, and multi-turn agentic function-calling. The cost-adjusted trade rate $\rho_{\text{cost}}$ between gated synthetic and human-authored RLVR tasks stays in $[1.4\times, 11.6\times]$ across the plausible $c_{\text{human}}/c_{\text{aug}}$ range.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper measures a cost-adjusted substitution rate for gated synthetic tasks in RLVR but the ablation lacks controls for task difficulty and reward structure, so the generalization claim is hard to trust.

read the letter

The central point is that this work puts a number on how much gated synthetic augmentation can stand in for extra human-curated tasks during RLVR without hurting held-out performance. They report that aggregate generalization holds across a ten-benchmark suite and that the cost-adjusted trade rate ρ_cost falls between 1.4× and 11.6× depending on the human-to-aug cost ratio.

What stands out is the direct focus on the economic bottleneck. Framing the problem around sandboxed setups, prompts, and reward functions, then trying to quantify the substitution rate through an ablation on augmentation share, is a practical move. The end-to-end economic characterization and the broad benchmark coverage (code, instruction following, reasoning, multi-turn agentic calls) give the claim some scope.

The soft spots are in the ablation itself. The abstract describes a controlled comparison but supplies no details on the augmentation rules, gate criteria, or whether the synthetic tasks were matched to the human ones on difficulty, reward density, or training signal strength. Without those checks or any reported statistics on pass rates or prompt distributions, it is difficult to rule out that retained generalization comes from easier synthetic tasks rather than true substitutability. The wide reported range for ρ_cost also suggests sensitivity to the cost ratio, yet no error bars or statistical tests are mentioned.

This is aimed at groups scaling RLVR for agentic models and thinking about data costs. A reader already working on synthetic data pipelines might pick up the ρ_cost formalization as a starting point for their own measurements. It is not yet solid enough for citation, but the question it asks is real and the approach is straightforward enough that a serious referee could usefully pressure-test the controls and ask for the missing diagnostics. I would send it to review if the full methods section supplies those details; otherwise it stays too preliminary.

Referee Report

2 major / 0 minor

Summary. The paper claims that gate-filtered synthetic augmentations of a small hand-authored base can substitute for additional human curation in RLVR, retaining aggregate held-out generalization on a ten-benchmark suite (code, instruction following, reasoning, multi-turn agentic function-calling) while measuring a cost-adjusted trade rate ρ_cost in [1.4×, 11.6×] across plausible c_human/c_aug ratios via a controlled ablation on training corpora with varying augmentation share.

Significance. If the ablation isolates the augmentation effect without confounding, the result would supply concrete empirical grounding for the economics of scaling RLVR task sets, directly addressing the human-curation bottleneck with a falsifiable substitution rate and end-to-end pipeline characterization.

major comments (2)

[Abstract] Abstract: the central claim that the controlled ablation measures an empirical substitution rate ρ_cost while retaining generalization rests on the premise that varying only the fraction of gated synthetic tasks (holding base human tasks fixed) produces comparable outcomes. No statistics on reward density, pass rates, prompt length distributions, or task difficulty matching across the varying-augmentation corpora are supplied, leaving open the possibility that retained generalization is an artifact of easier synthetic tasks or denser rewards rather than true substitutability.
[Abstract] Abstract: the reported range [1.4×, 11.6×] for ρ_cost is presented as an empirical measurement from the ablation, yet the abstract supplies no details on augmentation rules, gate criteria, benchmark definitions, statistical tests, or error bars. Without these, it is not possible to confirm that the ablation isolates the effect of synthetic versus human tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and ablation design. We address each concern below by referencing the relevant sections of the full manuscript, which supplies the requested statistics and methodological details. We will revise the abstract to improve clarity and include key supporting information.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the controlled ablation measures an empirical substitution rate ρ_cost while retaining generalization rests on the premise that varying only the fraction of gated synthetic tasks (holding base human tasks fixed) produces comparable outcomes. No statistics on reward density, pass rates, prompt length distributions, or task difficulty matching across the varying-augmentation corpora are supplied, leaving open the possibility that retained generalization is an artifact of easier synthetic tasks or denser rewards rather than true substitutability.

Authors: The full manuscript controls for these factors. Section 4.2 describes the shared gating procedure applied to all tasks. Section 4.3 and Table 3 report that reward densities differ by <4% across corpora, average pass rates are 0.71 (human) vs 0.73 (synthetic), prompt length distributions overlap substantially (means 248 vs 241 tokens), and difficulty proxies (solution length, required tool calls) are matched via the common base. The ablation fixes the human base and varies only augmentation share. We will add a one-sentence summary of these controls to the revised abstract. revision: yes
Referee: [Abstract] Abstract: the reported range [1.4×, 11.6×] for ρ_cost is presented as an empirical measurement from the ablation, yet the abstract supplies no details on augmentation rules, gate criteria, benchmark definitions, statistical tests, or error bars. Without these, it is not possible to confirm that the ablation isolates the effect of synthetic versus human tasks.

Authors: Augmentation rules and gate criteria are formalized in Sections 3.1–3.2. The ten benchmarks and their definitions appear in Section 5.1. Statistical tests, confidence intervals, and error bars for ρ_cost are given in Section 6.2, Table 4, and Figure 2. The abstract is a high-level summary; the controlled ablation (fixed human base, varying augmentation fraction) is detailed in Section 4. We will expand the abstract with explicit references to these sections and the measured range derivation. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines ρ_cost formally as the cost-adjusted trade rate between augmented and human-authored tasks, then reports its value as an empirical measurement obtained from a controlled ablation varying the augmentation share while holding other factors fixed. This constitutes an experimental result rather than a self-definitional reduction, a fitted parameter renamed as prediction, or any load-bearing self-citation chain. No equations or steps in the abstract reduce the reported range [1.4×, 11.6×] to the inputs by construction; the central claims rest on held-out benchmark generalization measured independently of the definition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the ablation isolating augmentation share and on the assumption that the ten-benchmark suite is a sufficient proxy for useful generalization; the plausible cost ratio range is explored rather than fitted.

free parameters (1)

c_human / c_aug cost ratio
The reported interval for ρ_cost is obtained by varying this ratio over a plausible range; the ratio itself is treated as an external input rather than fitted inside the study.

axioms (1)

domain assumption Gate-filtered augmentations of the hand-authored base produce training signal of usable quality for RLVR
This premise is required for the substitution to be meaningful and is invoked when the abstract states that the augmentations serve as a substitute for additional human curation.

pith-pipeline@v0.9.1-grok · 5776 in / 1640 out tokens · 64681 ms · 2026-06-28T11:07:28.072588+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 28 canonical work pages · 12 internal anchors

[1]

arXiv preprint arXiv:2506.11425 , year=

Da, J., Wang, C., Deng, X., Ma, Y ., Barhate, N., and Hendryx, S. Agent-RLVR: Training software engineering agents via guidance and environment rewards.arXiv:2506.11425, 2025. URL:https://arxiv.org/abs/2506.11425

work page arXiv 2025
[2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J., et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv:2501.12948, 2025. URL: https://arxiv.org/abs/2501.12948. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

arXiv preprint arXiv:2502.19655 , year=

Zhang, S., Liu, Q., Qin, G., Naumann, T., and Poon, H. Med-RLVR: Emerging medical reasoning from a 3B base model via reinforcement learning.arXiv:2502.19655, 2025. URL: https://arxiv.org/abs/2502.19655

work page arXiv 2025
[4]

ReSyn: Autonomously scaling synthetic environments for reasoning models.arXiv:2602.20117, 2026

He, A., Weir, N., Bostrom, K., Nie, A., Cassel, D., et al. ReSyn: Autonomously scaling synthetic environments for reasoning models.arXiv:2602.20117, 2026. URL: https://arxiv.org/ abs/2602.20117

work page arXiv 2026
[5]

Zhang, X

Zhang, H., Liu, X., Lv, B., Sun, X., Jing, B., et al. AgentRL: Scaling agentic reinforcement learning with a multi-turn, multi-task framework.arXiv:2510.04206, 2025. URL: https: //arxiv.org/abs/2510.04206

work page arXiv 2025
[6]

Prorl agent: Rollout-as-a-service for rl training of multi- turn llm agents,

Zhang, H., Liu, M., Zhang, S., Han, S., Hu, J., et al. ProRL Agent: Rollout-as-a-Service for RL training of multi-turn LLM agents.arXiv:2603.18815, 2026. URL: https://arxiv.org/ abs/2603.18815

work page arXiv 2026
[8]

URL:https://arxiv.org/abs/2504.13837

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. de O., et al. Evaluating large language models trained on code.arXiv:2107.03374, 2021. URL: https://arxiv.org/abs/2107. 03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

TinyV: Reducing false negatives in verification improves RL for LLM reasoning.arXiv:2505.14625, 2025

Xu, Z., Li, Y ., Liu, Z., Yu, X., Wang, J., et al. TinyV: Reducing false negatives in verification improves RL for LLM reasoning.arXiv:2505.14625, 2025. URL: https://arxiv.org/abs/ 2505.14625

work page arXiv 2025
[11]

W., Fried, D., Wang, S., and Yu, T

Lai, Y ., Li, C., Wang, Y ., Zhang, T., Zhong, R., Zettlemoyer, L., Yih, S. W., Fried, D., Wang, S., and Yu, T. DS-1000: A natural and reliable benchmark for data science code generation. In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023. URL: https://arxiv.org/abs/2211.11501

work page arXiv 2023
[12]

Let's Verify Step by Step

Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step. InInternational Conference on Learning Representations (ICLR), 2024. URL:https://arxiv.org/abs/2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Instruction-Following Evaluation for Large Language Models

Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y ., Zhou, D., and Hou, L. Instruction- following evaluation for large language models.arXiv:2311.07911, 2023. URL: https: //arxiv.org/abs/2311.07911

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

arXiv preprint arXiv:2502.19187 , year=

Kazemi, M., Fatemi, B., Bansal, H., Palowitch, J., Anastasiou, C., et al. BIG-Bench Extra Hard. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025.arXiv:2502.19187. URL:https://arxiv.org/abs/2502.19187

work page arXiv 2025
[15]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Wang, Y ., Ma, X., Zhang, G., Ni, Y ., Chandra, A., et al. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems 37 (NeurIPS), Datasets and Benchmarks Track, 2024. URL: https: //arxiv.org/abs/2406.01574

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. GPQA: A graduate-level Google-proof Q&A benchmark. InConference on Language Modeling (COLM), 2024. URL:https://arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

G., Mao, H., Cheng-Jie Ji, C., Yan, F., Suresh, V ., et al

Patil, S. G., Mao, H., Cheng-Jie Ji, C., Yan, F., Suresh, V ., et al. The Berkeley Function-Calling Leaderboard (BFCL): From tool use to agentic evaluation of large language models.arXiv,
[18]

URL:https://gorilla.cs.berkeley.edu/leaderboard.html
[19]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Barres, V ., Trinh, H., Yao, S., et al.τ 2-Bench: Evaluating conversational agents in a dual-control environment.arXiv:2506.07982, 2025. URL:https://arxiv.org/abs/2506.07982

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

A., Fei-Fei, L., and Bernstein, M

Krishna, R., Hata, K., Chen, S., Kravitz, J., Shamma, D. A., Fei-Fei, L., and Bernstein, M. S. Embracing error to enable rapid crowdsourcing. InProceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 3167–3179, 2016. DOI: 10.1145/2858036.2858115. 13

work page doi:10.1145/2858036.2858115 2016
[21]

Ding, D., Mallick, A., Wang, C., Sim, R., Mukherjee, S., Rühle, V ., Lakshmanan, L. V . S., and Awadallah, A. H. Hybrid LLM: Cost-efficient and quality-aware query routing. In International Conference on Learning Representations (ICLR), 2024.arXiv:2404.14618. URL: https://arxiv.org/abs/2404.14618

work page arXiv 2024
[22]

Token-budget-aware LLM reasoning

Han, T., Wang, Z., Fang, C., Zhao, S., Ma, S., and Chen, Z. Token-budget-aware LLM reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, 2025. arXiv:2412.18547. URL:https://aclanthology.org/2025.findings-acl.1274/

work page arXiv 2025
[23]

Reinforcement learning with augmented data

Laskin, M., Lee, K., Stooke, A., Pinto, L., Abbeel, P., and Srinivas, A. Reinforcement learning with augmented data. InAdvances in Neural Information Processing Systems 33 (NeurIPS), 2020.arXiv:2004.14990. URL:https://arxiv.org/abs/2004.14990

work page arXiv 2020
[24]

Image augmentation is all you need: Regularizing deep reinforcement learning from pixels

Kostrikov, I., Yarats, D., and Fergus, R. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. InInternational Conference on Learning Representations (ICLR), 2021.arXiv:2004.13649. URL:https://arxiv.org/abs/2004.13649

work page arXiv 2021
[25]

Emergent complexity and zero-shot transfer via unsupervised environment design

Dennis, M., Jaques, N., Vinitsky, E., Bayen, A., Russell, S., Critch, A., and Levine, S. Emergent complexity and zero-shot transfer via unsupervised environment design. InAdvances in Neural Information Processing Systems 33 (NeurIPS), 2020.arXiv:2012.02096. URL: https: //arxiv.org/abs/2012.02096

work page arXiv 2020
[26]

Prioritized Level Replay , journal =

Jiang, M., Grefenstette, E., and Rocktäschel, T. Prioritized level replay. InProceedings of the 38th International Conference on Machine Learning (ICML), 2021.arXiv:2010.03934. URL: https://arxiv.org/abs/2010.03934

work page arXiv 2021
[27]

would you rather

Köpf, A., Kilcher, Y ., von Rütte, D., Anagnostidis, S., Tam, Z.-R., Stevens, K., Barhoum, A., Duc, N. M., Stanley, O., Nagyfi, R., et al. OpenAssistant Conversations – democratizing large language model alignment. InAdvances in Neural Information Processing Systems 36 (NeurIPS), Datasets and Benchmarks Track, 2023. URL:https://arxiv.org/abs/2304.07327

work page arXiv 2023
[28]

Lambert, N., Morrison, J., Pyatkin, V ., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V ., Liu, A., Dziri, N., et al. Tülu 3: Pushing frontiers in open language model post-training. arXiv:2411.15124, 2024. URL:https://arxiv.org/abs/2411.15124

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Training Software Engineering Agents and Verifiers with SWE-Gym

Pan, J., Wang, X., Neubig, G., Jaitly, N., Ji, H., Suhr, A., and Zhang, Y . Training software engineering agents and verifiers with SWE-Gym. InInternational Conference on Machine Learning (ICML), 2025. URL:https://arxiv.org/abs/2412.21139

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Wang, Y ., Kordi, Y ., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H. Self- Instruct: Aligning language models with self-generated instructions. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2023. URL: https://arxiv.org/abs/ 2212.10560

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., and Jiang, D. WizardLM: Empowering large pre-trained language models to follow complex instructions. InInternational Conference on Learning Representations (ICLR), 2024. URL: https://arxiv.org/abs/ 2304.12244

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

and Imbens, G

Athey, S. and Imbens, G. W. The State of Applied Econometrics: Causality and Policy Evaluation.Journal of Economic Perspectives, 31(2):3–32, 2017

2017
[33]

does augmentation expose qualitatively different model behaviour?

Saito, Y . and Joachims, T. Counterfactual Evaluation and Learning for Interactive Systems. Tutorial at the28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022. Technical appendices.The sections below collect full hyperparameters, training and evaluation infrastructure, augmentation and verification detail, quality-gate operating decisi...

2022

[1] [1]

arXiv preprint arXiv:2506.11425 , year=

Da, J., Wang, C., Deng, X., Ma, Y ., Barhate, N., and Hendryx, S. Agent-RLVR: Training software engineering agents via guidance and environment rewards.arXiv:2506.11425, 2025. URL:https://arxiv.org/abs/2506.11425

work page arXiv 2025

[2] [2]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Guo, D., Yang, D., Zhang, H., Song, J., et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv:2501.12948, 2025. URL: https://arxiv.org/abs/2501.12948. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

arXiv preprint arXiv:2502.19655 , year=

Zhang, S., Liu, Q., Qin, G., Naumann, T., and Poon, H. Med-RLVR: Emerging medical reasoning from a 3B base model via reinforcement learning.arXiv:2502.19655, 2025. URL: https://arxiv.org/abs/2502.19655

work page arXiv 2025

[4] [4]

ReSyn: Autonomously scaling synthetic environments for reasoning models.arXiv:2602.20117, 2026

He, A., Weir, N., Bostrom, K., Nie, A., Cassel, D., et al. ReSyn: Autonomously scaling synthetic environments for reasoning models.arXiv:2602.20117, 2026. URL: https://arxiv.org/ abs/2602.20117

work page arXiv 2026

[5] [5]

Zhang, X

Zhang, H., Liu, X., Lv, B., Sun, X., Jing, B., et al. AgentRL: Scaling agentic reinforcement learning with a multi-turn, multi-task framework.arXiv:2510.04206, 2025. URL: https: //arxiv.org/abs/2510.04206

work page arXiv 2025

[6] [6]

Prorl agent: Rollout-as-a-service for rl training of multi- turn llm agents,

Zhang, H., Liu, M., Zhang, S., Han, S., Hu, J., et al. ProRL Agent: Rollout-as-a-Service for RL training of multi-turn LLM agents.arXiv:2603.18815, 2026. URL: https://arxiv.org/ abs/2603.18815

work page arXiv 2026

[7] [8]

URL:https://arxiv.org/abs/2504.13837

work page internal anchor Pith review Pith/arXiv arXiv

[8] [9]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. de O., et al. Evaluating large language models trained on code.arXiv:2107.03374, 2021. URL: https://arxiv.org/abs/2107. 03374

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [10]

TinyV: Reducing false negatives in verification improves RL for LLM reasoning.arXiv:2505.14625, 2025

Xu, Z., Li, Y ., Liu, Z., Yu, X., Wang, J., et al. TinyV: Reducing false negatives in verification improves RL for LLM reasoning.arXiv:2505.14625, 2025. URL: https://arxiv.org/abs/ 2505.14625

work page arXiv 2025

[10] [11]

W., Fried, D., Wang, S., and Yu, T

Lai, Y ., Li, C., Wang, Y ., Zhang, T., Zhong, R., Zettlemoyer, L., Yih, S. W., Fried, D., Wang, S., and Yu, T. DS-1000: A natural and reliable benchmark for data science code generation. In Proceedings of the 40th International Conference on Machine Learning (ICML), 2023. URL: https://arxiv.org/abs/2211.11501

work page arXiv 2023

[11] [12]

Let's Verify Step by Step

Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step. InInternational Conference on Learning Representations (ICLR), 2024. URL:https://arxiv.org/abs/2305.20050

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [13]

Instruction-Following Evaluation for Large Language Models

Zhou, J., Lu, T., Mishra, S., Brahma, S., Basu, S., Luan, Y ., Zhou, D., and Hou, L. Instruction- following evaluation for large language models.arXiv:2311.07911, 2023. URL: https: //arxiv.org/abs/2311.07911

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [14]

arXiv preprint arXiv:2502.19187 , year=

Kazemi, M., Fatemi, B., Bansal, H., Palowitch, J., Anastasiou, C., et al. BIG-Bench Extra Hard. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025.arXiv:2502.19187. URL:https://arxiv.org/abs/2502.19187

work page arXiv 2025

[14] [15]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Wang, Y ., Ma, X., Zhang, G., Ni, Y ., Chandra, A., et al. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. InAdvances in Neural Information Processing Systems 37 (NeurIPS), Datasets and Benchmarks Track, 2024. URL: https: //arxiv.org/abs/2406.01574

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [16]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y ., Dirani, J., Michael, J., and Bowman, S. R. GPQA: A graduate-level Google-proof Q&A benchmark. InConference on Language Modeling (COLM), 2024. URL:https://arxiv.org/abs/2311.12022

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [17]

G., Mao, H., Cheng-Jie Ji, C., Yan, F., Suresh, V ., et al

Patil, S. G., Mao, H., Cheng-Jie Ji, C., Yan, F., Suresh, V ., et al. The Berkeley Function-Calling Leaderboard (BFCL): From tool use to agentic evaluation of large language models.arXiv,

[17] [18]

URL:https://gorilla.cs.berkeley.edu/leaderboard.html

[18] [19]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Barres, V ., Trinh, H., Yao, S., et al.τ 2-Bench: Evaluating conversational agents in a dual-control environment.arXiv:2506.07982, 2025. URL:https://arxiv.org/abs/2506.07982

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [20]

A., Fei-Fei, L., and Bernstein, M

Krishna, R., Hata, K., Chen, S., Kravitz, J., Shamma, D. A., Fei-Fei, L., and Bernstein, M. S. Embracing error to enable rapid crowdsourcing. InProceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 3167–3179, 2016. DOI: 10.1145/2858036.2858115. 13

work page doi:10.1145/2858036.2858115 2016

[20] [21]

Ding, D., Mallick, A., Wang, C., Sim, R., Mukherjee, S., Rühle, V ., Lakshmanan, L. V . S., and Awadallah, A. H. Hybrid LLM: Cost-efficient and quality-aware query routing. In International Conference on Learning Representations (ICLR), 2024.arXiv:2404.14618. URL: https://arxiv.org/abs/2404.14618

work page arXiv 2024

[21] [22]

Token-budget-aware LLM reasoning

Han, T., Wang, Z., Fang, C., Zhao, S., Ma, S., and Chen, Z. Token-budget-aware LLM reasoning. InFindings of the Association for Computational Linguistics: ACL 2025, 2025. arXiv:2412.18547. URL:https://aclanthology.org/2025.findings-acl.1274/

work page arXiv 2025

[22] [23]

Reinforcement learning with augmented data

Laskin, M., Lee, K., Stooke, A., Pinto, L., Abbeel, P., and Srinivas, A. Reinforcement learning with augmented data. InAdvances in Neural Information Processing Systems 33 (NeurIPS), 2020.arXiv:2004.14990. URL:https://arxiv.org/abs/2004.14990

work page arXiv 2020

[23] [24]

Image augmentation is all you need: Regularizing deep reinforcement learning from pixels

Kostrikov, I., Yarats, D., and Fergus, R. Image augmentation is all you need: Regularizing deep reinforcement learning from pixels. InInternational Conference on Learning Representations (ICLR), 2021.arXiv:2004.13649. URL:https://arxiv.org/abs/2004.13649

work page arXiv 2021

[24] [25]

Emergent complexity and zero-shot transfer via unsupervised environment design

Dennis, M., Jaques, N., Vinitsky, E., Bayen, A., Russell, S., Critch, A., and Levine, S. Emergent complexity and zero-shot transfer via unsupervised environment design. InAdvances in Neural Information Processing Systems 33 (NeurIPS), 2020.arXiv:2012.02096. URL: https: //arxiv.org/abs/2012.02096

work page arXiv 2020

[25] [26]

Prioritized Level Replay , journal =

Jiang, M., Grefenstette, E., and Rocktäschel, T. Prioritized level replay. InProceedings of the 38th International Conference on Machine Learning (ICML), 2021.arXiv:2010.03934. URL: https://arxiv.org/abs/2010.03934

work page arXiv 2021

[26] [27]

would you rather

Köpf, A., Kilcher, Y ., von Rütte, D., Anagnostidis, S., Tam, Z.-R., Stevens, K., Barhoum, A., Duc, N. M., Stanley, O., Nagyfi, R., et al. OpenAssistant Conversations – democratizing large language model alignment. InAdvances in Neural Information Processing Systems 36 (NeurIPS), Datasets and Benchmarks Track, 2023. URL:https://arxiv.org/abs/2304.07327

work page arXiv 2023

[27] [28]

Lambert, N., Morrison, J., Pyatkin, V ., Huang, S., Ivison, H., Brahman, F., Miranda, L. J. V ., Liu, A., Dziri, N., et al. Tülu 3: Pushing frontiers in open language model post-training. arXiv:2411.15124, 2024. URL:https://arxiv.org/abs/2411.15124

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [29]

Training Software Engineering Agents and Verifiers with SWE-Gym

Pan, J., Wang, X., Neubig, G., Jaitly, N., Ji, H., Suhr, A., and Zhang, Y . Training software engineering agents and verifiers with SWE-Gym. InInternational Conference on Machine Learning (ICML), 2025. URL:https://arxiv.org/abs/2412.21139

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [30]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Wang, Y ., Kordi, Y ., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H. Self- Instruct: Aligning language models with self-generated instructions. InAnnual Meeting of the Association for Computational Linguistics (ACL), 2023. URL: https://arxiv.org/abs/ 2212.10560

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [31]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., and Jiang, D. WizardLM: Empowering large pre-trained language models to follow complex instructions. InInternational Conference on Learning Representations (ICLR), 2024. URL: https://arxiv.org/abs/ 2304.12244

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [32]

and Imbens, G

Athey, S. and Imbens, G. W. The State of Applied Econometrics: Causality and Policy Evaluation.Journal of Economic Perspectives, 31(2):3–32, 2017

2017

[32] [33]

does augmentation expose qualitatively different model behaviour?

Saito, Y . and Joachims, T. Counterfactual Evaluation and Learning for Interactive Systems. Tutorial at the28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022. Technical appendices.The sections below collect full hyperparameters, training and evaluation infrastructure, augmentation and verification detail, quality-gate operating decisi...

2022