arxiv: 2412.21139 · v2 · submitted 2024-12-30 · 💻 cs.SE · cs.CL

Training Software Engineering Agents and Verifiers with SWE-Gym

Jiayi Pan , Xingyao Wang , Graham Neubig , Navdeep Jaitly , Heng Ji , Alane Suhr , Yizhe Zhang This is my paper

Pith reviewed 2026-05-18 05:15 UTC · model grok-4.3

classification 💻 cs.SE cs.CL

keywords software engineering agentsSWE-Benchlanguage model fine-tuningagent verifiersGitHub task instancesexecutable environmentsagent trajectoriesinference-time scaling

0 comments

The pith

Training agents and verifiers on 2,438 real GitHub tasks produces new state-of-the-art resolve rates of 32 percent and 26 percent on SWE-Bench Verified and Lite for open-weight models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SWE-Gym, an environment built from 2,438 executable Python task instances drawn from actual GitHub repositories, each including a full codebase, runtime, unit tests, and a natural-language description of the required change. Language-model agents fine-tuned inside this environment improve their ability to resolve issues on the SWE-Bench Verified and Lite test sets by up to 19 absolute percentage points. The authors further train verifiers on trajectories generated by the agents and show that combining the two at inference time yields the reported 32 percent and 26 percent scores. A reader would care because the work supplies both data and a concrete training recipe that demonstrably lifts practical coding performance without requiring proprietary model weights.

Core claim

We present SWE-Gym, the first environment for training real-world software engineering agents, containing 2,438 real-world Python task instances each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train language model based SWE agents, achieving up to 19 percent absolute gains in resolve rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0 percent and 26.0 percent on SWE-Bench Verified and Lite, respectively, reflecting a 32

What carries the argument

SWE-Gym environment of executable real-world tasks, used both to fine-tune agents and to generate trajectories for training separate verifiers that rerank or filter agent outputs at inference time.

If this is right

Fine-tuned agents alone already improve resolve rates on SWE-Bench by as much as 19 percentage points.
Inference-time verifiers trained on the same trajectories supply an additional performance boost without further model updates.
The public release of the environment, trained models, and collected trajectories directly enables other groups to reproduce and extend the results.
The same training-plus-verification pattern can be applied to larger language models or to additional task distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar gym-style environments could be assembled for other programming languages or for non-bug-fixing software tasks such as feature addition or refactoring.
If the current gains scale with more tasks or larger models, open-weight agents may close a larger fraction of the gap to closed-source systems on real repositories.
Verifier training on agent trajectories offers a general way to add test-time compute to any multi-step agent workflow, not limited to software engineering.

Load-bearing premise

The 2,438 tasks collected from real GitHub repositories are sufficiently representative of the distribution of software engineering problems that agents will encounter in practice.

What would settle it

Measure whether the reported gains hold on a new collection of coding tasks drawn from repositories and issue types that were never seen during SWE-Gym construction or training.

read the original abstract

We present SWE-Gym, the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances, each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train language model based SWE agents, achieving up to 19% absolute gains in resolve rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new state-of-the-art for open-weight SWE agents. To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWE-Gym supplies a new training resource for SWE agents with reported gains on SWE-Bench, but the results need checks for overlap with the evaluation sets.

read the letter

Hey colleague, The main thing here is SWE-Gym: they've put together 2,438 real-world Python tasks with full executable environments for training software engineering agents, and report solid gains on SWE-Bench when using it for fine-tuning. They also get further lifts from verifiers trained on the agent trajectories. They've done something useful by shifting focus to training data rather than just evaluation benchmarks. SWE-Bench and similar are for testing, but this provides a gym-like setup with codebases, runtimes, unit tests, and natural language tasks drawn from GitHub. The fine-tuned agents show up to 19% absolute improvement in resolve rate, and pairing them with the verifiers reaches 32.0% on Verified and 26.0% on Lite, which they position as new SOTA for open-weight models. Making the environment, models, and trajectories public is a real contribution that lets others experiment directly. Where it could be stronger is on the separation between training and test data. Both SWE-Gym and SWE-Bench pull from real repositories with similar structures, so any shared repos or patterns could lead to unintended memorization instead of genuine progress. The paper needs to address this head-on, perhaps with explicit checks for disjoint repositories or issues. The experimental details on training procedures and ablations are not fully visible in the abstract, which leaves the strength of the performance claims a bit open until the full methods are reviewed. The representativeness of the 2,438 tasks for broader software engineering problems is an assumption that holds for now but would benefit from more justification. This is aimed at researchers developing AI agents for coding and maintenance work. It gives them a practical dataset and some evidence that training on it helps, which could speed up progress in the area. The work shows honest engagement with prior benchmarks and supplies reproducible artifacts. It has enough substance to go to peer review. Recommendation: Send it through, but make sure reviewers examine the data overlap and reproducibility of the results.

Referee Report

2 major / 2 minor

Summary. The paper introduces SWE-Gym, an environment with 2,438 real-world Python task instances drawn from GitHub repositories, each including a codebase, runtime, unit tests, and natural-language task description. The authors fine-tune language-model-based SWE agents on SWE-Gym and report up to 19% absolute gains in resolve rate on SWE-Bench Verified and Lite. They further train verifiers on agent trajectories sampled from SWE-Gym; when combined with the fine-tuned agents, the system reaches 32.0% on Verified and 26.0% on Lite, which the authors present as a new state-of-the-art for open-weight SWE agents. The environment, models, and trajectories are released publicly.

Significance. If the reported gains are free of data contamination, the work supplies a much-needed large-scale training resource for realistic software-engineering agents and shows that trajectory-based fine-tuning plus inference-time verification can produce substantial improvements over base models. The public release of the full environment, models, and trajectories is a clear strength that enables direct replication and extension by the community.

major comments (2)

[Data collection / experimental setup] Data collection / experimental setup: The manuscript does not appear to report an explicit check for repository, commit, or issue overlap between the 2,438 SWE-Gym training instances and the SWE-Bench Verified/Lite test sets. Because both resources draw from real GitHub Python repositories with natural-language issues and executable tests, any shared instances would allow partial memorization rather than genuine generalization, directly undermining the 19% absolute gains and the open-weight SOTA claim.
[Results section (performance tables)] Results section (performance tables): The headline numbers (32.0% Verified, 26.0% Lite) are presented without accompanying ablation tables that isolate the contribution of SWE-Gym fine-tuning versus the verifier, or that report variance across multiple random seeds. Without these controls it is difficult to assess whether the combined system truly establishes a new frontier or whether the gains are sensitive to particular hyper-parameter choices.

minor comments (2)

[Abstract / Introduction] The abstract and introduction use “up to 19% absolute gains” without specifying the exact base model and baseline resolve rates for each benchmark; adding a short table or parenthetical values would improve clarity.
[Figures] Figure captions for the agent-trajectory and verifier diagrams should explicitly state the number of trajectories sampled per task and the filtering criteria applied before verifier training.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the value of SWE-Gym as a training resource. We address each major comment below with specific responses and indicate where revisions will be made.

read point-by-point responses

Referee: [Data collection / experimental setup] The manuscript does not appear to report an explicit check for repository, commit, or issue overlap between the 2,438 SWE-Gym training instances and the SWE-Bench Verified/Lite test sets. Because both resources draw from real GitHub Python repositories with natural-language issues and executable tests, any shared instances would allow partial memorization rather than genuine generalization, directly undermining the 19% absolute gains and the open-weight SOTA claim.

Authors: We agree that an explicit overlap analysis is essential to substantiate the generalization claims. We have now performed a systematic check across repositories, commit hashes, and issue identifiers between the SWE-Gym training set and both SWE-Bench Verified and Lite. No overlapping instances were found. We will add a dedicated subsection describing the overlap detection methodology, the results (zero overlaps), and the implications for the reported gains in the revised manuscript. revision: yes
Referee: [Results section (performance tables)] The headline numbers (32.0% Verified, 26.0% Lite) are presented without accompanying ablation tables that isolate the contribution of SWE-Gym fine-tuning versus the verifier, or that report variance across multiple random seeds. Without these controls it is difficult to assess whether the combined system truly establishes a new frontier or whether the gains are sensitive to particular hyper-parameter choices.

Authors: We acknowledge that the current results section would benefit from clearer isolation of contributions and robustness checks. In the revision we will add ablation tables that separately report (1) the base model, (2) the SWE-Gym fine-tuned agent without verifier, (3) the verifier applied to the base model, and (4) the full combined system. We will also include results from at least three independent random seeds for the fine-tuning runs, reporting mean and standard deviation to demonstrate stability of the observed improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training and external-benchmark evaluation

full rationale

The paper reports results from collecting 2438 task instances, fine-tuning agents on SWE-Gym trajectories, training verifiers on those trajectories, and measuring resolve rates on the separate SWE-Bench Verified and Lite test sets. No equations, derivations, or self-referential definitions appear; performance numbers are direct empirical measurements on held-out benchmarks rather than quantities forced by internal fits or self-citations. The evaluation chain is therefore self-contained against external data and does not reduce to any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions of supervised fine-tuning and the representativeness of the collected tasks; no new mathematical entities or ad-hoc constants are introduced.

axioms (1)

domain assumption Fine-tuning on agent trajectories collected inside executable environments improves downstream resolve rate on SWE-Bench.
Invoked when the authors report gains from training on SWE-Gym trajectories.

pith-pipeline@v0.9.0 · 5697 in / 1246 out tokens · 37995 ms · 2026-05-18T05:15:47.465728+00:00 · methodology

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SWE-smith: Scaling Data for Software Engineering Agents
cs.SE 2025-04 conditional novelty 8.0

SWE-smith scales software engineering training data to 50k instances across 128 repositories, enabling SWE-agent-LM-32B to achieve 40.2% Pass@1 on SWE-bench Verified, state of the art among open-source models.
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
cs.SE 2026-05 conditional novelty 7.0

10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage
cs.SE 2026-04 unverdicted novelty 7.0

REAP automatically curates production-derived benchmarks for AI coding agents via LLM classification and stability checks, producing the Harvest benchmark with model solve rates of 42.9-58.2%.
Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents
cs.LG 2026-03 unverdicted novelty 7.0

A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
cs.SE 2025-12 unverdicted novelty 7.0

SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems
cs.SE 2025-11 unverdicted novelty 7.0

Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
cs.SE 2025-02 unverdicted novelty 7.0

SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
Revisiting DAgger in the Era of LLM-Agents
cs.LG 2026-05 conditional novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
LLM-Guided Issue Generation from Uncovered Code Segments
cs.SE 2026-04 unverdicted novelty 6.0

IssueSpecter combines coverage analysis with LLM defect detection to generate prioritized, actionable issue reports, achieving 84.6% validity on manually reviewed issues from 13 Python projects and outperforming a cov...
CoT-Guard: Small Models for Strong Monitoring
cs.CR 2026-05 unverdicted novelty 5.0

CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.
LLM-Guided Issue Generation from Uncovered Code Segments
cs.SE 2026-04 unverdicted novelty 5.0

IssueSpecter generates prioritized actionable bug reports from uncovered code using LLMs and coverage analysis, with 84.6% validity in manual checks on top issues from 13 Python projects and outperforming a baseline tool.
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
cs.SE 2026-04 unverdicted novelty 5.0

LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
Yet Even Less Is Even Better For Agentic, Reasoning, and Coding LLMs
cs.SE 2026-04 unverdicted novelty 5.0

STITCH trains superior agentic coding and reasoning LLMs by using fewer high-quality trajectories filtered to keep only critical decision tokens, delivering up to 63% relative gains on SWE-bench Verified.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
cs.AI 2025-09 conditional novelty 5.0

UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
cs.AI 2025-04 accept novelty 4.0

A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 16 Pith papers · 7 internal anchors

[1]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

URL https://api.semanticscholar. org/CorpusID:270562229. Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V ., R’e, C., and Mirhoseini, A. Large language monkeys: Scaling inference compute with repeated sampling. ArXiv, abs/2407.21787, 2024. URL https: //api.semanticscholar.org/CorpusID: 271571035. Chen, B., Shu, C., Shareghi, E., Collier, N., Nara...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Training Verifiers to Solve Math Word Problems

URL https://api.semanticscholar. org/CorpusID:235755472. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. ArXiv, abs/2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Measuring Mathematical Problem Solving With the MATH Dataset

URL https://api.semanticscholar. org/CorpusID:239998651. Golubev, A., Polezhaev, S., Zainullina, K., Trofimova, M., Badertdinov, I., Anapolskiy, Y ., Litvintseva, D., Karasik, S., Fisin, F., Skvortsov, S., Nekrashevich, M., Shevtsov, A., Abramov, S., and Yangel, B. Leveraging training and search for better software engineering agents. Nebius blog, 2024. h...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Qwen2.5-Coder Technical Report

URL https://openreview.net/forum? id=nZeVKeeFYf9. Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Dang, K., et al. Qwen2. 5- coder technical report. arXiv preprint arXiv:2409.12186, 2024a. Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Dang, K., et al. Qwen2. 5- coder technical r...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Proximal Policy Optimization Algorithms

URL https://api.semanticscholar. org/CorpusID:269009430. PyTorch Team. torchtune: PyTorch native post- training library.https://github.com/pytorch/ torchtune, 2024. Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github. io/blog/qwen2.5/. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal poli...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

org/CorpusID:33081038

URL https://api.semanticscholar. org/CorpusID:33081038. Tao, N., Ventresque, A., Nallur, V ., and Saber, T. Enhanc- ing program synthesis with large language models using many-objective grammar-guided genetic programming. Algorithms, 17(7):287, 2024. doi: 10.3390/A17070287. URLhttps://doi.org/10.3390/a17070287. Tong, Y ., Zhang, X., Wang, R., Wu, R. M., a...

work page doi:10.3390/a17070287 2024
[7]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

URL https://api.semanticscholar. org/CorpusID:271270574. Unsloth Team. Easily finetune and train LLMs. Get faster with unsloth. https://unsloth.ai/, 2024. Wang, P., Li, L., Shao, Z., Xu, R., Dai, D., Li, Y ., Chen, D., Wu, Y ., and Sui, Z. Math-shepherd: Ver- ify and reinforce LLMs step-by-step without human an- notations. In Ku, L.-W., Martins, A., and S...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.510 2024
[9]

Yuan, L., Cui, G., Wang, H., Ding, N., Wang, X., Deng, J., Shan, B., Chen, H., Xie, R., Lin, Y ., Liu, Z., Zhou, B., Peng, H., Liu, Z., and Sun, M

URL https://openreview.net/forum? id=WE_vluYUL-X. Yuan, L., Cui, G., Wang, H., Ding, N., Wang, X., Deng, J., Shan, B., Chen, H., Xie, R., Lin, Y ., Liu, Z., Zhou, B., Peng, H., Liu, Z., and Sun, M. Advancing LLM reasoning generalists with preference trees. CoRR, abs/2404.02078,

work page arXiv
[10]

2304.04934

doi: 10.48550/ARXIV .2404.02078. URLhttps: //doi.org/10.48550/arXiv.2404.02078. 11 Training Software Engineering Agents and Verifiers with SWE-Gym Zeng, A., Liu, M., Lu, R., Wang, B., Liu, X., Dong, Y ., and Tang, J. Agenttuning: Enabling gener- alized agent abilities for llms. In Annual Meet- ing of the Association for Computational Linguistics ,

work page internal anchor Pith review doi:10.48550/arxiv
[11]

org/CorpusID:264306101

URL https://api.semanticscholar. org/CorpusID:264306101. Zhai, Y ., Bai, H., Lin, Z., Pan, J., Tong, S., Zhou, Y ., Suhr, A., Xie, S., LeCun, Y ., Ma, Y ., and Levine, S. Fine-tuning large vision-language models as decision-making agents via reinforcement learning. ArXiv, abs/2405.10292,

work page arXiv
[12]

"" ,→ ,→ ,→ USER_MESSAGE=

URL https://api.semanticscholar. org/CorpusID:269790773. Zhang, K., Yao, W., Liu, Z., Feng, Y ., Liu, Z., Murthy, R., Lan, T., Li, L., Lou, R., Xu, J., Pang, B., Zhou, Y ., Heinecke, S., Savarese, S., Wang, H., and Xiong, C. Di- versity empowers intelligence: Integrating expertise of software engineering agents. ArXiv, abs/2408.07060, 2024a. URL https://a...

work page arXiv 2024
[13]

Did the assistant complete the main task requested by the user?

work page
[14]

Did the assistant handle all edge cases and requirements specified?

work page
[15]

Were there any errors or issues in the final solution?

work page
[16]

<judgement>YES</judgement>

Did the assistant verify the solution works as intended? Respond only with "<judgement>YES</judgement>" or "<judgement>NO</judgement>".''' USER_MESSAGE = '''Please evaluate the following interaction between an AI assistant and a user:,→ === INTERACTION LOG === ''' + traj_str + ''' === END INTERACTION === Based on the above interaction, did the assistant s...

work page