Training Software Engineering Agents and Verifiers with SWE-Gym
Pith reviewed 2026-05-18 05:15 UTC · model grok-4.3
The pith
Training agents and verifiers on 2,438 real GitHub tasks produces new state-of-the-art resolve rates of 32 percent and 26 percent on SWE-Bench Verified and Lite for open-weight models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present SWE-Gym, the first environment for training real-world software engineering agents, containing 2,438 real-world Python task instances each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train language model based SWE agents, achieving up to 19 percent absolute gains in resolve rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0 percent and 26.0 percent on SWE-Bench Verified and Lite, respectively, reflecting a 32
What carries the argument
SWE-Gym environment of executable real-world tasks, used both to fine-tune agents and to generate trajectories for training separate verifiers that rerank or filter agent outputs at inference time.
If this is right
- Fine-tuned agents alone already improve resolve rates on SWE-Bench by as much as 19 percentage points.
- Inference-time verifiers trained on the same trajectories supply an additional performance boost without further model updates.
- The public release of the environment, trained models, and collected trajectories directly enables other groups to reproduce and extend the results.
- The same training-plus-verification pattern can be applied to larger language models or to additional task distributions.
Where Pith is reading between the lines
- Similar gym-style environments could be assembled for other programming languages or for non-bug-fixing software tasks such as feature addition or refactoring.
- If the current gains scale with more tasks or larger models, open-weight agents may close a larger fraction of the gap to closed-source systems on real repositories.
- Verifier training on agent trajectories offers a general way to add test-time compute to any multi-step agent workflow, not limited to software engineering.
Load-bearing premise
The 2,438 tasks collected from real GitHub repositories are sufficiently representative of the distribution of software engineering problems that agents will encounter in practice.
What would settle it
Measure whether the reported gains hold on a new collection of coding tasks drawn from repositories and issue types that were never seen during SWE-Gym construction or training.
read the original abstract
We present SWE-Gym, the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances, each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train language model based SWE agents, achieving up to 19% absolute gains in resolve rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new state-of-the-art for open-weight SWE agents. To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SWE-Gym, an environment with 2,438 real-world Python task instances drawn from GitHub repositories, each including a codebase, runtime, unit tests, and natural-language task description. The authors fine-tune language-model-based SWE agents on SWE-Gym and report up to 19% absolute gains in resolve rate on SWE-Bench Verified and Lite. They further train verifiers on agent trajectories sampled from SWE-Gym; when combined with the fine-tuned agents, the system reaches 32.0% on Verified and 26.0% on Lite, which the authors present as a new state-of-the-art for open-weight SWE agents. The environment, models, and trajectories are released publicly.
Significance. If the reported gains are free of data contamination, the work supplies a much-needed large-scale training resource for realistic software-engineering agents and shows that trajectory-based fine-tuning plus inference-time verification can produce substantial improvements over base models. The public release of the full environment, models, and trajectories is a clear strength that enables direct replication and extension by the community.
major comments (2)
- [Data collection / experimental setup] Data collection / experimental setup: The manuscript does not appear to report an explicit check for repository, commit, or issue overlap between the 2,438 SWE-Gym training instances and the SWE-Bench Verified/Lite test sets. Because both resources draw from real GitHub Python repositories with natural-language issues and executable tests, any shared instances would allow partial memorization rather than genuine generalization, directly undermining the 19% absolute gains and the open-weight SOTA claim.
- [Results section (performance tables)] Results section (performance tables): The headline numbers (32.0% Verified, 26.0% Lite) are presented without accompanying ablation tables that isolate the contribution of SWE-Gym fine-tuning versus the verifier, or that report variance across multiple random seeds. Without these controls it is difficult to assess whether the combined system truly establishes a new frontier or whether the gains are sensitive to particular hyper-parameter choices.
minor comments (2)
- [Abstract / Introduction] The abstract and introduction use “up to 19% absolute gains” without specifying the exact base model and baseline resolve rates for each benchmark; adding a short table or parenthetical values would improve clarity.
- [Figures] Figure captions for the agent-trajectory and verifier diagrams should explicitly state the number of trajectories sampled per task and the filtering criteria applied before verifier training.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the value of SWE-Gym as a training resource. We address each major comment below with specific responses and indicate where revisions will be made.
read point-by-point responses
-
Referee: [Data collection / experimental setup] The manuscript does not appear to report an explicit check for repository, commit, or issue overlap between the 2,438 SWE-Gym training instances and the SWE-Bench Verified/Lite test sets. Because both resources draw from real GitHub Python repositories with natural-language issues and executable tests, any shared instances would allow partial memorization rather than genuine generalization, directly undermining the 19% absolute gains and the open-weight SOTA claim.
Authors: We agree that an explicit overlap analysis is essential to substantiate the generalization claims. We have now performed a systematic check across repositories, commit hashes, and issue identifiers between the SWE-Gym training set and both SWE-Bench Verified and Lite. No overlapping instances were found. We will add a dedicated subsection describing the overlap detection methodology, the results (zero overlaps), and the implications for the reported gains in the revised manuscript. revision: yes
-
Referee: [Results section (performance tables)] The headline numbers (32.0% Verified, 26.0% Lite) are presented without accompanying ablation tables that isolate the contribution of SWE-Gym fine-tuning versus the verifier, or that report variance across multiple random seeds. Without these controls it is difficult to assess whether the combined system truly establishes a new frontier or whether the gains are sensitive to particular hyper-parameter choices.
Authors: We acknowledge that the current results section would benefit from clearer isolation of contributions and robustness checks. In the revision we will add ablation tables that separately report (1) the base model, (2) the SWE-Gym fine-tuned agent without verifier, (3) the verifier applied to the base model, and (4) the full combined system. We will also include results from at least three independent random seeds for the fine-tuning runs, reporting mean and standard deviation to demonstrate stability of the observed improvements. revision: yes
Circularity Check
No circularity: empirical training and external-benchmark evaluation
full rationale
The paper reports results from collecting 2438 task instances, fine-tuning agents on SWE-Gym trajectories, training verifiers on those trajectories, and measuring resolve rates on the separate SWE-Bench Verified and Lite test sets. No equations, derivations, or self-referential definitions appear; performance numbers are direct empirical measurements on held-out benchmarks rather than quantities forced by internal fits or self-citations. The evaluation chain is therefore self-contained against external data and does not reduce to any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Fine-tuning on agent trajectories collected inside executable environments improves downstream resolve rate on SWE-Bench.
Forward citations
Cited by 17 Pith papers
-
SWE-smith: Scaling Data for Software Engineering Agents
SWE-smith scales software engineering training data to 50k instances across 128 repositories, enabling SWE-agent-LM-32B to achieve 40.2% Pass@1 on SWE-bench Verified, state of the art among open-source models.
-
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
REAP: Automatic Curation of Coding Agent Benchmarks from Interactive Production Usage
REAP automatically curates production-derived benchmarks for AI coding agents via LLM classification and stability checks, producing the Harvest benchmark with model solve rates of 42.9-58.2%.
-
Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents
A rubric-based generative reward model improves reinforced fine-tuning of SWE agents by supplying richer behavioral guidance than binary terminal rewards alone.
-
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
-
Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems
Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.
-
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
-
Revisiting DAgger in the Era of LLM-Agents
DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
-
LLM-Guided Issue Generation from Uncovered Code Segments
IssueSpecter combines coverage analysis with LLM defect detection to generate prioritized, actionable issue reports, achieving 84.6% validity on manually reviewed issues from 13 Python projects and outperforming a cov...
-
CoT-Guard: Small Models for Strong Monitoring
CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.
-
LLM-Guided Issue Generation from Uncovered Code Segments
IssueSpecter generates prioritized actionable bug reports from uncovered code using LLMs and coverage analysis, with 84.6% validity in manual checks on top issues from 13 Python projects and outperforming a baseline tool.
-
Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering
LLM judges for code tasks show high sensitivity to prompt biases that systematically favor certain options, changing accuracy and model rankings even when code is unchanged.
-
Yet Even Less Is Even Better For Agentic, Reasoning, and Coding LLMs
STITCH trains superior agentic coding and reasoning LLMs by using fewer high-quality trajectories filtered to keep only critical decision tokens, delivering up to 63% relative gains on SWE-bench Verified.
-
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
-
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
Reference graph
Works this paper leans on
-
[1]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
URL https://api.semanticscholar. org/CorpusID:270562229. Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V ., R’e, C., and Mirhoseini, A. Large language monkeys: Scaling inference compute with repeated sampling. ArXiv, abs/2407.21787, 2024. URL https: //api.semanticscholar.org/CorpusID: 271571035. Chen, B., Shu, C., Shareghi, E., Collier, N., Nara...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Training Verifiers to Solve Math Word Problems
URL https://api.semanticscholar. org/CorpusID:235755472. Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. ArXiv, abs/2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Measuring Mathematical Problem Solving With the MATH Dataset
URL https://api.semanticscholar. org/CorpusID:239998651. Golubev, A., Polezhaev, S., Zainullina, K., Trofimova, M., Badertdinov, I., Anapolskiy, Y ., Litvintseva, D., Karasik, S., Fisin, F., Skvortsov, S., Nekrashevich, M., Shevtsov, A., Abramov, S., and Yangel, B. Leveraging training and search for better software engineering agents. Nebius blog, 2024. h...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Qwen2.5-Coder Technical Report
URL https://openreview.net/forum? id=nZeVKeeFYf9. Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Dang, K., et al. Qwen2. 5- coder technical report. arXiv preprint arXiv:2409.12186, 2024a. Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Dang, K., et al. Qwen2. 5- coder technical r...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Proximal Policy Optimization Algorithms
URL https://api.semanticscholar. org/CorpusID:269009430. PyTorch Team. torchtune: PyTorch native post- training library.https://github.com/pytorch/ torchtune, 2024. Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github. io/blog/qwen2.5/. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal poli...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
URL https://api.semanticscholar. org/CorpusID:33081038. Tao, N., Ventresque, A., Nallur, V ., and Saber, T. Enhanc- ing program synthesis with large language models using many-objective grammar-guided genetic programming. Algorithms, 17(7):287, 2024. doi: 10.3390/A17070287. URLhttps://doi.org/10.3390/a17070287. Tong, Y ., Zhang, X., Wang, R., Wu, R. M., a...
-
[7]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
URL https://api.semanticscholar. org/CorpusID:271270574. Unsloth Team. Easily finetune and train LLMs. Get faster with unsloth. https://unsloth.ai/, 2024. Wang, P., Li, L., Shao, Z., Xu, R., Dai, D., Li, Y ., Chen, D., Wu, Y ., and Sui, Z. Math-shepherd: Ver- ify and reinforce LLMs step-by-step without human an- notations. In Ku, L.-W., Martins, A., and S...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.acl-long.510 2024
-
[9]
URL https://openreview.net/forum? id=WE_vluYUL-X. Yuan, L., Cui, G., Wang, H., Ding, N., Wang, X., Deng, J., Shan, B., Chen, H., Xie, R., Lin, Y ., Liu, Z., Zhou, B., Peng, H., Liu, Z., and Sun, M. Advancing LLM reasoning generalists with preference trees. CoRR, abs/2404.02078,
-
[10]
doi: 10.48550/ARXIV .2404.02078. URLhttps: //doi.org/10.48550/arXiv.2404.02078. 11 Training Software Engineering Agents and Verifiers with SWE-Gym Zeng, A., Liu, M., Lu, R., Wang, B., Liu, X., Dong, Y ., and Tang, J. Agenttuning: Enabling gener- alized agent abilities for llms. In Annual Meet- ing of the Association for Computational Linguistics ,
work page internal anchor Pith review doi:10.48550/arxiv
-
[11]
URL https://api.semanticscholar. org/CorpusID:264306101. Zhai, Y ., Bai, H., Lin, Z., Pan, J., Tong, S., Zhou, Y ., Suhr, A., Xie, S., LeCun, Y ., Ma, Y ., and Levine, S. Fine-tuning large vision-language models as decision-making agents via reinforcement learning. ArXiv, abs/2405.10292,
-
[12]
URL https://api.semanticscholar. org/CorpusID:269790773. Zhang, K., Yao, W., Liu, Z., Feng, Y ., Liu, Z., Murthy, R., Lan, T., Li, L., Lou, R., Xu, J., Pang, B., Zhou, Y ., Heinecke, S., Savarese, S., Wang, H., and Xiong, C. Di- versity empowers intelligence: Integrating expertise of software engineering agents. ArXiv, abs/2408.07060, 2024a. URL https://a...
-
[13]
Did the assistant complete the main task requested by the user?
-
[14]
Did the assistant handle all edge cases and requirements specified?
-
[15]
Were there any errors or issues in the final solution?
-
[16]
Did the assistant verify the solution works as intended? Respond only with "<judgement>YES</judgement>" or "<judgement>NO</judgement>".''' USER_MESSAGE = '''Please evaluate the following interaction between an AI assistant and a user:,→ === INTERACTION LOG === ''' + traj_str + ''' === END INTERACTION === Based on the above interaction, did the assistant s...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.