arxiv: 2502.18449 · v2 · submitted 2025-02-25 · 💻 cs.SE · cs.AI· cs.CL

Recognition: 3 theorem links

· Lean Theorem

SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

Yuxiang Wei , Olivier Duchenne , Jade Copet , Quentin Carbonneaux , Lingming Zhang , Daniel Fried , Gabriel Synnaeve , Rishabh Singh

show 1 more author

Sida I. Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:23 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL

keywords reinforcement learninglarge language modelssoftware engineeringreasoningcode generationsoftware evolutionSWE-bench

0 comments

The pith

Reinforcement learning on open software evolution data enables LLMs to recover developer reasoning and solve 41% of real GitHub issues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SWE-RL to apply reinforcement learning directly to the full record of software projects, including code snapshots, changes, issues, and pull requests. It uses a simple rule-based similarity score between model outputs and actual developer solutions as the reward signal. This setup lets the model learn from massive real-world data rather than curated math or coding problems. The resulting 70B model reaches 41% on the human-verified SWE-bench, matching some larger proprietary systems, and shows gains on unrelated tasks where supervised training hurts performance. The central argument is that software evolution supplies a scalable training signal for genuine reasoning improvements.

Core claim

SWE-RL applies reinforcement learning with a lightweight rule-based similarity reward to open-source software evolution data, allowing a Llama 3 base model to autonomously reconstruct developer reasoning processes; the trained Llama3-SWE-RL-70B model solves 41% of tasks on SWE-bench Verified, the highest reported result for models under 100B parameters and comparable to leading closed models, while also improving performance on five out-of-domain benchmarks including function coding, code reasoning, mathematics, and language understanding.

What carries the argument

SWE-RL, the reinforcement learning procedure that treats the similarity score between LLM-generated patches and ground-truth developer solutions as the sole reward, applied at scale to software evolution records of code changes and events.

If this is right

LLMs can acquire transferable reasoning skills by learning from the natural evolution of real codebases rather than synthetic problems alone.
Software engineering data at scale forms an effective domain for reinforcement learning that avoids the performance degradation seen in supervised fine-tuning on out-of-domain tasks.
A rule-based reward derived from existing project records is sufficient to drive measurable gains on practical issue-resolution benchmarks.
Medium-sized open models can reach performance levels previously associated only with much larger or proprietary systems when trained via RL on evolution data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same RL-on-evolution approach could transfer to other domains that maintain long records of incremental changes, such as scientific code or legal document histories.
Observed generalization across coding, math, and language tasks implies that software evolution data encodes broadly applicable reasoning structures.
Future scaling of this method would benefit from testing whether richer reward signals, such as test-suite execution outcomes, produce further gains beyond the similarity metric.
Open-source project histories offer a low-cost, continually growing data source that could reduce dependence on curated reasoning datasets.

Load-bearing premise

A simple similarity score between generated and ground-truth code solutions supplies a reward that teaches real reasoning instead of surface-level pattern matching.

What would settle it

Demonstration that high benchmark scores arise from memorizing training-project patterns rather than solving previously unseen issues, or that removing the RL stage and using only supervised fine-tuning on the same data yields equal or better results on SWE-bench Verified.

read the original abstract

The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reasoning processes and solutions by learning from extensive open-source software evolution data -- the record of a software's entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RL on real software evolution data lifts SWE-bench to 41% and beats SFT on generalization, but the similarity reward leaves open whether the model learned actual reasoning or just edit patterns.

read the letter

The paper takes RL for LLM reasoning and applies it to open-source software evolution records—code snapshots, diffs, issues, and PRs—rather than the usual math or competitive programming problems. Starting from Llama 3, they produce a 70B model that reaches 41% on SWE-bench Verified and improves on five out-of-domain tasks (function coding, library use, code reasoning, math, language understanding) while SFT degrades performance on average. That combination of in-domain gains plus positive transfer is the clearest new signal here, and it is worth noting because most prior RL work on code stays narrow.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces SWE-RL, the first RL-based approach to scale LLM reasoning on real-world software engineering by training on open-source software evolution data (code snapshots, changes, issues, PRs) using a lightweight rule-based similarity reward between ground-truth and generated patches. It reports that Llama3-SWE-RL-70B achieves a 41.0% solve rate on SWE-bench Verified, claimed as the best result for models under 100B parameters and comparable to GPT-4o, while also exhibiting emergent generalization to five out-of-domain tasks (function coding, library use, code reasoning, mathematics, general language understanding) where a supervised fine-tuning baseline degrades performance.

Significance. If the results are robust, this would be a significant contribution to the field by demonstrating that RL on massive software evolution corpora can recover developer-like reasoning and produce generalized capabilities beyond the training distribution, extending RL successes from math/coding to practical SE and opening a scalable data source for future work.

major comments (3)

[Abstract] Abstract: The central performance claim (41.0% on SWE-bench Verified) and the generalization result both rest on the unverified assumption that a lightweight rule-based similarity score between generated and ground-truth patches constitutes an effective RL reward for genuine reasoning; without the exact definition of this score, its scaling, or any execution/test-passing component, it is impossible to rule out that gains arise from lexical/template matching rather than causal understanding of the issues.
[Method and Experiments] Method and Experiments sections: No ablation studies, reward variants, or comparisons to execution-based rewards are reported, nor are details provided on data curation, training dynamics, or statistical significance of the 41.0% result (e.g., variance across seeds or tests against baselines); these omissions make the out-of-domain generalization claim load-bearing yet unsupported.
[Abstract and Results] Abstract and Results: The claim that SWE-RL yields 'generalized reasoning skills' while SFT degrades performance requires explicit controls showing that the RL objective (rather than data volume or model scale) drives the difference; the current presentation leaves open the possibility that the observed pattern is an artifact of the particular training setup.

minor comments (2)

[Abstract] Abstract: The statement 'to our knowledge, this is the best performance reported for medium-sized (<100B) LLMs' should cite the specific competing models and their reported scores for direct comparison.
[Method] Throughout: Clarify whether the similarity reward operates on raw patches, normalized diffs, or AST-level representations, as this affects reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, with plans to revise the manuscript accordingly to improve clarity and support for the claims.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claim (41.0% on SWE-bench Verified) and the generalization result both rest on the unverified assumption that a lightweight rule-based similarity score between generated and ground-truth patches constitutes an effective RL reward for genuine reasoning; without the exact definition of this score, its scaling, or any execution/test-passing component, it is impossible to rule out that gains arise from lexical/template matching rather than causal understanding of the issues.

Authors: We appreciate the referee pointing this out. The current manuscript describes the reward only at a high level as a lightweight rule-based similarity score (e.g., between ground-truth and generated patches) without providing the exact formulation, scaling details, or explicit comparison to execution-based alternatives. We will revise the abstract and add a precise definition plus computation details in the Methods section, along with a discussion of its design choices and limitations (including the lack of test-passing verification). This will allow readers to better assess whether the gains reflect reasoning or other factors, while retaining the scalable nature of the approach. revision: yes
Referee: [Method and Experiments] Method and Experiments sections: No ablation studies, reward variants, or comparisons to execution-based rewards are reported, nor are details provided on data curation, training dynamics, or statistical significance of the 41.0% result (e.g., variance across seeds or tests against baselines); these omissions make the out-of-domain generalization claim load-bearing yet unsupported.

Authors: We agree that the current version lacks these elements, which weakens the support for the generalization results. In the revised manuscript we will add ablation studies on reward components and variants, include comparisons to execution-based rewards where feasible, expand the data curation and training dynamics descriptions, and report statistical significance including variance across seeds for the main SWE-bench result and baseline comparisons. revision: yes
Referee: [Abstract and Results] Abstract and Results: The claim that SWE-RL yields 'generalized reasoning skills' while SFT degrades performance requires explicit controls showing that the RL objective (rather than data volume or model scale) drives the difference; the current presentation leaves open the possibility that the observed pattern is an artifact of the particular training setup.

Authors: The SFT baseline was constructed using the identical data volume, model scale, and initialization as the RL run to isolate the objective. We will revise the abstract and Results section to state this control explicitly, add supporting analysis of training dynamics, and discuss why the observed pattern (RL improvement vs. SFT degradation) is attributable to the RL objective rather than other setup factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper trains an RL model using a rule-based similarity reward computed directly against ground-truth patches drawn from open-source evolution data, then evaluates the resulting model on the independent, human-verified SWE-bench Verified benchmark and on separate out-of-domain tasks. No equations, fitted parameters, or self-citations are presented that reduce the claimed performance gains or generalization claims to the training reward by construction. The derivation therefore remains self-contained: standard RL is applied to external data, with results measured on held-out benchmarks rather than through self-referential definitions or renamed fits.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that software evolution data encodes transferable reasoning patterns and that a simple similarity-based reward is sufficient to elicit them via RL; no invented entities are introduced.

free parameters (2)

RL training hyperparameters
Standard parameters such as learning rate and batch size are required for the RL process but not detailed in the abstract.
Reward scaling coefficient
Scaling applied to the similarity score reward to stabilize training.

axioms (2)

domain assumption Similarity score between generated and ground-truth solutions is a valid proxy for reasoning quality
Directly used as the lightweight rule-based reward signal.
domain assumption Open-source software evolution data contains generalizable reasoning signals
Basis for training on lifecycle records including code changes and events.

pith-pipeline@v0.9.0 · 5642 in / 1557 out tokens · 56161 ms · 2026-05-15T10:23:06.408182+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel contradicts

?

contradicts
CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer’s reasoning processes...
Foundation.LawOfExistence defect_zero_iff_one contradicts

?

contradicts
CONTRADICTS: the theorem conflicts with this paper passage, or marks a claim that would need revision before publication.

the reward is a similarity score (between 0 and 1) of the predicted and the oracle patch calculated by Python’s difflib.SequenceMatcher
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Llama3-SWE-RL-70B achieves a 41.0% solve rate on SWE-bench Verified

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
cs.SE 2026-05 conditional novelty 7.0

10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
cs.AI 2026-05 unverdicted novelty 7.0

BoostAPR improves automated program repair by using execution-grounded RL with a sequence-level assessor and line-level credit allocator, reaching 40.7% on SWE-bench Verified and strong cross-language results.
RewardHarness: Self-Evolving Agentic Post-Training
cs.AI 2026-05 unverdicted novelty 7.0

RewardHarness self-evolves a tool-and-skill library from 100 preference examples to reach 47.4% accuracy on image-edit evaluation, beating GPT-5, and yields stronger RL-tuned models.
Agentic Discovery of Exchange-Correlation Density Functionals
cs.AI 2026-05 conditional novelty 7.0

An agentic LLM system discovers the XC functional SAFS26-a that improves on the ωB97M-V baseline by roughly 9% on a held-out thermochemistry dataset while warning that such systems can exploit unphysical shortcuts.
Faithful Mobile GUI Agents with Guided Advantage Estimator
cs.AI 2026-05 unverdicted novelty 7.0

Faithful-Agent raises Trap SR in GUI agents from 13.88% to 80.21% via faithfulness-oriented SFT and GuAE-enhanced RFT with consistency rewards while retaining general performance.
ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories
cs.SE 2026-04 unverdicted novelty 7.0

ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 11...
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
Group-in-Group Policy Optimization for LLM Agent Training
cs.LG 2025-05 unverdicted novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
Revisiting DAgger in the Era of LLM-Agents
cs.LG 2026-05 conditional novelty 6.0

DAgger-style training with turn-level policy interpolation raises 4B and 8B LLM agents to 27.3% and 29.8% on SWE-bench Verified, beating several larger published systems.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
cs.AI 2026-05 unverdicted novelty 6.0

BoostAPR uses supervised fine-tuning on verified fixes, dual sequence- and line-level reward models from execution feedback, and PPO to reach 40.7% on SWE-bench Verified with strong cross-language results.
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
cs.AI 2026-05 unverdicted novelty 6.0

BoostAPR boosts automated program repair by training a sequence-level assessor and line-level credit allocator from execution outcomes, then applying them in PPO to reach 40.7% on SWE-bench Verified.
Improving LLM Code Generation via Requirement-Aware Curriculum Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 6.0

REC RL improves LLM code generation by automatically assessing and optimizing requirement difficulty with adaptive curriculum sampling, yielding 1.23-5.62% Pass@1 gains over baselines.
CodePivot: Bootstrapping Multilingual Transpilation in LLMs via Reinforcement Learning without Parallel Corpora
cs.SE 2026-04 unverdicted novelty 6.0

CodePivot uses Python as a pivot language plus an Aggressive-Partial-Functional RL reward to train a 7B model that outperforms much larger LLMs on multilingual code transpilation without parallel corpora.
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems
cs.LG 2026-04 unverdicted novelty 6.0

AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training
cs.DC 2026-04 unverdicted novelty 6.0

TensorHub uses Reference-Oriented Storage to enable scalable weight transfer in LLM RL training by referencing replicated GPU weights, achieving up to 19x reduction in cross-datacenter stall time.
ARuleCon: Agentic Security Rule Conversion
cs.CR 2026-04 unverdicted novelty 6.0

ARuleCon uses AI agents plus execution-based checks to convert SIEM rules across vendors with 15% higher fidelity than standard LLM translation.
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR
cs.LG 2026-04 unverdicted novelty 6.0

Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
Reinforcement Learning with Negative Tests as Completeness Signal for Formal Specification Synthesis
cs.SE 2026-04 unverdicted novelty 5.0

SpecRL uses the fraction of negative tests rejected by candidate specifications as a reward signal in RL training to produce stronger and more verifiable formal specifications than prior methods.
Towards Enabling An Artificial Self-Construction Software Life-cycle via Autopoietic Architectures
cs.SE 2026-04 unverdicted novelty 4.0

Proposes autopoietic architectures for self-constructing software as a fundamental shift in the SDLC, leveraging foundation models for autonomous evolution and maintenance.
Curr-RLCER:Curriculum Reinforcement Learning For Coherence Explainable Recommendation
cs.IR 2026-04 unverdicted novelty 4.0

Curr-RLCER applies curriculum reinforcement learning with coherence-driven rewards to align generated explanations with predicted ratings in explainable recommendation systems.

Reference graph

Works this paper leans on

192 extracted references · 192 canonical work pages · cited by 19 Pith papers · 14 internal anchors

[1]

Claude 3.5 sonnet model card addendum

Anthropic. Claude 3.5 sonnet model card addendum. Claude-3.5 Model Card, 2024 a

work page 2024
[2]

Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet

Anthropic. Raising the bar on SWE-bench Verified with Claude 3.5 Sonnet . https://www.anthropic.com/research/swe-bench-sonnet, 2024 b

work page 2024
[4]

Codet: Code generation with generated tests

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations, 2023. https://openreview.net/forum?id=ktrw68Cmu9c

work page 2023
[5]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021
[6]

Meta large language model compiler: Foundation models of compiler optimization, 2024

Chris Cummins, Volker Seeker, Dejan Grubisic, Baptiste Roziere, Jonas Gehring, Gabriel Synnaeve, and Hugh Leather. Meta large language model compiler: Foundation models of compiler optimization, 2024. https://arxiv.org/abs/2407.02524

work page arXiv 2024
[7]

DeepSeek-V3 Technical Report

DeepSeek-AI. Deepseek-v3 technical report, 2024. https://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Gao...

work page arXiv 2024
[10]

Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models, 2023

Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models, 2023

work page 2023
[11]

Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion

Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, ...

work page 2023
[13]

The Llama 3 Herd of Models

AI@Meta: Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, and Angela Fan et al. The llama 3 herd of models, 2024. https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Aider is ai pair programming in your terminal

Paul Gauthier. Aider is ai pair programming in your terminal. https://aider.chat/, 2024

work page 2024
[15]

Rlef: Grounding code llms in execution feedback with reinforcement learning

Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Quentin Carbonneaux, Taco Cohen, and Gabriel Synnaeve. Rlef: Grounding code llms in execution feedback with reinforcement learning, 2025. https://arxiv.org/abs/2410.02089

work page arXiv 2025
[16]

Github rest api documentation

GitHub. Github rest api documentation. https://docs.github.com/en/rest?apiVersion=2022-11-28, 2022. Accessed: 2025-02-24

work page 2022
[17]

GitHub. Github. https://github.com, 2025

work page 2025
[18]

Gh archive

Ilya Grigorik. Gh archive. https://www.gharchive.org/, 2025. Accessed: 2025-02-23

work page 2025
[19]

CRUXE val: A benchmark for code reasoning, understanding and execution

Alex Gu, Baptiste Roziere, Hugh James Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida Wang. CRUXE val: A benchmark for code reasoning, understanding and execution. In Proceedings of the 41st ICML, volume 235 of Proceedings of Machine Learning Research, pages 16568--16621. PMLR, 21--27 Jul 2024. https://proceedings.mlr.press/v235/gu24c.html

work page 2024
[20]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang. Deepseek-coder: When the large language model meets programming -- the rise of code intelligence, 2024

work page 2024
[21]

Measuring coding challenge competence with apps, 2021 a

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps, 2021 a

work page 2021
[22]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021 b . https://openreview.net/forum?id=d7KBjmI3GmQ

work page 2021
[23]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021 c . https://openreview.net/forum?id=7Bywt2mQsCe

work page 2021
[24]

Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report, 2024. https://arxiv.org/abs/2...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Testgeneval: A real world unit test generation and test completion benchmark, 2024 a

Kush Jain, Gabriel Synnaeve, and Baptiste Rozière. Testgeneval: A real world unit test generation and test completion benchmark, 2024 a . https://arxiv.org/abs/2410.00752

work page arXiv 2024
[26]

Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024 b

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024 b

work page 2024
[27]

Impact of code language models on automated program repair, 2023

Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. Impact of code language models on automated program repair, 2023

work page 2023
[28]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2023

work page 2023
[29]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench leaderboard. https://www.swebench.com, 2024. Accessed: 2025-02-04

work page 2024
[30]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models

Caroline Lemieux, Jeevana Priya Inala, Shuvendu K Lahiri, and Siddhartha Sen. Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 919--931. IEEE, 2023

work page 2023
[32]

Starcoder: may the source be with you!, 2023

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Loge...

work page 2023
[34]

Is your code generated by chat GPT really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chat GPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. https://openreview.net/forum?id=1qvx610Cu7

work page 2023
[35]

Evaluating language models for efficient code generation

Jiawei Liu, Songrun Xie, Junhao Wang, Yuxiang Wei, Yifeng Ding, and Lingming Zhang. Evaluating language models for efficient code generation. In First Conference on Language Modeling, 2024 a . https://openreview.net/forum?id=IBCBMeAhmC

work page 2024
[36]

Repobench: Benchmarking repository-level code auto-completion systems

Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems. In The Twelfth International Conference on Learning Representations, 2024 b . https://openreview.net/forum?id=pPjZIOuQuF

work page 2024
[37]

Starcoder 2 and the stack v2: The next generation, 2024

Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Z...

work page 2024
[38]

Lingma swe-gpt: An open development-process-centric language model for automated software improvement, 2024

Yingwei Ma, Rongyu Cao, Yongchang Cao, Yue Zhang, Jue Chen, Yibo Liu, Yuchen Liu, Binhua Li, Fei Huang, and Yongbin Li. Lingma swe-gpt: An open development-process-centric language model for automated software improvement, 2024. https://arxiv.org/abs/2411.00622

work page arXiv 2024
[39]

GPT-4o System Card

OpenAI. Gpt-4o system card, 2024 a . https://arxiv.org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

OpenAI o1 System Card

OpenAI. Openai o1 system card, 2024 b . https://arxiv.org/abs/2412.16720

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

simple-evals

OpenAI . simple-evals. https://github.com/openai/simple-evals, 2024. Accessed: 2025-02-23

work page 2024
[42]

Introducing swe-bench verified

OpenAI. Introducing swe-bench verified. https://openai.com/index/introducing-swe-bench-verified, 2024. Accessed: 2025-02-04

work page 2024
[43]

Training software engineering agents and verifiers with swe-gym, 2024.URL https://arxiv

Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym, 2024. https://arxiv.org/abs/2412.21139

work page arXiv 2024
[44]

Ratcliff and David E

John W. Ratcliff and David E. Metzener. Pattern Matching: The Gestalt Approach . Dr. Dobb's Journal, page 46, July 1988. https://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970

work page arXiv 1988
[45]

Code llama: Open foundation models for code, 2023

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thoma...

work page 2023
[46]

An empirical evaluation of using large language models for automated unit test generation

Max Sch \"a fer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering, 2023

work page 2023
[47]

Approximating kl divergence

John Schulman. Approximating kl divergence. http://joschu.net/blog/kl-approx.html, 2020. Accessed: 2025-02-22

work page 2020
[48]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Wang, Alex Gu, Lovish Madaan, Dieuwke Hupkes, Jiawei Liu, Yuxiang Wei, Naman Jain, Yuhang Lai, Sten Sootla, Ofir Press, Baptiste Rozière, and Gabriel Synnaeve

Sida I. Wang, Alex Gu, Lovish Madaan, Dieuwke Hupkes, Jiawei Liu, Yuxiang Wei, Naman Jain, Yuhang Lai, Sten Sootla, Ofir Press, Baptiste Rozière, and Gabriel Synnaeve. E val- A rena: noise and errors on llm evaluations. https://github.com/crux-eval/eval-arena, 2024 a

work page 2024
[50]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Chi, Quoc V Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=_V...

work page 2022
[52]

Copiloting the copilots: Fusing large language models with completion engines for automated program repair, 2023

Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. Copiloting the copilots: Fusing large language models with completion engines for automated program repair, 2023

work page 2023
[53]

Magicoder: Empowering code generation with OSS -instruct

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empowering code generation with OSS -instruct. In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 52632--52657. PMLR, 21--27 Jul 2024. https://proceedings.mlr.press/v235/wei24h.html

work page 2024
[54]

Less training, more repairing please: Revisiting automated program repair via zero-shot learning, 2022

Chunqiu Steven Xia and Lingming Zhang. Less training, more repairing please: Revisiting automated program repair via zero-shot learning, 2022

work page 2022
[55]

Universal fuzzing via large language models, 2023 a

Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. Universal fuzzing via large language models, 2023 a

work page 2023
[56]

Automated program repair in the era of large pre-trained language models

Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482--1494, 2023 b . doi:10.1109/ICSE48619.2023.00129

work page doi:10.1109/icse48619.2023.00129 2023
[57]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents, 2024. https://arxiv.org/abs/2407.01489

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

Swe-fixer: Training open-source llms for effective and efficient github issue resolution, 2025

Chengxing Xie, Bowen Li, Chang Gao, He Du, Wai Lam, Difan Zou, and Kai Chen. Swe-fixer: Training open-source llms for effective and efficient github issue resolution, 2025

work page 2025
[59]

Qwen2.5 technical report, 2024 a

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tingyu X...

work page 2024
[60]

SWE -agent: Agent-computer interfaces enable automated software engineering

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. SWE -agent: Agent-computer interfaces enable automated software engineering. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024 b . https://openreview.net/forum?id=mXpq6ut8J3

work page 2024
[61]

Jimenez, Alex L

John Yang, Carlos E. Jimenez, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R. Narasimhan, Diyi Yang, Sida I. Wang, and Ofir Press. Swe-bench multimodal: Do ai systems generalize to visual software domains?, 2024 c . https://arxiv.org/abs/2410.03859

work page arXiv 2024
[62]

Demystifying long chain-of-thought reasoning in llms, 2025

Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms, 2025. https://arxiv.org/abs/2502.03373

work page arXiv 2025
[64]

Acecoder: Acing coder rl via automated test-case synthesis

Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, and Wenhu Chen. Acecoder: Acing coder rl via automated test-case synthesis. ArXiv, 2502.01718, 2025

work page arXiv 2025
[65]

Repocoder: Repository-level code completion through iterative retrieval and generation, 2023

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and generation, 2023

work page 2023
[66]

Autocoderover: Autonomous program improvement, 2024

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. Autocoderover: Autonomous program improvement, 2024. https://arxiv.org/abs/2404.05427

work page arXiv 2024
[67]

Commit0: Library generation from scratch, 2024

Wenting Zhao, Nan Jiang, Celine Lee, Justin T Chiu, Claire Cardie, Matthias Gallé, and Alexander M Rush. Commit0: Library generation from scratch, 2024. https://arxiv.org/abs/2412.01769

work page arXiv 2024
[69]

Moatless tools

Albert Örwall. Moatless tools. https://github.com/aorwall/moatless-tools, 2024

work page 2024
[70]

2023 , eprint=

Code Llama: Open Foundation Models for Code , author=. 2023 , eprint=

work page 2023
[71]

2023 , eprint=

WizardCoder: Empowering Code Large Language Models with Evol-Instruct , author=. 2023 , eprint=

work page 2023
[72]

2023 , eprint=

Teaching Large Language Models to Self-Debug , author=. 2023 , eprint=

work page 2023
[73]

2017 , volume =

Foundations and Trends® in Programming Languages , title =. 2017 , volume =. doi:10.1561/2500000010 , issn =

work page doi:10.1561/2500000010 2017
[74]

Is Your Code Generated by Chat

Liu, Jiawei and Xia, Chunqiu Steven and Wang, Yuyao and Zhang, Lingming , booktitle =. Is Your Code Generated by Chat. 2023 , url =

work page 2023
[75]

The Eleventh International Conference on Learning Representations , year=

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis , author=. The Eleventh International Conference on Learning Representations , year=

work page
[76]

C ode T 5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Wang, Yue and Wang, Weishi and Joty, Shafiq and Hoi, Steven C.H. C ode T 5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.685

work page doi:10.18653/v1/2021.emnlp-main.685 2021
[77]

Amazon Web Services , title =

work page
[78]

2023 , eprint=

Copiloting the Copilots: Fusing Large Language Models with Completion Engines for Automated Program Repair , author=. 2023 , eprint=

work page 2023
[79]

GitHub repository , howpublished =

Sahil Chaudhary , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[80]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[81]

2022 , eprint=

Finetuned Language Models Are Zero-Shot Learners , author=. 2022 , eprint=

work page 2022
[82]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Wizardlm: Empowering large language models to follow complex instructions , author=. arXiv preprint arXiv:2304.12244 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[83]

arXiv preprint arXiv:2306.08568 , year=

WizardCoder: Empowering Code Large Language Models with Evol-Instruct , author=. arXiv preprint arXiv:2306.08568 , year=

work page arXiv
[84]

2022 , eprint=

PaLM: Scaling Language Modeling with Pathways , author=. 2022 , eprint=

work page 2022
[85]

2022 , eprint=

Training Compute-Optimal Large Language Models , author=. 2022 , eprint=

work page 2022

Showing first 80 references.